This is the second part of a five-part series, collated here. Having covered writing tools in the last post, this time I’m focussing on creating something to write about.
Let’s assume that you’ve been out, conducted experiments or sampling regimes, and returned after much effort with a mountain of data. As scientists we invest much thought into how best to collect reliable data, and also in how to effectively analyse it. The intermediate stage — arranging, cleaning and processing the data — is often overlooked. Yet this can sometimes take as long as collecting the data in the first place, and specialist tools exist to make your life easier.
I’m not going to dwell here on good practices for data management; for that there’s an excellent guide produced by the British Ecological Society which says more than I could. The principles of data organisation are well covered in this paper by Hadley Wickham. Both are on the essential reading list for students in my group, and I’d recommend them to anyone. Instead my focus here is on the tools you can use to do it.
The familiar Microsoft Excel is fine for small datasets, but struggles with large spreadsheets, and if you’ve ever tried to load a sizeable amount of data into it then you’ll know that you might as well go away to make a cup of tea, come back and hope it hasn’t crashed. This is a problem with Excel, not your data. Incidentally, LibreOffice Calc is the free substitute for Excel if you want a straight replacement. Don’t even consider using either of them to do statistics or draw figures (on which there will be more next time). I consider this computational limitation more than enough reason to look elsewhere, even though there are many official and unofficial plug-ins which extend Excel’s capabilities. Excel can also reformat your data without you knowing about it.
One of the main functionalities lacking in Excel is a way to use GREP. Regular Expressions are powerful search terms that allow you to screen data, check for errors and fix problems. Learning how to use them properly will save all the time you used to spend scrolling through datasheets looking for problems until your mind went numb. Proper text editors allow this functionality. Personally I use jEdit to manage my data, which is available free for all operating systems. Learning to parse a .csv or .txt file that isn’t in a conventional box-format spreadsheet takes a little time but soon becomes routine.
For larger, linked databases, Microsoft Access used to be the class-leader. The later versions have compromised functionality for accessibility, leading many people to seek alternatives. Databases are compiled using SQL (Structured Query Language), and learning to use Access compels you to pick up the basics of this anyway. Given this, starting with a free alternative is no more difficult. I have always found MySQL to be easy and straightforward, but some colleagues strongly recommend SQLite. It might not have all the same functions of the larger database tools but most users won’t notice the difference. Most importantly, a database in SQL format can be transferred between any of these software tools with no loss of function. Migrating into (or out of) Access is trickier.
As a general rule, your data management software should be used for that alone. The criterion for choosing what software to use is that it should allow you to clean your data and load it into an analysis platform as quickly and easily as possible. Don’t waste time producing summaries, figures or reports when this can be done more efficiently using proper tools.
These days no-one looks further than R. As a working environment it’s the ideal way to load and inspect data, carry out statistical tests, and produce publication-quality figures. Many people — including myself — do pretty much all their data processing, analysis and visualisation in R*.
It’s interesting to note just how rapidly the landscape has changed. As an undergraduate in the 90s we were taught using Minitab. For my PhD I did all my statistics in SPSS, then as a post-doc I transitioned to GenStat. All are perfectly decent, serviceable solutions for basic statistical analyses. Each has its limitations but moving between them isn’t difficult.
I won’t hide the simple truth — learning R is hard, especially if you have no experience of programming. Why then did I bother? The simple answer is that R can do everything that all the above programs can do, and more. It’s also more efficient, reproducible and adaptable. Once you have the code to do a particular set of analyses you can tweak, amend and reapply at will. Never again do you have to work through a lengthy menu, drag-and-drop variables, tick the right boxes and remember the exact sequence for next time. Once a piece of code is written, you keep it.
If you’re struggling then there are loads of websites providing advice to all levels from beginners to experienced statistical programmers. It’s also worth looking at the excellent books by Alain Zuur which I can’t recommend highly enough. If you have a problem then a quick internet search will usually retrieve an answer in no time, while the mailing lists are filled with incredibly helpful people**. The other great thing about R is that it’s free***.
One word of warning is to not dive too deep at the beginning. Start by replicating analyses you’re already familiar with, perhaps from previous papers. The Quick-R page is a good entry point. A bad (but common) way of beginning with R is to be told that you need to use a particular analytical approach, and that R is the only way to do it. This way leads at best to frustration, at worst to errors. If someone tells you to use approximate Bayesian inference via integrated nested Laplace approximation, then you can do it with the R-INLA package. The responsibility is still on you to know what you’re doing though; don’t expect someone to hold your hand.
Because R is a language rather than a program, the default environment isn’t very easy to work in, and you’re much better using another program to interface with R. By far the most widely-used is RStudio, and it’s the one I recommend to my own post-graduate students. It will improve your R coding experience immensely. Some programmers use it for almost everything. An alternative is Tinn-R, which I used to use, but gave up on a few years ago because it was too buggy. It may have improved now so by all means try it out. If you’re desperate for a familiar-looking graphical user interface with menus then R Commander provides one, but I recommend using this as a gateway to learning more (or teaching students) rather than a long-term solution.
I’m a bit old-fashioned and prefer to use a traditional text editor to work in R. My choice, for historical reasons, is eMacs, which links neatly to R through ESS. The other tribe of programmers use Vim with the sensibly-named Vim-R-plugin, and we shall speak no more of them. If you’re already a programmer then you know about these, and can be assured that you can code in R just as easily. If not then stick to Rstudio, which is much easier. I also often use Geany as a tool for making quick edits to scripts.
Most of all, don’t type directly into R, it’s a recipe for disaster, and removes the greatest advantage which is its reproducibility. Likewise don’t keep a Word document open with R commands while continually copy-and-pasting them over. I’ve seen many students doing this, and it’s only recommended if you want to speed the onset of repetitive strain injury. Word will also keep reformatting and autocorrecting your text, introducing many errors. Use a proper editor and it’s done in one click.
One issue with R that more experienced users will come across is that it is relatively slow at processing very large datasets or large numbers of files. This is a problem that relatively few users will encounter, and by that point most will be competent programmers. In these cases it’s worth learning one of the major programming languages for file handling. Python is the easiest to pick up, for which Rosalind provides a nice series of scaled problems for learning and teaching (albeit with a bioinformatics focus). Serious programmers will know of or already use C, which is more widespread and has greater power. Finding out how to use a Bash shell efficiently is also immensely helpful. Learning to program in these other languages will open many doors, including to alternative careers, but is not essential for most people.
As a final aside, there is a recent attempt to link the power of C with the statistical capabilities of R in a new programming language called Julia. This is still in early development but is worth keeping an eye on if statistical programming is likely to become a major feature of your research.
Specialist software tools
Almost everything can be done in R, and those that can’t already, can be programmed. That said, there are some bespoke free software tools that are worth mentioning as they can be of great use to ecologists. They’re also valuable for those who prefer a GUI (Graphical User Interface) and aren’t ready to move over to a command-line tool just yet. Where I know of them, I’ve mentioned the leading R packages too.
Diversity statistics — the majority of people now use the vegan package in R. Outside R, the most widely-used free tool for diversity analysis is EstimateS. Much of the same functionality is contained in SPADE, written by Anne Chao (who has a number of other free programs on her website). I’ve always found the latter to be a little buggy, but it’s also reliably updated with the very latest methods. It has more recently been converted into an R package, spadeR, which has an accessible webpage that will do all the analyses for you. As a final mention, there is good commercial software available from Pisces Conservation, but apart from a cleaner-looking interface I’ve never seen any advantage to using it.
GIS — I’ll be returning to the issue of making maps in a later post, but will mention here that a direct replacement for the expensive ArcGIS is the free qGIS. I’ve never found any functionality lacking, but I’m not a serious GIS user either. There are a plethora of R packages which in combination cover the same range of functions but I wouldn’t like to make recommendations.
Macroecology — SAM (for Spatial Analysis in Macroecology) is a useful tool for quickly loading and inspecting patterns in spatial ecological data. I would personally still move into R for publication-grade analyses, but this can be a helpful stepping stone when exploring a new dataset.
Null models — these can be very useful in community ecology. The only time I’ve done this, I used the free version of EcoSim. I see that you now have to pay for the full version, so if someone can recommend a comparable R package in the comments then I’ll update this accordingly.
I’m happy to extend this list with further recommendations; please drop a note in the comments.
Practical Computing for Biologists is a great book. A little knowledge goes a long way, and learning how to use the shell, regular expressions and a small amount of Python will soon reap dividends for your research, whatever stage you’re at.
* The most mathematically-inclined biologists might hanker after something more like MATLAB, for which a direct free replacement is GNU Octave. You can even transfer MATLAB programs across, although there are some minor differences in the language.
** Normal forum protocol applies here, which is that you shouldn’t ask a question to which you could reasonably have found an answer by searching for yourself. If you ask a stupid question that implies no effort on your part then you can expect a curt answer (or none at all). That said, if you really can’t work something out then it’s well worth bringing up because you might be the first person to spot an issue. If your problem is an interesting one then often you’ll find yourself receiving support from some of the top names in the field, so long as you are willing to learn and engage. Please read the posting guide before you start.
*** A few years ago a graduate student declined my advice to use R, declaring in my office that if R was so good, someone would be charging for it. I was taken aback, perhaps because I take the logic of Free Open-Source Software for granted. If you’re unsure, then the main benefit is that it’s free to obtain and modify the original code. This means that someone has almost certainly created a specific tool to meet your research needs. Proprietary commercial software is aimed at the market and the average user, whereas open-source software can be tweaked and modified. The reason R is so powerful is that it’s used by so many people, many of whom are actively developing new tools and bringing them directly to your computer. Often these will be published in Journal of Statistical Software or more recently Methods in Ecology and Evolution.