Free software for biologists pt. 2 – data management and analysis

This is the second part of a five-part series, collated here. Having covered writing tools in the last post, this time I’m focussing on creating something to write about.

Data management

Let’s assume that you’ve been out, conducted experiments or sampling regimes, and returned after much effort with a mountain of data. As scientists we invest much thought into how best to collect reliable data, and also in how to effectively analyse it. The intermediate stage — arranging, cleaning and processing the data — is often overlooked. Yet this can sometimes take as long as collecting the data in the first place, and specialist tools exist to make your life easier.

I’m not going to dwell here on good practices for data management; for that there’s an excellent guide produced by the British Ecological Society which says more than I could. The principles of data organisation are well covered in this paper by Hadley Wickham. Both are on the essential reading list for students in my group, and I’d recommend them to anyone. Instead my focus here is on the tools you can use to do it.

The familiar Microsoft Excel is fine for small datasets, but struggles with large spreadsheets, and if you’ve ever tried to load a sizeable amount of data into it then you’ll know that you might as well go away to make a cup of tea, come back and hope it hasn’t crashed. This is a problem with Excel, not your data. Incidentally, LibreOffice Calc is the free substitute for Excel if you want a straight replacement. Don’t even consider using either of them to do statistics or draw figures (on which there will be more next time). I consider this computational limitation more than enough reason to look elsewhere, even though there are many official and unofficial plug-ins which extend Excel’s capabilities. Excel can also reformat your data without you knowing about it.

One of the main functionalities lacking in Excel is a way to use GREP. Regular Expressions are powerful search terms that allow you to screen data, check for errors and fix problems. Learning how to use them properly will save all the time you used to spend scrolling through datasheets looking for problems until your mind went numb. Proper text editors allow this functionality. Personally I use jEdit to manage my data, which is available free for all operating systems. Learning to parse a .csv or .txt file that isn’t in a conventional box-format spreadsheet takes a little time but soon becomes routine.

For larger, linked databases, Microsoft Access used to be the class-leader. The later versions have compromised functionality for accessibility, leading many people to seek alternatives. Databases are compiled using SQL (Structured Query Language), and learning to use Access compels you to pick up the basics of this anyway. Given this, starting with a free alternative is no more difficult. I have always found MySQL to be easy and straightforward, but some colleagues strongly recommend SQLite. It might not have all the same functions of the larger database tools but most users won’t notice the difference. Most importantly, a database in SQL format can be transferred between any of these software tools with no loss of function.  Migrating into (or out of) Access is trickier.

As a general rule, your data management software should be used for that alone. The criterion for choosing what software to use is that it should allow you to clean your data and load it into an analysis platform as quickly and easily as possible. Don’t waste time producing summaries, figures or reports when this can be done more efficiently using proper tools.

Data analysis

These days no-one looks further than R. As a working environment it’s the ideal way to load and inspect data, carry out statistical tests, and produce publication-quality figures. Many people — including myself — do pretty much all their data processing, analysis and visualisation in R*.

It’s interesting to note just how rapidly the landscape has changed. As an undergraduate in the 90s we were taught using Minitab. For my PhD I did all my statistics in SPSS, then as a post-doc I transitioned to GenStat. All are perfectly decent, serviceable solutions for basic statistical analyses. Each has its limitations but moving between them isn’t difficult.

I won’t hide the simple truth — learning R is hard, especially if you have no experience of programming. Why then did I bother? The simple answer is that R can do everything that all the above programs can do, and more. It’s also more efficient, reproducible and adaptable. Once you have the code to do a particular set of analyses you can tweak, amend and reapply at will. Never again do you have to work through a lengthy menu, drag-and-drop variables, tick the right boxes and remember the exact sequence for next time. Once a piece of code is written, you keep it.

If you’re struggling then there are loads of websites providing advice to all levels from beginners to experienced statistical programmers. It’s also worth looking at the excellent books by Alain Zuur which I can’t recommend highly enough. If you have a problem then a quick internet search will usually retrieve an answer in no time, while the mailing lists are filled with incredibly helpful people**. The other great thing about R is that it’s free***.

One word of warning is to not dive too deep at the beginning. Start by replicating analyses you’re already familiar with, perhaps from previous papers. The Quick-R page is a good entry point. A bad (but common) way of beginning with R is to be told that you need to use a particular analytical approach, and that R is the only way to do it. This way leads at best to frustration, at worst to errors. If someone tells you to use approximate Bayesian inference via integrated nested Laplace approximation, then you can do it with the R-INLA package. The responsibility is still on you to know what you’re doing though; don’t expect someone to hold your hand.

Because R is a language rather than a program, the default environment isn’t very easy to work in, and you’re much better using another program to interface with R. By far the most widely-used is RStudio, and it’s the one I recommend to my own post-graduate students. It will improve your R coding experience immensely. Some programmers use it for almost everything. An alternative is Tinn-R, which I used to use, but gave up on a few years ago because it was too buggy. It may have improved now so by all means try it out. If you’re desperate for a familiar-looking graphical user interface with menus then R Commander provides one, but I recommend using this as a gateway to learning more (or teaching students) rather than a long-term solution.

I’m a bit old-fashioned and prefer to use a traditional text editor to work in R. My choice, for historical reasons, is eMacs, which links neatly to R through ESS. The other tribe of programmers use Vim with the sensibly-named Vim-R-plugin, and we shall speak no more of them. If you’re already a programmer then you know about these, and can be assured that you can code in R just as easily. If not then stick to Rstudio, which is much easier. I also often use Geany as a tool for making quick edits to scripts.

Most of all, don’t type directly into R, it’s a recipe for disaster, and removes the greatest advantage which is its reproducibility. Likewise don’t keep a Word document open with R commands while continually copy-and-pasting them over. I’ve seen many students doing this, and it’s only recommended if you want to speed the onset of repetitive strain injury. Word will also keep reformatting and autocorrecting your text, introducing many errors. Use a proper editor and it’s done in one click.

One issue with R that more experienced users will come across is that it is relatively slow at processing very large datasets or large numbers of files. This is a problem that relatively few users will encounter, and by that point most will be competent programmers. In these cases it’s worth learning one of the major programming languages for file handling. Python is the easiest to pick up, for which Rosalind provides a nice series of scaled problems for learning and teaching (albeit with a bioinformatics focus). Serious programmers will know of or already use C, which is more widespread and has greater power. Finding out how to use a Bash shell efficiently is also immensely helpful. Learning to program in these other languages will open many doors, including to alternative careers, but is not essential for most people.

As a final aside, there is a recent attempt to link the power of C with the statistical capabilities of R in a new programming language called Julia. This is still in early development but is worth keeping an eye on if statistical programming is likely to become a major feature of your research.

Specialist software tools

Almost everything can be done in R, and those that can’t already, can be programmed. That said, there are some bespoke free software tools that are worth mentioning as they can be of great use to ecologists. They’re also valuable for those who prefer a GUI (Graphical User Interface) and aren’t ready to move over to a command-line tool just yet. Where I know of them, I’ve mentioned the leading R packages too.

Diversity statistics — the majority of people now use the vegan package in R. Outside R, the most widely-used free tool for diversity analysis is EstimateS. Much of the same functionality is contained in SPADE, written by Anne Chao (who has a number of other free programs on her website). I’ve always found the latter to be a little buggy, but it’s also reliably updated with the very latest methods. It has more recently been converted into an R package, spadeR, which has an accessible webpage that will do all the analyses for you. As a final mention, there is good commercial software available from Pisces Conservation, but apart from a cleaner-looking interface I’ve never seen any advantage to using it.

GIS — I’ll be returning to the issue of making maps in a later post, but will mention here that a direct replacement for the expensive ArcGIS is the free qGIS. I’ve never found any functionality lacking, but I’m not a serious GIS user either. There are a plethora of R packages which in combination cover the same range of functions but I wouldn’t like to make recommendations.

MacroecologySAM (for Spatial Analysis in Macroecology) is a useful tool for quickly loading and inspecting patterns in spatial ecological data. I would personally still move into R for publication-grade analyses, but this can be a helpful stepping stone when exploring a new dataset.

Null models — these can be very useful in community ecology. The only time I’ve done this, I used the free version of EcoSim. I see that you now have to pay for the full version, so if someone can recommend a comparable R package in the comments then I’ll update this accordingly.

I’m happy to extend this list with further recommendations; please drop a note in the comments.

Further reading

Practical Computing for Biologists is a great book. A little knowledge goes a long way, and learning how to use the shell, regular expressions and a small amount of Python will soon reap dividends for your research, whatever stage you’re at.


* The most mathematically-inclined biologists might hanker after something more like MATLAB, for which a direct free replacement is GNU Octave. You can even transfer MATLAB programs across, although there are some minor differences in the language.

** Normal forum protocol applies here, which is that you shouldn’t ask a question to which you could reasonably have found an answer by searching for yourself. If you ask a stupid question that implies no effort on your part then you can expect a curt answer (or none at all).  That said, if you really can’t work something out then it’s well worth bringing up because you might be the first person to spot an issue. If your problem is an interesting one then often you’ll find yourself receiving support from some of the top names in the field, so long as you are willing to learn and engage. Please read the posting guide before you start.

*** A few years ago a graduate student declined my advice to use R, declaring in my office that if R was so good, someone would be charging for it. I was taken aback, perhaps because I take the logic of Free Open-Source Software for granted. If you’re unsure, then the main benefit is that it’s free to obtain and modify the original code. This means that someone has almost certainly created a specific tool to meet your research needs. Proprietary commercial software is aimed at the market and the average user, whereas open-source software can be tweaked and modified. The reason R is so powerful is that it’s used by so many people, many of whom are actively developing new tools and bringing them directly to your computer. Often these will be published in Journal of Statistical Software or more recently Methods in Ecology and Evolution.

 

11 thoughts on “Free software for biologists pt. 2 – data management and analysis

  1. ScientistSeesSquirrel

    Markus – another great and very useful piece. I would quibble with one small thing – with respect to data management software, you say “Don’t waste time producing summaries, figures or reports when this can be done more efficiently using proper tools”. I would say this: it’s incredibly easy in Excel (say) to make a quick scatterplot.or similar, and this is a really good way to (1) spot some kinds of data-entry errors, and (2) make sure you have an intuitive grasp of your data before heading over into R (or whatever). That way, if you get an unexpected result, you have some idea whether the problem lies in data management or in analysis. But I suspect you really meant presentation-purposed figures or analyses. Here I agree: your data-management tool isn’t for that.

    Like

    Reply
    1. Markus Eichhorn Post author

      Thanks Stephen — horses for courses I suppose, and there is always an advantage to familiarity. I’d still contend that typing plot(A ~ B) in R isn’t particularly onerous, and there are other commands such as dotchart() that produce much more informative figures than the Excel defaults. This paper is one of my favourites and gives loads of tips on efficient data exploration in R: http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2009.00001.x/abstract

      Liked by 1 person

      Reply
  2. atiretoo

    Nice list and I look forward to the rest of the feature! One question I’m wrestling with on the data management side — how to actually ENTER the data into the system? This is a step where something like Microsoft Access forms can do a lot of validation to ensure the data are what they purport to be (like insisting on date-time field formatting and using dropdown lists for categorical variables). Some of that can also be done in Excel and presumably Calc, but it isn’t as powerful as doing the entry directly into a relational database. But, Access isn’t available on a Mac (unless you use a virtual machine). So what is the equivalent to Access forms for data entry on MySQL or SQLite?

    Like

    Reply
    1. Markus Eichhorn Post author

      Thanks for your comment. You’re right that data entry is a missing step in these posts. There’s already a great collection of apps for field data collection on the Bruna Lab webpage: http://brunalab.org/apps/. Otherwise I just type stuff directly into a very basic spreadsheet. Drop-down lists for each item sound painfully slow.

      I wouldn’t expect my data processing software to detect variable types or perform validation for me. If your ultimate goal is something that can be loaded into R then you will need to save it as a .csv or .txt file anyway, which means keeping the data as simple as possible. R is very good at automatically detecting variable types and the str() command gives a concise summary of what it thinks a dataframe contains. This allows you to check and modify as necessary. My workflow would therefore be (1) get the data into a .csv and loaded in R as quickly as possible, (2) inspect with str() to see what has actually loaded, (3) identify errors, (4) go back to a text editor like jEdit to fix them, then (5) reload in R.

      Like

      Reply
      1. atiretoo

        The dropdown box is very fast if set up correctly, and avoids sooooo many typos (trailing spaces, capitalization variation etc) that I think it is worth it. Putting some validation on the front end really reduces your steps 3 and 4. R is good at recognizing data types except (IMHO) dates and times, and people enter them in so many different ways as to be incomprehensible.

        Generally I agree with R for inspection / exploration. But if data is in an SQL database then you can pull it out quickly into a data.frame without any fuss. Best of both worlds.

        But, horses for courses, as you say!

        Liked by 1 person

  3. Mike

    To add to your list of editors with an R interface, on Windows I’d recommend Notepad++ with the nppToR plugin, which supports syntax highlighting and auto-completion – but as you say, there are plenty of alternatives!

    Like

    Reply
  4. Pingback: Free software for biologists pt. 3 – preparing figures | Trees In Space

  5. Pingback: Free software for biologists pt. 5 –operating systems | Trees In Space

  6. Pingback: Free software for biologists pt. 4 – presentations | Trees In Space

Leave a comment