Tag Archives: research

Free software for biologists pt. 5 –operating systems

If you’ve made it this far in the series then you’ll have already explored software for writing, analysing data, preparing figures and creating presentations, many of which are designed explicitly with scientists in mind. You’re clearly interested in learning how to make your computer work better, which is great. If you’re willing to do this then why not take the natural next step and choose an operating system for your computer which is designed with the scientific user in mind?

Put simply, Windows is not the ideal operating system for scientific computing. It takes up an unnecessarily large amount of space on your hard drive, uses your computer’s resources inefficiently, and slows down everything else you’re trying to do*. Ever wondered why you have to keep updating your anti-virus software, and worry about attachments or executable files? It’s because Windows is so large and unwieldy that it’s full of back-doors, loopholes and other vulnerabilities. You are not safe using Windows.**

What should you do? Macs are superior (and pretty), but also expensive, and free software solutions are preferable. The alternative is to install a Linux operating system. If this sounds intimidating, but you own a smartphone, then you may not realise that Android is actually a Linux operating system. Many games consoles such as the PlayStation, along with TVs and other devices, also run on Linux. Do you own a Chromebook? Linux. You’ve probably been a Linux user for some time without realising it.

Linux-Tux-Logo-Vector

I have no idea why the Linux avatar is a penguin. It just is.

If you’re coming out of Windows then you can get an operating system that looks and behaves almost identically. Try the popular Linux Mint or Mageia, which both offer complete desktops with many of the programs listed in earlier posts pre-installed. Mint is based on the Ubuntu distribution, which is another common Linux version, but has a default desktop environment that will take a few days to get used to. The best thing about Ubuntu is that there is a vast support network, and whatever problem you come across, however basic, a quick web search will show you how to resolve it in seconds.

thumb_11

Your Linux Mint desktop could look like this sample image from their website. See, Linux isn’t so intimidating after all.

Unlike Windows, all these distributions are free to download, easy to install, and everything works straight out of the box. Within a week you will be able to do everything you could do on Windows. Within two weeks you will be realising some of the benefits. Like any change, it takes a little time to get used to, but the investment is worth it. There are literally thousands of operating systems, each tailored to a particular group of users or devices. Rather than getting confused by them all, try one of the major distributions first, which offer plenty of support for beginners. Once you know what you need you can seek out an operating system that is specifically tailored for you (or, if you’re really brave, create one).

DistroWatch.com: Put the fun back into computing. Use Linux, BSD._001

Yeah, I know, DistroWatch.com may not look like the most exciting website in the world, but it does contain download links to every Linux OS you could imagine, and many more.

It’s possible to boot many of these distributions from a DVD or even a USB stick. This means you can try them out and see whether they suit before taking the plunge and installing them on your hard drive (remembering to back all your files up first, of course). If it doesn’t work out then take the DVD out of the computer and all will return to normal. An alternative, once you’ve set it up, is VirtualBox, which allows you to run a different distribution inside your existing operating system.

If you have an old computer which appears to have slowed down to a standstill thanks to all the Windows updates and is not capable of running the newer versions, don’t throw it away! This is exactly what manufacturers want you to do, and is why it’s not in their interests to have an efficient operating system. Making your computer obsolete is how they make more money. Try installing one of the smaller operating systems designed for low-powered computers like elementaryOS. You will get years more use out of your old hardware. A really basic OS like Puppy Linux will run even on the most ancient of computers, and if all you need to do are the basics then it might be good enough.

My preferred operating system is Arch, which has an accessible version Manjaro for moderately-experienced users. It’s not recommended for beginners though so try one of the above first. Why bother? Well, there’s an old adage among computer geeks that ‘if it isn’t broken, break it’. You learn a lot by having to build your OS from the ground up, making active decisions about how your computer is going to work, fixing mistakes as you go along. I won’t pretend that it saves time but there is a satisfaction to it***. Even if it means having to remember to update the kernel manually every now and again. One of its best features is the Arch User Repository, which contains a vast array of programs and tools, all a quick download away.

desktop 3_003

Behold the intimidating greyness that I favour on my laptop, mainly to minimise distractions, which is one of the advantages of the OpenBox window manager. Files and links on the desktop just stress me out.

As with every other article in this series, I’ve made it clear that you will need to spend a little time learning to use new tools in order to break out of your comfort zone. In this case there are great resources both online and from more traditional sources, such as the magazine Linux Format, which is written explicitly for the general Linux user in mind. You might outgrow it after a few years but it’s an excellent entry point. If you’re going to spend most of your working life in front of a computer then why not learn to use it properly?

With that, my series is complete. Have I missed something out? Made a catastrophic error? Please let me know in the comments!


* To be fair to Microsoft, Windows 10 is much better in this regard. That said, if you don’t already have it then you’ll need to pay for an upgrade, which is unnecessary when there are free equivalents.

** If you think I’m kidding, and you’re currently on a laptop with an integral camera, read this. Then go away, find something to cover the camera, and come back. You’re also never completely safe on other operating systems, but their baseline security is much better. For the absolutely paranoid (or if you really need privacy and security), try the TAILS OS.

*** Right up until something goes snap when you need it most. For this reason I also have a computer in the office that runs safe, stable Debian, which is valued by many computer users for its reliability. It will always work even when I’ve messed up my main workstation.

Advertisements

Free software for biologists pt. 4 – presentations

This post is going to strike a slightly different note to previous pieces on software tools for writing, handling data and preparing figures. In each of those I emphasised the advantages of breaking away from the default proprietary software shipped with the average PC and exploring bespoke options designed for scientists. In the case of giving talks or lectures, I’m going to argue for the complete opposite position: it’s not so much what you use, but how you use it.

When delivering a talk, the slides that accompany it are visual aids. I’ve emphasised that term because its meaning has been lost through repetition. The key word is aids. The slides are there to support and enhance the understanding of the audience, and to back up what you say. They are not supposed to be the focus of attention. The slides are not your notes*.

What’s more, slides cause problems more often than they dramatically improve a talk. An ideal talk is one where the audience receive the message without anything getting in the way. How many times have you walked out of a conference talk thinking ‘great slides’? Perhaps never. On the other hand, how many times have you seen a perfectly good talk ruined by a distracting display or computing failure?** For me, that’s at least once a session.

With this in mind, I recommend starting to plan a talk with a simple question: do you need to have any slides at all? Yes, I know, I’ve just challenged the default assumption of almost every conference presenter these days. But I’m absolutely serious. Start from the perspective of thinking what you are going to tell the audience, in normal speech, while they look directly at you and listen to what you say. If you can convey all the information you need to  without slides (or by using other visual aids, such as props or exhibits) then there is no obligation to have slides at all.

Next ask yourself what elements would benefit from being presented visually as well. Note that I’m explicitly trying not to write the talk around the slides, but the visual aids around the talk. Once again there might be no need for slides — you could work through equations or models by sketching them on a blackboard. Nevertheless, for certain types of information, slides are the best means to present them. Data figures, photographs, diagrams, maps and so on are going to need to be put up on the big screen. Note that none of these involve much text, if any.

When you start from that perspective, the software you choose to prepare your slides should be the one that permits you to most clearly present your figures without distracting clutter.

example

Slides are there to help the audience understand your points, not to replicate the talk. Only include the bare minimum of text and be prepared to walk your audience through the details.

With this in mind, PowerPoint is fine for producing lecture slides, and easy to use. The main challenge is changing all the default settings to be as plain and simple as possible, and resisting the temptation to use features that only serve to distract the audience from your intended content (animations, background images, sound effects). These should be used sparingly, and only if they improve the transmission of information***. Remember: slides are there to inform, not to entertain. If you don’t want to pay for Powerpoint then the free LibreOffice Impress will do all the same things and serves as a direct replacement.

An online alternative is slides, which adds the neat trick of allowing remote control of presentations from a second computer or your mobile phone. Another choice is reveal.js, which is free for basic users, but if you want to download a copy of the presentation or collaborate with a colleague then a subscription is required. Being willing to write a little code helps too.

 

If you’re using LaTeX then an alternative is the beamer document class. powerdot appears to do the same thing but I’ve never used it. The usual caveat about LaTeX applies — if you’re not already using it for everything then the time investment for presentations alone won’t be worth it. I have also yet to find a way to embed videos directly into slides.

example2

All my slides are prepared in LaTeX using the intridea beamer theme. I like the look of them, but it takes time and expertise to set up. You could achieve something similar with much less effort.

One good reason to move away from Powerpoint or its analogues is frequency-dependent selection. You can stand out from the crowd simply by virtue of using something different. By the end of the first day of a meeting people are already suffering from Powerpoint fatigue, which makes anything else a pleasant relief.

 

To really change style and impress your audience, try Prezi. This is a different way of visualising your talk, and some time investment is required to get it right. As with Powerpoint, there are many tricks and decorations that can be inserted, but which will distract from the information you’re trying to get across. Particularly try to minimise use of the ‘swooping’ movement, which can induce nausea in your audience.

The two main disadvantages to Prezi are that you need to be connected to the internet to use it, and that the free version requires your presentation to be visible online. The first is seldom an issue, the latter only matters if what you’re showing is somehow private or confidential, and if so then why are you presenting?

In general I don’t submit posters at conferences, though there are many good reasons to choose a poster over a talk, and a lot of guidance on how to do it well. I’m not going to repeat this because I have nothing to add, but also because I have no personal experience to draw from, and can’t therefore recommend any particular software.


* This is true for most public, professional presentations. Lectures for undergraduate students are a different matter though, at least within my experience. Many students now assume that the slides are the notes, and expect to be able to reconstruct the material from these alone. Some lecturers provide printouts of slides as their handouts. You can debate whether this means you should include more material on your slides to serve this function, or make a stand, expect students to take their own notes, and risk complaints.

** Many years ago — long enough for the scars to have healed — a collaborator of mine presented her work at a major international conference. It was a hot topic, and the theatre was packed. We had gone through the talk together the previous night on her laptop and I’d not seen any problems. But on the day it turned into a nightmare. For some unknown reason, every animation (in Powerpoint terms, that means lines or other elements appearing on the screen) was accompanied by a sound effect. Distorted by the conference room speakers it was transformed into something akin to the bellow of a caged animal. This happened every time she clicked, all the way through the talk. Even worse, none of the videos worked. Her evident mortification was met by the awkward, sympathetic unease of the audience. Everyone remembered that talk, though not for the right reasons.

*** A good general rule is: can I save it as a pdf file with no loss of features? If you can then do; not only are they smaller, but they’re more stable, and guaranteed to look identical on whatever computer you need to use. If there are features that would be lost then think carefully about whether you really need them.

Barnacles are much like trees

I am not a forest ecologist. OK, that’s not entirely true, as demonstrated by the strapline of this blog and the evidence on my research page. Nevertheless, having published papers on entomology, theoretical ecology and snail behaviour (that’s completely true), I’m not just a forest ecologist. Having now published a paper on barnacles, one could suspect that I’m having an identity crisis.

When a biologist is asked what they work on, the answer often depends on the audience. On the corridor that hosts my office, neighbouring colleagues might tell a generally-interested party that they work on spiders, snails, hoverflies or stickleback. Likewise, I usually tell people that I work on forests. When talking to a fellow ecologist, however, the answer is completely different, as it would be for every one of the colleagues mentioned above*.

If you walked up to me at a conference, or met me at a seminar, I would probably say that I work on spatial self-organisation in natural systems. If you were likely to be a mathematician or physicist** then I’d probably claim to study the emergent properties of spatially-structured systems. I might follow this up by saying that I’m mostly concerned with trees, but that would be a secondary point.

What I and all my colleagues have in common is that we are primarily interested in a question. The study organism is a means to an end. We might love the organism in question, rear them in our labs, grow them in our glasshouses, spend weeks catching or watching them in the field, learn the fine details of their taxonomy, or even collect them as a hobby… but in the end it is the fundamental question that drives our work. The general field of study always takes priority when describing your work to a fellow scientist.

appendix1

Behold the high-tech equipment used to survey barnacles. This is the kind of methodology a forest ecologist can really get behind.

The work on barnacles was done by a brilliant undergraduate student, Beki Hooper, for her final-year project***. The starting point was the theory of spatial interactions among organisms most clearly set out by Iain Couzin in this paper****. His basic argument is that organisms often interact negatively at short distances: they compete for food, or territorial space, or just bump into one another. On the other hand, interactions at longer ranges are often positive: organisms are better protected against predators, able to communicate with one another, and can receive all the benefits of being in a herd. Individuals that get too close to one another will move apart, but isolated individuals will move closer to their nearest neighbour. At some distance the trade-off between these forces will result in the maximum benefit.

Iain’s paper was all about vertebrates, and his main interest has been in the formation of shoals of fish or herds of animals (including humans). I’m interested in sessile species, in other words those that don’t move. Can we apply the same principles? I would argue that we can, and in fact, I’ve already applied the same ideas to trees.

What about barnacles? They’re interesting organisms because, although they don’t move as adults, to some extent they get to choose where they settle. Their larvae drift in ocean currents until they reach a suitable rock surface to which they can cling. They then crawl around and decide whether they can find a good spot to fix themselves. It’s a commitment that lasts a lifetime; get it wrong, and that might not be a long life.

If you know one thing about barnacles, it’s probably that they have enormously long penises for their size. Many species, including acorn barnacles, require physical contact with another individual to reproduce. This places an immediate spatial constraint on their settlement behaviour. More than 2.5 cm from another individual and they can’t mate; this is potentially disastrous. Previous studies have focussed on settling rules based on this proximity principle. They will also benefit from protection from exposure or predators.  On the other hand, settle too close to another barnacle and you run the risk of being crushed, pushed off the rock, or having to compete for other resources.

theory

Barnacles can be expected to interact negatively at short distances, but positively at slightly longer distances. This disparity in the ranges of interactions gives rise to the observed patterning of barnacles in nature.

 

What Beki found was that barnacles are most commonly found just beyond the point at which two barnacles would come into direct contact. They cluster as close as they possibly can, even to the point of touching, and even though this will have the side effect of restricting their growth.

Furthermore, Beki found that dead barnacles had more neighbours at that distance than would be expected by chance, and that particularly crowded patches had more dead barnacles in them. There is evidence that this pattern is structured by a trade-off between barnacles wanting to be close together, but not too close.

1a_all

On the left, the pattern of barnacles in a 20 cm quadrat. On the right, the weighted probability of finding another barnacle at increasing distance from any individual. A random pattern would have a value of 1. This shows that at short distances (less than 0.30 cm) you’re very unlikely to find another barnacle, but the most frequent distance is 0.36 cm. Where it crosses the line at 1 is where the benefits of being close exceed the costs.

Hence the title of our paper: too close for comfort. Barnacles deliberately choose to settle near to neighbours, even though this carries risks of being crowded out. The pattern we found was exactly that which would be expected if Iain Couzin’s model of interaction zones were determining the choices made by barnacles.

When trees disperse their seeds, they don’t get to decide where they land, they just have to put up with it. The patterns we see in tree distributions therefore reflect the mortality that takes place as they grow and compete with one another. This is also likely to take place in barnacles, but the interesting difference lies in the early decision by the larvae about where they settle.

Where do we go from here? I’m now developing barnacles as an alternative to trees for studying self-organisation in nature. The main benefit is that their life cycles are much shorter than trees, which means we can track the dynamics year-by-year. For trees this might take lifetimes. We can also scrape barnacles off rocks and see how the patterns actually assemble in real time. Clearing patches of forests for ecological research is generally frowned upon. The next step, working with Maria Dornelas at St. Andrews, will be to look at what happens when you have more than one species of barnacle. Ultimately we’re hoping to test these models of how spatial interactions can allow species to coexist. Cool, right?

The final message though is that as an ecologist you are defined by the question you work on rather than the study organism. If barnacles turn out to be a better study system for experimental tests then I can learn from them, and ultimately they might teach me to understand my forests a little bit better.


 

* Respectively: Sara Goodacre studies the effects of long-range dispersal on population genetics; Angus Davison the genetic mechanisms underpinning snail chirality; Francis Gilbert the evolution of imperfect mimicry; Andrew MacColl works on host-parasite coevolution. I have awesome colleagues.

** I’ve just had an abstract accepted for a maths conference, which will be a first for me, and slightly terrifying. I’ve given talks in mathematics departments before but this is an entirely new experience.

*** Beki is now an MSc student on the Erasmus+ program in Evolutionary Biology (MEME). Look out for her name, she’s going to have a great research career. Although I suspect that it won’t involve barnacles again.

**** Iain and I once shared a department at Leeds, many years ago. He’s now at Princeton. I’m in the East Midlands. I’m not complaining…

Free software for biologists pt. 3 – preparing figures

So far we’ve looked at software tools for handing and analysing data and for writing. Now it’s time to turn to the issue of making figures.

Early in my career, I wish someone had taken me to one side and explained just how important figures are. Too often I see students fretting over the text, reading endless reams of publications out of concern that they haven’t cited enough, or cited the right things. Or fine-tuning their statistical analyses far beyond the point at which it makes any meaningful difference. And yet when it comes to the figures, they slap something together using default formatting, almost as an afterthought.

Having recently written a textbook (shameless plug), it has only brought home to me how crucial figures are to whether your work will get used and cited*. The entry criterion for a study being used in a book isn’t necessarily the quality of science, volume of data or clarity of expression, though I would argue that all of these are high in the best papers. What really sets a paper apart is its figures. Most of us, when we read papers, look at the pictures, and often make a snap judgement based on those. If the figures are no good then the chances of anyone wading through your prose to pick out the gems of insight will be substantially reduced.

Here then is a useful rule of thumb: you should spend at least one working day preparing each figure in a manuscript. That’s after collecting and analysing the data, and after doing a first-pass inspection of the output. A whole day just fine-tuning and making sure that each final figure is carefully and concisely constructed. You might not do it all in one sitting; you may spend 75% of the time trying out multiple formats before settling on the best one. All this is time well spent. And if you’re going to put the time into preparing them then you should look into bespoke software that will improve the eventual output.

xkcd1945

Easy to use does not mean good quality! Comic by XKCD.

Presenting statistical outputs

If you’ve been following this series of posts then it will come as no shock that I don’t recommend any of Microsoft’s products for scientific data presentation. The default options for figures in Excel are designed for business users and are unsuitable for academic publication. Trying to reformat an Excel figure so that it is of the required quality is a long task, and one that has to be repeated from scratch every time**. Then saving it in the right format for most journals (a .tiff or .eps file) is even less straightforward. As an intermediate option, and for those who wish to remain in Excel, Daniel’s XL plugin is a set of tools for analysis and presentation that improve its functionality for scientists.

Needless to say, this is all easier in R with a few commands and, once you’ve figured it out, you can tweak and repeat with minimal effort (the ggplot2 package is especially good). The additional investment in learning R will be rewarded. In fact, I’d go so far as to say that R is worth the effort for preparing figures alone. No commercial product will offer the same versatility and quality.

foliage2

Here’s one I made earlier, showing foliage profiles in 40 woodlands across the UK. Try creating that in Excel.

One of the reasons I recommend ggplot2 is that it is designed to follow the principles of data presentation outlined in Edward Tufte’s seminal book The Visual Display of Quantitative Information. It’s one of those books that people get evangelical about. It will change the way you think about presenting data, and forms the basis for the better scientific graphing tools.

visual-display-quantitative-information-tufte

What do you mean you haven’t read it? OK, you don’t have to, but it will convince you that data can be aesthetically pleasing as well as functional.

If you’re not an R user then a good alternative is the trusty gnuplot. Older readers can be forgiven for shedding a nostalgic tear, as this is one of the ancient software tools from the pre-internet age, having been around for about 30 years. It lives on, and has been continually maintained and developed, making it just as useful today as it was then.

A colleague pointed me towards D3.js, which is a JavaScript library that manipulates documents based on data input. I haven’t played with it but it might be an option for those who want to quickly generate standardised and reproducible reports.

Finally, if your main aim is to plot equations, then Octave is a free alternative to the commercial standard MATLAB. Only the most mathematical of biologists will want to use this though.

Diagrams

Some people try to produce diagrams using PowerPoint. No. Don’t do it. They will invariably look rubbish and unprofessional.

For drawing scientific diagrams, the class-leader is the fearsomely expensive Adobe Illustrator. Don’t even consider paying for your own license though because the free Inkscape will do almost everything you’ll ever need, unless you’re a professional graphic designer, in which case someone else is paying. Another free option is sK1 which has even more technical features should you need them. Xara Xtreme may have an awful name but it’s in active development and looks very promising. It’s also worth mentioning LibreOffice Draw, which comes as part of the standard LibreOffice installation.

One interesting tool I’m itching to try is Fiziko, which is a MetaPost script for preparing black-and-white illustrations for textbooks which mimic the appearance of blocky woodcuts or ink drawings. It looks like some effort and experience is required to use it though.

Image editing

The expensive commercial option is Photoshop, which is so ubiquitous that it has even become its own verb. For most users the free GIMP program will do everything they desire. I also sometimes use ImageMagick for image transformation, but mostly the command-line tool sam2p. Metadata attached to image files can be read and edited with ExifTool.

A common task in manuscripts is to create a simplified vector image, perhaps using a photo as a template. You might need to draw a map, show the structure of an organ or demonstrate an animal’s behaviour. For this there are specialist tools like Blender, Cheetah3D for Mac users or Google’s SketchUp, though the latter only offers a limited version for free download. Incidentally, never use a pixel art program (like Photoshop) to trace an image. All you end up with is a simplified pixel image of the original, which looks terrible. Plus you’ve paid for Photoshop.

For the rather specialised task of cropping and assembling documents from pdf files, briss might be an ancient piece of software but it’s still the go-to application.

Preparing outline maps (e.g. of study sites) is a common task and an expensive platform like ArcGIS is unnecessary. Luckily the free qGIS is almost as good and improving rapidly. There’s a guide to preparing maps here.

anglesey

A map showing the study site in a forthcoming paper (Hooper & Eichhorn 2016) and prepared by Jon Moore in qGIS.

There are countless programs out there for sorting, handling and viewing photographs (e.g. digiKam, Shotwell). Not being much of a photographer I’m not a connoisseur.

Flowcharts

Flowcharts, organisational diagrams and other images with connected elements can be created in LibreOffice Draw. I’ve not used it for this though, and therefore can’t compare it effectively to commercial options like OmniGraffle, which is good but expensive for something you might not be doing regularly. A LaTeX-based option such as TikZ is my usual choice, and infinitely better than spending ages trying to get boxes to snap to a grid in Powerpoint. If you’re not planning to put the time into learning LaTeX then this is no help, but add it to the reasons why you might. If anyone knows of a particularly good FOSS solution to this issue then please add in the comments and I will update.

pdfmaker

I made this in TikZ to illustrate the publication process for my MSci class in research skills. I won’t lie, it took a long time (even as a LaTeX obsessive), and I’d like to find a more efficient means of creating these figures.

Animations

This is one task that R makes very easy. Take the output of a script that creates multiple PNG files from a loop and bundle them into an animation using QuickTime or the very straightforward FFmpeg. For something that looks so impressive, especially in a presentation, it’s surprisingly easy to do.

Collecting data

To collect data from images ImageJ is by far the best program, largely due to the immense number of specialist plug-ins. Some of these have been collected into a spin-off called Fiji, which provides a great set of tools for biologists. Whatever you need to do, someone has almost certainly written a plug-in for it. Note that R can also collect data from images and even interfaces with ImageMagick via the EBimage package. Load JPEGs with the ReadImage package and TIFF files with rtiff.

A common task if you’re redrawing figures, or preparing a meta-analysis, is to extract data from figures. This is especially common when trying to obtain data from papers published before the digital age, or when the authors haven’t put their original data online. For this, Engauge will serve your needs.

Next time: how to prepare presentations!


* At some point in the pre-digital age, maybe in the 90s, I recall an opinion piece by one textbook author making exactly this point. Was it Lawton, Krebs, Southwood… I really can’t remember. If anyone can point me in the right direction then I’d be grateful because I can’t track it down.

** I did overhear one very prominent ecologist declare only half-jokingly that they stopped listening to talks if they saw someone present an Excel figure because it indicated that the speaker didn’t know what they were doing. Obviously I wouldn’t advocate such an extreme position, but using Excel does send a signal, and it’s not a good one.

Free software for biologists pt. 2 – data management and analysis

This is the second part of a five-part series, collated here. Having covered writing tools in the last post, this time I’m focussing on creating something to write about.

Data management

Let’s assume that you’ve been out, conducted experiments or sampling regimes, and returned after much effort with a mountain of data. As scientists we invest much thought into how best to collect reliable data, and also in how to effectively analyse it. The intermediate stage — arranging, cleaning and processing the data — is often overlooked. Yet this can sometimes take as long as collecting the data in the first place, and specialist tools exist to make your life easier.

I’m not going to dwell here on good practices for data management; for that there’s an excellent guide produced by the British Ecological Society which says more than I could. The principles of data organisation are well covered in this paper by Hadley Wickham. Both are on the essential reading list for students in my group, and I’d recommend them to anyone. Instead my focus here is on the tools you can use to do it.

The familiar Microsoft Excel is fine for small datasets, but struggles with large spreadsheets, and if you’ve ever tried to load a sizeable amount of data into it then you’ll know that you might as well go away to make a cup of tea, come back and hope it hasn’t crashed. This is a problem with Excel, not your data. Incidentally, LibreOffice Calc is the free substitute for Excel if you want a straight replacement. Don’t even consider using either of them to do statistics or draw figures (on which there will be more next time). I consider this computational limitation more than enough reason to look elsewhere, even though there are many official and unofficial plug-ins which extend Excel’s capabilities. Excel can also reformat your data without you knowing about it.

One of the main functionalities lacking in Excel is a way to use GREP. Regular Expressions are powerful search terms that allow you to screen data, check for errors and fix problems. Learning how to use them properly will save all the time you used to spend scrolling through datasheets looking for problems until your mind went numb. Proper text editors allow this functionality. Personally I use jEdit to manage my data, which is available free for all operating systems. Learning to parse a .csv or .txt file that isn’t in a conventional box-format spreadsheet takes a little time but soon becomes routine.

For larger, linked databases, Microsoft Access used to be the class-leader. The later versions have compromised functionality for accessibility, leading many people to seek alternatives. Databases are compiled using SQL (Structured Query Language), and learning to use Access compels you to pick up the basics of this anyway. Given this, starting with a free alternative is no more difficult. I have always found MySQL to be easy and straightforward, but some colleagues strongly recommend SQLite. It might not have all the same functions of the larger database tools but most users won’t notice the difference. Most importantly, a database in SQL format can be transferred between any of these software tools with no loss of function.  Migrating into (or out of) Access is trickier.

As a general rule, your data management software should be used for that alone. The criterion for choosing what software to use is that it should allow you to clean your data and load it into an analysis platform as quickly and easily as possible. Don’t waste time producing summaries, figures or reports when this can be done more efficiently using proper tools.

Data analysis

These days no-one looks further than R. As a working environment it’s the ideal way to load and inspect data, carry out statistical tests, and produce publication-quality figures. Many people — including myself — do pretty much all their data processing, analysis and visualisation in R*.

It’s interesting to note just how rapidly the landscape has changed. As an undergraduate in the 90s we were taught using Minitab. For my PhD I did all my statistics in SPSS, then as a post-doc I transitioned to GenStat. All are perfectly decent, serviceable solutions for basic statistical analyses. Each has its limitations but moving between them isn’t difficult.

I won’t hide the simple truth — learning R is hard, especially if you have no experience of programming. Why then did I bother? The simple answer is that R can do everything that all the above programs can do, and more. It’s also more efficient, reproducible and adaptable. Once you have the code to do a particular set of analyses you can tweak, amend and reapply at will. Never again do you have to work through a lengthy menu, drag-and-drop variables, tick the right boxes and remember the exact sequence for next time. Once a piece of code is written, you keep it.

If you’re struggling then there are loads of websites providing advice to all levels from beginners to experienced statistical programmers. It’s also worth looking at the excellent books by Alain Zuur which I can’t recommend highly enough. If you have a problem then a quick internet search will usually retrieve an answer in no time, while the mailing lists are filled with incredibly helpful people**. The other great thing about R is that it’s free***.

One word of warning is to not dive too deep at the beginning. Start by replicating analyses you’re already familiar with, perhaps from previous papers. The Quick-R page is a good entry point. A bad (but common) way of beginning with R is to be told that you need to use a particular analytical approach, and that R is the only way to do it. This way leads at best to frustration, at worst to errors. If someone tells you to use approximate Bayesian inference via integrated nested Laplace approximation, then you can do it with the R-INLA package. The responsibility is still on you to know what you’re doing though; don’t expect someone to hold your hand.

Because R is a language rather than a program, the default environment isn’t very easy to work in, and you’re much better using another program to interface with R. By far the most widely-used is RStudio, and it’s the one I recommend to my own post-graduate students. It will improve your R coding experience immensely. Some programmers use it for almost everything. An alternative is Tinn-R, which I used to use, but gave up on a few years ago because it was too buggy. It may have improved now so by all means try it out. If you’re desperate for a familiar-looking graphical user interface with menus then R Commander provides one, but I recommend using this as a gateway to learning more (or teaching students) rather than a long-term solution.

I’m a bit old-fashioned and prefer to use a traditional text editor to work in R. My choice, for historical reasons, is eMacs, which links neatly to R through ESS. The other tribe of programmers use Vim with the sensibly-named Vim-R-plugin, and we shall speak no more of them. If you’re already a programmer then you know about these, and can be assured that you can code in R just as easily. If not then stick to Rstudio, which is much easier. I also often use Geany as a tool for making quick edits to scripts.

Most of all, don’t type directly into R, it’s a recipe for disaster, and removes the greatest advantage which is its reproducibility. Likewise don’t keep a Word document open with R commands while continually copy-and-pasting them over. I’ve seen many students doing this, and it’s only recommended if you want to speed the onset of repetitive strain injury. Word will also keep reformatting and autocorrecting your text, introducing many errors. Use a proper editor and it’s done in one click.

One issue with R that more experienced users will come across is that it is relatively slow at processing very large datasets or large numbers of files. This is a problem that relatively few users will encounter, and by that point most will be competent programmers. In these cases it’s worth learning one of the major programming languages for file handling. Python is the easiest to pick up, for which Rosalind provides a nice series of scaled problems for learning and teaching (albeit with a bioinformatics focus). Serious programmers will know of or already use C, which is more widespread and has greater power. Finding out how to use a Bash shell efficiently is also immensely helpful. Learning to program in these other languages will open many doors, including to alternative careers, but is not essential for most people.

As a final aside, there is a recent attempt to link the power of C with the statistical capabilities of R in a new programming language called Julia. This is still in early development but is worth keeping an eye on if statistical programming is likely to become a major feature of your research.

Specialist software tools

Almost everything can be done in R, and those that can’t already, can be programmed. That said, there are some bespoke free software tools that are worth mentioning as they can be of great use to ecologists. They’re also valuable for those who prefer a GUI (Graphical User Interface) and aren’t ready to move over to a command-line tool just yet. Where I know of them, I’ve mentioned the leading R packages too.

Diversity statistics — the majority of people now use the vegan package in R. Outside R, the most widely-used free tool for diversity analysis is EstimateS. Much of the same functionality is contained in SPADE, written by Anne Chao (who has a number of other free programs on her website). I’ve always found the latter to be a little buggy, but it’s also reliably updated with the very latest methods. It has more recently been converted into an R package, spadeR, which has an accessible webpage that will do all the analyses for you. As a final mention, there is good commercial software available from Pisces Conservation, but apart from a cleaner-looking interface I’ve never seen any advantage to using it.

GIS — I’ll be returning to the issue of making maps in a later post, but will mention here that a direct replacement for the expensive ArcGIS is the free qGIS. I’ve never found any functionality lacking, but I’m not a serious GIS user either. There are a plethora of R packages which in combination cover the same range of functions but I wouldn’t like to make recommendations.

MacroecologySAM (for Spatial Analysis in Macroecology) is a useful tool for quickly loading and inspecting patterns in spatial ecological data. I would personally still move into R for publication-grade analyses, but this can be a helpful stepping stone when exploring a new dataset.

Null models — these can be very useful in community ecology. The only time I’ve done this, I used the free version of EcoSim. I see that you now have to pay for the full version, so if someone can recommend a comparable R package in the comments then I’ll update this accordingly.

I’m happy to extend this list with further recommendations; please drop a note in the comments.

Further reading

Practical Computing for Biologists is a great book. A little knowledge goes a long way, and learning how to use the shell, regular expressions and a small amount of Python will soon reap dividends for your research, whatever stage you’re at.


* The most mathematically-inclined biologists might hanker after something more like MATLAB, for which a direct free replacement is GNU Octave. You can even transfer MATLAB programs across, although there are some minor differences in the language.

** Normal forum protocol applies here, which is that you shouldn’t ask a question to which you could reasonably have found an answer by searching for yourself. If you ask a stupid question that implies no effort on your part then you can expect a curt answer (or none at all).  That said, if you really can’t work something out then it’s well worth bringing up because you might be the first person to spot an issue. If your problem is an interesting one then often you’ll find yourself receiving support from some of the top names in the field, so long as you are willing to learn and engage. Please read the posting guide before you start.

*** A few years ago a graduate student declined my advice to use R, declaring in my office that if R was so good, someone would be charging for it. I was taken aback, perhaps because I take the logic of Free Open-Source Software for granted. If you’re unsure, then the main benefit is that it’s free to obtain and modify the original code. This means that someone has almost certainly created a specific tool to meet your research needs. Proprietary commercial software is aimed at the market and the average user, whereas open-source software can be tweaked and modified. The reason R is so powerful is that it’s used by so many people, many of whom are actively developing new tools and bringing them directly to your computer. Often these will be published in Journal of Statistical Software or more recently Methods in Ecology and Evolution.

 

Free software for biologists pt. 1 – writing tools

This is the first in a planned series of five posts, to cover (1) writing tools, (2) data management and analysis, (3) preparing figures, (4) writing presentations and (5) choosing a new operating system. They will eventually be collated here.

Document-writing tools

Microsoft Word remains the default word processing software for the majority of people. Its advantage is exactly that, which makes collaboration relatively straightforward. The track changes function is appreciated by many people, though I would argue it’s unnecessary and can lead to problems; see below for tips on collaborative writing.

If you’re going to be spending a large proportion of your life writing then Word is not the ideal solution, especially for scientists. On this point it’s worth making clear that `scientist’ is just another word for `writer’. We write constantly — papers, grant proposals, lecture notes, articles and books. Professional writers use other commercial software such as Scrivener; this however is just paying for something different. Microsoft Word has improved in recent years, but there are still problems. The main limitations are:

  • It’s terrible at handling large documents (e.g. theses, or anything more than a couple of pages). Do you really need to do all that scrolling?
  • Including equations or mathematical script is difficult and always looks poor quality.
  • Embedded images are reproduced at low resolution.
  • Files are unnecessarily large in size.
  • The .docx format is very unstable. Send it to a collaborator on another computer (even with Windows) and it will appear different, with mangled formatting.
  • The default appearance doesn’t look very professional, and improving it takes forever.
  • It keeps reformatting everything as you go along, particularly when you combine sections from different documents.

I didn’t realise how much time was spent fighting Word’s defaults until I tried other software. Escaping isn’t tricky, as this blog post reveals. Several options are available to the scientific writer, and will improve both the quality and the experience of writing.

LibreOffice Writer. Want something that looks exactly like Microsoft Word, does everything that Word does, but don’t fancy paying for it? Just download LibreOffice and you’ll find it works equally well (if not better). This is perhaps the best option if you have an out-of-date or bootlegged version of Word and can’t access updates. With LibreOffice you will be able to open, edit and share all of your existing Word documents, and even save them in .doc format. The native format is .odt (for open document text). This is recommended as a stable document format by the British Government, which tells you something. Your Word-using colleagues will be able to open them as well.

Markdown. This has grown in popularity with scientists as it’s easier to use than professional tools such as LaTeX (see below) but provides many of the document-formatting tasks that scientists need. You can even write Markdown scripts in Word, but why would you. Combining it with pandoc makes it even more powerful because you can convert a Markdown template into any other format to match the requirements of a journal (or your collaborators). This is much easier to do than with LaTeX, which requires some programming nous. A good, free Markdown editor is Retext.

LaTeX. The gold standard, as used by many professional writers and editors (it’s pronounced lay-tech; the final letter is a chi). All my handouts are prepared in LaTeX, as are my presentations, manuscripts, in fact pretty much everything I write apart from e-mails. The problem is that learning LaTeX takes time. Most word processor programs run on the principle of WYSIWYG (What You See Is What You Get), whereas in LaTeX you need to explicitly state the formatting as you go along.

There are a number of gateway programs which allow you to write in LaTeX but with a more familiar writing environment. These therefore ease the transition and can show you the potential. I know many people who swear by LyX. My preferred editor is Kile, though this will involve a steeper learning curve. A great help while writing in LaTeX is to be able to see what the document looks like as you write. I pair Kile with Okular, but there are many other options that are equally good.

As a health warning, before diving into the deep end, bear in mind that working in LaTeX will initially be much slower. It takes time to become competent, and there are annoying side issues that remain frustrating (installing new fonts, for example, is bizarrely complex). While the majority of journals and publishers accept LaTeX submissions, and most will provide a template to format your manuscripts, there are still a few who require .doc format. This is changing though due to demand on the part of authors.

Collaborative writing

In the old days, when you collaborated on writing a paper, it required dozens of e-mails to be sent round as each author added her comments. Version control became impossible as soon as there were multiple copies and it was easy to lose track. Some people persist in working this way despite the fact that there are loads of tools that make this unnecessary. By using an online collaborative-writing site, multiple authors can contribute simultaneously, and you can even chat to each other while you’re at it.

The best-known is of course Google Docs which has the virtue of a familiar interface. It’s not designed for scientific writing though, and unsurprisingly there are more specific tools out there. While I’ve not used it, Fidus Writer looks like a promising option with a familiar layout to Google Docs but more better suited to the demands of science writing.

The one I’ve used most often is Authorea, which has the major advantage that anyone can write in any style and on any platform. This means that one person can write the technical parts in LaTeX while another adds sections Markdown, or you can cut-and-paste text from a normal word processor. The final document can be exported in your format of choice. This solves the problem of having all your collaborators needing to use the same software. My favoured option (for LaTeX users only) is shareLaTeX, though writeLaTeX looks to be equally good.

I haven’t mentioned GitHub here, even though I know many people who use it to maintain version history in collaborative work. This is particularly true of programmers who need to trace changes in code as it’s being developed. The same functionality can be very helpful in writing manuscripts, but using GitHub is not easy to use and it’s rare in biology that you will find yourself working with a pool of collaborators who know what they’re doing.

As a final note, I discourage the use of tracked changes due to many bad experiences. The main issue is that once more than one person has commented on a document it gets completely mangled, and it can take a long time to reconstruct the flow of the text once all the contradictory changes have been accepted. Furthermore, if your reason for having a WYSIWYG processor is that you want to see how the final document will look, then tracked changes remove that benefit and make your document unreadable. Lastly, whenever I’ve been forced into using them (in one notable occasion by a journal editor) it has invariably introduced errors into the text. By using some of the software recommended here there should be no need for the track changes function at all.

References and citations

The old standard for reference management used to be Endnote, which is an expensive solution if you don’t have either an institutional license or a student discount. Much the same can be said of Papers, which I hear is excellent but have never used.

I strongly recommend Mendeley to all my students. Think of it as iTunes for papers. It’s free and integrates smoothly with all the word processing software above. Even better is the online functionality which means you can synchronise documents across all your devices, including a commenting function, and share with colleagues. So you can read a PDF on the train, make notes on it, then open your office computer and retrieve all the notes straight away before dropping the citation directly into your manuscript. There are many tutorials online and the few hours you spend learning to use it will be rewarded by much time saved. Apparently Zotero, which is also free, offers similar functionality, but I’ve not tried it.

Having said all that, I don’t use Mendeley. If you’re using LaTeX then citing references is done through BibTeX, and I prefer kBibTeX to manage my reference library as it integrates nicely with Kile. This is only a personal choice though, and Mendeley would achieve the same result.

 

In praise of backwards thinking

What is science? This is a favourite opening gambit of some external examiners in viva voce examinations. PhD students, be warned! Imagine yourself in that position, caught off-guard, expected to produce some pithy epithet that somehow encompasses exactly what it is that we do.

It’s likely that in such a situation most of us would jabber something regarding the standard narrative progression from observation to hypothesis then testing through experimentation. We may even mumble about the need for statistical analysis of data to test whether the outcome differs from a reasonable null hypothesis. This is, after all, the sine qua non of scientific enquiry, and we’re all aware of such pronouncements on the correct way to do science, or at least some garbled approximation of them.* It’s the model followed by multiple textbooks aimed at biology students.

Pause and think about this in a little more depth. How many great advances in ecology, or how many publications on your own CV, have come through that route? Maybe some, and if so then well done, but many people will recognise the following routes:

  • You stumble upon a fantastic data repository. It takes you a little while to work out what to do with it (there must be something…) but eventually an idea springs to mind. It might even be your own data — this paper of mine only came about because I was learning about a new statistical technique and remembered that I still had some old data to play with.
  • In an experiment designed to test something entirely different, you spot a serendipitous pattern that suggests something more interesting. Tossing away your original idea, you analyse the data with another question in mind.
  • After years of monitoring an ecological community, you commence descriptive analyses with the aim of getting something out of it. It takes time to work out what’s going on, but on the basis of this you come up with some retrospective hypotheses as to what might have happened.

Are any of these bad ways to do science, or are they just realistic? Purists may object, but I would say that all of these are perfectly valid and can lead to excellent research. Why is it then that, when writing up our manuscripts, we feel obliged — or are compelled — to contort our work into a fantasy in which we had the prescience to sense the outcome before we even began?

We maintain this stance despite the fact that most major advances in science have not proceeded through this route. We need to recognise that descriptive science is both valid and necessary. Parameter estimation and refinement often has more impact than testing a daring new hypothesis. I for one am entranced by a simple question: over what range do individual forest trees compete with one another? The question is one that can only be answered with an empirical value. To quote a favourite passage from a review:

“Biology is pervaded by the mistaken idea that the formulation of qualitative hypotheses, which can be resolved in a discrete unequivocal way, is the benchmark of incisive scientific thinking. We should embrace the idea that important biological answers truly come in a quantitative form and that parameter estimation from data is as important an activity in biology as it is in the other sciences.”Brookfield (2010)

Picture 212

Over what distance do these Betula ermanii trees in Kamchatka compete with one another? I reckon around three metres but it’s not straightforward to work that out. That’s me on the far left, employing the most high-tech equipment available.

It might appear that I’m creating a straw man of scientific maxims, but I’m basing this rant on tenets I have received from reviewers of manuscripts, grant applications or been given as advice in person. Here are some things I’ve been told repeatedly:

  • Hypotheses should precede data collection. We all know this is nonsense. Take, for example, the global forest plot network established by the Center For Tropical Forest Science (CTFS). When Steve Hubbell and Robin Foster set up the first 50 ha plot on Barro Colorado Island, they did it because they needed data. The plots have led to many discoveries, with new papers coming out continuously. Much the same could be said of other fields, such as genome mapping. It would be absurd to claim that all the hypotheses should have been known at the start. Many people would refine this to say that the hypothesis should precede data analyses (as in most of macroecology) but that’s still not the way that our papers are structured.
  • Observations are not as powerful as experiments. This view is perhaps shifting with the acknowledgement that sophisticated methods of inference can strip patterns from detailed observations. For example, this nice paper using Bayesian analyses of a global dataset of tropical forests to discern the relationship between wood density and tree mortality. Ecologists frequently complain that there isn’t enough funding for long-term or large-scale datasets to be produced; we need to demonstrate that they are just as valuable as experiments, and recognising the importance of post-hoc explanations is an essential part of making this case. Perfect experimental design isn’t the ideal metric of scientific quality either; even weak experiments can yield interesting findings if interpreted appropriately.
  • Every good study should be a hypothesis test. We need to get over this idea. Many of the major questions in ecology are not hypothesis tests.** Over what horizontal scales do plants interact? To my mind the best element of this paper by Nicolas Barbier was that they determined the answer for desert shrubs empirically, by digging them up. If he’d tried to publish using that as the main focus, I doubt it would have made it into a top ecological journal. Yet that was the real, lasting contribution.

Still wondering what to say when the examiner turns to you and asks what science is? My answer would be: whatever gets you to an answer to the question at hand. I recommend reading up on the anarchistic model of science advocated by Paul Feyerabend. That’ll make your examiner pause for thought.


* What I’ve written is definitely a garbled approximation of Popper, but the more specific and doctrinaire one gets, the harder it becomes to achieve any form of consensus. Which is kind of my point.

** I’m not even considering applied ecology, where a practical outcome is in mind from the outset.

EDIT: added the direct quotation from Brookfield (2010) to make my point clearer.