IAG review of BIG
Rod Page
On Monday and Tuesday, 14-15 April, the MBL at Woods Hole hosted the first review of EOL’s Biodiversity Informatics Group (BIG). This meeting was a chance for the Informatics Advisory Group (IAG — sorry, there are still more acronyms to come) to hear about progress to date, and where BIG wanted to go next. Chris Freeland (from BHL) has some posted some photos of the meeting on Flickr, which give a sense of the number of people involved: members of BIG and IAG, together representatives from BHL, BioSynC, the Steering Committee, and interested observers. What the photos can’t convey is the spirited nature of the discussion, which made the two days hugely enjoyable.
It was my task as chair of the IAG to try and condense the detailed reports given by BIG, and the subsequent discussions, into a written report. That has now been done, and the result presented to BIG. In this post I will summarise two key areas, namely content and vetting. The report also addressed topics such as the site design, globally unique identifiers, and organisational matters, but I think content and vetting are the two that generated the most debate.
Content
Now that EOL is live and people have had a chance to look around, it is striking that 76% percent of visitors don’t return, and 44% of all visitors left in under 10 seconds. After the initial launch where, if anything, EOL was too popular, interest seems to have dropped off markedly. One possible reason for this is the relative lack of content. As I noted elsewhere, for many pages EOL compares unfavourably with other sites, such as David Stang’s ZipcodeZoo, or my own mashup iSpecies. EOL’s current strategy has been to limit its content to “vetted” information from trusted providers. For 24 exemplar taxa EOL provides relatively detailed information, but for the rest of life the content it currently displays is pretty sparse.
The challenge is how to cover all life in reasonable detail. If we take the well-worn estimate of 1,800,000 million described species, and EOL’s 10 year time frame, then BIG needs to add around 500 species pages per day! Doing this without massive automation simply won’t scale. Assembling the 24 exemplar pages required considerable effort, yet simple aggregators such as iSpecies can generate a roughly similar level of detail within seconds. The diagram below compares pages for Anolis carolinensis (EOL exemplar taxa) in EOL (left, or go here) and iSpecies (right, or go here). The iSpecies account is assembled automatically on the fly from sources such as GenBank, GBIF, Google Scholar, Yahoo Images, and Wikipedia.

EOL is a long term project, and hence it may seem unfair to judge it so soon after it has been launched (and after Herculean efforts by BIG staff). However, given EOL’s current lack of content, and the existence of other web sites (such as ZipcodeZoo, DiscoverLife, and iSpecies) that already serve a much greater amount of information, my concern is that EOL risks being marginalised. I don’t think that EOL has anything like 10 years in which to prove itself.
How to add content quickly
For the report I prepared a cartoon plotting the cost of obtaining content against the amount of content obtained. “Costs” are in terms of developer time to import data (are they in a standard form, or a format unique to the provider), and time spent negotiating intellectual property agreements (such as how to display credit and attribution information, how the data will look, etc). At the bottom right (1) are large, freely available data sources such as GenBank, GBIF, and Wikipedia. At the left (2) are small sources that require tools to make their content available. In the middle (3) are well-established data providers that can require considerable effort to incorporate into EOL, due to both IPR issues and idiosyncratic data structures. The dotted line is an arbitrary cutoff, above which the effort required to obtain content outweighs the value that content would bring to EOL.

The report recommends going after content in category 1 first. These are sources have massive amounts of data that are freely available, and relatively easy to import. These include PubMed, GenBank, Wikipedia, ITIS, Flickr, and GBIF. As I noted on the iSpecies blog, GenBank records often contain metadata about organismal distribution, habitats, and ecological associations, which could be harvested. There are communities on Flickr building photo libraries of organisms, often tagged with scientific name and geographic location (e.g., Field Guide: Birds of the World). Harvesting these sources will provide considerable initial content for EOL. Of course, not all sources in this category are of comparable quality. GenBank and PubMed are, publicly funded, curated archives of scientific research, Wikipedia and Flickr are not.
Category 2 is next, and this is where we need tools to enable smaller providers to manage their own content, and contribute to EOL at the same time. This content would be targeted by “LifeDesks” (similar to the scratchpads being developed at the Natural History Museum, London).
Content in category 3 may have high scientific value, but in the short term the effort involved in incorporating it may outweigh the value it brings.
It’s perhaps a glib phrase, but I’m reminded of genius of “and” versus the tyranny of “or”. Harvesting resources in category 1 is not an argument against also going after resources in category 2, it’s a question of priorities. In the same way, tools developed for category 2 providers may well facilitate acquiring content from category 3 sources.
Vetting
The issue of “vetting” generated much discussion during the review meeting. It became clear that this term can mean different things:
- data that is error free (”correct”).
- data provided by scientific sources (”scientifically authenticated”)
- data that has been verified by experts
No data source is without error, so EOL will inevitably include erroneous information. Currently the bulk of its data comprise distribution maps from GBIF, which are known to contain errors. For example, some 16% of legume records are incorrect (doi:10.1371/journal.pone.0001124.). The GBIF map below shows numerous, erroneous records of the North American channel catfish (Ictalurus punctatus) in China.

At the scale at which EOL operates (100’s of millions of items of information), manually vetting all information before it is displayed is not feasible, and indeed by displaying GBIF maps EOL tacitly acknowledges this.
Of course, EOL wants to be an authoritative resource (in other words, more than a simple mashup), hence, one of its biggest challenges is to develop methods to catch errors. Innovative methods of annotation will need to be developed. Human Computation (see also Luis von Ahn’s talk at Google) is one approach, recently used by Google’s Image Labeler to annotate web images. BIG will need to develop easy-to-use interfaces so that EOL users can annotate data and flag possible errors. These annotations should be publicly visible, so that users who take the trouble to make annotations get instant feedback, and other users can see which records are contested.
Summary
A project on the scale of EOL is bound to take some time to settle in, and initial expectations were never going to be met, hence the generally under whelmed reaction in the blogosphere (myself included). There is much to do, and the overall theme of the IAG report is that EOL needs more content, fast, and needs to tackle the issue of vetting in a way that will scale.

May 2nd, 2008 at 5:49 am → Go on!!! EOL will not have any competitor. This web will be The Reference for all the investigators and nature ... Read it ↓
Go on!!! EOL will not have any competitor. This web will be The Reference for all the investigators and nature lovers.
We will be patient!
May 8th, 2008 at 12:52 pm → I think (hope?) my site (http://mushroomobserver.org) falls into category 2. I strongly agree with what you are saying about ... Read it ↓
I think (hope?) my site (http://mushroomobserver.org) falls into category 2. I strongly agree with what you are saying about allowing user to annotate the data. This is crucial and in my opinion should be the number 1 priority. It is great to have scientifically vetted data, but it is abundantly clear that is not the same as error free. When someone seems data that they know is wrong and have no effective way to fix it, they walk away. The key is to make it clear where the data is coming from. Ultimately this comes down to a rating system for indivdual contributors. My own site allows for both annotation and voting. The votes are weighted based on user’s contribution to the site. This is far from perfect, but I believe it is the direction systems like this have to go to really be successful. The next thing I want to add in this regard is a mechanism for users to recommend other users. This allows the community to both develop its own experts and acknowledge existing experts that join the community but have not had the time to develop their own reputation.
The other key feature for category 2 (and category 3) developers is to provide clear interfaces that the developers of those sites can hook into to do the work for you. I would love to make the data I’ve been collecting available to EOL, but I have no way to do it.
May 13th, 2008 at 1:23 pm → Nathan, I couldn't agree more. Take GBIF, for example, which is a wonderful resource, but full of errors. It is ... Read it ↓
Nathan, I couldn’t agree more. Take GBIF, for example, which is a wonderful resource, but full of errors. It is possible to get things fixed (see my Fixing GBIF post), but it is not straightforward. It has to be made much, much easier. I particularly like your idea of a mechanism where existing users can recommend other others, notably those that haven’t yet acquired a reputation via the site, but who have an existing reputation.
In terms of making your data available, I suggesting contacting David Shorthouse at EOL.