Archive for May, 2008

New Members of the Informatics Team

David Shorthouse
Friday, May 30th, 2008

Jonathan Clapp
Software Developer

clapp.jpg
I grew up on Cape Cod and join the EOL Informatics Group as a software developer. My experience is in database and web application development.
I will help ensure that the foundations of the Encyclopedia of Life are as solid as possible while allowing flexibility in the future. I have followed this important project for some time and am thrilled to be contributing to its success. It has the potential to be a great resource for learning about and advancing the conservation of the myriad species on earth.

Vitthal Kudal
Software Developer

vitthal.jpg
I am working as a Software Developer with the informatics group of Encyclopedia of Life. I have Master’s degree in Computer Science from University of Pune, INDIA. I have worked for NCL center for Biodiversity Informatics (NCBI) as a Project Assistant. My dream is to make all species data available over the internet on a single command of user and which is going to fulfill by working with the EOL Group. What it is has been like working at EOL? Interesting, inspiring, insightful, impactful, fun and more such words.

Jeremy Rice
Software Developer

rice.jpg
Working with the Encyclopedia of Life is the realization a long-time dream for me. I love being at a university, advancing research. The Encyclopedia’s vision to synthesize all information about life present on Earth… that makes it something really essential for me. I’m joining the team with over ten years of experience developing a variety of applications. What really drives me is turning abstract ideas into working products that people appreciate. There’s an abundance of ideas here, and I hope that we can produce some amazing tools to facilitate them…

Dimitry Mozzherin
Software Developer

dimitry.jpg
I was born in Russia, and from my early school years I wanted to become a biologist and a wild life photographer. I happened to become both later. At some point I started to learn programming languages and after discovering Open Source movement I decided to make programming my profession. And I am now at EOL because here I can express my passion for wild life and passion for development Free Software at the same time!

Anne Thessen
Post-doctoral Investigator

thessen.jpg
I’m working on data mobilization for EOL and the International Census of Marine Microbes. Lots of biological data can be found on the printed page, which must be read to retrieve information. I’m trying to find ways to make this information easier to retrieve and use. Prior to joining the EOL team, I worked on Arctic primary production, toxin-producing diatoms and shellfish grazing.

SOS - State of Observed Species

Rod Page
Thursday, May 29th, 2008

Arizona State University’s “International Institute for Species Exploration” has released it’s first State of Observed Species Report. It reports that 16,969 new species were discovered in 2006 (approximately 46 species per day). Not surprisingly, most are insects:

sos.png

SOS have also published a list of the “top 10″ species described in 2007.

2008_01th.jpg 2008_02th.jpg 2008_03th.jpg 2008_04th-1.jpg 2008_05th.jpg
2008_06th.jpg 2008_07th.jpg 2008_08th.jpg 2008_09th.jpg 2008_10th.jpg

This list has attracted some comment at The Other 95%, Zooillogix, and Catalogue of Organisms.

These lists have implications for EOL. The report gives us a lower bound on the rate of new species description — EOL will need to be able to add at east 46 species pages a day just to keep pace with new discoveries, never mind what has already been described. It isn’t doing anything like this at present, and hence none of the species in the SOS top ten list are in EOL (most are already in Wikipedia, and all return at least some information in iSpecies).

iNaturalist

Rod Page
Saturday, May 17th, 2008

logo-1.gif
Ken-ichi Ueda told me about iNaturalist.org, a wonderful site he, Nathan Agrin, and Jessica Kline have created for their Masters at UC Berkeley’s School of Information. To quote from the web site:

iNaturalist.org is a place where you can record what you see in nature, meet other nature lovers, and learn about the natural world.

It looks gorgeous (lots of Flickr Creative Commons photos), use of Wikipedia, and the TimeMap Javascript library. arachnida.png

Arguably the species pages are clearer than EOL’s (compare Anolis carolinensis on iNaturalist and EOL).But what makes it especially cool is the way it engages users with the ability to add observations of organisms, and request identifications. I like the emphasis on being

…a fun and efficient way to record, find, and share nature observations.

I think its a great project that could provide useful ideas for the design of EOL’s pages.

Citizen science podcast

Rod Page
Thursday, May 15th, 2008

PodcastLogo.png
Jon Udell has a great podcast where he interviews Janis Dickinson, who directs the citizen science program at at Cornell’s Laboratory of Ornithology. On his blog Jon writes:

Extracting signal from noise is, of course, one of the classic bread-and-butter activities of information science. What’s fascinating here is the Web 2.0 angle. Birdwatchers are famously passionate data collectors who develop reputations among their peers. When they contribute their data to eBird — and thence to the Avian Knowledge Network — those reputations can begin to be measured, and used to tune the analysis of a large body of contributed data.

These are, of course, issues directly relevant to EOL. Jon has long been interested in integrating information (including digital libraries), social networking, and how people interact with technology. His podcast is a mine of useful information. Click this link to subscribe to it in iTunes.

IAG review of BIG

Rod Page
Thursday, May 1st, 2008

2415336890_84744a837e_t.jpgOn Monday and Tuesday, 14-15 April, the MBL at Woods Hole hosted the first review of EOL’s Biodiversity Informatics Group (BIG). This meeting was a chance for the Informatics Advisory Group (IAG — sorry, there are still more acronyms to come) to hear about progress to date, and where BIG wanted to go next. Chris Freeland (from BHL) has some posted some photos of the meeting on Flickr, which give a sense of the number of people involved: members of BIG and IAG, together representatives from BHL, BioSynC, the Steering Committee, and interested observers. What the photos can’t convey is the spirited nature of the discussion, which made the two days hugely enjoyable.

It was my task as chair of the IAG to try and condense the detailed reports given by BIG, and the subsequent discussions, into a written report. That has now been done, and the result presented to BIG. In this post I will summarise two key areas, namely content and vetting. The report also addressed topics such as the site design, globally unique identifiers, and organisational matters, but I think content and vetting are the two that generated the most debate.

Content

Now that EOL is live and people have had a chance to look around, it is striking that 76% percent of visitors don’t return, and 44% of all visitors left in under 10 seconds. After the initial launch where, if anything, EOL was too popular, interest seems to have dropped off markedly. One possible reason for this is the relative lack of content. As I noted elsewhere, for many pages EOL compares unfavourably with other sites, such as David Stang’s ZipcodeZoo, or my own mashup iSpecies. EOL’s current strategy has been to limit its content to “vetted” information from trusted providers. For 24 exemplar taxa EOL provides relatively detailed information, but for the rest of life the content it currently displays is pretty sparse.

The challenge is how to cover all life in reasonable detail. If we take the well-worn estimate of 1,800,000 million described species, and EOL’s 10 year time frame, then BIG needs to add around 500 species pages per day! Doing this without massive automation simply won’t scale. Assembling the 24 exemplar pages required considerable effort, yet simple aggregators such as iSpecies can generate a roughly similar level of detail within seconds. The diagram below compares pages for Anolis carolinensis (EOL exemplar taxa) in EOL (left, or go here) and iSpecies (right, or go here). The iSpecies account is assembled automatically on the fly from sources such as GenBank, GBIF, Google Scholar, Yahoo Images, and Wikipedia.

ispecies.png

EOL is a long term project, and hence it may seem unfair to judge it so soon after it has been launched (and after Herculean efforts by BIG staff). However, given EOL’s current lack of content, and the existence of other web sites (such as ZipcodeZoo, DiscoverLife, and iSpecies) that already serve a much greater amount of information, my concern is that EOL risks being marginalised. I don’t think that EOL has anything like 10 years in which to prove itself.

How to add content quickly

For the report I prepared a cartoon plotting the cost of obtaining content against the amount of content obtained. “Costs” are in terms of developer time to import data (are they in a standard form, or a format unique to the provider), and time spent negotiating intellectual property agreements (such as how to display credit and attribution information, how the data will look, etc). At the bottom right (1) are large, freely available data sources such as GenBank, GBIF, and Wikipedia. At the left (2) are small sources that require tools to make their content available. In the middle (3) are well-established data providers that can require considerable effort to incorporate into EOL, due to both IPR issues and idiosyncratic data structures. The dotted line is an arbitrary cutoff, above which the effort required to obtain content outweighs the value that content would bring to EOL.
content.png

The report recommends going after content in category 1 first. These are sources have massive amounts of data that are freely available, and relatively easy to import. These include PubMed, GenBank, Wikipedia, ITIS, Flickr, and GBIF. As I noted on the iSpecies blog, GenBank records often contain metadata about organismal distribution, habitats, and ecological associations, which could be harvested. There are communities on Flickr building photo libraries of organisms, often tagged with scientific name and geographic location (e.g., Field Guide: Birds of the World). Harvesting these sources will provide considerable initial content for EOL. Of course, not all sources in this category are of comparable quality. GenBank and PubMed are, publicly funded, curated archives of scientific research, Wikipedia and Flickr are not.

Category 2 is next, and this is where we need tools to enable smaller providers to manage their own content, and contribute to EOL at the same time. This content would be targeted by “LifeDesks” (similar to the scratchpads being developed at the Natural History Museum, London).

Content in category 3 may have high scientific value, but in the short term the effort involved in incorporating it may outweigh the value it brings.

It’s perhaps a glib phrase, but I’m reminded of genius of “and” versus the tyranny of “or”. Harvesting resources in category 1 is not an argument against also going after resources in category 2, it’s a question of priorities. In the same way, tools developed for category 2 providers may well facilitate acquiring content from category 3 sources.

Vetting

The issue of “vetting” generated much discussion during the review meeting. It became clear that this term can mean different things:

  1. data that is error free (”correct”).
  2. data provided by scientific sources (”scientifically authenticated”)
  3. data that has been verified by experts

No data source is without error, so EOL will inevitably include erroneous information. Currently the bulk of its data comprise distribution maps from GBIF, which are known to contain errors. For example, some 16% of legume records are incorrect (doi:10.1371/journal.pone.0001124.). The GBIF map below shows numerous, erroneous records of the North American channel catfish (Ictalurus punctatus) in China.

At the scale at which EOL operates (100’s of millions of items of information), manually vetting all information before it is displayed is not feasible, and indeed by displaying GBIF maps EOL tacitly acknowledges this.

Of course, EOL wants to be an authoritative resource (in other words, more than a simple mashup), hence, one of its biggest challenges is to develop methods to catch errors. Innovative methods of annotation will need to be developed. Human Computation (see also Luis von Ahn’s talk at Google) is one approach, recently used by Google’s Image Labeler to annotate web images. BIG will need to develop easy-to-use interfaces so that EOL users can annotate data and flag possible errors. These annotations should be publicly visible, so that users who take the trouble to make annotations get instant feedback, and other users can see which records are contested.

Summary

A project on the scale of EOL is bound to take some time to settle in, and initial expectations were never going to be met, hence the generally under whelmed reaction in the blogosphere (myself included). There is much to do, and the overall theme of the IAG report is that EOL needs more content, fast, and needs to tackle the issue of vetting in a way that will scale.