Archive for the ‘Biodiversity Informatics’ Category

Code Sprint!

David Shorthouse
Thursday, July 17th, 2008

UPDATE! (July 23, 2008): Because of the short notice advertising, the number of participants will no doubt be too small to have a successful event. So, we have decided to postpone until early autumn. Stay tuned for updates.

DrupalThe Encyclopedia of Life has been keenly interested in content management systems and social networking phenomena, especially relating to how well these might be of benefit to practicing taxonomists who are under pressure to get online. So, we have been getting serious about Drupal and want to make a stab at hosting sites called “LifeDesks” that as a start will focus on particular groups of organisms and will be similar to Scratchpads, a most amazing collection of Drupal-based sites hosted at the Natural History Museum in London, England. LifeDesks would sit just to the side of EOL, but will have the advantage of providing some extra, distinct visibility for participants while still feeling part of the EOL dream. A bit of work needs to be done on Drupal to make this work and we’re interested in sharing developments with the wider Drupal community.

So, although it’s short notice, we’re going to host a Drupal code sprint August 11-14 in Chicago, Illinois to kick-off our relationship with Drupal. Please visit http://sprint.eol.org to see what we have in store and please also spread the word to any Drupal developers you know.

New Members of the Informatics Team

David Shorthouse
Friday, May 30th, 2008

Jonathan Clapp
Software Developer

clapp.jpg
I grew up on Cape Cod and join the EOL Informatics Group as a software developer. My experience is in database and web application development.
I will help ensure that the foundations of the Encyclopedia of Life are as solid as possible while allowing flexibility in the future. I have followed this important project for some time and am thrilled to be contributing to its success. It has the potential to be a great resource for learning about and advancing the conservation of the myriad species on earth.

Vitthal Kudal
Software Developer

vitthal.jpg
I am working as a Software Developer with the informatics group of Encyclopedia of Life. I have Master’s degree in Computer Science from University of Pune, INDIA. I have worked for NCL center for Biodiversity Informatics (NCBI) as a Project Assistant. My dream is to make all species data available over the internet on a single command of user and which is going to fulfill by working with the EOL Group. What it is has been like working at EOL? Interesting, inspiring, insightful, impactful, fun and more such words.

Jeremy Rice
Software Developer

rice.jpg
Working with the Encyclopedia of Life is the realization a long-time dream for me. I love being at a university, advancing research. The Encyclopedia’s vision to synthesize all information about life present on Earth… that makes it something really essential for me. I’m joining the team with over ten years of experience developing a variety of applications. What really drives me is turning abstract ideas into working products that people appreciate. There’s an abundance of ideas here, and I hope that we can produce some amazing tools to facilitate them…

Dimitry Mozzherin
Software Developer

dimitry.jpg
I was born in Russia, and from my early school years I wanted to become a biologist and a wild life photographer. I happened to become both later. At some point I started to learn programming languages and after discovering Open Source movement I decided to make programming my profession. And I am now at EOL because here I can express my passion for wild life and passion for development Free Software at the same time!

Anne Thessen
Post-doctoral Investigator

thessen.jpg
I’m working on data mobilization for EOL and the International Census of Marine Microbes. Lots of biological data can be found on the printed page, which must be read to retrieve information. I’m trying to find ways to make this information easier to retrieve and use. Prior to joining the EOL team, I worked on Arctic primary production, toxin-producing diatoms and shellfish grazing.

iNaturalist

Rod Page
Saturday, May 17th, 2008

logo-1.gif
Ken-ichi Ueda told me about iNaturalist.org, a wonderful site he, Nathan Agrin, and Jessica Kline have created for their Masters at UC Berkeley’s School of Information. To quote from the web site:

iNaturalist.org is a place where you can record what you see in nature, meet other nature lovers, and learn about the natural world.

It looks gorgeous (lots of Flickr Creative Commons photos), use of Wikipedia, and the TimeMap Javascript library. arachnida.png

Arguably the species pages are clearer than EOL’s (compare Anolis carolinensis on iNaturalist and EOL).But what makes it especially cool is the way it engages users with the ability to add observations of organisms, and request identifications. I like the emphasis on being

…a fun and efficient way to record, find, and share nature observations.

I think its a great project that could provide useful ideas for the design of EOL’s pages.

Citizen science podcast

Rod Page
Thursday, May 15th, 2008

PodcastLogo.png
Jon Udell has a great podcast where he interviews Janis Dickinson, who directs the citizen science program at at Cornell’s Laboratory of Ornithology. On his blog Jon writes:

Extracting signal from noise is, of course, one of the classic bread-and-butter activities of information science. What’s fascinating here is the Web 2.0 angle. Birdwatchers are famously passionate data collectors who develop reputations among their peers. When they contribute their data to eBird — and thence to the Avian Knowledge Network — those reputations can begin to be measured, and used to tune the analysis of a large body of contributed data.

These are, of course, issues directly relevant to EOL. Jon has long been interested in integrating information (including digital libraries), social networking, and how people interact with technology. His podcast is a mine of useful information. Click this link to subscribe to it in iTunes.

IAG review of BIG

Rod Page
Thursday, May 1st, 2008

2415336890_84744a837e_t.jpgOn Monday and Tuesday, 14-15 April, the MBL at Woods Hole hosted the first review of EOL’s Biodiversity Informatics Group (BIG). This meeting was a chance for the Informatics Advisory Group (IAG — sorry, there are still more acronyms to come) to hear about progress to date, and where BIG wanted to go next. Chris Freeland (from BHL) has some posted some photos of the meeting on Flickr, which give a sense of the number of people involved: members of BIG and IAG, together representatives from BHL, BioSynC, the Steering Committee, and interested observers. What the photos can’t convey is the spirited nature of the discussion, which made the two days hugely enjoyable.

It was my task as chair of the IAG to try and condense the detailed reports given by BIG, and the subsequent discussions, into a written report. That has now been done, and the result presented to BIG. In this post I will summarise two key areas, namely content and vetting. The report also addressed topics such as the site design, globally unique identifiers, and organisational matters, but I think content and vetting are the two that generated the most debate.

Content

Now that EOL is live and people have had a chance to look around, it is striking that 76% percent of visitors don’t return, and 44% of all visitors left in under 10 seconds. After the initial launch where, if anything, EOL was too popular, interest seems to have dropped off markedly. One possible reason for this is the relative lack of content. As I noted elsewhere, for many pages EOL compares unfavourably with other sites, such as David Stang’s ZipcodeZoo, or my own mashup iSpecies. EOL’s current strategy has been to limit its content to “vetted” information from trusted providers. For 24 exemplar taxa EOL provides relatively detailed information, but for the rest of life the content it currently displays is pretty sparse.

The challenge is how to cover all life in reasonable detail. If we take the well-worn estimate of 1,800,000 million described species, and EOL’s 10 year time frame, then BIG needs to add around 500 species pages per day! Doing this without massive automation simply won’t scale. Assembling the 24 exemplar pages required considerable effort, yet simple aggregators such as iSpecies can generate a roughly similar level of detail within seconds. The diagram below compares pages for Anolis carolinensis (EOL exemplar taxa) in EOL (left, or go here) and iSpecies (right, or go here). The iSpecies account is assembled automatically on the fly from sources such as GenBank, GBIF, Google Scholar, Yahoo Images, and Wikipedia.

ispecies.png

EOL is a long term project, and hence it may seem unfair to judge it so soon after it has been launched (and after Herculean efforts by BIG staff). However, given EOL’s current lack of content, and the existence of other web sites (such as ZipcodeZoo, DiscoverLife, and iSpecies) that already serve a much greater amount of information, my concern is that EOL risks being marginalised. I don’t think that EOL has anything like 10 years in which to prove itself.

How to add content quickly

For the report I prepared a cartoon plotting the cost of obtaining content against the amount of content obtained. “Costs” are in terms of developer time to import data (are they in a standard form, or a format unique to the provider), and time spent negotiating intellectual property agreements (such as how to display credit and attribution information, how the data will look, etc). At the bottom right (1) are large, freely available data sources such as GenBank, GBIF, and Wikipedia. At the left (2) are small sources that require tools to make their content available. In the middle (3) are well-established data providers that can require considerable effort to incorporate into EOL, due to both IPR issues and idiosyncratic data structures. The dotted line is an arbitrary cutoff, above which the effort required to obtain content outweighs the value that content would bring to EOL.
content.png

The report recommends going after content in category 1 first. These are sources have massive amounts of data that are freely available, and relatively easy to import. These include PubMed, GenBank, Wikipedia, ITIS, Flickr, and GBIF. As I noted on the iSpecies blog, GenBank records often contain metadata about organismal distribution, habitats, and ecological associations, which could be harvested. There are communities on Flickr building photo libraries of organisms, often tagged with scientific name and geographic location (e.g., Field Guide: Birds of the World). Harvesting these sources will provide considerable initial content for EOL. Of course, not all sources in this category are of comparable quality. GenBank and PubMed are, publicly funded, curated archives of scientific research, Wikipedia and Flickr are not.

Category 2 is next, and this is where we need tools to enable smaller providers to manage their own content, and contribute to EOL at the same time. This content would be targeted by “LifeDesks” (similar to the scratchpads being developed at the Natural History Museum, London).

Content in category 3 may have high scientific value, but in the short term the effort involved in incorporating it may outweigh the value it brings.

It’s perhaps a glib phrase, but I’m reminded of genius of “and” versus the tyranny of “or”. Harvesting resources in category 1 is not an argument against also going after resources in category 2, it’s a question of priorities. In the same way, tools developed for category 2 providers may well facilitate acquiring content from category 3 sources.

Vetting

The issue of “vetting” generated much discussion during the review meeting. It became clear that this term can mean different things:

  1. data that is error free (”correct”).
  2. data provided by scientific sources (”scientifically authenticated”)
  3. data that has been verified by experts

No data source is without error, so EOL will inevitably include erroneous information. Currently the bulk of its data comprise distribution maps from GBIF, which are known to contain errors. For example, some 16% of legume records are incorrect (doi:10.1371/journal.pone.0001124.). The GBIF map below shows numerous, erroneous records of the North American channel catfish (Ictalurus punctatus) in China.

At the scale at which EOL operates (100’s of millions of items of information), manually vetting all information before it is displayed is not feasible, and indeed by displaying GBIF maps EOL tacitly acknowledges this.

Of course, EOL wants to be an authoritative resource (in other words, more than a simple mashup), hence, one of its biggest challenges is to develop methods to catch errors. Innovative methods of annotation will need to be developed. Human Computation (see also Luis von Ahn’s talk at Google) is one approach, recently used by Google’s Image Labeler to annotate web images. BIG will need to develop easy-to-use interfaces so that EOL users can annotate data and flag possible errors. These annotations should be publicly visible, so that users who take the trouble to make annotations get instant feedback, and other users can see which records are contested.

Summary

A project on the scale of EOL is bound to take some time to settle in, and initial expectations were never going to be met, hence the generally under whelmed reaction in the blogosphere (myself included). There is much to do, and the overall theme of the IAG report is that EOL needs more content, fast, and needs to tackle the issue of vetting in a way that will scale.

We were at DrupalCon

David Shorthouse
Thursday, March 13th, 2008

DrupalCon

Here at the Encyclopedia of Life, we are on constant lookout for cool technologies and user groups. One that we have had our eye on for a long time is Drupal, a content management system that uses PHP and MySQL as the backend database. Drupal has an immensely active fanbase and an increasing number of installations throughout the world. DrupalCon was held March 3-6, 2008 at the Boston, Massachusetts Convention & Expo Center. If you’re interested in how the conference was organized and what sessions were held, take a peek at the program (PDF).

Peter Mangiafico and I attended DrupalCon and were floored by the enthusiasm. I was particularly interested in the jQuery and GIS/Mapping sessions. Brian Aker’s (MySQL) plenary was also very helpful. And, this from Peter:

I was at DrupalCon on Thursday, March 6 and the first thing I noticed was that this seemed to be the highest concentrations of laptops per square inch I have ever seen, with a higher percentage of glowing Apple symbols on the back of them than the population in general. You could tell you were in the right rooms just by glancing around and noticing folks coding in the audience in real-time. The sessions I went to were well-attended, with the “Using Drupal with External Data Sources” session spilling out of the room. It was great to see how many folks are using Drupal to solve their specific needs across different domains, from education to business.

We’re too Popular!

David Shorthouse
Tuesday, February 26th, 2008

You may have noticed that the EOL site has been flaky at best since approximately 12 EST this afternoon. Although we are serving the site from a load balanced cluster of several machines, we are experiencing phenomenal loads.

I just churned through the web logs from web machines in this cluster and there were 5.8M hits in the span of 3 hours. Most of these happened within 1 hour. We were down (and continue to experience intermittent access) for a few hours, then flipped the machines back on. Since then, there were an additional 5.7M hits, totaling 11.5M hits since 9AM this morning and it is now 2:45PM here. Wow!

We are working hard to resolve the issue so stay tuned and please have patience! I’ll post updates here as the day progresses.

Update Feb 27 @ 11:45AM EST:

In the first 24 hours (minus the approx. 3-4 hours we were completely down) there were:
18.5M hits, 13.3M page views, and 940GB of data transferred.

Biodiversity Informatics Team

David Shorthouse
Monday, February 25th, 2008

Although we are a small team and are in the midst of filling positions, we are working very hard to ensure the public launch is a success. The Biodiversity Informatics component of the Encyclopedia of Life is stationed in Woods Hole, MA at the Marine Biological Laboratory. Some of us are long-time Cape Cod residents, but others have travelled great distances to help fulfill the dream.

David J (Paddy) Patterson
Biodiversity Informatics Leader

David Patterson
As a taxonomist responsible wholly or in part for the discovery of about 250 taxa, my view is that taxonomists are the information managers of biology, and that bioinformatics is a domain within taxonomy. I am a member of the International Commission of Zoological Nomenclature. I am also a Senior Scientist at the Marine Biological Laboratory in Woods Hole Massachusetts and holds professorial positions at Brown University (Rhode Island) and at the University of Sydney (Australia).

Jennifer Schopf
System Architect

Jennifer Schopf
I’m the System Architect for the informatics group of EoL, which means I’m responsible for the overall management of this portion of the project, including working with end users to define requirements, working with the development team to come up with specifications, architectures and plans, and overseeing the day-to-day deliverables and progress. For the previous 6 years I was part of a large distributed software group called Globus, which also took a research project and made it into production-quality code for a large set of users. I came to EoL in part because it’s such a fascinating idea - a Web page for every species! - and I wanted to be part of a team that worked closely with scientists to change how they do their research.

Patrick Leary
Portal & Aggregation Project Leader

Patrick Leary
I have been working at the Marine Biological Laboratory on biodiversity informatics projects since 2001. I received a BA in Computer Science and Mathematics from Skidmore College in 2005 and have been working full-time at the MBL since then. Before working for EOL I was the lead developer and database administrator for the uBio project. Some of my recent developments have included natural language processing tools for identifying scientific names, applications which index and aggregate recent biodiversity literature, and tools which automatically mark-up documents with semantic annotations. For EOL, I develop and maintain our central data indices and repositories, as well as create the web services used to interact with these data bases. I see the Encyclopedia of Life as a communal resource which can help connect those interested in learning about biodiversity directly to the most compelling resources available. I hope EOL can promote awareness in biodiversity, and engage a n ew generation of future biologists.

Peter Mangiafico
Research & Development Project Leader

peter.jpg
I have an undergraduate degree in Physics, an MS education and an ME Engineering Physics. I’ve worked for NASA as a data analyst, as a researcher on digital medical imaging, and in technology companies in every role from sales to marketing to technical project management, web software development and as director of web application development. For the EOL, I have worked on the presentation layer of the preliminary version of the EOL.org species pages to be released at Ted 2008 and in the future will be working with new technologies and investigating partnerships, as well as creating prototypes and assisting in the integration of these technologies into the core EOL infrastructure. What I like best about the EOL is working with the enthusiastic and talented group of individuals that are undertaking this most incredible project.

David P. Shorthouse
WorkBench Project Leader

David Shorthouse
I have an undergraduate and Master’s of Science degree from Laurentian University in Sudbury, Ontario where I focused my interests on ecological questions using spiders as a focal taxon. I have done the same for my Ph.D. thesis at the University of Alberta where I used ground-dwelling spiders as indicators of whole-forest biodiversity. I am leading development of what we are internally calling the “WorkBench” environment. This will be where users create, mix, mash, and reuse materials. There are lots of exciting ideas for how this will work and I am dilligently putting the pieces together. What excites me most about the Encyclopedia of Life is the unbounded enthusiasm people have, especially those who may not have formal training in biology but are just as empassioned about biodiversity as are taxonomists and ecologists.

Pam Fournier
Information Technology

pam.jpg
I am in charge of the Information Technology area for the Encyclopedia of Life. I am responsible for all aspects of server system design, implementation, and support. The servers include infrastructure, domain, and site servers run on various operating systems. My background includes degrees in Computer Science and Accounting with certifications and experience in software development, project management, networking, and systems administration. I joined the EOL because I looked forward to the challenges offered by such an extreme project. What I like best about the EOL is the working with such a dedicated group of people and the “can do” attitude which abounds here.

Alexey Shipunov
Cybertaxonomist

Alexey Shipunov
I am botanist, taxonomist, and a developer, all in one. My Ph.D. in Moscow State University was on the taxonomy of Russian plantains. My recent academic pursuits focused on molecular taxonomy and diversity of orchids and endophytic fungi. I like diversity of any kind, but am especially intrigued by the global diversity of species. The vast majority of species are still undescribed, but at the same time, many species have been described several times over and consequently, have different names. To deal with this issue, I am leading what we are calling Union — a system that will intelligently inform an end-user of currently recognized names but will also inform him/her of other names including homonyms, synonyms, vernaculars, misspellings, and surrogates.

Sarah Bordenstein
Content Manager

sarah.jpg
I serve as Content Manager for the EOL Informatics Group. With an undergraduate and Master’s of Science degree in Biology, I have worked at the Marine Biological Laboratory for four years in the areas of Education and Outreach. Prior to joining the EOL team, I led development of Microbial Life, a digital library dedicated to the ecology, evolution and diversity of microbes. I also coordinated the collection of legacy data for the International Census of Marine Microbes and served as education liaison for the local community. I am particularly interested in empowering teachers and students to participate in the scientific discovery process by making learning resources, tools and datasets freely available online. I feel extremely lucky to be part of EOL as I believe this resource will greatly enhance our awareness of biodiversity and facilitate the collection and distribution of biological information.

Jon Ferguson
Scientific Informatics Analyst

jon.jpg
I’m moving to Woods Hole to be part of the Biodiversity Informatics group. Having lived in Scotland with large a family for a number of years it’s a big move. The kind of thing you do only when an opportunity to be part of something big comes along. What captures my imagination about EOL is both the beauty of the subject and the scale of the undertaking. To make this work we’ll need strong collaboration from biologists and semantic researchers around the globe. Meeting and working with these people will surely be the most exciting part of all.

Kristen Lans
Project Administrator

kristin.jpg
I bring a background in Environmental Education and Administration to the Encyclopedia of Life’s Biodiversity Informatics Group. I earned a MA in Education at Portland State University in Oregon, where my thesis focused on creating participatory, web-based sustainable design tools for K-12 students and teachers. In 2005 and 2006, I was awarded the US Environmental Protection Agencys’s P3 (People, Prosperity, and the Planet) Award for this work. I am hopeful about EOL’s potential as a medium to engage students with the natural world and to empower them to participate in designing creative solutions to stop environmental degradation where they live. I also hope that EOL will play a role in preserving and cataloguing biodiversity in the developing world, particularly by allowing underrepresented populations to contribute knowledge about culturally-specific species names, uses, and social and ecological functions.