Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

A session at MCN 2012

Thursday 8th November, 2012

3:30pm to 4:00pm (PST)

Many historical documents contain records of interest to historians, scientists and the general public, from census records in government publications to tables in scientific journals, even in personal diaries -- in some cases, records not available from any other sources. Extracting these records can be a time-consuming and expensive process, requiring painstaking attention to detail; however, crowdsourcing this task to citizen scientists has the potential to simultaneously involve a larger pool of interested transcribers, thus parallelizing the work.

In this presentation, we outline a workflow to crowdsource the annotations for 352 pages of previously transcribed biology fied notebook text. Within sixteen weeks, citizen scientists had identified 2,342 species, locations and dates marked up in a computer-readable format. We used freely available technology, in particular Wikisource and Wordpress, to recruit volunteers, coordinate efforts and to extract the records from the transcribed text while maintaining a link between annotation and content.

(150 words)

About the speakers

This person is speaking at this event.
Gaurav Vaidya

Taxonomic informatics, Wikipedia and other things

This person is speaking at this event.
David Bloom

Biodiversist, museophile, community builder, data mobilizer, geek-in-training, eater, pop bio from Twitter

Next session in East Room

4pm GLAM Women: How You Can Help Close the Gender Gap on Wikipedia by Sarah Stierch

Coverage of this session

Sign in to add slides, notes or videos to this session

MCN 2012

United States United States, Seattle

7th10th November 2012

Tell your friends!


Time 3:30pm4:00pm PST

Date Thu 8th November 2012


East Room, Renaissance Seattle

Short URL


View the schedule



See something wrong?

Report an issue with this session