The Virtual Taxonomist

Post on 19-Jun-2015

1.531 views 11 download

description

Scholarly communication for the facebook generation

Transcript of The Virtual Taxonomist

Vincent S. Smith

The Virtual TaxonomistScholarly communication forthe facebook generation

Goal…• Inventory the Earth’s species• Document their relationships• “Publish” these data

Data set…• 1.8M described species (10M names)

• 300M pages (over last 250 years)

• 1.5-3B specimens

People…• 4-6,000 scientists• 30-40,000 amateurs• Many more citizen scientists?

TaxonomyThe foundation of biology

Bacteria9021 Spp

Archaebacteria

259 Spp.

Plants260k spp.

Animals1.18 M spp.

Other193k spp.

Fungi101k

1.8 million species

Taxonomy is parochialInformation sits in the “long tail” of a power distribution

Crusta-ceans

39k

Birds 10kReptiles 7.1kMammals 5kAmphib.5k

Sponges 10kCnidarians 9kRotifers 1.8k

Flatworms 13.7k

Insects0.82 M spp.

Molluscs117 k

Fish 25k

Bacteria9021 Spp

Archaebacteria

259 Spp.

Plants260k spp.

Animals1.18 M spp.

Other193k spp.

Fungi101k

Taxonomy is parochialInformation sits in the “long tail” of a power distribution

1.8 million species

Crusta-ceans

39k

Birds 10kReptiles 7.1kMammals 5kAmphib.5k

Sponges 10kCnidarians 9kRotifers 1.8k

Flatworms 13.7k

Insects0.82 M spp.

Molluscs117 k

Fish 25k

Bacteria9021 Spp

Archaebacteria

259 Spp.

Plants260k spp.

Animals1.18 M spp.

Other193k spp.

Fungi101k

Beetles370k spp.

Flies85k spp.

Butterflies & moths165k spp.

Bees, wasps & ants198k spp.

0.01 papers per species per yeari.e 1 paper every 100 years

Birds: 1 paper per species per yr.Mammals: 2 papers per species per yr.

Elephants: 47 papers per species per yr.

Taxonomy is parochialInformation sits in the “long tail” of a power distribution

1.8 million species

250 yrs 1000 yrs!!!

?1758 2008 3008

Taxonomy is slowMost life on earth is still undescribed

Bacteria9021 Spp

Archaebacteria

259 Spp.

Plants260k spp.

Animals1.18 M spp.

Other193k spp.

Fungi101k

250 year and counting!

The story so far…• Estimates range from 5-100 million species (prob. 80% undescribed)

• At present rates most species will be extinct before we get to describe them

• Most descriptions are formulaic, publication process is slow, involves paper archival

Most biodiversity (data) is hidden

Taxonomy is hard to findPeople & data distributed & highly fragmented

• Small communities working on biodiversity

• So is the data we use & publish- 1.5-3B specimens worldwide (type specimens)- 300M pages spanning 250 yrs. (all still relevant)

• We use different methods of citation (pp.)

- Just 4-6,000 taxonomists worldwide

Mol. Phyl. Evol.21,964 pp. since 2000

Menopon gallinaeNumidicola antennatusAmyrsidea ventralisSomaphantus lusiusMenacanthus stramineusColimenopon urocoliusTrinoton anserinumMeromenopon meropisGruimenopon longumHoazineus armiferusCopocephalum zebraComatomenopon elbeli/elongatumPsittacomenopon poicephalusOdoriphila clayae/phoeniculiArdeiphilus trochioxusCuculiphilus fasciatusCiconiphilus quadripustulatusEomenopon denticulatumPiagetiella bursaepelecaniOsborniella crotophagaeHohorstiella lataNeomenopon pteroclurusMachaerilaemus laticorpus/latifronsAustromenopon crocatumEidmanniella pellucidaHolomenopon brevithoracicumDennyus hirundinisMyrsidea victrixAncistrona vagelliPseudomenopon pilosumBonomiella columbaeChapinia robustaPlegadiphilus threskiornisActornithophilus uniseriatusMEGAMENOPONRediella mirabilisLatumcephalum lesouefi/macropusParaboopia flavaParaheterodoxus insignisBoopia tarsataTherodoxus oweniLaemobothrion maximumRicinus fringillaeTrochiliphagus abdominalisTrochiloecetes rupununiLiposcelis bostrychophilus

Taxonomy is hard to findPeople & data distributed & highly fragmented

• Small communities working on biodiversity

• So is the data we use & publish- 1.5-3B specimens worldwide (type specimens)- 300M pages spanning 250 yrs. (all still relevant)

• We use different methods of citation (pp.)

- Just 4-6,000 taxonomists worldwide

• Publications are data rich

Taxonomy is hard to findPeople & data distributed & highly fragmented

DATA

• Linked by taxonomic names

• Small communities working on biodiversity

• So is the data we use & publish- 1.5-3B specimens worldwide (type specimens)- 300M pages spanning 250 yrs. (all still relevant)

• We use different methods of citation (pp.)

- Just 4-6,000 taxonomists worldwide

• Publications are data rich

Taxonomy is hard to findPeople & data distributed & highly fragmented

DATA

What does this all mean…• Taxonomy is an information science (formulaic, data rich, parochial, under funded)

• Taxonomy lends itself to the Web

• Linked by taxonomic names

• Small communities working on biodiversity

• So is the data we use & publish- 1.5-3B specimens worldwide (type specimens)- 300M pages spanning 250 yrs. (all still relevant)

• We use different methods of citation (pp.)

- Just 4-6,000 taxonomists worldwide

• Publications are data rich

Getting taxonomy on the Web

Scratchpads• Web publishing for taxonomists

Tackling the problems of the taxonomic community

Biodiversity Heritage Library• Digitising heritage literature

Encyclopedia of Life• A web page for every species

Plazi.org & iPhylo• Data mining contemporary literature

Getting taxonomy on the Web

Scratchpads• Web publishing for taxonomists

Tackling the problems of the taxonomic community

Biodiversity Heritage Library• Digitising heritage literature

Encyclopedia of Life• A web page for every species

Plazi.org & iPhylo• Data mining contemporary literature

Biodiversity Heritage Library (BHL)“Digitising biodiversity literature”

• Biodiversity publications since 1469- 5.4 million books- 800,000 monographs- 40,000 periodicals

• Held by Natural History librariesE.g., NHM holds more than 1M books, 250kmonographs & periodicals, 0.5M artworks

• Sharing the digisation of contents• Focus on out of copyright materials• Partnership with “Internet Archive”

• BHL partnership of 10 Nat. Hist. libraries

• Make the contents “findable”

Biodiversity Heritage Library (BHL)“Digitising biodiversity literature”

1 scribe machine, 3,500 pages per shift per day

2. Extract text (OCR)1. Scan (photograph)

34 scribe machines now in operation

3. Find keywords- Taxonomic names- Author names- Citations- Collection data- Morphological data- Descriptions- Identification keys- Illustrations- Photographs

Biodiversity Heritage Library (BHL)“Digitising biodiversity literature”

2. Extract text (OCR)3. Find keywords

1. Scan

- Taxonomic names- Author names- Citations- Collection data- Morphological data- Descriptions- Identification keys- Illustrations- Photographs

Palma, R.L., andR.L.C. Pilgrim.2002. A revisionof the genusNaubates(Insecta:Phthiraptera:Philopteridae).J. R. Soc. N.Z.32:7-60.

Biodiversity Heritage Library (BHL)“Digitising biodiversity literature”

2. Extract text (OCR)3. Find keywords

1. Scan

- Taxonomic names- Author names- Citations- Collection data- Morphological data- Descriptions- Identification keys- Illustrations- Photographs

4. Index5. Put on the web

Palma, R.L., andR.L.C. Pilgrim.2002. A revisionof the genusNaubates(Insecta:Phthiraptera:Philopteridae).J. R. Soc. N.Z.32:7-60.

Biodiversity Heritage Library (BHL)“Digitising biodiversity literature”

• NHM, London- 1 scribe machine- >500k pages- Focus on exceptionally rare text

• Completed to date:- 3,802 periodicals (journals)- 9,181 books- 5.5 million pages (2% of total)

http://www.biodiversitylibrary.org/

- Copyright (1923 USA)• Challenges

- OCR quality (old fonts)- Better indexing- Foreign language content- Needs a critical mass of content to be useful

Getting taxonomy on the Web

Scratchpads• Web publishing for taxonomists

Tackling the problems of the taxonomic community

Biodiversity Heritage Library• Digitising heritage literature

Encyclopedia of Life• A web page for every species

Plazi.org & iPhylo• Data mining contemporary literature

Data mining taxonomic publications“Extracting factual information”

- Taxonomic names- Author names- Citations- Collection data- Morphological data- Descriptions- Identification keys- Illustrations- Photographs

Palma, R.L., andR.L.C. Pilgrim.2002. A revisionof the genusNaubates(Insecta:Phthiraptera:Philopteridae).J. R. Soc. N.Z.32:7-60.

“Extracting factual information”

Palma, R.L., andR.L.C. Pilgrim.2002. A revisionof the genusNaubates(Insecta:Phthiraptera:Philopteridae).J. R. Soc. N.Z.32:7-60.

- Taxonomic names- Author names- Citations- Collection data- Morphological data- Descriptions- Identification keys- Illustrations- Photographs

Data mining taxonomic publications

Experimental extraction of factual information

Plazi.org (D. Agosti et al)(Manual, slow but accurate)

iPhylo (R. Page)(Automatic, fast but dirty)

Article(Hand selected)

“Library”(Legal & minable)

Repository(DSpace)

Entity-Attribute-Value Model(Database)

GoldenGate(Manual Software)

Crawler scripts & web services

Approx. 26nested fields

(TaxonX-XML)

Approx. 12?data objects

Data mining taxonomic publications

Experimental extraction of factual information

Plazi.org (D. Agosti et al)(Manual, slow but accurate)

iPhylo (R. Page)(Automatic, fast but dirty)

Repository(DSpace)

Entity-Attribute-Value Model(Database)

RSS + TAPIR Data visualizations

“A database of everything!”

RDF + RSS

Data mining taxonomic publications

Getting taxonomy on the Web

Scratchpads• Web publishing for taxonomists

Tackling the problems of the taxonomic community

Biodiversity Heritage Library• Digitising heritage literature

Encyclopedia of Life• A web page for every species

Plazi.org & iPhylo• Data mining contemporary literature

Encyclopedia of Life (EOL)“A web page for every species”

http://www.eol.org/

• A web page for all 1.8M species

• Multi-institution collaboration

• $50m funding (5 years)- MacArthur and Sloan Foundations

• Megascience mashup- Aggregating data from the web

• Multiple audiences- Science & outreach

• 10 years to complete- First draft 2008, “finished” 2017!

Encyclopedia of Life (EOL)“A web page for every species”

• Huge interest- 11.5 million hits in first 5 hours- 500+ press articles- Pages unavailable for first two days!

• First draft 27 Feb. 2008- 24 “exemplar” pages- 30,000 detailed pages (fish & amphib.)- 1 million “stubs” (names & links)

- Growth (needs 1,000 spp. per day)• Much praise but some criticism

- Quality vs. quantity of information- Authoritative “vetting” process- Credit for “authors”

• Nine more years to go- Get more content online- Better tools to engage more people

Getting taxonomy on the Web

Scratchpads• Web publishing for taxonomists

Tackling the problems of the taxonomic community

Biodiversity Heritage Library• Digitising heritage literature

Encyclopedia of Life• A web page for every species

Plazi.org & iPhylo• Data mining contemporary literature

What is a Scratchpad?

Your data1

Published & reviewedon your site

3Uploaded &

tagged

2

“A Website & publishing platform for taxonomic communities”

What is a Scratchpad?

Your data1

Published & reviewedon your site

3Uploaded &

tagged

2

Fast Intuitive Fit for use

“A Website & publishing platform for taxonomic communities”

What can Scratchpads do?Import, manage, search & browse:

DNA & Phylogenies

Specimens

Literature Images

What can Scratchpads do?Integration & connectivity within & between sites

DNA & Phylogenies

Specimens

Literature ImagesTaxonomy

Current ScratchpadsAntsBeesBeetlesBig-headed fliesBirdsBlackfliesCiliatesCockroachesDragon TreesDung BeetlesFalse ButtonweedFlat wormsFliesForaminiferaFossil InsectsFungus GnatsHolometabolaLeaf-miner FliesLiceLichens of BermudaMalvaceaeMegalastrum fernsMilichiid fliesMosquitoesMossesNannotax fossilsNepticuloid mothsPalmsPearl oystersPolychaete wormsScaleworms

TermitesTriticid grassesWeevilsWood Ferns

Sulawesi FernsStick insects

Sites: 61Users: 665Pages: 130kSince March 2007

Scratchpad applications

4th Edition Howard & Moore, Birds of the world(fact checking, data compilation, 2010, funding)

A multipurpose, flexible technology

eBooks

Scratchpad applications

European Mosquito Bulletin (ISSN 1460-6127), Phasmid Studies (ISSN 0966-0011)(submission, review, & dissemination of articles)

A multipurpose, flexible technology

eJournals

Scratchpad applications

Image galleries

A multipurpose, flexible technology

Nanno fossils, Cockroaches, Stick insects, Flatworms, Grasses, Lichens & many more… (rapid upload, annotation, & display of images)

Scratchpad usageContent & contributors in the first 15 months

Pages:- Across 61 sites- In detail:

• Definitions (41%)• References (26%)• Associations (8.5%)• DNA sequences (6%)

• Images (4.5%)• Maps (2.8%)• Specimens (2.1%)• Others (1.3%)

129,896 pages, 665 contributors

June 24 2008

Scratchpad usage

Contributors:- No more than 10% significantly active- Contributors in more than 30 countries- In detail:

• Europe (55%)• Unknown (29%)• North America (9%)• Asia (3%)

• Australasia (2.5%)• South America (2%)• Russia (0.8%)• Middle East (0.4%) [Jan. 08]

129,896 pages, 665 contributors

June 24 2008

Content & contributors in the first 15 months

Scratchpad visitorsTracking visitors across sites

March 2008

Scratchpad visitorsPopular content: what visitors are looking at

The “long tail” of taxonomy

Visitors want less of more, i.e. everyone wants something different

Scratchpad overview

Scratchpads are integrating taxonomy

Scratchpads• Web publishing for taxonomists

“Small pieces loosely joined”

Biodiversity Heritage Library• Digitising heritage literature

Encyclopedia of Life• A web page for every species

Plazi.org & iPhylo• Data mining contemporary literature

Integrating taxonomy

Questions?

Scratchpad managementScalable & sustainable technology

Virtual machine, open-source software, self-archiving, backed-up, multi-site configuration(easy to move & upgrade, secure & reliable, citable, screencasts, low admin., low marginal costs)

Hardware, software & user support

Impact(Web equivalent to journal impact

factor & personal H-index)

Scratchpad bibliometricsMetrics of output and use

130,000 pages

665 contributors

Content Usage