Little eScience
-
Upload
andrea-wiggins -
Category
Technology
-
view
3.194 -
download
1
description
Transcript of Little eScience
Little eScience
Andrea WigginsJune 18, 2009
Overview
• Background
• Exposition: Sociology of Science
• Broad generalizations about science
• Example: FLOSS Research
• Little science context for eScience research
• Expectations: What next?
http://www.flickr.com/photos/pmtorrone/304696349/
• BA: Maths with economics
• Nonprofit & IT industry work
• Adult literacy, nonprofit management support, professional theatre
• Web analytics
• MSI: Human-computer interaction, complex systems & network science
• PhD: Information science & technology
My Background
Science
• Systematic investigation for the production of knowledge
• Scientific method emphasizes reproducibility
• Not all phenomena are reproducible...
• Many categories
• Experimental, applied, social, etc.
• Categories are not mutually exclusive
http://www.flickr.com/photos/radiorover/419414206/
• Kuhn - Laws, theories, applications & instrumentation that create coherent traditions of scientific research
• Paradigms help us direct our research, but limit our view of the world
• New technologies can lead to scientific revolutions by revealing anomalies
Paradigms & Revolutions
http://www.flickr.com/photos/weichbrodt/644302381/
Normal Science
• Kuhn - “normal science” is research based on broadly accepted scientific paradigms
• Shared paradigms are based on rules and standards for scientific practice
• Key requirement: agreement onfocus and conduct of research
• Ǝ(Grand Challenges)|Discipline
http://www.flickr.com/photos/themadlolscientist/2421152973/
Big Science
• de Solla Price - “Big Science” is...
• Inherently paradigmatic
• Always normal science
• Produces detailed insights into the minutiae of phenomena studied in the paradigm
http://www.flickr.com/photos/31333486@N00/1883498062/
• Paradigms require agreement on...
• Epistemology
• Ontology
• Methodology
• Most social sciences are pre-paradigmatic
• Primarily exploratory research
• Very little replication
Pre-paradigmatic Science
http://www.flickr.com/photos/askpang/327577395/
Little Science
• de Solla Price - “Little Science” is aromanticized precursor to Big Science,featuring lone, long-haired geniuses misunderstood by society, etc.
• If it’s not Big Science, it’s Little Science
• Pre-paradigmatic and fraught with ambiguity
• Often fundamentally exploratory
• Epistemological/theoretical/methodologicaldivergence among researchers
http://www.flickr.com/photos/mrjoax/2548045246/
Social Science
• Social science is real science: the goal is systematic knowledge production
• Focuses on the study of the social life of human groups and individuals
• IMHO, fundamentally more difficult than “hard” sciences due to infinite complexity of social phenomena
• Replicability is a major challenge with respect to scientific method
• Not all social science can or shouldaspire to replicability
http://www.flickr.com/photos/smiteme/2379629501/
Normalizing Science
• Becoming a normal science requires community and convergence
• Ǝ(community) != Ǝ(agreement)
• Establishing grand challenges and methods are primary tasks of normalizing
• Resistance to change is pervasive
http://www.flickr.com/photos/9036026@N08/2949211479/
Scientific Collaboration
• Collaboration requires common focus, if not also epistemology and ontology
• Challenging enough in normal sciences
• Harder in pre-paradigmatic research
• Economics: systemic disincentives to collaborate, versus potential benefits and ideals of science
http://www.flickr.com/photos/richardsummers/542738965/
• LHC, CERN, etc.
• Thousands of collaborators
• Complex but coordinated,at least somewhat centralized
• Requires shared goals and resources, plus (lots of) communication
• Only happens in normal sciences
Big Science Collaboration
http://www.flickr.com/photos/8767020@N08/531355152/
• A Professor & a grad student, give or take
• Localized goals and resources
• -> localized research practices
• Small research teams
• Fundamentally difficult to achieve consensus that allows larger groups
• Restricts the ability to obtain fundingand undertake ambitious projects
Little Science Collaboration
http://www.flickr.com/photos/lamazone/2735939345/
Scientific Collaboration Requirements
• Shared goals
• Establishes focus of research
• Shared research resources
• Both social and artifactual
• Social aspects include training and community socialization
http://www.flickr.com/photos/ryanr/142455033/
we can has share?
• Letters, Books, Journals, Lectures
• Also technologies: methods, instrumentation
• Sharing?
• Recordkeeping is not alwaysa researcher’s main priority
• Without records, there’s notmuch to share except theresearch outputs
Historical Research Artifacts
http://www.flickr.com/photos/smailtronic/1535870363/
Today’s Research Artifacts
• Large scale datasets, scripts, software, workflows, papers, images, video, audio, annotations, ephemera, web sites...
• “Research objects” -bundling all the pieces together
• Hybrids of boundary objects and touchstones
• Technologies -> scientific revolution!
• Open science
http://www.flickr.com/photos/smiteme/2379630899/
Example: FLOSS Research
• Phenomenological & interdisciplinary
• Software engineering, Information Systems,Anthropology, Sociology, CSCW, etc...
• Ethos
• (Idealistic) combination of open source values and scientific values
http://www.flickr.com/photos/themadlolscientist/2542236565/
FLOSS Phenomenon
• Free/Libre Open Source Software “Free as in speech, free as in beer” - liberty versus cost
• Distributed collaboration to develop software
• Volunteers and sponsored developers
• Community-based model of development
http://www.flickr.com/photos/prawnwarp/541526661/
Typical FLOSS Research Topics
• Coordination and collaboration
• Growth and evolution (social and code)
• Code quality
• Business models and firm involvement
• Motivation, leadership, success
• Culture and community
• Intellectual property and copyright http://www.flickr.com/photos/eean/519258881/
What we study @ SU
• Social aspects of FLOSS
• What practices make some distributed work teams more effective than others?
• How are these practices developed?
• What are the dynamics through which self-organizing distributed teams develop and work?
Sharing FLOSS Research Artifacts
• Community: Small but growing, maybe around 400 researchers worldwide, with lively face-to-face interaction but relatively low listserv activity
• Data: Lots of it, and readily available, though often difficult to use for several reasons
• Analyses and tools: Not quite as easy to get, but there if you can find them
• Papers: Repositories are as yetunderdeveloped, but efforts areunderway
http://www.flickr.com/photos/12698507@N08/2762563631/
FLOSS Research Community
• Handful of small research groups, mostly in UK & Europe
• Most often found in Software Engineering departments
• International conferences targeted to academics, developers, or both
• OSS, ICSE, FOSDEM, etc.
• IFIP WG 2.13
http://www.flickr.com/photos/steevithak/2883218362/
FLOSS Research Data
• Data sources include interviews, surveys, and ethnographic fieldwork
• Digital “trace” data: archival, secondary, by-product of work, easy but hard
• Repositories
• Hosting “forges” like SourceForge, FreshMeat, RubyForge, etc.
• RoRs: Repositories of Repositories
• Data sources for research
We Built It...
• Motivations
• Stop hammering forge servers, getting entire campus IPs blocked...
• Stop reinventing the wheel!
• Adoption
• Shared data sources seeing increasing use
• Next step is harder: sharing tools and workflows
http://www.flickr.com/photos/circulating/997909242/
RoRs: FLOSSmole
• Multiple PIs @ Syracuse, Elon, & Carnegie MellonOne grad student @ SU (me), a couple of undergrads @ Elon
• Public access to 300+ GB data on
• 300K+ projects from 8 repositories
• Flat files & SQL datamarts
• Released via SF & GC
• 5 TB allotment on TeraGrid @ SDSC
RoRs: FLOSSmetrics
• Produced by LibreSoft with academic and corporate partners
• Public access to data for 2800+ projects
• Analyzed & raw data from CVS, email, trackers
• Tools for:
• calculating code metrics
• parsing trackers
• parsing email lists
RoRs: SRDA
• SourceForge Research Data Archive
• One PI @ Notre Dame University
• One massive 300 GB+ SQL db of monthly dumps from SourceForge
• Original obtuse structure, regular table deprecation, some documentation
• Gated access: researchers only,condition of data release from SF
RoRs: Emerging Sources
• Ultimate Debian Database (UDD)
• 300 MB compressed Postgres DB, produced by Debian community
• Planning to add to FLOSSmole
• When available...
• Bespoke Scripts
• Taverna workflows
FLOSS Research Analyses
FLOSS Research Papers
• First, there was opensource.mit.edu
• They no longer maintain it, and gave us the data
• Work-in-progress working papers repository at FLOSSpapers.org
• Essential viability problem is thatrepositories require long-termstewardship...
• ...which requires long-termcommitments of funding and personnel, not just volunteers
FLOSS Research Collaboration
• Multiple partners involved in producing FLOSSmole & FLOSSmetrics
• Federated data sources by choice, starting to develop ontologies
• As yet, a Little Science domain
• Cross-institutional collaborationposes many challenges
• Usual difficulties magnified bygeneral lack of resources, bothfinancial and human
Latest Initiatives
• Resource-oriented
• Expanding resources: data, research artifacts, and pedagogical materials
• DOIs: 10.4118/*
• Semantic data interoperability
• Community-oriented
• FLOSShub.org
Evangelizing eScience
• Made presentations at OSS conferences: well received, but hard to make converts for several reasons
• Tried to get other research group members to use Taverna: learning overhead is too high for most
• Submitted a paper on eScienceto an IS conference: rejected because reviewers were unable to adequately evaluate eScienceas a topic, as it’s too unfamiliar
• Currently just doing our work this way, as an exemplar
http://www.flickr.com/photos/naezmi/2418745377/
Barriers to Uptake
• Lack of agreement in research focus, theory, methods; researcher isolation
• Bimodal distribution of requisite skills
• “I can’t possibly do that! I can’t code!”
• “Why bother? I can code my own. You should too; just use Python.”
“Overheard” on Twitter:
Friend #1: i HATE that openoffice automatically took over my "open with..." defaults.
Friend #2: @Friend #1 <opensourcedeveloper> If you don't like it, then why don't you submit code to change the behavior!? </opensourcedeveloper> http://www.flickr.com/photos/noner/1739876378/
What I had to learn to get this far
• Taverna
• A lot more Unix terminal & XML
• Relational DB management & SQL
• More R, plus packages and dependency management
• Java & Eclipse - just enough to write my own Beanshells
• SVN & SSH
• A little bit of OWL, RDF, & SPARQL
• I would not have taken this on if I had known what was in store, but once I got started, I was hooked
http://www.flickr.com/photos/sashala/292868436/
Sociotechnical Engineering
• Tools are part of the solution, thanks to brilliant CS and SE people
• Social elements are the true barrier
• Awareness of methods andbenefits
• Incentive systems
• Resistance to change (paradigms again)
• Proof of concept is difficulthttp://www.flickr.com/photos/pinprick/3117108495/
Using Taverna for Little eScience
• Implementing analysis is usually easy
• Data handling is almost always hard
• All data are in SQL databases, with consistent IDs
• Lots of data manipulation is required
• Avoiding web services as much as possible
• Infrastructure and resources are limited
• Benefit is truly questionable: AFAIK, I am 50% of the user base...
• Estimating user base and potential user interest in FLOSS projects
• Based on common release-and-download patterns
• Proxy for project success, a common dependent variable
Example: Our Recent Research
Version 0.5 Version 0.6 Version 0.7
Area under curve is active users updating
Active user base growth
Potential user experimentation growth (good publicity?)
down
load
s
“Normal” Download-Release Patterns
BibDesk
down
loads
●
●
●
●
●
●
● ● ● ● ● ●
1000
2000
3000
4000
5000
Oct−2005 Apr−2006 Oct−2006 Apr−2007
measure
user_base
baseline
External effects!Taverna’s Download-
Release Patterns
1.3.2-RC1+2 presentations 1.5.0
? ?
Taverna’s Estimated Baseline & User Base
14 day baseline & drop-off
Taverna’s Estimated Baseline & User Base
7 day baseline & drop-off
Interpretation
• Taverna is not a “normal” open source project
• Speaking tours, tutorials, articles, and other events influence downloads
• What this demonstrates...
• Care is needed with quantitative measures
• Not all open source projects are the same
• Taverna users are just as reactive as any
http://www.flickr.com/photos/pagedooley/2121472112/
Where next?
• Adoption is a long-term agenda, as changing social practices doesn’t happen overnight
• For FLOSS research and our disciplinary communities
• We will keep doing our work this way, and hope to draw in others
“Won’t you come out and play?”
http://www.flickr.com/photos/atiq/2658884520/
Thanks!
• Credits where they are due
• Kevin Crowston, my advisor
• James Howison, my collaborator
• Everett Wiggins, my husband