Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big...
Transcript of Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big...
![Page 1: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/1.jpg)
Data Wrangling
Data Science: Jordan Boyd-GraberUniversity of MarylandJANUARY 14, 2018
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 1 / 14
![Page 2: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/2.jpg)
Download Data
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 2 / 14
![Page 3: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/3.jpg)
Big Picture
� Data are messy (this isn’t so messy!)
� The first step to doing anything cool is using data
� Need to use common sense and brute force often
� You’ll see more in first real homework
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 3 / 14
![Page 4: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/4.jpg)
First Steps: Get Data
� From FEC
� Odd formatting
� Today: pure Python (easier with Pandas), will help expose level ofPython you’ll need
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 4 / 14
![Page 5: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/5.jpg)
First Steps: Get Data
� From FEC
� Odd formatting
� Today: pure Python (easier with Pandas), will help expose level ofPython you’ll need
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 4 / 14
![Page 6: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/6.jpg)
Look at file . . .
� Periods instead of commas (vice versa)
� Odd New York parties
� Semi-colon delimiters
� Includes totals
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 5 / 14
![Page 7: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/7.jpg)
Look at file . . .
� Periods instead of commas (vice versa)
� Odd New York parties
� Semi-colon delimiters
� Includes totals
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 5 / 14
![Page 8: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/8.jpg)
Read in Data
from csv import DictReadervotes = list(DictReader(open("2012pres.csv", ’r’),
delimiter=";"))
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 6 / 14
![Page 9: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/9.jpg)
Read in Data
from csv import DictReadervotes = list(DictReader(open("2012pres.csv", ’r’),
delimiter=";"))
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 6 / 14
![Page 10: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/10.jpg)
Read in Data
from csv import DictReadervotes = list(DictReader(open("2012pres.csv", ’r’),
delimiter=";"))
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 7 / 14
![Page 11: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/11.jpg)
Read in Data
from csv import DictReadervotes = list(DictReader(open("2012pres.csv", ’r’),
delimiter=";"))
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 7 / 14
![Page 12: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/12.jpg)
How many votes were cast?
Total votes 129085410
total_votes = sum(int(x["TOTAL VOTES #"].replace(".", "")) \for x in votes if x["TOTAL VOTES #"])
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 8 / 14
![Page 13: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/13.jpg)
How many votes were cast?
Total votes 129085410
total_votes = sum(int(x["TOTAL VOTES #"].replace(".", "")) \for x in votes if x["TOTAL VOTES #"])
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 8 / 14
![Page 14: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/14.jpg)
How many votes were cast?
Total votes 129085410
total_votes = sum(int(x["TOTAL VOTES #"].replace(".", "")) \for x in votes if x["TOTAL VOTES #"])
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 8 / 14
![Page 15: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/15.jpg)
What state had the largest numerical margin between first and second
place?
Largest numerical margin 3014327 in California
margins = {}for ss in set(x["STATE"] for x in votes):
margins[ss] = winner(votes, ss)[1] - second(votes, ss)[1]num_margin = argmax(margins)print("Largest numerical margin %i in %s" %
(max(margins.values()), num_margin))
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 9 / 14
![Page 16: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/16.jpg)
What state had the largest numerical margin between first and second
place?
Largest numerical margin 3014327 in California
margins = {}for ss in set(x["STATE"] for x in votes):
margins[ss] = winner(votes, ss)[1] - second(votes, ss)[1]num_margin = argmax(margins)print("Largest numerical margin %i in %s" %
(max(margins.values()), num_margin))
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 9 / 14
![Page 17: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/17.jpg)
What state had the largest numerical margin between first and second
place?
Largest numerical margin 3014327 in California
margins = {}for ss in set(x["STATE"] for x in votes):
margins[ss] = winner(votes, ss)[1] - second(votes, ss)[1]num_margin = argmax(margins)print("Largest numerical margin %i in %s" %
(max(margins.values()), num_margin))
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 9 / 14
![Page 18: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/18.jpg)
What state had the largest percentage margin between first and second
place?
Largest percentage margin 48.04 in Utah
margins = {}for ss in set(x["STATE"] for x in votes
if x["STATE"] != "District of Columbia"):margins[ss] = winner(votes, ss)[2] - \
second(votes, ss)[2]num_margin = argmax(margins)print("Largest percentage margin %f in %s" %
(max(margins.values()), num_margin))
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 10 / 14
![Page 19: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/19.jpg)
What state had the largest percentage margin between first and second
place?
Largest percentage margin 48.04 in Utah
margins = {}for ss in set(x["STATE"] for x in votes
if x["STATE"] != "District of Columbia"):margins[ss] = winner(votes, ss)[2] - \
second(votes, ss)[2]num_margin = argmax(margins)print("Largest percentage margin %f in %s" %
(max(margins.values()), num_margin))
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 10 / 14
![Page 20: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/20.jpg)
What state had the largest percentage margin between first and second
place?
Largest percentage margin 48.04 in Utah
margins = {}for ss in set(x["STATE"] for x in votes
if x["STATE"] != "District of Columbia"):margins[ss] = winner(votes, ss)[2] - \
second(votes, ss)[2]num_margin = argmax(margins)print("Largest percentage margin %f in %s" %
(max(margins.values()), num_margin))
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 10 / 14
![Page 21: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/21.jpg)
What state had the largest numerical third party vote (and for whom)?
Johnson had largest third party vote in California with 143221
all_third_vote = {}top_third_vote = {}for ss in set(x["STATE"] for x in votes):
try:all_third_vote[ss] = \dict((x["LAST NAME"],
parseint(x["GENERAL RESULTS"]))for x in votesif x["STATE"] == ssand x["LAST NAME"] not in kMAJORand x["LAST NAME"])
except ValueError:all_third_vote[ss] = {}
if all_third_vote[ss]:top_third_vote[ss] = max(all_third_vote[ss].values())
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 11 / 14
![Page 22: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/22.jpg)
What state had the largest numerical third party vote (and for whom)?
Johnson had largest third party vote in California with 143221
all_third_vote = {}top_third_vote = {}for ss in set(x["STATE"] for x in votes):
try:all_third_vote[ss] = \dict((x["LAST NAME"],
parseint(x["GENERAL RESULTS"]))for x in votesif x["STATE"] == ssand x["LAST NAME"] not in kMAJORand x["LAST NAME"])
except ValueError:all_third_vote[ss] = {}
if all_third_vote[ss]:top_third_vote[ss] = max(all_third_vote[ss].values())
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 11 / 14
![Page 23: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/23.jpg)
What state had the largest numerical third party vote (and for whom)?
Johnson had largest third party vote in California with 143221
all_third_vote = {}top_third_vote = {}for ss in set(x["STATE"] for x in votes):
try:all_third_vote[ss] = \dict((x["LAST NAME"],
parseint(x["GENERAL RESULTS"]))for x in votesif x["STATE"] == ssand x["LAST NAME"] not in kMAJORand x["LAST NAME"])
except ValueError:all_third_vote[ss] = {}
if all_third_vote[ss]:top_third_vote[ss] = max(all_third_vote[ss].values())
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 11 / 14
![Page 24: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/24.jpg)
What state had the largest percentage vote (and for whom)?
Johnson had largest third party percent in New Mexico with 3.55
all_third_vote = {}top_third_vote = {}for ss in set(x["STATE"] for x in votes):
try:all_third_vote[ss] = \dict((x["LAST NAME"],
parseint(x["GENERAL RESULTS"]))for x in votesif x["STATE"] == ssand x["LAST NAME"] not in kMAJORand x["LAST NAME"])
except ValueError:all_third_vote[ss] = {}
if all_third_vote[ss]:top_third_vote[ss] = max(all_third_vote[ss].values())
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 12 / 14
![Page 25: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/25.jpg)
What state had the largest percentage vote (and for whom)?
Johnson had largest third party percent in New Mexico with 3.55
all_third_vote = {}top_third_vote = {}for ss in set(x["STATE"] for x in votes):
try:all_third_vote[ss] = \dict((x["LAST NAME"],
parseint(x["GENERAL RESULTS"]))for x in votesif x["STATE"] == ssand x["LAST NAME"] not in kMAJORand x["LAST NAME"])
except ValueError:all_third_vote[ss] = {}
if all_third_vote[ss]:top_third_vote[ss] = max(all_third_vote[ss].values())
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 12 / 14
![Page 26: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/26.jpg)
What state had the largest percentage vote (and for whom)?
Johnson had largest third party percent in New Mexico with 3.55
all_third_vote = {}top_third_vote = {}for ss in set(x["STATE"] for x in votes):
try:all_third_vote[ss] = \dict((x["LAST NAME"],
parseint(x["GENERAL RESULTS"]))for x in votesif x["STATE"] == ssand x["LAST NAME"] not in kMAJORand x["LAST NAME"])
except ValueError:all_third_vote[ss] = {}
if all_third_vote[ss]:top_third_vote[ss] = max(all_third_vote[ss].values())
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 12 / 14
![Page 27: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/27.jpg)
Summary
� Data are messy
� Easier with formatted data (e.g., csv)
� Need basic data structures
� Check whether answers are reasonable
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 13 / 14
![Page 28: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing](https://reader034.fdocument.pub/reader034/viewer/2022042318/5f07f32f7e708231d41f92d3/html5/thumbnails/28.jpg)
Next Time . . .
� Lecture: make sure to do reading
� Probability foundations (if you found today boring . . . )
� Math needed for the course (quiz likely)
Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 14 / 14