DS504/CS586: Big Data Analytics -...

30
DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Welcome to Time: 6:00pm – 8:50pm R Location: AK 232 Fall 2016

Transcript of DS504/CS586: Big Data Analytics -...

Page 1: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning

Prof. Yanhua Li

Welcome to

Time: 6:00pm – 8:50pm R Location: AK 232

Fall 2016

Page 2: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Ocean Biodiversity Informatics, Hamburg

29 Nov

2004

The Data Equation Oceans of

Data

Praia de Forte, Brazil

Page 3: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Data Quality Dimensions

v  Accuracy §  Errors in data Example:”Jhn” vs. “John”

v  Currency §  Lack of updated data Example: Residence (Permanent) Address: out-dated vs. up-to-dated

v  Consistency §  Discrepancies into the data Example: ZIP Code and City consistent

v  Completeness §  Lack of data §  Partial knowledge of the records in a table

Page 4: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Ocean Biodiversity Informatics, Hamburg

29 Nov

2004

Geographic outliers - GIS

Country, State, named district, etc.

Gazetteer of Brazilian localities

Page 5: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

What do we mean by ‘Data Quality’?

An essential or distinguishing characteristic necessary for data to be fit for use.

SDTS 02/92 The general intent of describing the quality of a

particular dataset or record is to describe the fitness of that dataset or record for a particular use that one may have in mind for the data.

Chrisman, 1991

Page 6: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Loss of data quality

Loss of data quality can occur at many stages: v  At the time of collection v  During digitisation v  During documentation v  During storage and archiving v  During analysis and manipulation v  At time of presentation v  And through the use to which they are put

Don’t underestimate the simple elegance of quality improvement. Other than teamwork, training, and discipline, it requires no special skills. Anyone who wants to can be an

effective contributor. (Redman 2001).

Page 7: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Data Cleaning

v  Data cleaning tasks

§ Accuracy: Smooth out noisy data

§ Currency: Update the records

§ Consistency: Correct inconsistent data

§ Completeness: Fill in missing values

Page 8: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Map matching

Page 9: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Map-matchingv  Problem: (Sampled data)

§ GPS trajectory = a sequence of GPS locations with time stamps

§  Map a GPS trajectory onto a road network §  a sequence of GPS points à a sequence of road segments

Page 10: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Spatial Data

v 

e1 e2e3

e3.start

e3.end

e4

Page 11: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Map-Matching

v  Why it is important § A fundamental step in many transportation

applications • Navigation and driving •  Traffic analysis •  Taxi dispatching and recommendations

§ Examples: •  Find the vehicles passing Institute Road • Calculate the average travel time from WPI

campus to MIT campus • When will the Bus 3 arrive at stop Highland St &

North Ashland St

Page 12: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Map-Matching

v  Simple solution for high-sampling-rate data § Weighted distance

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Page 13: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Map-Matching (for low sampling rate)v  Why difficult

? ?

??

??

b) Overpass(a) Parallel roads c) Spur

Page 14: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Map-Matching

v  According to the additional information used §  Geometric §  Topological §  Probabilistic §  Advanced techniques

v  According to the range of sampling points §  Local/incremental §  Global

Yu Zheng. Trajectory Data Mining: An Overview. ACM Transaction on Intelligent Systems and Technology, 6, 3, 2015.

Page 15: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Map-matching

v  Insights

§  Consider both local and global information

§  Incorporating both spatial and temporal features

pa

pb

pc

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Page 16: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Map-matching framework

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Page 17: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Map-matchingv  Solution (incorporating spatial information)

§  (Observation Probability) Model local possibility

•  (Transmission Probability) Considering context (global)

§  Spatial analysis function

𝑒𝑖

3 𝑒𝑖

1

𝑒𝑖2

𝑐𝑖3

𝑝𝑖 𝑐𝑖

2

𝑐𝑖1

𝑐𝑖2 𝑝𝑖−1

𝑝𝑖 𝑐𝑖1

𝑝𝑖+1

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

N(cji ) =1p2⇡�

e(x

j

i

�µ)2

2�2

V (cti�1 ! csi ) =di�1!i

w(i�1,t)!(i,s)

Fs(cti�1 ! csi ) = V (cti�1 ! csi ) ⇤N(csi )

Page 18: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Map-matching•  Solution (Cosine Similarity)

•  Temporal analysis function (Considering temporal information) •  Shortest path is used.

Pi-1Pi

A Highway

A Service Road

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Ft(cti�1 ! csi ) =

Pku=1(e

0u.v ⇥ v̄(i�1,t)!(i,s))qPk

u=1(e0u.v)

2 ⇥qPk

u=1 v̄2(i�1,t)!(i,s)

v̄(i�1,t)!(i,s) =

Pku=1 lu

�ti�1!i

Page 19: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Map-matching •  Aggregating

–  Spatial and temporal information –  Local and global information

•  Dynamic programing

•  Spatio-temporal function

𝑐11

𝑐12

𝑐13

𝑐11 → 𝑐21

𝑐13 → 𝑐22

𝑐21

𝑐22

𝑐𝑛1

𝑐𝑛2

P1's candidates P2's candidates Pn's candidates

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

F (cti�1 ! csi ) = Fs(cti�1 ! csi ) ⇤ Ft(c

ti�1 ! csi ), 2 i n

Page 20: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Map-matching •  Path Selection

•  Dynamic programing

𝑐11

𝑐12

𝑐13

𝑐11 → 𝑐21

𝑐13 → 𝑐22

𝑐21

𝑐22

𝑐𝑛1

𝑐𝑛2

P1's candidates P2's candidates Pn's candidates

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

P = argmaxPcF (Pc)

F (Pc) = N(cs11 ) +nX

i=2

F (csi�1

i�1 ! csii )

Page 21: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Map-matching Example •  Path Selection

Page 22: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Map-matching framework

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Page 23: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Localized ST-Matching Strategy •  Path Selection

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Page 24: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Evaluations

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

AN =#Correctly Matched Road Seg

#all road segments

AL =

PLength Matched Road Seg

Length of the trajectory

Page 25: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

25

Course Project

Page 26: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Project 1 directions What is your project goal? v  What new story you want to tell? v  New contents to sample? v  New sampling methods via API? v  New statistics of YouTube, view count distribution,

dynamics, or # uploaders/active users? v  Analysis on other websites, Twitter, Facebook,

Foursquare, Yelp, with API interfaces Broad impacts? (Keep in mind) v  How YouTube is evolving?

§  More business or personal videos? How to distinguish the two §  How special events, e.g., NBA game, breaking news, affect the

uploading, viewing behaviors v  Online Marketing, advertising?

Page 27: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

27

A Few Words on Course Project I Project I: Collecting and Measuring Online Data v  Team work; each team 3-4 students. v  Starting date: Week 3 (9/8 R) v  Proposal Due: Week 4 (9/15 R) 2 pages roughly v  Due date/time: Before Class on Week 8 (10/13 R) v  Presentation date/time: Class on Week 8 (10/13 R)

§  Selected teams only v  Requiring Programming in C/C++, Java, Python, and, etc

v  Choose one online site/service with APIs to download data, or use existing datasets.

v  Examples: v  (1) estimate site statistics, or v  (2) applying machine learning methods to predict future trends, or v  (3) perform time-series analysis to capture dynamic patterns, v  or something else, as long as your work can potentially bring research value to

the community.

Page 28: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

Transport Layer 3-28

A Few Words on Course Project I v  Group meeting with Prof Li by appointment)

§ Week 3 (9/8 R), Starting date § Week 4 (9/15 R), Proposal Due: 2 pages roughly (upload it

to discussion board) § Week 5 (9/22 R), Methodology due (upload it to discussion

board) § Week 6 (9/29 R), Results due (upload it to discussion board) § Week 7 (10/6 R), Conclusion due (upload it class discussion

board) § Week 8 (10/13 R), Final Report due at 11:59pm EST & Self

and Cross-evaluation due at 11:59pm EST § Week 8 (10/13 R), In-class Presentation (10 min) (Selected

teams only)

Page 29: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

29

Course Project II v  Projects will be in groups!

v 3-5 students per group, depending on enrollment

v  Topics on your choice (related to big data analytics) v Application-driven v Fundamental data analytics research v Data sources on course website

http://wpi.edu/~yli15/courses/DS504Fall16/Resources.html

Talk to me once you have an idea.

Page 30: DS504/CS586: Big Data Analytics - users.wpi.eduusers.wpi.edu/~yli15/courses/DS504Fall16/slides/BDA-2-Cleaning.pdfDS504/CS586: Big Data Analytics ... • Dynamic programing !! 1 1!

30

Next Class: Data Management

v  Do assigned readings before class v  Be prepared, read and review required readings on your own in

advance! v  Do literature survey: find and read related papers if any v  Bring your questions to the class and look for answers during

the class.

v  Submit reviews/critiques v  In myWPI before class v  Bring 2 hardcopies to the class v  Hand in one copy, and keep one copy with you.

Review Writing: http://users.wpi.edu/~yli15/courses/DS504Fall16/Critiques.html v  Attend in-class discussions

v  Please ask and answer questions in (and out of) class! v  Let’s try to make the class interactive and fun!