Big Data n Analytics
-
Upload
somil-asthana -
Category
Documents
-
view
150 -
download
0
Transcript of Big Data n Analytics
WHAT IS BIG DATA ?
According to Wiki “Big Data is a term for data
sets that are so large or complex that traditional
processing applications are inadequate “
To big to fit on a single machine RAM
Painfully show to process on a single machine
How Large is Big Data
GB (Any application )
TB ( call logs, user click information )
PB ( facebook , linked-in)
But that’s
WHAT IS (REALLY) BIG DATA ?
Big Data is mostly about Human Behaviour
What do people do ?
Where do they go ?
What do they buy ?
As a consequence, these observations are
recorded.
To predict what people will do ?
BIG DATA TYPES
What constitutes Big Data
Structured Data
Semi-Structured Data
Unstructured Data
Contradictory to popular belief
Big Data is NOT Hadoop or its variant
Big Data is NOT Cloud Computing
Big Data is NOT Analytics
VISUALIZE BIG DATA
There are two ways to look at it
...
.....
...
Field
1
Field
2
Field
3
Field
4
Field
X
Field
Y
Field
Z
Field
A
STATISTICAL CHALLENGE INHERENT IN
BIGDATA
Consider Linear Regression : basically fitting a
line on plotted data.
STATISTICAL CHALLENGE INHERENT IN
BIGDATA
Essentially what we are doing in Linear Regression
mathematically is minimizing R2
• Trying to find that Best
Fits and help predict
ESSENTIAL AND HIDDEN IN BIG DATA
What ? TRUTH
Consider rating data ( incorrect, biased )
Machine dysfunction ( incorrect, biased )
Human Error ( incorrect, biased )
Statistical Issue ( Model Over-fitting )
Validation & Verification
More efforts
CHALLENGES WITH BIG DATA IN
PRODUCTION SYSTEMS
Expect Out of Memory Error
Expect Network Failures
Expect Disk Failures
Expect I/O Performance Bottlenecks
Lack of Programming Skills
Statistical Model Fail
INDUSTRIAL DISSECTION OF BIG DATA
ANALYTICS MARKET
There are 3 different sub markets of Big Data
Infrastructure
Software
Services
BIG DATA INFRASTRUCTURE
Examples
Amazon AWS
Microsoft Azure
IBM SoftLayer
Rackspace Open Cloud
Google Compute Engine
BIG DATA INFRASTRUCTURE ( IAAS )
consist of
computing, storage, networking and data centre infrastructure and maintenance
IAAS Feature
On-Demand Use of Resource ( hence scalable )
Automation of Admin Tasks ( security to machine maintenance )
Facilitate platform ( which can be programmed )
Management Tools ( Backup, High Availability )
Load balancing
Machine Scaling ( Auto Scaling )
BI System
Data Warehouse
Work flow System
Machine Learning Tools & Software
BIG DATA INFRASTRUCTURE: WHY DO WE
CARE
Study Done ( courtesy IBM )
375 MB Data consumed per House Hold
20 Hours of You Tube Video uploaded per 60 sec
24 PB processed by Google
50 Million tweets per Day
700 Million minutes Spend on Face Book per Month
1.3 ExaBytes received or sent by Mobile Internet
User
72.9 Products ordered on Amazon per second.
BIG DATA SOFTWARE / PLATFORM
Distinguished by
Commercial Software (SAS, SAP, HANA, Tableau,
TeraData, SPSS, Talend)
Open Source Software ( Hadoop, Apache Spark,
Elastic Search )
Enterprise wrapping around or Extension of Open
Source Software (Cloudera Hadoop, Map R,
HortonWorks)
BIG DATA MARKET FRAGMENTED
Segments that provide Data Marts capabilities:
Data Sift & GNIP ( store tons of tweets )
Zomato & Indian Mart ( Directory Listing )
FreeBase ( acquired by Google, Knowledge Store)
Open Data initiatives
Infrastructure Extensions
MongoDB, Neo4j, DynamoDB, MarkLogic
SQL VS NO SQL
SQL supports ACID property ie Atomic
Consistent Isolated Durable ( require locks &
syncronization )
No SQL supports BASE ie Basic Availability,
Soft-state, Eventual consistency.
REDIS NOSQL
Key Space of the hashmap spread across numerous
machines
To load balance.
Introduce Redundancy.
Supports storage in form of
Strings
Hashes
Lists
Sets
Sorted sets
COLUMN NOSQL
4 Building Blocks
All data stored in the same column is of the same
type.
Row key used to dynamically partition the data
Load balancing
Locality of reference
MONGODB NOSQL
Document Database
Internally store data as key-pair value as JSON
However, support indexing for fast lookup and
aggregation.
Cannot Perform Joins
Find() query
NEO4J NOSQL
Unique : stores, process and query connections
Network Graph
Nodes
Relationship
Properties
Exploit Structural Properties of Network Graph
MANIPULATING BIG DATA
Traditional Approach Fails
SELECT COUNT(column_name) FROM table_name;
SELECT cli_detail FROM table_name where
cli_number = X
Approach Possible
Index Data fields
Use Probabilistic Data Structure
ETL PROCESS
Source 1
Source 2
Source 3
Source 4
Source 5
Data Volume Reduces
Data Complexity Increases
Extract Transform Load Destination
ETL DETAILS
Extraction
Annotation / Enriching
DeDuping
Cleaning
Transformation
Processing
Aggregating
Loading
Storing in easy consumable form / further processing.
TWO TYPES BIG DATA APPLICATIONS Batch Processing
Can handle Large Volume of Data
Can be used to implement iterative algorithms
Can reschedule failed job
Real-time Processing
The Data needs to be processed in buckets
Need to process the data at once
BIG DATA ARCHITECTURE BREAK-DOWN
(1)
Data Repository /
Processing Unit
Coll
ect
ion
Un
it
Sta
gin
g S
erv
er
BI
– Suitable for Business Intelligence
Applications
BIG DATA ARCHITECTURE BREAK-DOWN
(2)
Data Repository /
Processing Unit
Coll
ect
ion
Un
it
Mod
el
Tra
inin
g
Cla
ssif
ica
tion
/ S
egm
en
tati
on
– Suitable for Machine Learning Applications
Machine Learning
(Offline) Model
User
APACHE KAFKA A Pipe which handles hundreds of megabytes of reads and
writes per second
Scalable Data Streams are partitioned and spread over a cluster of
machines
Durable Persisted on disk
Replicated within the cluster
Feeds of messages in
categories called topics by
producers consumed by
Consumers managed by
a broker
APACHE KAFKA Maintains Strict Sequence
Essentially, consumers can retrieve data in two form
Queue
Publish – Subscriber Queue
QUESTIONS
Contact Address: [email protected]