Lambda Architecture and open source technology stack for real time big data
-
Upload
trieu-nguyen -
Category
Technology
-
view
113 -
download
1
description
Transcript of Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and Open Source Tools for
Real-time Big Data● Concepts & Techniques “Thinking with Lambda”● Case studies in Practice
Trieu Nguyen - http://nguyentantrieu.info or @tantrieuf31Principal Engineer at eClick Data Analytics team, FPT OnlineAll contents and thoughts in this slide are my subjective ideas and compiled from Communities
Just a little introduction● 2008 Java Developer, developed Social
Trading Network for a small startup (Yopco)● 2011 worked at FPT Online, software engineer
in Banbe Project, Restful API for VnExpress Mobile App
● 2012 joined Greengar Studios in 6 months, scaling backend API mobile games (iOS, Android)
● 2013 back to FPT Online, R&D about Big Data & Analytics, developing the new core Analytics Platform (on JVM Platform)
Contents for this talk
● The lessons from history● Problems In Practice● What is the Lambda Architecture?● Why lambda architecture for real-time big
data ?● Open Source Technology Stack ● Lambda in Practice (Mobile Data and Web Data)● Lessons I have learned● Questions & Answers
History ?The best way to predict the future is looking at the past and now ?
Big data is a buzzword for old problems
Explaining Big Datahttp://www.youtube.com/watch?v=7D1CQ_LOizA
Learning ?
Working ?
Big Data + Old Historyhttp://www.youtube.com/watch?v=tp4y-_VoXdA
This is most valuable things!
This is Big DATA
We can't solve problems by using the same kind of thinking we used when we created them.Albert Einstein
Think more withLambda and Reactive
Where Big Data can be used
BBC Horizon 2013 The Age of Big Data
http://www.youtube.com/watch?v=RE0ITQ7XQjM
Google’s mission is to organize
the world’s information and make it
universally accessible and useful.
Organize the world’s information?
How did Google scale their search engine ?How does Hadoop really work ?
http://stackoverflow.com/questions/6087834/how-scalable-is-mapreduce-in-the-original-functional-languages
Trends of Now and the Future
● MapReduce Programming● Reactive Programming● Functional Programming● Streaming Computation
=> All just the special cases of Lambda
So what is the λ (Lambda) Architecture ?
the Lambda Architecture:
● apply the (λ) Lambda philosophy in designing big data system
● equation “query = function(all data)” which is the basis of all data systems
● proposed by Nathan Marz (http://nathanmarz.com/), a software engineer from Twitter in his “Big Data” book.
● is based on three main design principles:
○ human fault-tolerance – the system is unsusceptible to data loss or data
corruption because at scale it could be irreparable. (BUGS ?)
○ data immutability – store data in it’s rawest form immutable and for
perpetuity. (INSERT/ SELECT/DELETE but no UPDATE !)
○ recomputation – with the two principles above it is always possible to
(re)-compute results by running a function on the raw data.
Lambda In Practice2 case studies from my experiences
Case Study 1: Mobile Data
Monitor API Backend + System KPI
Problem:Inside “mobile data”, What's the most valuable piece of information
Backend System for mobile app
I applied “Lambda” here
Web vs Mobile AppWeb
Visitors
Visits
Pageviews
Events
Mobile AppUsers
Sessions
Events
Metrics: Cause and Effect● Screen Size => App Design, UI/UX, Usability● App version => Deployment, Marketing● Connectivity => Code, User Experience ● Location => Marketing, User Behaviour● OS => Marketing, Cost, Development● Memory => User Experience ● Feature Session => How to engage app users
The data and the size, not too big for a small startup!
Where is the lambda ?I used Groovy + GPars (Groovy Parallel Systems) + MongoDB for fast parallel computation (actor model) on statistical datahttp://gpars.codehaus.org/ The GPars framework offers Java developers intuitive and safe ways to handle Java or Groovy tasks concurrently. Support:
● Dataflow concurrency● Actor programming model● CSP● Agent - an thread-safe reference to mutable state● Concurrent collection processing● Composable asynchronous functions● Fork/Join● STM (Software Transactional Memory)
Mobile Apps => Backend APIs => Statistics => Find the Trends & Insights?
Reactive Data Analytics for Mobile Apps
It means real-time recommendation by:➔ context (location, time)➔ user profile (preferences, level,
...)
Big Data on Small Devices: Data Science goes Mobilehttp://strataconf.com/strata2013/public/schedule/detail/27605
Case Study 2: Web Data
● Real-time Data Analytics ● Monitoring Stream Data (Reactive)
http://eclick.vn
at eClick we have30~40 GB Logs in Stream10~20 GB Bandwidthjust for tracking user actions (click, impression,...) in ONE day !
at eClick we must check campaigns in near-real-time (seconds) !
at eClick we have many types of log (video, web, mobile, system logs, ad-campaign, articles, … )
“lambda architecture” proposed by @nathanmarz
Netty Http Server
Kafka
Storm
Redis
Hadoop Tools
KPI Report
Internet
the open-source lambda architecture at eClick
Redis
Akka Workers
TCP Connection
The big-data technology stack ● Netty (http://netty.io/) a framework using reactive programming
pattern for scaling HTTP system easier, by JBoss http://www.jboss.org ● Kafka (http://kafka.apache.org/) a publish-subscribe messaging
rethought as a distributed commit log, open sourced by Linkedin● Storm (http://storm-project.net/) the framework for distributed
realtime computation system, by Twitter● Redis (http://redis.io/) a advanced key-value in-memory NoSQL
database, all fast statistical computations in here.● Groovy for scripting layer on JVM, ad-hoc query on Redis ● Hadoop ecosystem: HDFS, Hive, HBase for batch processing● RxJava https://github.com/Netflix/RxJava a library for composing
asynchronous and event-based programs● Hystrix https://github.com/Netflix/Hystrix : for Latency and Fault
Tolerance for Distributed Systems
My new ideas for the future
Connecting the active functor pattern + reactive programming + stream computation + in-memory computing to make:● real-time data analytics easier● better recommendation system● build more profitable in big data
More Information:● http://activefunctor.blogspot.com/ (a special case of Lambda
that actively search best connections to form optimal topology) - from ideas when internship at DRD with my advisor.
● Can a function be persistent (stored as data), distributed in a cluster (cloud), reactive to right data (best value in network) ?
● http://www.reactivemanifesto.org/ (reactive pattern)
LessonsWhat I have learned from Lambda and Big Data World
What I have learned● Study about lambda and read some books● Ask questions=> analytics=> Profit & Value● Collect any data you can, learn inside !● Implement it! Just right tools for right jobs.● Turn your data into the things everyone can
"look & feel"
read papers
Study the “lambda”I studied Haskell in 2007 with Dr.Peter Gammie http://peteg.org/ when internship at DRD (a non-profit organization).● Imperative programs will always be vulnerable to data races because
they contain mutable variables.● There are no data races in purely functional languages because they
don't have mutable variables.
Reading some books
Improve your business knowledge !=> read the Behavioral Economics Books
http://www.goodreads.com/shelf/show/behavioral-economics
Collect the data ?
Use your imagination is more than just knowledge you have
Think more about Butterfly Effect!
“Logic will get you from A to Z;
imagination will get you
everywhere.” - Albert Einstein
Use your imagination with data analytics, not just logic
Learn Data Visualization
Questions & AnswersThe link of this slide is here:● http://nguyentantrieu.info/blog/lambda-architecture-and-
open-source-tools-for-real-time-big-data/
More useful resources:
● http://nguyentantrieu.info/blog● http://www.mc2ads.com