Big Data in Production Environments

30
Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt 1 Proposal for establishing modern concepts of data storage and analytics to production data

Transcript of Big Data in Production Environments

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt1

Proposal for establishing modern concepts

of data storage and analytics to production data

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt2

Current situation

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt3

Current situation

Just the Numbers

✔ Approx. 270.000 sensors in one installation (AUDI Györ)(but only 17.000 sensures are currently tracked)

✔ Lots of 'unsynchronized' control desks, respective their data

✔ Lot of duplicated data(because of the 'home-grown' failover/replication concept)

✔ No historical data(because the amount of data is overwhelming and can't be handeled)

✔ Problems with scalability

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt4

Current situation

Outdated technologies

✔ Trend Server is developed in Delphi: Who develops in that?

✔ Microsoft SQL Server: not fast enough

✔ Technological breaches between several technologies

Bottlenecks

✔ Query slow for mor than 750k events

✔ No more than 7500 CSV files

✔ CSV & SQL server for the same tasks

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt5

Current situation

Scalability and fault tolerance

✔ Only few sensor can be saved

✔ IOM synchronization problems

✔ Buffered data saved with the same timestamp

✔ Different IOM saved same data with different timestamps

No integration / standalone application

✔ Data can not be accessed from every place (control desk)

✔ Data can not be recorded in case of failure

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt6

Big data and NoSQL

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt7

Why the relational model

… sometimes isn't enough

✔ Can't handle extremely large data amounts (in extreme 15 Petabyte data in Gov. Of India)

✔ Hard to scale (esp. scaleing out adding nodes to handle the load)→

✔ Hard to deal with 'unstructured' data due to strict data model

✔ The valuable transactional model sometimes is an overkill

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt8

Dealing with data

… awfull lots of data

… petabytes

… and even behind this

plus NoSQL

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt9

Dealing with data

Characteristics and value proposition

✔ Big Data gains momentum (data generates value)

➢ High data velocity

➢ Data variety

➢ Data Volume

➢ Data complexity

✔ Continuous availability

✔ Data location independence

✔ Flexible data model (schemaless databases)

✔ Improved architecture and enhanced analytics

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt10

Problems with NoSQL

Document - oriented MongoDBCouchDB

Column Store Big TableHBase

Key-Value CassandraDynamoDBAzure Table StorageRiakBerkeleyDB

Graph Neo4J

Many players, several concepts, no one size fits all approach and no standards

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt11

Why Cassandra?

Because people with the same problems have chosen it ...

“I can create a Cassandra cluster in any region of the world in 10 minutes. When marketing decides we want to move into a certain part of the world, we’re ready.”

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt12

Why Cassandra

Scalability

✔ Add nodes to scale

✔ Millions operations

✔ Low latency in read/write operations

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt13

Why Cassandra

Availability

✔ Created to be distributed

✔ Resistant and flexible to failures

✔ Different data centers (probably in different parts of the world)

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt14

Why Cassandra

Replication

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt15

Why Cassandra

Sometimes things go wrong:

✔ Hardware fails

✔ Bug

✔ Power

✔ Natural disaster

and then...

✔ Fast node recovery

✔ Auto-Balancing when a node fails

✔ Transparent to the client

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt16

Why Cassandra

Easy to use

✔ Large ecosystem

✔ Well documented

✔ Full Java support

✔ SQL-like syntax

INSERT INTO sensor_by_day(sensor_id,date,event_time,value)

VALUES (’1234ABCD’,’2013-04-03′,’2013-04-03 07:01:00′,’72F’);

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt17

Time Series in Cassandra

Cassandra can store up to 2 billion columns per row, but if we’re storing data every millisecond you wouldn’t even get a month’s worth of data.

The solution is to use a pattern called row partitioning by adding data to the row key to limit the amount of columns you get per device.

Almost no limits!

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt18

Data analysis goals

✔ Low latency (interactive) queries on historical data: enable faster decisions

✔ Low latency queries on live data (streaming): enable decisions on real-time data

✔ Sophisticated data processing: enable “better” decisions (e.g. anomaly detection, trend analysis)

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt19

Spark ecosystem

Well integrated with Cassandra and includes:

✔ SQL-like interface

✔ Machine learning:Algorithms that can learn from data, used for predictions (predictive maintenance: exploit patterns found in historical and transactional data to identify risks and opportunities)

✔ Streaming:Real-time streaming data likesliding windows

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt20

Use Cases

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt21

Use Case

✔ Data from Oven will be collected

✔ Cassandra stores sequentially

✔ TrendPage reads sequentially for faster graphic creation.

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt22

Use Case

Data Model to support queries

✔ Store data per oven

✔ Store time series in order: first to last

✔ Get all data for one oven

Queries needed

✔ Get data for a single date and time

✔ Get data for a range of dates and times

Cassandra is really good for time-series data because you can write one column for each period in your series and then query across a range of time using sub-string matching.

This is best done using columns for each period rather than rows, as you get huge IO efficiency wins from loading only a single row per query.

– MyDrive Telemetry (15 billion records on average)

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt23

Time Series in Cassandra

The data model

✔ Row Key is Time Identifier

✔ Column Values are Events

✔ Columns Values are Measurements

✔ Rows Can be Very Wide

1 s Schema

Faster data storage in database

1 min Schema

Avoid networks overloadsData can be compressed (prior to sending)Extra data like min, max, avg can be calculated before stored.Increment retrieving data speed.

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt24

Architectual options

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt25

Architectual Options

Unreplicated databases

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt26

Architectual Options

Redundant and replicated databases

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt27

Architectual Options

Replicated databases plus analytics

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt28

What is next

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt29

Discussion

Copyright by Brockhaus GmbH, alle Rechte reserviert, unautorisierte Vervielfältigung untersagt30

Enough propaganda ... Get in touch!

Contact information:

Brockhaus Consulting GmbHGustav Stresemann Ring 1D - 65189 WiesbadenGermany

Fon: +49-611-97774-332Fax: +49-611-97774-432

Web: www.brockhaus-gruppe.de Mail: [email protected]