Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time...

76

Transcript of Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time...

Page 1: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer
Page 2: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Shortening the Feedback Loop How Spotify’s Big Data Ecosystem Has Evolved to Produce Real-time Insights

Josh Baer ([email protected])Note: opinions expressed in these slides are the authors and not necessarily those of Spotify

Page 3: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Who am I?

• Technical Product Owner at Spotify

• Working with fast processing infrastructure

• Previously, building out Spotify’s 2500 node Hadoop cluster

@l_phant

Page 4: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

• Spotify Launches

• Instant Access to a gigantic catalog of music

• Click to play instantaneous!

In 2008

Page 5: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Behind the Scenes: Days to Insights

Page 6: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Behind the Scenes

Page 7: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Behind the Scenes

Minutes to transfer

Hours to Clean and Bucket

Hours to Run Jobs or Ad Hoc

Queries

Page 8: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009

Page 9: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Real-time

ProcessingBatch Processing

(Hadoop, Hive, BigQuery)

“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009

Operational

Monitoring

Page 10: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

To leverage actionable insights, we need a

faster feedback loop!

Page 11: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

• Music Streaming Service

• Launched in 2008

• Premium and Free Tiers

• Available in 59 Countries

What is Spotify?

Page 12: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

100+ Million Active Users

Page 13: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

30+ Million Songs

Page 14: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

1+ Billion Plays/Day

Page 15: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

And we have Data

Page 16: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Hadoop at Spotify

• 2,500 Nodes

• 130 PB Capacity

• 120 TB Memory accessible by jobs

• 20K Jobs/Day

Page 17: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Apache Kafka at Spotify

• 500 Kafka-related machines

• 40 TB/day from logs

Page 18: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

“Real-Time” at Spotify

• Storm Topologies fed via Kafka

• Powering

✦ Ad Targeting

✦ Real-time recommendations

✦ Real-time stream counts

Page 19: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Migrating to the Cloud

Page 20: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

In the Beginning…

• Spotify was almost completely on-premise/bare metal

• 2500 node Hadoop cluster, over 10K machines in production at four globally distributed data centers

• Grew with users: from 1M in 2009, over 100M in 2016

Page 21: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Why Move to the Cloud?

• Cloud Providers have matured, decreasing in costs and increasing in reliability and variety of service offered

• Owning and operating physical machines is not a competitive advantage for Spotify

Page 22: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Why Google’s Cloud?

• We believe Google’s industry leading background in Big Data technologies will give us a data processing advantage

Page 23: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Google Cloud “Primitives”

Page 24: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

BigQuery

• Ad-hoc and interactive querying service for massive datasets

• Like Hive, but without needing to manage Hadoop and servers

• Leverages Google’s internal tech

• Dremel (query execution engine)

• Colossus (distributed storage)

• Borg (distributed compute)

• Juniper (network)

Source: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood

Page 25: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

BigQuery vs. Hive

• Example Query: Find the top 10 songs by popularity in Spain during October

• BigQuery (1.50 TB processed): 108s

• Hive(15.5 TB processed): 2647s

Note: Hive performance unoptimized. Version used (0.14), input format (Avro), run on a ~2500 node Yarn cluster. This is not considered to be a thorough benchmark

Page 26: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

BigQuery vs. Hive (example #2)

• Example Query: Find the total hours of music listening in Spain during October

• BigQuery (780 GB processed): 33s

• Hive(15.5 TB processed): 969s

Note: Hive performance unoptimized. Version used (0.14), input format (Avro), run on a ~2500 node Yarn cluster. This is not considered to be a thorough benchmark

Page 27: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

• Top 10 Tracks in Spain during October 2016

Rank Artist(s) Track Name1 JBalvin Safari

2 DJSnake LetMeLoveYou

3 RickyMar8n VentePa'Ca

4 Sebas8anYatra Traicionera

5 Zion&Lennox(feat.JBalvin) OtraVez

6 CarlosVives,Shakira LaBicicleta

7 TheChainsmokers Closer

8 MajorLazer(feat.Jus8nBieber&MØ) ColdWater

9 Sia TheGreatest

10 IAmChino(feat.Pitbull,Yandel&Chacal) AyMIDios

Page 28: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Time Spent Listening to Spotify by users in Spain

during October

Nearly 10,000 Years!

Page 29: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

BigQuery at Spotify

• Interactive and ad-hoc querying immediately started to transfer to BQ once the data was available on the cloud

• Pace of learning increases as friction to question decreases

Page 30: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Cloud Pub/Sub

• At least once globally distributed message queue

• For high volume, low topic (<10,000) publish subscribe behavior

• Like Kafka, but without needing to operate servers and supporting services (zookeeper)

Page 31: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Cloud Pub/Sub at Spotify

• 800K events/second? No problem

• P99 Latency of ingestions into ES: 500ms

• Ingestion from globally distributed non-GCP datacenters is painless

Page 32: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

• Managed Service for running batch and streaming jobs

• Unified API for batch and streaming mode

• Inspired by internal Google tools like FlumeJava and Millwheel

• Programming model open-sourced as Apache Beam (currently incubating)

Cloud Dataflow

Page 33: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

• Usually run via Scio: https://github.com/spotify/scio

• Scio provides a scala API for running Dataflow jobs and provides easy integrations with BigQuery

• New batch processing jobs @Spotify are being written in Scio/Dataflow

Cloud Dataflow (Batch) at Spotify

Page 34: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

• Exactly-once stream processing framework

• A replacement for Spark/Flink streaming and Storm workloads at Spotify

• Optimizes for consistency which can complicate real-time workloads

Cloud Dataflow (Streaming) at Spotify

Page 35: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Spotify + Google Cloud Timeline

2015 2016

Beginning of Google Cloud evaluation

BigQuery begins to replace Hive

Cloud Pub/Sub begins to replace Kafka

Dataflow (streaming) begins to replace StormSpotify + Google

Cloud Announcement

Dataflow (batch) replacing Map/Reduce

Note: Dates are approximations

Page 36: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Putting It All Together

Page 37: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

The Problem

• We want to detect within minutes if we’ve introduced a bug in a client release that affects critical event logging behavior

Page 38: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Before…

Minutes to transfer

Hours to Clean and Bucket

Hours to Run Jobs or Ad Hoc

Queries

HOURS TO INSIGHTS

Page 39: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Introducing “Pulsar”

• An internal name for the system aggregating data from Access Points and feeding it into Cloud Pub/Sub

• Replaces the Kafka real-time event feed

Page 40: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Pulsar

Page 41: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Pub/Sub

• Aggregates global event feed from Pulsar

• Makes data available to multiple zones in milliseconds

Page 42: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Dataflow

• Subscribes to critical event Pub/Sub topics

• Aggregate events into minute windows

• Always running, no need to schedule or wait for results

Page 43: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

BigQuery

• Receives aggregates from Dataflow

• Allows for ad-hoc inspection or slicing on different dimensions

Page 44: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Tableau

• Data Visualization Tool that integrates nicely with BigQuery

• Pulls data from BigQuery periodically and caches for quick inspection

Page 45: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Putting it all together

Page 46: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Putting it all togetherMilliseconds

to transfer

Milliseconds to process

Seconds to Query

SECONDS TO INSIGHTS

Page 47: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Putting it all together

Page 48: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Faster Insights to Client Behavior

Page 49: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Problem

As a developer, I want to be able to instantly explore data being logged by the clients.

Page 50: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Solution

• Produce a topic for all employee client events

• Store in Elasticsearch

• Visualize in Kibana

Page 51: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer
Page 52: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer
Page 53: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Benefits

• Able to understand what’s being sent by the client as it happens

• Exploring events, visualizing distribution (i.e. does this field actually get populated)

• Prototyping analysis based on a sample

• Dashboards for Employee Releases

Page 54: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Faster Insights on New Features

Page 55: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

The previous dashboard is great for prototyping, but what if you want all the data?

Problem

Page 56: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Solution

Allow developers to funnel feature-specific data to their own elastic search cluster

Page 57: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Dataflow to the Rescue!

• We created a library that allows teams to build maps/filters with simple java code

• Code gets translated into a Dataflow job

Page 58: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Abstract Away the Complexity

Page 59: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer
Page 60: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

No Ops!

• For our users:

• Event-feed managed through Cloud Pub/Sub

• Dataflow managed by Google

• Shared Elasticsearch cluster (managed by an infra team)

Page 61: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Low Ops :/

• Dataflow is improving, but it’s had some stability issues with streaming jobs

• Teams may need to set-up their own Elasticsearch cluster if they require a higher SLA than default

Page 62: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Other Uses

Page 63: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Ad Targeting

• Real-time genre targeting

• Session insights — explicit filter

Page 64: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Real-time Recommendations

Page 65: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Live Results for X-Factor

• X-Factor: television music competition

• Contest songs get loaded onto Spotify immediately after show airs

• Listener behavior determines the order of contestants on the playlist

Page 66: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Review

Page 67: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Behind the Scenes

Minutes to transfer

Hours to Clean and Bucket

Hours to Run Jobs or Ad Hoc

Queries

Page 68: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

To leverage actionable insights, we need a

faster feedback loop!

Page 69: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Real-time

ProcessingBatch Processing

(Hadoop, Hive, BigQuery)

“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009

Operational

Monitoring

Page 70: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Cloud to the Rescue!

• Spotify has leveled up our ability to gain actionable insights by leveraging Google Cloud tools, such as Pub/Sub, Dataflow and BigQuery

Page 71: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

The Value of a Fast Feedback Loop

• Detecting problems early in data avoids long backfills or long term data loss

• Instant insights on newly developed features allows teams to iterate quicker and take risks

• Providing a quicker ad-hoc querying engine allows teams to ask more questions and learn faster

Page 72: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Use Anything and Everything

• Opensource and other cloud providers offer many alternatives to the stack we’ve used

• Opensource tools, like Elasticsearch/Kibana, and proprietary solutions, like Tableau, have also been useful additions

Page 73: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Where Are We Going?

• The real-time mission is in the early stages at Spotify

Page 74: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Stream Processing First

• The sun never sets on Spotify, why impose boundaries on our datasets?

• What’s the shortest distance between two lines? Zero!

• Can we reduce the feedback cycle to zero?

Page 75: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

We’re Hiring!Engineers, Managers, Product Owners needed in NYC and Stockholm

https://www.spotify.com/jobs

Page 76: Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Questions?