Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time...

Post on 14-Jan-2017

164 views 3 download

Transcript of Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time...

Shortening the Feedback Loop How Spotify’s Big Data Ecosystem Has Evolved to Produce Real-time Insights

Josh Baer (jbx@spotify.com)Note: opinions expressed in these slides are the authors and not necessarily those of Spotify

Who am I?

• Technical Product Owner at Spotify

• Working with fast processing infrastructure

• Previously, building out Spotify’s 2500 node Hadoop cluster

@l_phant

• Spotify Launches

• Instant Access to a gigantic catalog of music

• Click to play instantaneous!

In 2008

Behind the Scenes: Days to Insights

Behind the Scenes

Behind the Scenes

Minutes to transfer

Hours to Clean and Bucket

Hours to Run Jobs or Ad Hoc

Queries

“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009

Real-time

ProcessingBatch Processing

(Hadoop, Hive, BigQuery)

“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009

Operational

Monitoring

To leverage actionable insights, we need a

faster feedback loop!

• Music Streaming Service

• Launched in 2008

• Premium and Free Tiers

• Available in 59 Countries

What is Spotify?

100+ Million Active Users

30+ Million Songs

1+ Billion Plays/Day

And we have Data

Hadoop at Spotify

• 2,500 Nodes

• 130 PB Capacity

• 120 TB Memory accessible by jobs

• 20K Jobs/Day

Apache Kafka at Spotify

• 500 Kafka-related machines

• 40 TB/day from logs

“Real-Time” at Spotify

• Storm Topologies fed via Kafka

• Powering

✦ Ad Targeting

✦ Real-time recommendations

✦ Real-time stream counts

Migrating to the Cloud

In the Beginning…

• Spotify was almost completely on-premise/bare metal

• 2500 node Hadoop cluster, over 10K machines in production at four globally distributed data centers

• Grew with users: from 1M in 2009, over 100M in 2016

Why Move to the Cloud?

• Cloud Providers have matured, decreasing in costs and increasing in reliability and variety of service offered

• Owning and operating physical machines is not a competitive advantage for Spotify

Why Google’s Cloud?

• We believe Google’s industry leading background in Big Data technologies will give us a data processing advantage

Google Cloud “Primitives”

BigQuery

• Ad-hoc and interactive querying service for massive datasets

• Like Hive, but without needing to manage Hadoop and servers

• Leverages Google’s internal tech

• Dremel (query execution engine)

• Colossus (distributed storage)

• Borg (distributed compute)

• Juniper (network)

Source: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood

BigQuery vs. Hive

• Example Query: Find the top 10 songs by popularity in Spain during October

• BigQuery (1.50 TB processed): 108s

• Hive(15.5 TB processed): 2647s

Note: Hive performance unoptimized. Version used (0.14), input format (Avro), run on a ~2500 node Yarn cluster. This is not considered to be a thorough benchmark

BigQuery vs. Hive (example #2)

• Example Query: Find the total hours of music listening in Spain during October

• BigQuery (780 GB processed): 33s

• Hive(15.5 TB processed): 969s

Note: Hive performance unoptimized. Version used (0.14), input format (Avro), run on a ~2500 node Yarn cluster. This is not considered to be a thorough benchmark

• Top 10 Tracks in Spain during October 2016

Rank Artist(s) Track Name1 JBalvin Safari

2 DJSnake LetMeLoveYou

3 RickyMar8n VentePa'Ca

4 Sebas8anYatra Traicionera

5 Zion&Lennox(feat.JBalvin) OtraVez

6 CarlosVives,Shakira LaBicicleta

7 TheChainsmokers Closer

8 MajorLazer(feat.Jus8nBieber&MØ) ColdWater

9 Sia TheGreatest

10 IAmChino(feat.Pitbull,Yandel&Chacal) AyMIDios

Time Spent Listening to Spotify by users in Spain

during October

Nearly 10,000 Years!

BigQuery at Spotify

• Interactive and ad-hoc querying immediately started to transfer to BQ once the data was available on the cloud

• Pace of learning increases as friction to question decreases

Cloud Pub/Sub

• At least once globally distributed message queue

• For high volume, low topic (<10,000) publish subscribe behavior

• Like Kafka, but without needing to operate servers and supporting services (zookeeper)

Cloud Pub/Sub at Spotify

• 800K events/second? No problem

• P99 Latency of ingestions into ES: 500ms

• Ingestion from globally distributed non-GCP datacenters is painless

• Managed Service for running batch and streaming jobs

• Unified API for batch and streaming mode

• Inspired by internal Google tools like FlumeJava and Millwheel

• Programming model open-sourced as Apache Beam (currently incubating)

Cloud Dataflow

• Usually run via Scio: https://github.com/spotify/scio

• Scio provides a scala API for running Dataflow jobs and provides easy integrations with BigQuery

• New batch processing jobs @Spotify are being written in Scio/Dataflow

Cloud Dataflow (Batch) at Spotify

• Exactly-once stream processing framework

• A replacement for Spark/Flink streaming and Storm workloads at Spotify

• Optimizes for consistency which can complicate real-time workloads

Cloud Dataflow (Streaming) at Spotify

Spotify + Google Cloud Timeline

2015 2016

Beginning of Google Cloud evaluation

BigQuery begins to replace Hive

Cloud Pub/Sub begins to replace Kafka

Dataflow (streaming) begins to replace StormSpotify + Google

Cloud Announcement

Dataflow (batch) replacing Map/Reduce

Note: Dates are approximations

Putting It All Together

The Problem

• We want to detect within minutes if we’ve introduced a bug in a client release that affects critical event logging behavior

Before…

Minutes to transfer

Hours to Clean and Bucket

Hours to Run Jobs or Ad Hoc

Queries

HOURS TO INSIGHTS

Introducing “Pulsar”

• An internal name for the system aggregating data from Access Points and feeding it into Cloud Pub/Sub

• Replaces the Kafka real-time event feed

Pulsar

Pub/Sub

• Aggregates global event feed from Pulsar

• Makes data available to multiple zones in milliseconds

Dataflow

• Subscribes to critical event Pub/Sub topics

• Aggregate events into minute windows

• Always running, no need to schedule or wait for results

BigQuery

• Receives aggregates from Dataflow

• Allows for ad-hoc inspection or slicing on different dimensions

Tableau

• Data Visualization Tool that integrates nicely with BigQuery

• Pulls data from BigQuery periodically and caches for quick inspection

Putting it all together

Putting it all togetherMilliseconds

to transfer

Milliseconds to process

Seconds to Query

SECONDS TO INSIGHTS

Putting it all together

Faster Insights to Client Behavior

Problem

As a developer, I want to be able to instantly explore data being logged by the clients.

Solution

• Produce a topic for all employee client events

• Store in Elasticsearch

• Visualize in Kibana

Benefits

• Able to understand what’s being sent by the client as it happens

• Exploring events, visualizing distribution (i.e. does this field actually get populated)

• Prototyping analysis based on a sample

• Dashboards for Employee Releases

Faster Insights on New Features

The previous dashboard is great for prototyping, but what if you want all the data?

Problem

Solution

Allow developers to funnel feature-specific data to their own elastic search cluster

Dataflow to the Rescue!

• We created a library that allows teams to build maps/filters with simple java code

• Code gets translated into a Dataflow job

Abstract Away the Complexity

No Ops!

• For our users:

• Event-feed managed through Cloud Pub/Sub

• Dataflow managed by Google

• Shared Elasticsearch cluster (managed by an infra team)

Low Ops :/

• Dataflow is improving, but it’s had some stability issues with streaming jobs

• Teams may need to set-up their own Elasticsearch cluster if they require a higher SLA than default

Other Uses

Ad Targeting

• Real-time genre targeting

• Session insights — explicit filter

Real-time Recommendations

Live Results for X-Factor

• X-Factor: television music competition

• Contest songs get loaded onto Spotify immediately after show airs

• Listener behavior determines the order of contestants on the playlist

Review

Behind the Scenes

Minutes to transfer

Hours to Clean and Bucket

Hours to Run Jobs or Ad Hoc

Queries

To leverage actionable insights, we need a

faster feedback loop!

Real-time

ProcessingBatch Processing

(Hadoop, Hive, BigQuery)

“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009

Operational

Monitoring

Cloud to the Rescue!

• Spotify has leveled up our ability to gain actionable insights by leveraging Google Cloud tools, such as Pub/Sub, Dataflow and BigQuery

The Value of a Fast Feedback Loop

• Detecting problems early in data avoids long backfills or long term data loss

• Instant insights on newly developed features allows teams to iterate quicker and take risks

• Providing a quicker ad-hoc querying engine allows teams to ask more questions and learn faster

Use Anything and Everything

• Opensource and other cloud providers offer many alternatives to the stack we’ve used

• Opensource tools, like Elasticsearch/Kibana, and proprietary solutions, like Tableau, have also been useful additions

Where Are We Going?

• The real-time mission is in the early stages at Spotify

Stream Processing First

• The sun never sets on Spotify, why impose boundaries on our datasets?

• What’s the shortest distance between two lines? Zero!

• Can we reduce the feedback cycle to zero?

We’re Hiring!Engineers, Managers, Product Owners needed in NYC and Stockholm

https://www.spotify.com/jobs

Questions?