Architecture at PBS

25
Architecture of PBS.org DCPython - June 7, 2011

description

Edgar and I had the pleasure of presenting at the DCPython meetup last night about how PBS uses Python, Django, Celery, Solr and Amazon Web Services (autoscaling EC2, RDS) to power many of our sites and services. We focused primarily on the COVE (video) and Merlin (content) APIs since those probably have the most interesting architectures. We had a blast and received many smart questions from the crowd about Solr, Amazon Web Services, Celery and the recent Tupac incident in about that order. Thanks for having us DCPython! Check out DCPython at http://dcpython.org or follow @DCPython.

Transcript of Architecture at PBS

Page 1: Architecture at PBS

Architecture of PBS.org

DCPython - June 7, 2011

Page 2: Architecture at PBS

PBS is…

• PBS is a national federation of independently owned and operated public television stations and producers– Each with their own management and development

resources

• 1500+ highly trafficked websites:– http://www.pbs.org/– http://www.pbs.org/nova/ – http://pbskids.org/– http://pbskids.org/sesame/– http://video.pbs.org/

• Enterprise services/APIs

Page 3: Architecture at PBS

PBS is not!

• Radio is easy… We do television!

• Or any of the other ~200 local stations.

Page 4: Architecture at PBS

What we do

• Technology leadership within public broadcasting community

• Distribution of national programming content

• Services to local stations• Core application development.

Yeah!!!

Page 5: Architecture at PBS

A few of our sites

Page 6: Architecture at PBS

History of PBS.org

Early 1990’s: Hand rolled static htmlLate 1990’s: Hand crafted static html + CGI!

Most of 2000’s: Zope/Plone CMS generated static html2008-10: Django generated static html

Launched Oct 2010: Django all the way

Page 7: Architecture at PBS

COVE API

• Contains the metadata for all PBS videos online including pointers to streaming video

• Needed to be:– Secure– Fast– Scalable

Page 8: Architecture at PBS

COVE API – Technology Stack

• Amazon Elastic Cluster Computing (EC2)• Amazon Relational Database Service (RDS)• Linux• Python• Django• Piston for REST API

Page 9: Architecture at PBS

COVE API - Architecture

Internet

Elastic Load Balancer

Auto Scale Array

App Server 1 App Server N…

HA Proxy

RDS Master RDS Slave 1RDS Slave 1RDS Slave 1

App Sync Server

S3 Backups

Page 10: Architecture at PBS

COVE API – Management Tools

• Amazon Web Service Console• RightScale• Splunk

Page 11: Architecture at PBS

COVE API – Interesting Stuff

• Easy to load test– Duplicate environment for several days

• Easy to scale– Autoscale array grows automatically

• Easy to upgrade– Each server built from vanilla base

Page 12: Architecture at PBS

COVE API – Lessons learned

• Use normalized data for administration and de-normalized data for API

Page 13: Architecture at PBS

COVE API – Lessons learned

• Piston is fine, but lacks flexibility without significant customization– TastyPie?

• JSON is probably good enough• Don’t get fancy with your endpoints• Stick to REST principles• Don’t get fancy with your authentication– Use OAuth2 or simple token

Page 14: Architecture at PBS

PBS.org and Merlin API

• PBS.org– Slim, fast layer– Pulls data from Merlin API– Uses memcache extensively– Currently Django, but could be anything

(Flask?)

• Merlin API– Aggregate content from distributed CMSes– Expose via standardized API– Power PBS.org and more

Page 15: Architecture at PBS

Merlin API – Technology stack

• Python• Django• MySQL• Piston• Solr• Celery• RabbitMQ

• Amazon Web Services (“cloud”)– EC2– RDS - Relational Database Service– ELB - Elastic Load Balancing– Cloudfront CDN– S3 Storage

Page 16: Architecture at PBS

Data flow

RSS FeedIngestor

Standardized API

Page 17: Architecture at PBS

Merlin API architecture

API Endpoint – Django Piston

Search serviceDjango-haystack

Indexing serviceSolr

Data layer – MySQL (RDS)

AdministrationDjango admin

Feed ingestionCelery

Page 18: Architecture at PBS

Merlin API server topology

Elastic Load Balancer

Internet

S3 backups

Celery MasterDB RDS

SolrIndex

App #NApp #NApp #NApp #n

Autoscalingarray

Page 19: Architecture at PBS

Merlin API – Management Tools

• Amazon Web Service Console• RightScale• Splunk

Page 20: Architecture at PBS

API - Piston/Haystack/Solr

class WebObjectIndexHandler(BaseHandler): ... def get_queryset(self): ... return PistonSearchQuerySet().models(*models)

from haystack.query import SearchQuerySetclass PistonSearchQuerySet(SearchQuerySet): ... def __getitem__(self, k): ... return [IndexSerializer(i) for i in super(PistonSearchQuerySet,

self).__getitem__(k)]

Page 21: Architecture at PBS

Feed ingestor - Celery

from celery.decorators import task, periodic_task

@periodic_task(run_every=timedelta(seconds=300))def update_webobject_states(): ... solr_visible = WebObject.children.filter(visible=True) solr_visible = solr_visible.exclude( flag__api_visible=True, available__isnull=True) ... updated = solr_visible.update(visible=False, is_indexed = False) ...

signals.bulk_update.send('tasks.update_webobject_states')

Page 22: Architecture at PBS

Merlin API - Lessons learned

• Memcached was not necessary• Denormalized search data via Solr index is much

faster than querying database• Asynchronous task delegation is awesome• Celery prone to memory leaks• App server array for easy horizontal scaling– Even if not autoscaling, increase min servers

• Never trust data you don’t control (validate!)

Page 23: Architecture at PBS

Resources

• http://lucene.apache.org/solr/• http://haystacksearch.org/• http://celeryproject.org/• http://celeryproject.org/docs/django-

celery/• http://aws.amazon.com/

Page 24: Architecture at PBS

PBS Developer Community

• Dedicated to making open.PBS the industry standard in open development communities.

http://open.pbs.org/https://github.com/pbs

[email protected]

Page 25: Architecture at PBS

Questions?

Drew [email protected]://tomatohater.com

Edgar [email protected]