Strata Hadoop Hopsworks

Post on 16-Apr-2017

129 views 0 download

Transcript of Strata Hadoop Hopsworks

HopsWorks: Multi-Tenant Hadoop-as-a-Service

Jim Dowling Associate Prof @ KTH Stockholm

Senior Researcher @ SICSCEO @ Logical Clocks AB

Hadoop Strata, London, June 3rd 2016

www.hops.io @hopshadoop

Open, Collaborative Software Development

Hadoop Administrator

Enterprise-Level Hadoop: Admin-in-the-Loop

3

Access Control in Hadoop: Apache Ranger

>hdfs dfs -chmod -R 000 /apps/hive

4[http://hortonworks.com/blog/best-practices-in-hdfs-authorization-with-apache-ranger]

Access Control in Hadoop: Apache Sentry

5How do you ensure the consistency of the policies and the data?

[Mujumdar’15]

Access Control in Relational Databases# Multi-tenancy for alice and bob on db1 and db2grant all privileges on db1.* to ‘alice'@‘%‘;grant all privileges on db2.* to ‘bob'@‘%‘;

6

Consistency of security and privileges guaranteed with foreign keys.

drop db2; // deletes associated privileges

Metadata Totem Poles in Hadoop

7Eventual Consistency

Why the separation of Metadata and Data?

8

HDFS v2

9

DataNodes

HDFS Client

Journal Nodes Zookeeper Nodes

NameNode StandbyNameNode

Max 200 GB

YARN

10

NodeManagers

YARN Client

Zookeeper Nodes

ResourceMgr StandbyResourceMgr

Tinker-Friendly?

There is another way…..

11

Hops: Distributed Metadata for Hadoop

12

HopsFS Architecture

13

NameNodes

NDB

Leader

HDFS Client

DataNodes

> 12 TB

> 2.6 XThroughput

HopsYARN Architecture

14

ResourceMgrs

NDB

Scheduler

YARN Client

NodeManagers

Resource TrackersLeader Election forFailed Scheduler

Up to 10KNode Clusters

Challenges for Project-Level Multi-Tenancy

15

(How can we introduce GitHub-style projects to Hadoop?)

Problem: Sensitive Data needs its own Cluster

16

NSA DataSet

User DataSet

has access to

has access to

Copy/cross-link between data sets

Alice has only one Kerberos Identity. Neither attribute-based access control nor dynamic roles supported in Hadoop.

Alice

Solution: Project-Specific UserIDs

17

Project NSA

Project UsersMember of

NSA__Alice

Users__Alice

Member of

HDFS enforcesaccess control

How can we share DataSets between Projects?

Sharing Data with First-Class DataSets

18

Project NSA

Project UsersMember of

DataSetowns

Add members of Project NSA to the DataSet group

NSA__Alice

Users__Alice

Member of

HopsWorks: Project-Level Multi-Tenancy

19

System UserAuthentication Provider

- JDBC Realm- 2-Factor Authentication- LDAP

20

HopsWorks Enforces Dynamic Roles

21

Alice@gmail.com

NSA__Alice

Authenticate

Users__Alice

HopsWorks

HopsFS

HopsYARN

ProjectsSecure

Impersonation

Kafka

X.509 Certificates

X.509 Certificate Per Project-Specific User

22

Alice@gmail.com

Authenticate

Add/DelUsers

Distributed Database

Insert/Remove CertsProject Mgr

RootCA

ServicesHadoopSparkKafkaetc

Cert Signing Requests

ProjectA project has an ownerA project is a collection of

- Members- HDFS DataSets - Kafka Topics- Notebooks and Jobs

A project has quotas

23

projectdataset 1

dataset N

Topic 1

Topic N

Kafka

HDFS

Project RolesData Owner Privileges

- Import/Export data- Manage Membership- Share DataSets, Topics

Data Scientist Privileges- Write and Run code

24

We delegate administration of privileges to usersjust like GitHub

Elastic HadoopEach Project has:YARN CPU Quota (in mins)HDFS Storage Quota (in GB/TB)Uber-Style Pricing

25

Sharing DataSets/Topics between Projects

26

The same as Sharing Folders in Dropbox

Delegate Access Control to HDFSHDFS enforces access control- UserID per Project- GroupID per Project and

DataSetMetadata Integrity using Foreign Keys- Removing a project removes

all users, groups, and (optionally) DataSets

27

Delegate Access Control to KafkaKafka brokers enforce access control with certificatesPrinciple name extracted from the X.509 Certificate is: projectName__userID

HopsAuthorizer enforces ACLs for the topic ACLs are stored in the distributed database

28

Free Text Search for Metadata

29

Free-Text Search

Distributed DatabaseElasticSearch

The Distributed Database is the Single Source of Truth.Zero overhead, streaming API synchronizes with Elasticsearch.

MetaDataDesigner

MetaDataEntry

Design your own Extended Metadata

30

Zeppelin (with multi-tenancy and Livy/Spark)

Automated Installation

32

Vagrant/Chef to spin up on a single hostKaramel/Chef to deploy on AWS/GCE/OpenStack or on-premises

name: HopsWorksec2: type: m3.medium cookbooks: hadoop: github: "hopshadoop/hopsworks-chef" version: "v0.1" groups: ui: size: 1 recipes: - hopsworksmetadata: size: 2 recipes: - hops::nn - hops::rm datanodes: size: 50 recipes: - hops::dn - hops::nm

www.hops.site

33

A 2 MW datacenter research and test environment

5 lab modules, planned up to 3-4000 servers, 2-3000 square meters

[Slide by Prof. Tor Björn Minde, CEO SICS North Swedish ICT AB]

Demo

34

Summing UpHopsWorks is providing a world’s first:Multi-Tenant Hadoop-as-a-ServiceOpen-Source Self-serviceTinker Friendly

35

www.hops.io @hopshadoop

The TeamActive: Jim Dowling, Seif Haridi, Tor Björn Minde,

Gautier Berthou, Salman Niazi, Mahmoud Ismail,Kamal Hakimzadeh, Ermias Gebremeskel, Theofilos Kakantousis, Johan Svedlund Nordström, Someya Sayeh, Vasileios Giannokostas, Antonios Kouzoupis, Misganu Dessalegn, Rizvi Hasan,Ahmad Al-Shishtawy, Ali Gholami, Paul Mälzer.

Alumni: K. “Sri” Srijeyanthan, Steffen Grohsschmiedt, Alberto Lorente, Andre Moré, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Jude D’Souza, Qi Qi, Gayana Chandrasekara,Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,Peter Buechler, Pushparaj Motamari, Hamid Afzali,Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.

Hops