Download - Map Reduce with Bash - An Example of the Unix Philosophy in Action

Transcript
Page 1: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Map Reduce With Bash(the power of the Unix philosophy)

Page 2: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

About Me

● Solutions integrator at Jumping Bean– Developer & Trainer

– Technologies● Java● PHP● HTML5/Javascript● Linux

– What I am planning to do:● The Internet of things

● Solutions integrator at Jumping Bean– Developer & Trainer

– Technologies● Java● PHP● HTML5/Javascript● Linux

– What I am planning to do:● The Internet of things

Page 3: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Map/Reduce with Bash

● Purpose of this presentation is: – to demonstrate the power and flexibility of the Unix

philosophy,

– what awesome solutions can be created by using simple bash script and userland tools,

– cool utilities and tools

● The purpose is not:– to suggest that Map/Reduce is best done with bash

– best given constraint – see business problem

● Purpose of this presentation is: – to demonstrate the power and flexibility of the Unix

philosophy,

– what awesome solutions can be created by using simple bash script and userland tools,

– cool utilities and tools

● The purpose is not:– to suggest that Map/Reduce is best done with bash

– best given constraint – see business problem

Page 4: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

“is a set of cultural norms and philosophical approaches to developing small yet capable

software” - Wikipedia

Unix Philosophy

Page 5: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Unix Philosophy

“Early Unix developers were important in bringing the concepts of modularity and reusability into

software engineering practice, spawning a 'software tools' movement” - Wikipedia

Page 6: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Business Problem

● Nuclear Engineering department needs to run monte-carlo methods on data to calculate something to do with core temperature of nuclear reactors :),

● Post-grad students need to run analysis as part of their course work,● Analysis can take days or weeks to run,● University has invested in 900 node cluster,● Cluster used for research when not used by students● Tool used for analysis is

– written in Fortran.

– single threaded,

● No money for fancy-pants solution

● Nuclear Engineering department needs to run monte-carlo methods on data to calculate something to do with core temperature of nuclear reactors :),

● Post-grad students need to run analysis as part of their course work,● Analysis can take days or weeks to run,● University has invested in 900 node cluster,● Cluster used for research when not used by students● Tool used for analysis is

– written in Fortran.

– single threaded,

● No money for fancy-pants solution

Page 7: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Business Problem

● As-Is system– Professor uses laptop and desktop,

– Manually starts application with simple script,

– Start script x number of times where x=number of cores,

– Waits for days,

– Manually checks progress,

– Not scalable to 900 nodes!

● As-Is system– Professor uses laptop and desktop,

– Manually starts application with simple script,

– Start script x number of times where x=number of cores,

– Waits for days,

– Manually checks progress,

– Not scalable to 900 nodes!

Page 8: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Business Problem

● Unknowns– How 900 node cluster set up i.e using any cluster software or virtualisation?

● Open Stack?● Open Nebula?● KVM?

– Tools available to IT department – I.e how they do deploys, monitoring, user management etc

● Requirements– independence from IT department or experts for help,

– Student & lecturer IT skills is limited to Fortran & some bash scripting skills,

– Due to security concerns prevent IT staff from gaining access to research,

● Keep it simple – Proof of concept

● Unknowns– How 900 node cluster set up i.e using any cluster software or virtualisation?

● Open Stack?● Open Nebula?● KVM?

– Tools available to IT department – I.e how they do deploys, monitoring, user management etc

● Requirements– independence from IT department or experts for help,

– Student & lecturer IT skills is limited to Fortran & some bash scripting skills,

– Due to security concerns prevent IT staff from gaining access to research,

● Keep it simple – Proof of concept

Page 9: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

What is Map/Reduce?

● Programming model for – Processing and generating large datasets,

– Using a parallel distribution algorithm,

– On a cluster or set of distributed nodes

● Popularised by Google and the advent of cloud computing● Apache Hadoop – full blown map/reduce framework. Used

to analyse your social media data, “understand the customer” and by numerous agencies with 3 letter acronyms. – “Really we only trying to help you know yourself better”

● Programming model for – Processing and generating large datasets,

– Using a parallel distribution algorithm,

– On a cluster or set of distributed nodes

● Popularised by Google and the advent of cloud computing● Apache Hadoop – full blown map/reduce framework. Used

to analyse your social media data, “understand the customer” and by numerous agencies with 3 letter acronyms. – “Really we only trying to help you know yourself better”

Page 10: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Map/Reduce Steps

● Map – Master node takes large dataset and distributes it to compute nodes to perform analysis on. The compute nodes return a result,

● Reduce – Gather the results of the compute nodes and aggregate results into final answer

● Map – Master node takes large dataset and distributes it to compute nodes to perform analysis on. The compute nodes return a result,

● Reduce – Gather the results of the compute nodes and aggregate results into final answer

Page 11: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

What we need

● Controller node functions – to distribute data to nodes,

– execute calculation functions

– collect results

● Management node functions– distribute application and scripts to compute nodes,

● Compute node functions– Scripts to run the single threaded application in parallel on multi core processors

● Security Requirements– Prevent system administrators from gaining access to core application , script or

data

● Controller node functions – to distribute data to nodes,

– execute calculation functions

– collect results

● Management node functions– distribute application and scripts to compute nodes,

● Compute node functions– Scripts to run the single threaded application in parallel on multi core processors

● Security Requirements– Prevent system administrators from gaining access to core application , script or

data

Page 12: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Controller Functions

● How to distribute files to a node (map), execute calculations & gather results (reduce)?– Use split to split input files,

– Use ssh to distribute files, execute processes,

● How to do this to multiple (900) nodes?– Use parallel ssh (pssh), paralle scp,

● Issues:– Copying public key to 900 machines?– Give each student their own account?

● Solution– Set up ldap authentication (password based) or– Include controller nodes root public key in compute node image, distribute 2ndary keys via scripts

using pssh– Fancy pants – chef, ansible

● How to distribute files to a node (map), execute calculations & gather results (reduce)?– Use split to split input files,

– Use ssh to distribute files, execute processes,

● How to do this to multiple (900) nodes?– Use parallel ssh (pssh), paralle scp,

● Issues:– Copying public key to 900 machines?– Give each student their own account?

● Solution– Set up ldap authentication (password based) or– Include controller nodes root public key in compute node image, distribute 2ndary keys via scripts

using pssh– Fancy pants – chef, ansible

Page 13: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Management Node Functions

● Use parallel ssh to distribute scripts from management node to compute nodes,

● Using Ansible or Chef could be a next evolutionary step to automate system maintenance

● Use parallel ssh to distribute scripts from management node to compute nodes,

● Using Ansible or Chef could be a next evolutionary step to automate system maintenance

Page 14: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Compute Node Functions

● Basically bash scirpt - How to parallelise single threaded application to use multiple cores on modern CPUs?

● xargs – pass through list of input files,

– -n set each iteration to run on one input file

– -P set number of processes to start in parallel

– Script waits for completion of processing & check output

● GNU parallels– Can run commands in parallel using 1 or more hosts

– More options for target input placement {}, string replacement

– Can pass output as input to another process

● Basically bash scirpt - How to parallelise single threaded application to use multiple cores on modern CPUs?

● xargs – pass through list of input files,

– -n set each iteration to run on one input file

– -P set number of processes to start in parallel

– Script waits for completion of processing & check output

● GNU parallels– Can run commands in parallel using 1 or more hosts

– More options for target input placement {}, string replacement

– Can pass output as input to another process

Page 15: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Compute/Controller Node

● At end of compute node process either compute node pings controller node,

● Controller node waits for pssh to return to carry out next step. I.e – reduce process or start next script with output from 1st being input to 2nd step,

● Check for errors and reschedule failed computes,

● At end of compute node process either compute node pings controller node,

● Controller node waits for pssh to return to carry out next step. I.e – reduce process or start next script with output from 1st being input to 2nd step,

● Check for errors and reschedule failed computes,

Page 16: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Security

● Each student should have separate account– Linux mutli-user system. User home directory for storing files and results

● Each user should be limited in resource usage– Simple

● ulimit● psacct

– Advanced● Cgroups● Namespaces

● Students can execute but not read bash script file, special permissions– Use sudo or

– Linux capabilities ● setcap – eg setcap "cap_kill=+ep" script.sh

● Each student should have separate account– Linux mutli-user system. User home directory for storing files and results

● Each user should be limited in resource usage– Simple

● ulimit● psacct

– Advanced● Cgroups● Namespaces

● Students can execute but not read bash script file, special permissions– Use sudo or

– Linux capabilities ● setcap – eg setcap "cap_kill=+ep" script.sh

Page 17: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Security

● Limit the root user– Linux capabilities

● setcap, capsh,pscap● Disable root account – grant CAP_SYS_ADMIN as

needed,● /etc/security/capabilities.conf

● Limit the root user– Linux capabilities

● setcap, capsh,pscap● Disable root account – grant CAP_SYS_ADMIN as

needed,● /etc/security/capabilities.conf

Page 18: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Resources

● Parallel SSH,● Xargs,● GNU parallel,● cgroups,● namespaces,● Linux capabilities

● Parallel SSH,● Xargs,● GNU parallel,● cgroups,● namespaces,● Linux capabilities

● Twitter - @mxc4● Gplus – Mark Clarke ● Jumping Bean● Cyber Connect

● Twitter - @mxc4● Gplus – Mark Clarke ● Jumping Bean● Cyber Connect

● Jozi Linux User Group

● Jozi Java User Group● Maker Labs

● Jozi Linux User Group

● Jozi Java User Group● Maker Labs