Map Reduce with Bash - An Example of the Unix Philosophy in Action
-
Upload
jumping-bean -
Category
Software
-
view
1.202 -
download
1
description
Transcript of Map Reduce with Bash - An Example of the Unix Philosophy in Action
Jumping Bean
Map Reduce With Bash(the power of the Unix philosophy)
Jumping Bean
About Me
● Solutions integrator at Jumping Bean– Developer & Trainer
– Technologies● Java● PHP● HTML5/Javascript● Linux
– What I am planning to do:● The Internet of things
● Solutions integrator at Jumping Bean– Developer & Trainer
– Technologies● Java● PHP● HTML5/Javascript● Linux
– What I am planning to do:● The Internet of things
Jumping Bean
Map/Reduce with Bash
● Purpose of this presentation is: – to demonstrate the power and flexibility of the Unix
philosophy,
– what awesome solutions can be created by using simple bash script and userland tools,
– cool utilities and tools
● The purpose is not:– to suggest that Map/Reduce is best done with bash
– best given constraint – see business problem
● Purpose of this presentation is: – to demonstrate the power and flexibility of the Unix
philosophy,
– what awesome solutions can be created by using simple bash script and userland tools,
– cool utilities and tools
● The purpose is not:– to suggest that Map/Reduce is best done with bash
– best given constraint – see business problem
Jumping Bean
“is a set of cultural norms and philosophical approaches to developing small yet capable
software” - Wikipedia
Unix Philosophy
Jumping Bean
Unix Philosophy
“Early Unix developers were important in bringing the concepts of modularity and reusability into
software engineering practice, spawning a 'software tools' movement” - Wikipedia
Jumping Bean
Business Problem
● Nuclear Engineering department needs to run monte-carlo methods on data to calculate something to do with core temperature of nuclear reactors :),
● Post-grad students need to run analysis as part of their course work,● Analysis can take days or weeks to run,● University has invested in 900 node cluster,● Cluster used for research when not used by students● Tool used for analysis is
– written in Fortran.
– single threaded,
● No money for fancy-pants solution
● Nuclear Engineering department needs to run monte-carlo methods on data to calculate something to do with core temperature of nuclear reactors :),
● Post-grad students need to run analysis as part of their course work,● Analysis can take days or weeks to run,● University has invested in 900 node cluster,● Cluster used for research when not used by students● Tool used for analysis is
– written in Fortran.
– single threaded,
● No money for fancy-pants solution
Jumping Bean
Business Problem
● As-Is system– Professor uses laptop and desktop,
– Manually starts application with simple script,
– Start script x number of times where x=number of cores,
– Waits for days,
– Manually checks progress,
– Not scalable to 900 nodes!
● As-Is system– Professor uses laptop and desktop,
– Manually starts application with simple script,
– Start script x number of times where x=number of cores,
– Waits for days,
– Manually checks progress,
– Not scalable to 900 nodes!
Jumping Bean
Business Problem
● Unknowns– How 900 node cluster set up i.e using any cluster software or virtualisation?
● Open Stack?● Open Nebula?● KVM?
– Tools available to IT department – I.e how they do deploys, monitoring, user management etc
● Requirements– independence from IT department or experts for help,
– Student & lecturer IT skills is limited to Fortran & some bash scripting skills,
– Due to security concerns prevent IT staff from gaining access to research,
● Keep it simple – Proof of concept
● Unknowns– How 900 node cluster set up i.e using any cluster software or virtualisation?
● Open Stack?● Open Nebula?● KVM?
– Tools available to IT department – I.e how they do deploys, monitoring, user management etc
● Requirements– independence from IT department or experts for help,
– Student & lecturer IT skills is limited to Fortran & some bash scripting skills,
– Due to security concerns prevent IT staff from gaining access to research,
● Keep it simple – Proof of concept
Jumping Bean
What is Map/Reduce?
● Programming model for – Processing and generating large datasets,
– Using a parallel distribution algorithm,
– On a cluster or set of distributed nodes
● Popularised by Google and the advent of cloud computing● Apache Hadoop – full blown map/reduce framework. Used
to analyse your social media data, “understand the customer” and by numerous agencies with 3 letter acronyms. – “Really we only trying to help you know yourself better”
● Programming model for – Processing and generating large datasets,
– Using a parallel distribution algorithm,
– On a cluster or set of distributed nodes
● Popularised by Google and the advent of cloud computing● Apache Hadoop – full blown map/reduce framework. Used
to analyse your social media data, “understand the customer” and by numerous agencies with 3 letter acronyms. – “Really we only trying to help you know yourself better”
Jumping Bean
Map/Reduce Steps
● Map – Master node takes large dataset and distributes it to compute nodes to perform analysis on. The compute nodes return a result,
● Reduce – Gather the results of the compute nodes and aggregate results into final answer
● Map – Master node takes large dataset and distributes it to compute nodes to perform analysis on. The compute nodes return a result,
● Reduce – Gather the results of the compute nodes and aggregate results into final answer
Jumping Bean
What we need
● Controller node functions – to distribute data to nodes,
– execute calculation functions
– collect results
● Management node functions– distribute application and scripts to compute nodes,
● Compute node functions– Scripts to run the single threaded application in parallel on multi core processors
● Security Requirements– Prevent system administrators from gaining access to core application , script or
data
● Controller node functions – to distribute data to nodes,
– execute calculation functions
– collect results
● Management node functions– distribute application and scripts to compute nodes,
● Compute node functions– Scripts to run the single threaded application in parallel on multi core processors
● Security Requirements– Prevent system administrators from gaining access to core application , script or
data
Jumping Bean
Controller Functions
● How to distribute files to a node (map), execute calculations & gather results (reduce)?– Use split to split input files,
– Use ssh to distribute files, execute processes,
● How to do this to multiple (900) nodes?– Use parallel ssh (pssh), paralle scp,
● Issues:– Copying public key to 900 machines?– Give each student their own account?
● Solution– Set up ldap authentication (password based) or– Include controller nodes root public key in compute node image, distribute 2ndary keys via scripts
using pssh– Fancy pants – chef, ansible
● How to distribute files to a node (map), execute calculations & gather results (reduce)?– Use split to split input files,
– Use ssh to distribute files, execute processes,
● How to do this to multiple (900) nodes?– Use parallel ssh (pssh), paralle scp,
● Issues:– Copying public key to 900 machines?– Give each student their own account?
● Solution– Set up ldap authentication (password based) or– Include controller nodes root public key in compute node image, distribute 2ndary keys via scripts
using pssh– Fancy pants – chef, ansible
Jumping Bean
Management Node Functions
● Use parallel ssh to distribute scripts from management node to compute nodes,
● Using Ansible or Chef could be a next evolutionary step to automate system maintenance
● Use parallel ssh to distribute scripts from management node to compute nodes,
● Using Ansible or Chef could be a next evolutionary step to automate system maintenance
Jumping Bean
Compute Node Functions
● Basically bash scirpt - How to parallelise single threaded application to use multiple cores on modern CPUs?
● xargs – pass through list of input files,
– -n set each iteration to run on one input file
– -P set number of processes to start in parallel
– Script waits for completion of processing & check output
● GNU parallels– Can run commands in parallel using 1 or more hosts
– More options for target input placement {}, string replacement
– Can pass output as input to another process
● Basically bash scirpt - How to parallelise single threaded application to use multiple cores on modern CPUs?
● xargs – pass through list of input files,
– -n set each iteration to run on one input file
– -P set number of processes to start in parallel
– Script waits for completion of processing & check output
● GNU parallels– Can run commands in parallel using 1 or more hosts
– More options for target input placement {}, string replacement
– Can pass output as input to another process
Jumping Bean
Compute/Controller Node
● At end of compute node process either compute node pings controller node,
● Controller node waits for pssh to return to carry out next step. I.e – reduce process or start next script with output from 1st being input to 2nd step,
● Check for errors and reschedule failed computes,
● At end of compute node process either compute node pings controller node,
● Controller node waits for pssh to return to carry out next step. I.e – reduce process or start next script with output from 1st being input to 2nd step,
● Check for errors and reschedule failed computes,
Jumping Bean
Security
● Each student should have separate account– Linux mutli-user system. User home directory for storing files and results
● Each user should be limited in resource usage– Simple
● ulimit● psacct
– Advanced● Cgroups● Namespaces
● Students can execute but not read bash script file, special permissions– Use sudo or
– Linux capabilities ● setcap – eg setcap "cap_kill=+ep" script.sh
● Each student should have separate account– Linux mutli-user system. User home directory for storing files and results
● Each user should be limited in resource usage– Simple
● ulimit● psacct
– Advanced● Cgroups● Namespaces
● Students can execute but not read bash script file, special permissions– Use sudo or
– Linux capabilities ● setcap – eg setcap "cap_kill=+ep" script.sh
Jumping Bean
Security
● Limit the root user– Linux capabilities
● setcap, capsh,pscap● Disable root account – grant CAP_SYS_ADMIN as
needed,● /etc/security/capabilities.conf
● Limit the root user– Linux capabilities
● setcap, capsh,pscap● Disable root account – grant CAP_SYS_ADMIN as
needed,● /etc/security/capabilities.conf
Jumping Bean
Resources
● Parallel SSH,● Xargs,● GNU parallel,● cgroups,● namespaces,● Linux capabilities
● Parallel SSH,● Xargs,● GNU parallel,● cgroups,● namespaces,● Linux capabilities
● Twitter - @mxc4● Gplus – Mark Clarke ● Jumping Bean● Cyber Connect
● Twitter - @mxc4● Gplus – Mark Clarke ● Jumping Bean● Cyber Connect
● Jozi Linux User Group
● Jozi Java User Group● Maker Labs
● Jozi Linux User Group
● Jozi Java User Group● Maker Labs