Post on 03-Jan-2016
Developing & Managing A Developing & Managing A Large Linux Farm – The Large Linux Farm – The Brookhaven ExperienceBrookhaven Experience
CHEP2004 – InterlakenCHEP2004 – Interlaken
September 27, 2004September 27, 2004
Tomasz Wlodek - BNLTomasz Wlodek - BNL
BackgroundBackground
Brookhaven National Lab (BNL) is a multi-Brookhaven National Lab (BNL) is a multi-disciplinary research laboratory funded by disciplinary research laboratory funded by US government.US government.
BNL is the site of Relativistic Heavy Ion BNL is the site of Relativistic Heavy Ion Collider (RHIC) and four of its experiments.Collider (RHIC) and four of its experiments.
The Rhic Computing Facility (RCF) was The Rhic Computing Facility (RCF) was formed in the mid 90’s, in order to address formed in the mid 90’s, in order to address computing needs of RHIC experiments.computing needs of RHIC experiments.
Background (cont.)Background (cont.)
BNL has also been chosen as the site of BNL has also been chosen as the site of Tier-1 ATLAS Computing Facility (ACF) for Tier-1 ATLAS Computing Facility (ACF) for the Atlas experiment in CERN.the Atlas experiment in CERN.
RCF/ACF supports HENP and HEP scientific RCF/ACF supports HENP and HEP scientific computing efforts and various general computing efforts and various general services (backup, e-mail, web, off-site data services (backup, e-mail, web, off-site data transfer, Grid, etc). transfer, Grid, etc).
Background (cont.)Background (cont.)
The Linux Farm is the main source of CPU (and The Linux Farm is the main source of CPU (and increasingly storage) resources in the RCF/ACFincreasingly storage) resources in the RCF/ACF
RCF/ACF is transforming itself from a local RCF/ACF is transforming itself from a local resource into a national and global resourceresource into a national and global resource
Growing design and operational complexityGrowing design and operational complexity
Increasing staffing levels to handle additional Increasing staffing levels to handle additional responsibilitiesresponsibilities
RCF/ACF StructureRCF/ACF Structure
Staff Growth at the RCF/ACFStaff Growth at the RCF/ACF
0
5
10
15
20
25
30
35
Staf
f L
evel
s (F
TE
)
1997 1998 1999 2000 2001 2002 2003 2004 2005(est.)
Year
The Pre-Grid EraThe Pre-Grid Era Rack-mounted commodity hardwareRack-mounted commodity hardware
Self-contained, localized resourcesSelf-contained, localized resources
Resources available only to local usersResources available only to local users
Little interaction with external resources at Little interaction with external resources at remote locations remote locations
Considerable freedom to set own usage policiesConsiderable freedom to set own usage policies
The (Near-Term) FutureThe (Near-Term) Future
Resources available globallyResources available globally
Distributed computing architectureDistributed computing architecture
Extensive interaction with remote resources Extensive interaction with remote resources requires closer software inter-operability and requires closer software inter-operability and higher network bandwidthhigher network bandwidth
Constraints on freedom to set own policiesConstraints on freedom to set own policies
How do we get there?How do we get there?
Change in management philosophyChange in management philosophy
Evolution in hardware requirementsEvolution in hardware requirements
Evolution in software packagesEvolution in software packages
Different security protocol(s)Different security protocol(s)
Change in access policyChange in access policy
Change in Management PhilosophyChange in Management Philosophy
Automated monitoring & management of servers Automated monitoring & management of servers in large clusters a mustin large clusters a must
Remote power management, predictive hardware Remote power management, predictive hardware failure analysis and preventive maintenance are failure analysis and preventive maintenance are important important
High-availability based on large number of High-availability based on large number of identical servers, not on 24-hour supportidentical servers, not on 24-hour support
Increasingly larger clusters only manageable if Increasingly larger clusters only manageable if servers are identical servers are identical avoid specialized servers avoid specialized servers
Evolution in Hardware Evolution in Hardware RequirementsRequirements
Early acquisitions emphasized CPU power over Early acquisitions emphasized CPU power over local storage capacitylocal storage capacity
Increasing affordability of local disk storage has Increasing affordability of local disk storage has changed this philosophychanged this philosophy
Hardware chosen by optimal combination of CPU Hardware chosen by optimal combination of CPU power, storage capacity, server density and pricepower, storage capacity, server density and price
Buy from high-quality vendors to avoid labor-Buy from high-quality vendors to avoid labor-intensive maintenance issuesintensive maintenance issues
The Growth of the Linux FarmThe Growth of the Linux Farm
0
200
400
600
800
1000
1200
1400
KSp
ecIn
t200
0
1999 2000 2001 2002 2003 2004
YearKSpecInt2000
Drop in Server Price as a Function Drop in Server Price as a Function of Performanceof Performance
02
4
6
8
10
12
14
Co
st/
Sp
ecIn
t2000
(in
U.S
. d
oll
ars
)
1999 2000 2001 2002 2003 2004
Year
Cost/SpecInt2000
Drop in Cost of Local Storage Drop in Cost of Local Storage
010
20
30
40
50
60
70
Co
st/
GB
(in
U.S
.
do
llars
)
1999 2000 2001 2002 2003 2004
Year
Cost/GB
Total Distributed Storage Capacity Total Distributed Storage Capacity
0
50
100
150
200
250
Total Storage Capacity
(TB)
1999 2000 2001 2002 2003 2004
Year
TB
Growth of Storage Capacity per Growth of Storage Capacity per ServerServer
050
100150200250300350400450
GB
1999 2000 2001 2002 2003 2004
Year
GB/server
Server ReliabilityServer Reliability
0
0.002
0.004
0.006
0.008
0.01
0.012
Fa
ilu
re/M
ac
hin
e.M
on
th
2000 2001 2002 2003 2004
Year
Failure Rate-about 1/week at current size
The Factors Enforcing Evolution in The Factors Enforcing Evolution in Software PackagesSoftware Packages
CostCost Farm size / scalabilityFarm size / scalability SecuritySecurity External influences / wide External influences / wide
acceptanceacceptance
CostCost
Red Hat Linux Red Hat Linux →→ Scientific Scientific LinuxLinux
LSF LSF →→ CondorCondor
Farm Size / ScalabilityFarm Size / Scalability
Home built batch system for Home built batch system for data reconstructiondata reconstruction→→ Condor Condor based batch system based batch system
Home built monitoring Home built monitoring system system →→ Ganglia Ganglia
SecuritySecurity
Started with NIS/telnet in the 90’sStarted with NIS/telnet in the 90’s
Cyber-security threats prompted the Cyber-security threats prompted the installation of firewalls, gatekeepers and installation of firewalls, gatekeepers and migration to ssh migration to ssh scricter security scricter security standards than in the paststandards than in the past
On-going change to Kerberos 5. Ongoing On-going change to Kerberos 5. Ongoing phase-out of NIS passwords.phase-out of NIS passwords.
Testing GSI Testing GSI limited support for GSI limited support for GSI
Security Changes (cont.)Security Changes (cont.) Authorization & authentication controlled by local Authorization & authentication controlled by local
site (NIS and Kerberos)site (NIS and Kerberos)
Migration to GSI requires a central CA and Migration to GSI requires a central CA and regional VO’s for authentication regional VO’s for authentication local sites local sites performs final authentication before granting performs final authentication before granting accessaccess
Accept certificates from multiple CA’s?Accept certificates from multiple CA’s?
Difficult transition from complete to partial control Difficult transition from complete to partial control over security issuesover security issues
External Influences / Wide External Influences / Wide AcceptanceAcceptance
Ganglia – used by RHIC experiments Ganglia – used by RHIC experiments to monitor the RCF and external to monitor the RCF and external farms in order to manage their job farms in order to manage their job submission.submission.
HRM / dCACHE – used by other labs HRM / dCACHE – used by other labs Condor – widely used by Atlas Condor – widely used by Atlas
communitycommunity
Software Evolution - summarySoftware Evolution - summaryPackagePackage OldOld NewNew DateDate
OSOS RedHat RedHat LinuxLinux
Scientific Scientific LinuxLinux
20042004
BatchBatch Home-Built/Home-Built/LSFLSF
Condor/LSFCondor/LSF 2004/20002004/2000
MonitoringMonitoring Home-BuiltHome-Built GangliaGanglia 20032003
SecuritySecurity NISNIS K5/GSIK5/GSI 2003/20042003/2004
Distributed Distributed StorageStorage
---------------------- HRM/dCacheHRM/dCache 2004/?2004/?
Ganglia at the RCF/ACFGanglia at the RCF/ACF
Condor at the RCF/ACFCondor at the RCF/ACF
SummarySummary
RCF/ACF going through a transition from a local RCF/ACF going through a transition from a local facility to a regional (global) facility facility to a regional (global) facility many many changeschanges
Linux Farm built with commodity hardware is Linux Farm built with commodity hardware is increasingly affordable and reliableincreasingly affordable and reliable
Distributed storage is also increasingly affordable Distributed storage is also increasingly affordable management software issues.management software issues.
Summary (cont.)Summary (cont.)
Inter-operability with remote sites (software and Inter-operability with remote sites (software and services) plays an increasingly important role in services) plays an increasingly important role in our software choicesour software choices
Transition with security and access issuesTransition with security and access issues
Migration will take longer and be more difficult Migration will take longer and be more difficult than generally expected than generally expected change in hardware change in hardware and software needs to be complemented by a and software needs to be complemented by a change in management philosophychange in management philosophy