HTCondor 소개 및 - scent.gist.ac.kr · PDF filethe language that HTCondor uses to represent...
Transcript of HTCondor 소개 및 - scent.gist.ac.kr · PDF filethe language that HTCondor uses to represent...
Korea Institute of Science and Technology Information
2014 SCENT HPC Summer School@GIST
국가슈퍼컴퓨팅연구소 슈퍼컴퓨팅서비스센터
슈퍼컴퓨팅서비스통합실
박 주 원
HTCondor 소개 및
HTC@PLSI 구축 사례
Korea Institute of Science and Technology Information
1
2
3
2
Korea Institute of Science and Technology Information
1
2
3
3
Korea Institute of Science and Technology Information
What
4
HTC (High Throughput Computing)
서로 독립적인 다수의 하위 작업
처리를 위한 컴퓨팅자원 사용 방식
Why
HPC HTC
Computation Need
Large computing power Large computing power
Duration Short(hours, days) Long(months, years)
InterestHow fast an individual job can complete
How many jobs can complete overa long period of time
Communication Tightly coupled Loosely coupled
Resource Reqs.Must execute within a particular site with low-latency interconnects
Independent, sequential jobs can be individually scheduled on many different computing resources across multiple sites
< HPC vs HTC >
슈퍼컴퓨터 지원 분야 확대
천문우주, 의료 등 다양한
분야에서 1만개 이상의 하위
작업으로 구성된 작업 처리 요청
슈퍼컴퓨터 이용 효율성 증대
고성능 슈퍼 컴퓨팅 자원이
필요한 분야에 효율적으로 제공
HTC 서비스 개요
Korea Institute of Science and Technology Information 5
XSEDE에서의 HTC 서비스는 Open Science
Grid 자원을 활용함.
OSG에는 CHTC에서 제공하는 Condor 풀
(GLOW)이 있음.
Condor-G 기반의 meta-scheduling 지원
서로 다른 종류의 Local Scheduler를
사용하는 클러스터를 통해 HTC 서비스를
제공하기 위해 BOSCO라는 프로젝트 진행중.
① HTC Service in XSEDE (미국)
HTC 서비스 개요
Korea Institute of Science and Technology Information 6
유럽에서는 HTC 서비스 제공을 위해
EGI (European Grid Infra.)를 이용함.
기존의 Grid 뿐만 아니라 클라우드
자원을 통합하여 a grid of clouds에
대한 연구도 진행 중임.
EGI에서는 meta-scheduler로
GridWay를 이용함.
② HTC Service in EGI (유럽)
HTC 서비스 개요
Korea Institute of Science and Technology Information 7
Meta-Job(OGF JSDL Standard)에 기반한
대규모 계산 작업 제출 및 자동 분할 기능
Agent(Pilot Job)에 기반한 Multi-level
Scheduling
계산 자원의 Local Storage 활용을 위한 Data
Management Framework 제공
웹 인터페이스, JAVA API, 클라이언트
프로그램을 비롯한 다양한 클라이언트
인터페이스 제공
② HTC Service @ KISTI
HTCaaS
HTC 서비스 개요
Korea Institute of Science and Technology Information
• Introduction • Architecture • Getting Started • Administration • Use Cases
2
1
3
8
Korea Institute of Science and Technology Information Korea Institute of Science and Technology Information 9
Korea Institute of Science and Technology Information
Job Scheduler
10
작업 실행
결과 확인
User Computing resource
Korea Institute of Science and Technology Information
Job Scheduler
11
Korea Institute of Science and Technology Information
Job Scheduler
12
HTCondor
Korea Institute of Science and Technology Information
Introduction to using HTCondor
13
Miron Livny
John (TJ) Knoeller
Todd Tannenbaum
Zach Miller
Open source project out of the University of Wisconsin-Madison
http://www.cs.wisc.edu/condor
Established in 1985 to do research and development of distributed high-
throughput computing
Korea Institute of Science and Technology Information
Introduction to using HTCondor
14
Open Source (Free!!)
Interoperates with many types of computing grids
Manages both dedicated CPUs (clusters) and non-dedicated resources
Very configurable, adaptable
Flexible multi-clustering (flocking)
HTCondor 특징
Korea Institute of Science and Technology Information Korea Institute of Science and Technology Information 15
Korea Institute of Science and Technology Information
HTCondor Architecture
16
ClassAd
HTCondor’s internal data representation
Matchmaking
associating a job with a machine resource
Central Manager
central repository for the whole pool
Submit Host
the computer from which jobs are submitted to HTCondor
Execute Host
Definition
Korea Institute of Science and Technology Information
HTCondor Architecture
17
Korea Institute of Science and Technology Information
HTCondor Architecture
18
Execute Machine Submit Machine
submit
schedd
starter Job shadow
startd
Central Manager
collector negotiator
Q
J
S
Q
S
J
J S
J J S S
master
master master
Korea Institute of Science and Technology Information
HTCondor Architecture
19
…
Out = "out/out.2699“
Cmd = "monte_int“
TransferInput = "data.2699“
UserLog = "log.2699“
Owner = "p377han“
Requirements = ( TARGET.Arch ==
"X86_64" ) && ( TARGET.OpSys ==
"LINUX" ) && ( TARGET.Disk >=
RequestDisk ) &&
( TARGET.Memory >= RequestMemory )
&& ( TARGET.HasFileTransfer
…
Job ClassAD
…
Machine = "glory254.plsi.or.kr"
OpSysAndVer = "CentOS5"
JavaVersion = "1.4.2"
CondorVersion = "$CondorVersion:
7.8.8 Mar 20 2013 BuildID: 110288
$"
HardwareAddress =
"00:1b:24:78:37:65"
COLLECTOR_HOST_STRING = "glory-
mg01.plsi.or.kr"
SubnetMask = "255.255.255.0“
…
Machine ClassAD
ClassAd
the language that HTCondor uses to represent information about: jobs (job ClassAd),
machines (machine ClassAd), and programs that implement HTCondor's functionality
(called Deamon)
Korea Institute of Science and Technology Information
HTCondor Architecture
20
Job ClassAD Machine ClassAD
Matchmaking
The matchmaker matches job ClassAds with machine ClassAds, taking into account:
• Requirements of both the machine and the job
• Rank of both the job and the machine
• Priorities, such as those of users and group
Matchmaking
Matchmaking
Korea Institute of Science and Technology Information Korea Institute of Science and Technology Information 21
Korea Institute of Science and Technology Information
Getting Started
22
① Choose a universe for the job
② Make the job batch-ready
③ Create a job description file
④ Run condor_submit to put the job in the queue
Korea Institute of Science and Technology Information
Getting Started
23
① Choose a universe for the job
Controls how HTCondor handles jobs
the many universes include:
Standard
Vanilla
Grid
Parallel
VM
Allows running almost any “serial” job
Provides automatic file transfer for input and output files
Korea Institute of Science and Technology Information
Getting Started
24
Must be able to run in the background
No interactive input
No GUI/window clicks
Job can still use stdin (keyboard), stdout (screen) , and stderr , but
files are used instead of the actual devices
② Make the job batch-ready
Korea Institute of Science and Technology Information
Getting Started
25
A plain text file
File name extensions are irrelevant,
although many use .sub or .submit as
suffixes
File handling
Input ( ex. $a < a.in)
Output (ex. $a > a.out)
Error (ex. $a 2> a.err)
③ Create a job description file
Universe = vanilla
Executable = a.out
Input = a.in
Output = a.out
Error = a.err
Log = a.log
queue
test.jds
Korea Institute of Science and Technology Information
Getting Started
26
④ Run condor_submit to put the job in the queue
Run condor_submit, providing the name of the submit description file:
condor_submit will
Parse the submit description files, checking for errors
Create a ClassAd that describes the job
Place the job in the queue
$ condor_submit test.jds
Korea Institute of Science and Technology Information
Example
27
count.sh multiple.jds
executable = count.sh
universe = vanilla
output = out/out.txt
error = out/err.txt
log = out/log.txt
Queue 50
Korea Institute of Science and Technology Information
Example
28
$ condor_status
Korea Institute of Science and Technology Information
Example
29
$ condor_submit multiple.jds
Job ID
Korea Institute of Science and Technology Information
Example
30
$ condor_status
Korea Institute of Science and Technology Information
Example
31
$ condor_q
Job ID (ClusterId.ProcId)
작업 제출자 작업 상태
Korea Institute of Science and Technology Information
Example
32
$ condor_release $ condor_hold
Korea Institute of Science and Technology Information
More that you do with HTCondor
33
Requirements
A boolean expression
Evaluated with respect to attributes from machine ClassAd(s)
Korea Institute of Science and Technology Information
More that you do with HTCondor
34
Rank
All matches which meet the requirements can be sorted by preference with a
Rank expression
Like Requirements, is evaluated against attributes from machine ClassAd(s)
Universe = vanilla
Executable = a.out
Input = a.in
Output = a.out
Error = a.err
Log = a.log
Requirements = machine == “test.plsi.or.kr”
Rank = KFLOPS
Queue 500
test.jds
Korea Institute of Science and Technology Information Korea Institute of Science and Technology Information 35
Korea Institute of Science and Technology Information
Administrating HTCondor
36
① Determine the machine role
② Obtain HTCondor & Install
③ Configure HTCondor
Korea Institute of Science and Technology Information
Administrating HTCondor
37
① Determine the machine role (Central manager)
하나의 pool에 하나의 Central manager (Negotiator, Collector)
다수의 Submit host 가능
다수의 Execute host
[jwpark@apu1 condor_example]$ ps -ef | grep condor_
condor 2796 1 0 Feb07 ? 02:14:20 condor_master -f
jwpark 3190 2936 0 09:56 pts/0 00:00:00 grep condor_
condor 30896 2796 0 Jun20 ? 00:04:09 condor_collector -f
condor 30897 2796 0 Jun20 ? 00:03:42 condor_negotiator -f
condor 30898 2796 0 Jun20 ? 00:02:15 condor_schedd -f
root 30900 30898 0 Jun20 ? 00:01:37 condor_procd -A
/tmp/condor-lock.apu10.901845565605075/procd_pipe.SCHEDD -L
/opt/condor/local/log/ProcLog.SCHEDD -R 10000000 -S 60 -C 780
[jwpark@apu3 ~]$ ps -ef | grep condor_
condor 2132 1 0 Feb06 ? 00:43:49 condor_master -f
condor 2386 2132 0 Feb18 ? 00:19:26 condor_startd -f
root 7701 2386 0 Feb19 ? 00:39:39 condor_procd -A
/tmp/condor-lock.apu30.70573943301952/procd_pipe.STARTD -L
/opt/condor/local/log/ProcLog.STARTD -R 10000000 -S 60 -C 780
jwpark 24247 24223 0 09:51 pts/0 00:00:00 grep condor_
Central Manager Execute host
Korea Institute of Science and Technology Information
Administrating HTCondor
38
② Obtain HTCondor & Install
Obtain HTCondor (http://research.cs.wisc.edu/htcondor/downloads/)
Install on central manager
Install on execute hosts
$ groupadd condor --gid 780
$ useradd -d /home/condor -g condor --uid
780 --shell /bin/bash condor
$ su
$ ./condor_install --prefix=/opt/condor --
local-dir=/opt/condor/local --
type=manager,submit --owner=condor
Central Manager Execute host
$ groupadd condor --gid 780
$ useradd -d /home/condor -g condor --uid
780 --shell /bin/bash condor
$ su
$ ./condor_install --prefix=/opt/condor --
local-dir=/opt/condor/local --
type=execute --owner=condor --
central-manager=apu1-ib
Korea Institute of Science and Technology Information
Administrating HTCondor
39
③ Configure HTCondor
/opt/condor/etc/condor_config
Local 설정 파일 수정 (/opt/condor/local/condor_config.local)
Central Manager 확인
policy 설정
security 설정
flocking 설정
cat /opt/condor/local/condor_config.local
## What machine is your central manager?
CONDOR_HOST = apu1-ib
…
Korea Institute of Science and Technology Information
Administrating HTCondor
40
③ Configure HTCondor
Policy expression
Start
Rank
Suspend
Continue
Preempt
Kill
Korea Institute of Science and Technology Information
Administrating HTCondor
41
③ Configure HTCondor
Always run jobs
START = True
RANK =
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL = False
Korea Institute of Science and Technology Information
Administrating HTCondor
42
③ Configure HTCondor
Prefer “chemistry” job
START = True
RANK = Department == "Chemistry"
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL = False
Korea Institute of Science and Technology Information
Administrating HTCondor
43
③ Configure HTCondor
Security configuration
ALLOW_WRITE = \*.plsi.or.kr
ALLOW_READ = \*.plsi.or.kr
Korea Institute of Science and Technology Information
Administrating HTCondor
44
③ Configure HTCondor
Flocking 설정
FLOCK_FROM = titan, venus
FLOCK_TO = titan, venus, kobic
Korea Institute of Science and Technology Information Korea Institute of Science and Technology Information 45
Korea Institute of Science and Technology Information
Use Cases
46
미국 지질조사국 (Monte Carlo Simulation)
지하수 분포도 모델링
Korea Institute of Science and Technology Information
Use Cases
47
Duke University
Protein Structure 모델링
Korea Institute of Science and Technology Information
Use Cases
48
University of Wisconsin – Madison
IceCube 중성미자 분석
Korea Institute of Science and Technology Information
3
1
2
49
Korea Institute of Science and Technology Information 50
Job
Job
Job
Job
Job
Job
JobJob
Job
Job
Job Job JobJob
Job
Job Job Job Job Job Job Job Job
Job Job Job Job Job Job Job Job
Job Job Job Job Job Job Job Job4호기
4호기
PLSI
HPC
HTC
Tachyon & GAIA
주요 미션
소규모 자원
장비의 노후화
분산된 환경
HTC@PLSI 구축 사례
Korea Institute of Science and Technology Information 51
HTCondor의 기능을 활용하여 HTC 서비스 인프라 구축 (@PLSI)
지리적으로 분산된 6개 사이트 1000+ core 규모
기관명 시스템명 CPU core Mem 비고
KU Venus Intel Xeon
@2.5GHz 112 16GB
(2014년도 편
입 완료)
KIAS gene
(x86 클러스터)
AMD Opteron
@2GHz 128 8GB
GIST titan
(x86 클러스터)
Intel Xeon
@2.5GHz 128 16GB
KOBIC KOBIC
(SUN x4100)
AMD Opteron
@2.1GHz 184 4GB
KISTI GLORY
(SUN x2100)
AMD Opteron
@1.8GHz 514 2GB
KISTI kairos
(VM cluster)
Intel Xeon
@2.5GHz 200 16GB
2014 하반기
편입 예정
UOS t2c
(x86 클러스터)
Intel(R) Xeon
@ 2.13GHz 80 ~ 120 8GB 편입 협의중
HTC@PLSI 구축 사례
Korea Institute of Science and Technology Information 52
HTC@PLSI 구축 사례
서비스 정책 (default)
Korea Institute of Science and Technology Information 53
HTC@PLSI 구축 사례
서비스 정책 (전용자원 요청시)
Korea Institute of Science and Technology Information 54
HTC@PLSI 활용 사례
연세대학교
주 소행성대 소행성 종족 연구 분야 (Monte Carlo Simulation)
서울시립대학교
의료/고에너지물리 분야 지원 (GEANT4 지원)
한국천문연구원
남천황도대 태양계소천체 집중탐사연구 분야 협력 (예정)
HTC@PLSI 구축 사례
Korea Institute of Science and Technology Information Korea Institute of Science and Technology Information 55
Korea Institute of Science and Technology Information
HTC@PLSI 구축 사례
56
Korea Institute of Science and Technology Information
HTC@PLSI 구축 사례
57
Demo!!