Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins...

Post on 19-Dec-2015

220 views 5 download

Transcript of Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins...

Monitoring HTCondor

Andrew Lahiff

STFC Rutherford Appleton Laboratory

European HTCondor Site Admins Meeting 2014

Introduction• Two aspects of monitoring

– General overview of the system• How many running/idle jobs?

By user/VO? By schedd?

• How full is the farm?

• How many draining worker nodes?

– More detailed views• What are individual jobs doing?

• What’s happening on individual worker nodes?

• Health of the different components of the HTCondor pool

• ...in addition to Nagios

Introduction• Methods

– Command line utilities

– Ganglia

– Third-party applications(which run command-line tools or use python API)

Command line• Three useful commands

– condor_status• Overview of the pool (including jobs, machines)• Information about specific worker nodes

– condor_q• Information about jobs in the queue

– condor_history• Information about completed jobs

Overview of jobs-bash-4.1$ condor_status -collector

Name Machine RunningJobs IdleJobs HostsTotal

RAL-LCG2@condor01.gridpp.rl. condor01.gridpp.rl 10608 8355 11347

RAL-LCG2@condor02.gridpp.rl. condor02.gridpp.rl 10616 8364 11360

Overview of machines-bash-4.1$ condor_status -total

Total Owner Claimed Unclaimed Matched Preempting Backfill

X86_64/LINUX 11183 95 10441 592 0 0 0

Total 11183 95 10441 592 0 0 0

Jobs by schedd-bash-4.1$ condor_status -schedd

Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs

arc-ce01.gridpp.rl.a arc-ce01.g 2388 1990 13

arc-ce02.gridpp.rl.a arc-ce02.g 2011 1995 31

arc-ce03.gridpp.rl.a arc-ce03.g 4272 1994 9

arc-ce04.gridpp.rl.a arc-ce04.g 1424 2385 12

arc-ce05.gridpp.rl.a arc-ce05.g 1 0 6

cream-ce01.gridpp.rl cream-ce01 266 0 0

cream-ce02.gridpp.rl cream-ce02 247 0 0

lcg0955.gridpp.rl.ac lcg0955.gr 0 0 0

lcgui03.gridpp.rl.ac lcgui03.gr 3 0 0

lcgui04.gridpp.rl.ac lcgui04.gr 0 0 0

lcgvm21.gridpp.rl.ac lcgvm21.gr 0 0 0

TotalRunningJobs TotalIdleJobs TotalHeldJobs

Total 10612 8364 71

Jobs by user, schedd-bash-4.1$ condor_status -submitters

Name Machine RunningJobs IdleJobs HeldJobs

group_ALICE.alice.alice043@g arc-ce01.gridpp.rl 0 0 0

group_ALICE.alice.alicesgm@g arc-ce01.gridpp.rl 540 0 1

group_ATLAS.atlas_pilot.tatl arc-ce01.gridpp.rl 142 0 0

group_ATLAS.prodatls.patls00 arc-ce01.gridpp.rl 82 5 0

group_CMS.cms.cmssgm@gridpp. arc-ce01.gridpp.rl 1 0 0

group_CMS.cms_pilot.ttcms022 arc-ce01.gridpp.rl 214 390 0

group_CMS.cms_pilot.ttcms043 arc-ce01.gridpp.rl 68 100 0

group_CMS.prodcms.pcms004@gr arc-ce01.gridpp.rl 78 476 4

group_CMS.prodcms.pcms054@gr arc-ce01.gridpp.rl 12 910 0

group_CMS.prodcms_multicore. arc-ce01.gridpp.rl 47 102 0

group_DTEAM_OPS.ops.ops047@g arc-ce01.gridpp.rl 0 0 0

group_LHCB.lhcb_pilot.tlhcb0 arc-ce01.gridpp.rl 992 0 2

group_NONLHC.snoplus.snoplus arc-ce01.gridpp.rl 0 0 0

…Jobs by user RunningJobs IdleJobs HeldJobs

group_ALICE.alice.al 0 0 0

group_ALICE.alice.al 3500 368 5

group_ALICE.alice_pi 0 0 0

group_ATLAS.atlas.at 0 0 0

group_ATLAS.atlas.at 0 0 0

group_ATLAS.atlas_pi 414 12 10

group_ATLAS.atlas_pi 0 0 2

group_ATLAS.prodatls 354 36 11

group_CMS.cms.cmssgm 1 0 0

group_CMS.cms_pilot. 371 2223 0

group_CMS.cms_pilot. 0 0 1

group_CMS.cms_pilot. 68 200 0

group_CMS.prodcms.pc 188 1905 10

group_CMS.prodcms.pc 312 3410 0

group_CMS.prodcms_mu 47 102 0

condor_q[root@arc-ce01 ~]# condor_q

-- Submitter: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:64454> : arc-ce01.gridpp.rl.ac.uk

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

794717.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794718.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794719.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794720.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794721.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794722.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794723.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794725.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794726.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

3502 jobs; 0 completed, 0 removed, 1528 idle, 1965 running, 9 held, 0 suspended

Multi-core jobs-bash-4.1$ condor_q -global -constraint 'RequestCpus > 1’

-- Schedd: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:39356>

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

832677.0 pcms004 12/5 14:33 0+00:15:07 R 0 2.0 (gridjob )

832717.0 pcms004 12/5 14:37 0+00:12:02 R 0 0.0 (gridjob )

832718.0 pcms004 12/5 14:37 0+00:00:00 I 0 0.0 (gridjob )

832719.0 pcms004 12/5 14:37 0+00:00:00 I 0 0.0 (gridjob )

832893.0 pcms004 12/5 14:47 0+00:00:00 I 0 0.0 (gridjob )

832894.0 pcms004 12/5 14:47 0+00:00:00 I 0 0.0 (gridjob )

Multi-core jobs• Custom print format

-bash-4.1$ condor_q -global -pr queue_mc.cpf

-- Schedd: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:39356>

ID OWNER SUBMITTED RUN_TIME ST SIZE CMD CORES

832677.0 pcms004 12/5 14:33 0+00:00:00 R 2.0 (gridjob) 8

832717.0 pcms004 12/5 14:37 0+00:00:00 R 0.0 (gridjob) 8

832718.0 pcms004 12/5 14:37 0+00:00:00 I 0.0 (gridjob) 8

832719.0 pcms004 12/5 14:37 0+00:00:00 I 0.0 (gridjob) 8

832893.0 pcms004 12/5 14:47 0+00:00:00 I 0.0 (gridjob) 8

832894.0 pcms004 12/5 14:47 0+00:00:00 I 0.0 (gridjob) 8

https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=ExperimentalCustomPrintFormats

Jobs with specific DN-bash-4.1$ condor_q -global -constraint 'x509userproxysubject=="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1”’

-- Schedd: arc-ce03.gridpp.rl.ac.uk : <130.246.181.25:62763>

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

678275.0 tatls015 12/2 17:57 2+06:07:15 R 0 2441.4 (arc_pilot )

681762.0 tatls015 12/3 03:13 1+21:12:31 R 0 2197.3 (arc_pilot )

705153.0 tatls015 12/4 07:36 0+16:49:12 R 0 2197.3 (arc_pilot )

705807.0 tatls015 12/4 08:16 0+16:09:27 R 0 2197.3 (arc_pilot )

705808.0 tatls015 12/4 08:16 0+16:09:27 R 0 2197.3 (arc_pilot )

706612.0 tatls015 12/4 09:16 0+15:09:37 R 0 2197.3 (arc_pilot )

706614.0 tatls015 12/4 09:16 0+15:09:26 R 0 2197.3 (arc_pilot )

Jobs killed• Jobs which were removed[root@arc-ce01 ~]# condor_history -constraint 'JobStatus == 3’

ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD

823881.0 alicesgm 12/5 01:01 1+06:13:22 X ??? /var/spool/arc/grid03/CVuMDmBSwGlnCIXDjqi

831849.0 tlhcb005 12/5 13:19 0+18:52:26 X ??? /var/spool/arc/grid09/gWmLDm5x7GlnCIXDjqi

832753.0 tlhcb005 12/5 14:38 0+17:07:07 X ??? /var/spool/arc/grid00/5wqKDm7C9GlnCIXDjqi

819636.0 alicesgm 12/4 19:27 1+12:13:56 X ??? /var/spool/arc/grid00/mlrNDmoErGlnCIXDjqi

825511.0 alicesgm 12/5 03:03 0+18:52:10 X ??? /var/spool/arc/grid04/XpuKDmxLyGlnCIXDjqi

823799.0 alicesgm 12/5 00:56 1+05:58:15 X ??? /var/spool/arc/grid03/DYuMDmzMwGlnCIXDjqi

820001.0 alicesgm 12/4 19:48 1+06:43:22 X ??? /var/spool/arc/grid08/cmzNDmpYrGlnCIXDjqi

833589.0 alicesgm 12/5 16:01 0+14:06:34 X ??? /var/spool/arc/grid09/HKSLDmqUAHlnCIXDjqi

778644.0 tlhcb005 12/2 05:56 4+00:00:10 X ??? /var/spool/arc/grid00/pIJNDm6cvFlnCIXDjqi

Jobs killed• Jobs removed for exceeding memory limit

[root@arc-ce01 ~]# condor_history -constraint 'JobStatus==3 && ResidentSetSize>1024*RequestMemory' -af ClusterId Owner ResidentSetSize RequestMemory

823953 alicesgm 3500000 3000

824438 alicesgm 3250000 3000

820045 alicesgm 3500000 3000

823881 alicesgm 3250000 3000

[root@arc-ce04 ~]# condor_history -constraint 'JobStatus==3 && ResidentSetSize>1024*RequestMemory' -af x509UserProxyVOName | sort | uniq -c

515 alice

5 cms

70 lhcb

condor_who• What jobs are currently running on a worker node?[root@lcg1211 ~]# condor_who

OWNER CLIENT SLOT JOB RUNTIME PID PROGRAM

tatls015@gridpp.rl.ac.uk arc-ce02.gridpp.rl.ac.uk 1_2 654753.0 0+00:01:54 15743 /usr/libexec/condor/co

tatls015@gridpp.rl.ac.uk arc-ce02.gridpp.rl.ac.uk 1_5 654076.0 0+00:56:50 21916 /usr/libexec/condor/co

pcms004@gridpp.rl.ac.uk arc-ce04.gridpp.rl.ac.uk 1_10 1337818.0 0+02:51:34 31893 /usr/libexec/condor/co

pcms004@gridpp.rl.ac.uk arc-ce04.gridpp.rl.ac.uk 1_7 1337776.0 0+03:06:51 32295 /usr/libexec/condor/co

tlhcb005@gridpp.rl.ac.uk arc-ce02.gridpp.rl.ac.uk 1_1 651508.0 0+05:02:45 17556 /usr/libexec/condor/co

alicesgm@gridpp.rl.ac.uk arc-ce03.gridpp.rl.ac.uk 1_4 737874.0 0+05:44:24 5032 /usr/libexec/condor/co

tlhcb005@gridpp.rl.ac.uk arc-ce04.gridpp.rl.ac.uk 1_6 1336938.0 0+08:42:18 26911 /usr/libexec/condor/co

tlhcb005@gridpp.rl.ac.uk arc-ce01.gridpp.rl.ac.uk 1_8 826808.0 1+02:50:16 3485 /usr/libexec/condor/co

tlhcb005@gridpp.rl.ac.uk arc-ce03.gridpp.rl.ac.uk 1_3 722597.0 1+08:44:28 22966 /usr/libexec/condor/co

Startd history• If STARTD_HISTORY defined on your WNs

[root@lcg1658 ~]# condor_history

ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD

841989.0 tatls015 12/6 07:58 0+00:02:39 C 12/6 08:01 /var/spool/arc/grid03/PZ6NDmPQPHlnCIXDjqi

841950.0 tatls015 12/6 07:56 0+00:02:40 C 12/6 07:59 /var/spool/arc/grid03/mckKDm4OPHlnCIXDjqi

841889.0 tatls015 12/6 07:53 0+00:02:33 C 12/6 07:56 /var/spool/arc/grid01/X3bNDmTMPHlnCIXDjqi

841847.0 tatls015 12/6 07:50 0+00:02:35 C 12/6 07:54 /var/spool/arc/grid00/yHHODmfJPHlnCIXDjqi

841816.0 tatls015 12/6 07:48 0+00:02:36 C 12/6 07:51 /var/spool/arc/grid04/iizMDmVHPHlnCIXDjqi

841791.0 tatls015 12/6 07:45 0+00:02:33 C 12/6 07:48 /var/spool/arc/grid00/N3vKDmKEPHlnCIXDjqi

716804.0 alicesgm 12/4 18:28 1+13:15:07 C 12/6 07:44 /var/spool/arc/grid07/TUQNDmUJqGlnzEJDjqI

Ganglia• condor_gangliad

– Runs on a single host (can be any host)

– Gathers daemon ClassAds from the collector

– Publishes metrics to ganglia with host spoofing

• At RAL we have on one hostGANGLIAD_VERBOSITY = 2

GANGLIAD_PER_EXECUTE_NODE_METRICS = FalseGANGLIAD = $(LIBEXEC)/condor_gangliadGANGLIA_CONFIG = /etc/gmond.confGANGLIAD_METRICS_CONFIG_DIR = /etc/condor/ganglia.dGANGLIA_SEND_DATA_FOR_ALL_HOSTS = trueDAEMON_LIST = MASTER, GANGLIAD

Ganglia• Small subset from schedd

Ganglia• Small subset from central manager

Easy to make custom plots

Total running, idle, held jobs• f

Running jobs by schedd

Negotiator health• s

Negotiation cycle duration Number of AutoClusters

Draining & multi-core slots

(Some) Third party tools

Job overview• Condor Job Overview Monitor

http://sarkar.web.cern.ch/sarkar/doc/condor_jobview.html

Mimic• Internal RAL application

htcondor-sysview

htcondor-sysview• Hover mouse over a core to get job information

Nagios• Most (all?) sites probably use Nagios or an alternative• At RAL

– Process checks for condor_master on all nodes– Central mangers

• Check for at least 1 collector• Check for the negotiator• Check for worker nodes

Number of startd ClassAds needs to be above a threshold

Number of non-broken worker nodes above a threshold

– CEs• Check for schedd• Job submission test