Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins...

34
Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Transcript of Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins...

Page 1: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Monitoring HTCondor

Andrew Lahiff

STFC Rutherford Appleton Laboratory

European HTCondor Site Admins Meeting 2014

Page 2: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Introduction• Two aspects of monitoring

– General overview of the system• How many running/idle jobs?

By user/VO? By schedd?

• How full is the farm?

• How many draining worker nodes?

– More detailed views• What are individual jobs doing?

• What’s happening on individual worker nodes?

• Health of the different components of the HTCondor pool

• ...in addition to Nagios

Page 3: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Introduction• Methods

– Command line utilities

– Ganglia

– Third-party applications(which run command-line tools or use python API)

Page 4: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Command line• Three useful commands

– condor_status• Overview of the pool (including jobs, machines)• Information about specific worker nodes

– condor_q• Information about jobs in the queue

– condor_history• Information about completed jobs

Page 5: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Overview of jobs-bash-4.1$ condor_status -collector

Name Machine RunningJobs IdleJobs HostsTotal

[email protected]. condor01.gridpp.rl 10608 8355 11347

[email protected]. condor02.gridpp.rl 10616 8364 11360

Page 6: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Overview of machines-bash-4.1$ condor_status -total

Total Owner Claimed Unclaimed Matched Preempting Backfill

X86_64/LINUX 11183 95 10441 592 0 0 0

Total 11183 95 10441 592 0 0 0

Page 7: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Jobs by schedd-bash-4.1$ condor_status -schedd

Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs

arc-ce01.gridpp.rl.a arc-ce01.g 2388 1990 13

arc-ce02.gridpp.rl.a arc-ce02.g 2011 1995 31

arc-ce03.gridpp.rl.a arc-ce03.g 4272 1994 9

arc-ce04.gridpp.rl.a arc-ce04.g 1424 2385 12

arc-ce05.gridpp.rl.a arc-ce05.g 1 0 6

cream-ce01.gridpp.rl cream-ce01 266 0 0

cream-ce02.gridpp.rl cream-ce02 247 0 0

lcg0955.gridpp.rl.ac lcg0955.gr 0 0 0

lcgui03.gridpp.rl.ac lcgui03.gr 3 0 0

lcgui04.gridpp.rl.ac lcgui04.gr 0 0 0

lcgvm21.gridpp.rl.ac lcgvm21.gr 0 0 0

TotalRunningJobs TotalIdleJobs TotalHeldJobs

Total 10612 8364 71

Page 8: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Jobs by user, schedd-bash-4.1$ condor_status -submitters

Name Machine RunningJobs IdleJobs HeldJobs

group_ALICE.alice.alice043@g arc-ce01.gridpp.rl 0 0 0

group_ALICE.alice.alicesgm@g arc-ce01.gridpp.rl 540 0 1

group_ATLAS.atlas_pilot.tatl arc-ce01.gridpp.rl 142 0 0

group_ATLAS.prodatls.patls00 arc-ce01.gridpp.rl 82 5 0

group_CMS.cms.cmssgm@gridpp. arc-ce01.gridpp.rl 1 0 0

group_CMS.cms_pilot.ttcms022 arc-ce01.gridpp.rl 214 390 0

group_CMS.cms_pilot.ttcms043 arc-ce01.gridpp.rl 68 100 0

group_CMS.prodcms.pcms004@gr arc-ce01.gridpp.rl 78 476 4

group_CMS.prodcms.pcms054@gr arc-ce01.gridpp.rl 12 910 0

group_CMS.prodcms_multicore. arc-ce01.gridpp.rl 47 102 0

group_DTEAM_OPS.ops.ops047@g arc-ce01.gridpp.rl 0 0 0

group_LHCB.lhcb_pilot.tlhcb0 arc-ce01.gridpp.rl 992 0 2

group_NONLHC.snoplus.snoplus arc-ce01.gridpp.rl 0 0 0

Page 9: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

…Jobs by user RunningJobs IdleJobs HeldJobs

group_ALICE.alice.al 0 0 0

group_ALICE.alice.al 3500 368 5

group_ALICE.alice_pi 0 0 0

group_ATLAS.atlas.at 0 0 0

group_ATLAS.atlas.at 0 0 0

group_ATLAS.atlas_pi 414 12 10

group_ATLAS.atlas_pi 0 0 2

group_ATLAS.prodatls 354 36 11

group_CMS.cms.cmssgm 1 0 0

group_CMS.cms_pilot. 371 2223 0

group_CMS.cms_pilot. 0 0 1

group_CMS.cms_pilot. 68 200 0

group_CMS.prodcms.pc 188 1905 10

group_CMS.prodcms.pc 312 3410 0

group_CMS.prodcms_mu 47 102 0

Page 10: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

condor_q[root@arc-ce01 ~]# condor_q

-- Submitter: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:64454> : arc-ce01.gridpp.rl.ac.uk

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

794717.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794718.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794719.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794720.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794721.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794722.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794723.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794725.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794726.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

3502 jobs; 0 completed, 0 removed, 1528 idle, 1965 running, 9 held, 0 suspended

Page 11: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Multi-core jobs-bash-4.1$ condor_q -global -constraint 'RequestCpus > 1’

-- Schedd: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:39356>

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

832677.0 pcms004 12/5 14:33 0+00:15:07 R 0 2.0 (gridjob )

832717.0 pcms004 12/5 14:37 0+00:12:02 R 0 0.0 (gridjob )

832718.0 pcms004 12/5 14:37 0+00:00:00 I 0 0.0 (gridjob )

832719.0 pcms004 12/5 14:37 0+00:00:00 I 0 0.0 (gridjob )

832893.0 pcms004 12/5 14:47 0+00:00:00 I 0 0.0 (gridjob )

832894.0 pcms004 12/5 14:47 0+00:00:00 I 0 0.0 (gridjob )

Page 12: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Multi-core jobs• Custom print format

-bash-4.1$ condor_q -global -pr queue_mc.cpf

-- Schedd: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:39356>

ID OWNER SUBMITTED RUN_TIME ST SIZE CMD CORES

832677.0 pcms004 12/5 14:33 0+00:00:00 R 2.0 (gridjob) 8

832717.0 pcms004 12/5 14:37 0+00:00:00 R 0.0 (gridjob) 8

832718.0 pcms004 12/5 14:37 0+00:00:00 I 0.0 (gridjob) 8

832719.0 pcms004 12/5 14:37 0+00:00:00 I 0.0 (gridjob) 8

832893.0 pcms004 12/5 14:47 0+00:00:00 I 0.0 (gridjob) 8

832894.0 pcms004 12/5 14:47 0+00:00:00 I 0.0 (gridjob) 8

https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=ExperimentalCustomPrintFormats

Page 13: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Jobs with specific DN-bash-4.1$ condor_q -global -constraint 'x509userproxysubject=="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1”’

-- Schedd: arc-ce03.gridpp.rl.ac.uk : <130.246.181.25:62763>

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

678275.0 tatls015 12/2 17:57 2+06:07:15 R 0 2441.4 (arc_pilot )

681762.0 tatls015 12/3 03:13 1+21:12:31 R 0 2197.3 (arc_pilot )

705153.0 tatls015 12/4 07:36 0+16:49:12 R 0 2197.3 (arc_pilot )

705807.0 tatls015 12/4 08:16 0+16:09:27 R 0 2197.3 (arc_pilot )

705808.0 tatls015 12/4 08:16 0+16:09:27 R 0 2197.3 (arc_pilot )

706612.0 tatls015 12/4 09:16 0+15:09:37 R 0 2197.3 (arc_pilot )

706614.0 tatls015 12/4 09:16 0+15:09:26 R 0 2197.3 (arc_pilot )

Page 14: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Jobs killed• Jobs which were removed[root@arc-ce01 ~]# condor_history -constraint 'JobStatus == 3’

ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD

823881.0 alicesgm 12/5 01:01 1+06:13:22 X ??? /var/spool/arc/grid03/CVuMDmBSwGlnCIXDjqi

831849.0 tlhcb005 12/5 13:19 0+18:52:26 X ??? /var/spool/arc/grid09/gWmLDm5x7GlnCIXDjqi

832753.0 tlhcb005 12/5 14:38 0+17:07:07 X ??? /var/spool/arc/grid00/5wqKDm7C9GlnCIXDjqi

819636.0 alicesgm 12/4 19:27 1+12:13:56 X ??? /var/spool/arc/grid00/mlrNDmoErGlnCIXDjqi

825511.0 alicesgm 12/5 03:03 0+18:52:10 X ??? /var/spool/arc/grid04/XpuKDmxLyGlnCIXDjqi

823799.0 alicesgm 12/5 00:56 1+05:58:15 X ??? /var/spool/arc/grid03/DYuMDmzMwGlnCIXDjqi

820001.0 alicesgm 12/4 19:48 1+06:43:22 X ??? /var/spool/arc/grid08/cmzNDmpYrGlnCIXDjqi

833589.0 alicesgm 12/5 16:01 0+14:06:34 X ??? /var/spool/arc/grid09/HKSLDmqUAHlnCIXDjqi

778644.0 tlhcb005 12/2 05:56 4+00:00:10 X ??? /var/spool/arc/grid00/pIJNDm6cvFlnCIXDjqi

Page 15: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Jobs killed• Jobs removed for exceeding memory limit

[root@arc-ce01 ~]# condor_history -constraint 'JobStatus==3 && ResidentSetSize>1024*RequestMemory' -af ClusterId Owner ResidentSetSize RequestMemory

823953 alicesgm 3500000 3000

824438 alicesgm 3250000 3000

820045 alicesgm 3500000 3000

823881 alicesgm 3250000 3000

[root@arc-ce04 ~]# condor_history -constraint 'JobStatus==3 && ResidentSetSize>1024*RequestMemory' -af x509UserProxyVOName | sort | uniq -c

515 alice

5 cms

70 lhcb

Page 16: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

condor_who• What jobs are currently running on a worker node?[root@lcg1211 ~]# condor_who

OWNER CLIENT SLOT JOB RUNTIME PID PROGRAM

[email protected] arc-ce02.gridpp.rl.ac.uk 1_2 654753.0 0+00:01:54 15743 /usr/libexec/condor/co

[email protected] arc-ce02.gridpp.rl.ac.uk 1_5 654076.0 0+00:56:50 21916 /usr/libexec/condor/co

[email protected] arc-ce04.gridpp.rl.ac.uk 1_10 1337818.0 0+02:51:34 31893 /usr/libexec/condor/co

[email protected] arc-ce04.gridpp.rl.ac.uk 1_7 1337776.0 0+03:06:51 32295 /usr/libexec/condor/co

[email protected] arc-ce02.gridpp.rl.ac.uk 1_1 651508.0 0+05:02:45 17556 /usr/libexec/condor/co

[email protected] arc-ce03.gridpp.rl.ac.uk 1_4 737874.0 0+05:44:24 5032 /usr/libexec/condor/co

[email protected] arc-ce04.gridpp.rl.ac.uk 1_6 1336938.0 0+08:42:18 26911 /usr/libexec/condor/co

[email protected] arc-ce01.gridpp.rl.ac.uk 1_8 826808.0 1+02:50:16 3485 /usr/libexec/condor/co

[email protected] arc-ce03.gridpp.rl.ac.uk 1_3 722597.0 1+08:44:28 22966 /usr/libexec/condor/co

Page 17: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Startd history• If STARTD_HISTORY defined on your WNs

[root@lcg1658 ~]# condor_history

ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD

841989.0 tatls015 12/6 07:58 0+00:02:39 C 12/6 08:01 /var/spool/arc/grid03/PZ6NDmPQPHlnCIXDjqi

841950.0 tatls015 12/6 07:56 0+00:02:40 C 12/6 07:59 /var/spool/arc/grid03/mckKDm4OPHlnCIXDjqi

841889.0 tatls015 12/6 07:53 0+00:02:33 C 12/6 07:56 /var/spool/arc/grid01/X3bNDmTMPHlnCIXDjqi

841847.0 tatls015 12/6 07:50 0+00:02:35 C 12/6 07:54 /var/spool/arc/grid00/yHHODmfJPHlnCIXDjqi

841816.0 tatls015 12/6 07:48 0+00:02:36 C 12/6 07:51 /var/spool/arc/grid04/iizMDmVHPHlnCIXDjqi

841791.0 tatls015 12/6 07:45 0+00:02:33 C 12/6 07:48 /var/spool/arc/grid00/N3vKDmKEPHlnCIXDjqi

716804.0 alicesgm 12/4 18:28 1+13:15:07 C 12/6 07:44 /var/spool/arc/grid07/TUQNDmUJqGlnzEJDjqI

Page 18: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Ganglia• condor_gangliad

– Runs on a single host (can be any host)

– Gathers daemon ClassAds from the collector

– Publishes metrics to ganglia with host spoofing

• At RAL we have on one hostGANGLIAD_VERBOSITY = 2

GANGLIAD_PER_EXECUTE_NODE_METRICS = FalseGANGLIAD = $(LIBEXEC)/condor_gangliadGANGLIA_CONFIG = /etc/gmond.confGANGLIAD_METRICS_CONFIG_DIR = /etc/condor/ganglia.dGANGLIA_SEND_DATA_FOR_ALL_HOSTS = trueDAEMON_LIST = MASTER, GANGLIAD

Page 19: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Ganglia• Small subset from schedd

Page 20: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Ganglia• Small subset from central manager

Page 21: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Easy to make custom plots

Page 22: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Total running, idle, held jobs• f

Page 23: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Running jobs by schedd

Page 24: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Negotiator health• s

Negotiation cycle duration Number of AutoClusters

Page 25: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Draining & multi-core slots

Page 26: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

(Some) Third party tools

Page 27: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Job overview• Condor Job Overview Monitor

http://sarkar.web.cern.ch/sarkar/doc/condor_jobview.html

Page 28: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Page 29: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Page 30: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Page 31: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Mimic• Internal RAL application

Page 32: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

htcondor-sysview

Page 33: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

htcondor-sysview• Hover mouse over a core to get job information

Page 34: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Nagios• Most (all?) sites probably use Nagios or an alternative• At RAL

– Process checks for condor_master on all nodes– Central mangers

• Check for at least 1 collector• Check for the negotiator• Check for worker nodes

Number of startd ClassAds needs to be above a threshold

Number of non-broken worker nodes above a threshold

– CEs• Check for schedd• Job submission test