CommandsforJobMonitoring
rtracejob
rtracejob is used to compare resource requests to resource usage for an individual job. Currently it can be used to display a single job information or the summary of array job information. By typing in rtracejob -h, the help information will be displayed to explain the arguments and available functions for rtracejob:
usage: rtracejob [-h] [-l] [--dump_failed_joblist] jobID
positional arguments:
jobID the slurm job ID for displaying job information.
For array jobs if root job ID is given (for
example, in job ID 1234567_12 the root job ID
is 1234567) the summary of array jobs will be
displayed.
optional arguments:
-h, --help show this help message and exit
-l, --list_subjobs with this option rtracejob will print out all
of sub jobs information automatically, in
default it's off
--dump_failed_joblist
with this option rtracejob will dump the failed
sub jobs ID to a file in name of
"failed_joblist_(your jobid).txt",
in default it's off
rtracejob is useful for troubleshooting when something goes wrong with your job. For example, rtracejob jobID can be used to display information for a single job:
[bob@vmps12 ~]$ rtracejob 1234567
+------------------+--------------------------+
| User: bob | JobID: 1234567 |
+------------------+--------------------------+
| Account | chemistry |
| Job Name | python.slurm |
| State | Completed |
| Exit Code | 0:0 |
| Wall Time | 00:10:00 |
| Requested Memory | 1000Mc |
| Memory Used | 13712K |
| CPUs Requested | 1 |
| CPUs Used | 1 |
| Nodes | 1 |
| Node List | vmp505 |
| Wait Time | 0.4 minutes |
| Run Time | 0.4 minutes |
| Submit Time | Thu Jun 18 09:23:32 2015 |
| Start Time | Thu Jun 18 09:23:57 2015 |
| End Time | Thu Jun 18 09:24:23 2015 |
+------------------+--------------------------+
| Today's Date | Thu Jun 18 09:25:08 2015 |
+------------------+--------------------------+
A user might want to check how much memory a job used compared to how much was requested, or how long it took a job to execute relative to how much wall time was requested. In this example, note the Requested Memory reported is 1000Mc, meaning 1000 megabytes per core (the “c” stands for “core”). This is the default for jobs that specify no memory requirement. If you see a lowercase “n” on the `Requested Memory` line, this stands for “node” and occurs when a --mem= line is included in a SLURM script, which allocates the amount of memory listed per node in the allocation.
Additionally, rtracejob can also used to display the summary of an array job. For example,
[lf8@gw342 ~]$ rtracejob 5270444
+---------------------------+-------------------------------------------+
+ SUMMARY of ARRAY JOBS +
+ User name: liuf8 | job ID: 5270444
+---------------------------+-------------------------------------------+
+ Account | accre
+ Job Name | array.slurm
+ No. of Submitted SubJobs | 5
+ No. of Finished SubJobs | 5
+ No. of Successful SubJobs | 5
+ No. of Failed SubJobs | 0
+ Requested Memory | 500mn
+ Max Memory Used by SubJobs| 154472k
+ Original Requested Time | 00:10:00
+ Max Running Time | 00:00:42
+ Min Running Time | 00:00:26
+ Max Waiting Time | 00:00:00
+ Min Waiting Time | 00:00:00
+---------------------------+-------------------------------------------+
The job ID 5270444 is the root job ID, and 5270444_1, 5270444_2 are the sub jobs that belong to this array job. You can also pass either sub job ID directly to the rtracejob command, or get information about all sub jobs by passing the --list_subjobs flag (e.g. rtracejob 5270444 --list_subjobs). By providing the root job ID, rtracejob will scan all of the sub jobs’ information and finally print a summary of the given array job.
qSummary
qSummary provides an alternate summary of jobs and cores running across all groups in the cluster. It is possible to filter the results by selecting a specific account through the -g option.
[jill@vmps12 ~]$ qSummary
GROUP USER ACTIVE_JOBS ACTIVE_CORES PENDING_JOBS PENDING_CORES
-----------------------------------------------------------------------------
science 18 34 5 7
jack 5 5 4 4
jill 13 29 1 3
-----------------------------------------------------------------------------
economics 88 200 100 100
emily 88 200 100 100
-----------------------------------------------------------------------------
Totals: 106 234 105 107
As shown, the output from qSummary provides a basic view of the active and pending jobs and cores across groups and users within a group. qSummary also supports a -g argument followed by the name of a group, a -p argument followed by the partition name, and a -gpu switch if you like to see GPU rather than CPU info. For example:
[jill@vmps12 ~]$ qSummary -p batch_gpu -gpu
GROUP USER ACTIVE_JOBS ACTIVE_GPUS PENDING_JOBS PENDING_GPUS
-----------------------------------------------------------------------------
science 4 8 1 2
jack 0 0 1 2
jill 4 8 0 0
-----------------------------------------------------------------------------
economics 4 16 0 0
emily 4 16 0 0
-----------------------------------------------------------------------------
Totals: 8 24 1 2
slurm_resources
To list your current SLURM group membership, type slurm_resources.
This command will list all your slurm accounts and special resources such as GPUs they have access to organized by partition (batch, interactive, batch_gpu, etc..). If you do have GPU access, it will return an example for your SLURM script.
[bob@gw01 ~]$ slurm_resources Accounts to use for accessing the batch (default) partition: Account ------------------- fe_accre_lab accre_guests Accounts and GPU types for accessing the batch_gpu partition: Accounts GPU Type GPU Limit ------------------ ---------------------------- ---------- fe_accre_lab_acc nvidia_geforce_rtx_2080_ti 2 fe_accre_lab_acc nvidia_rtx_a6000 2 fe_accre_lab_acc nvidia_titan_x 4 accre_guests_acc nvidia_geforce_rtx_2080_ti 2 accre_guests_acc nvidia_rtx_a6000 6 You have access to accelerated GPU resources in the batch_gpu partition. As a usage example, if you wanted to request 2 GPUs of type "nvidia_geforce_rtx_2080_ti" for a job with account "fe_accre_lab_acc" on the partition "batch_gpu", then you would add the following lines to your SLURM script: #SBATCH --account=p_dsi_acc #SBATCH --partition=batch_gpu #SBATCH --gres=gpu:nvidia_geforce_rtx_2080_ti:2 Note that the "GPU Limit" column on the table above lists is the burst limit for that GPU type for the specified account. For example, the scheduler will not allow more than 2 GPUs of type "nvidia_geforce_rtx_2080_ti" to be used simulatenously by account "fe_accre_lab_acc". If more GPUs of that type are requested by that account, the scheduler will not allow them to run until other running jobs have completed. You may not request more than the GPU limit for a single job. Accounts and QOSs for accessing the interactive partition: Account QOS CPU limit Memory Limit (GiB) ------------------ -------------------- ---------- ------------------- fe_accre_lab_int debug_int 80 960.0 accre_guests_int debug_int 80 960.0 Use the "qosstate [QOS_NAME]" command to get current usage for the QOS specified by QOS_NAME with a breakdown of usage by user and slurm account. Note that the "debug_int" QOS is a special QOS available to all cluster users for quick debugging of slurm scripts. On this special QOS there are additional restrictions of a maximum wall clock time of 00:30:00 and each user can use a maximum of cpu=16,mem=192G at one time. As an example, to submit a job to the interactive partition using the account "accre_int" and QOS "debug_int" you would add the following lines to your slurm script: #SBATCH --account=accre_int #SBATCH --partition=interactive #SBATCH --qos=debug_int
showLimits
As the name suggests, showLimits will display the resource limits imposed on accounts and groups on the cluster for the batch partition. Running the command without any arguments will list all parent accounts and group accounts on the cluster. Optionally, showLimits also accepts a -g argument followed by the name of a group or account. For example, to see a list of resource limits imposed on a parent account named science_account (this account does not actually exist on the cluster):
[jill@vmps12 ~]$ showLimits -g science_account
ACCOUNT GROUP FAIRSHARE MAXCPUS MAXMEM(GB) MAXCPUTIME(HRS)
-----------------------------------------------------------------------------
science_account 12 3600 2400 23040
biology 1 2400 1800 -
chemistry 1 800 600 -
physics 1 600 600 8640
science 1 - 2200 20000
-----------------------------------------------------------------------------
Limits are always imposed on the parent account level, and occasionally on the group level when multiple group accounts fall under a single parent account. If a particular limit is not defined on the group level, the group is allowed access to the entire limit under its parent account. For example, the science group does not have a MAXCPUS limit defined, and therefore can run across a maximum of 3600 cores so long as no other groups under `science_account` are running and no other limits (MAXMEM or MAXCPUTIME) are exceeded.
We leave FAIRSHARE defined on the parent account level only, so groups within the same account do not receive elevated priority relative to one another. The value 1 for FAIRSHARE defined at the group level means that all groups under the parent account receive equal relative priority.
qosstate
The qosstate command will display the current usage of all Quality of Services (QoSs) available to your user in the interactive and interactive_gpu partitions:
[bob@cn1287 ~]$ qosstate Live interactive QOS usage report for QOS available to user: bob QOS CPUs (used/limit) Memory GiB (used/limit) =================================================================== debug_int 1 / 80 1.0 / 960.0 bob_lab_2_int 0 / 4 0.0 / 30.0 bob_lab_int 4 / 8 16.0 / 60.0
When run with a specified QoS as an argument it will display the users and accounts which are using the QoS and how many resources they are using:
[bob@cn1287 ~]$ qosstate bob_lab_int
Live interactive QOS usage report for QOS available to user: appelte1
QOS CPUs (used/limit) Memory GiB (used/limit)
===================================================================
bob_lab_int 4 / 8 16.0 / 60.0
Breakdown of bob_lab_int usage by slurm account and user
Account User CPUs Memory GiB
========================================================
bob_lab_int
bob 4 16.0
SlurmActive
This command is under re-development.