- Using Moab Job Priorities – Creating a Prioritization Strategy
- Using Moab Job Priorities – Understanding mdiag -p Output
- Using Moab Job Priorities – Exploring Priority Sub-Components
Just over a year ago, I wrote my first post for this blog. It was (only in my opinion) a quaint little flight of fancy dealing with job prioritization in the world of TRON. Today, I want to be a little more grounded. Let’s talk about how Moab calculates job priorities, or in other words, how Moab interprets job priority configuration to create a prioritization strategy.
The Scheduling Cycle
In the recent “Understanding Moab Scheduling” series, I briefly discussed the topic of job prioritization.
Job prioritization happens as part of the second stage in the Moab scheduling cycle. In review, those five stages are:
- Update Information from Resource Managers
- Manage Workload
- Refresh Reservations
- Update Statistics
- Handle User Requests
As the first part of the second stage, job priorities are being continually reevaluated. Every scheduling iteration recalculates the priorities of all running and considered eligible jobs. Let’s look at exactly how that happens.
There are 41 different attributes (known as sub-components) that can be used for calculating job priorities. Only one of these, QueueTime, is enabled by default. So, without any configuration, Moab’s default behavior is that of a FIFO scheduler with backfill enabled. Each of these sub-components have a numerical value that is calculated each scheduling iteration where it is used.
These sub-components are then divided into seven different logical buckets (known as components), as can be seen in the table below:
|Job Credentials (CRED)||User, Group, Account, QoS, Class|
|Fairshare Usage (FS)||FSUser, FSGroup, FSAccount, FSQoS, FSClass, FSGUser, FSGGroup, FSGAccount, FSJPU, FSPPU, FSPSPU, WCAccuracy|
|Requested Job Resources (RES)||Node, Proc, Mem, Swap, Disk, PS, PE, Walltime|
|Current Service Levels (SERV)||QueueTime, XFactor, Bypass, StartCount, Deadline, SPViolation, UserPrio|
|Target Service Levels (TARGET)||TargetQueueTime, TargetXFactor|
|Consumed Resources (USAGE)||Consumed, Remaining, Percent, ExecutionTime|
|Job Attributes ( ATTR)||AttrAttr, AttrState, AttrGres|
To enable or “turn on” a specific sub-component for job priority calculation, entries must be placed in the moab.cfg file setting the associated component and sub-component weights to be non-zero. For example, let’s say we want to disable QueueTime and replace it with XFactor. Both are part of the Current Service Level component (SERV). So, the entries in moab.cfg would be the following:
The calculation of the job priority is done using the following function:
For each running and eligible job Moab will go through all of the sub-components and add up the Component-Weight x Sub-Component-Weight x Sub-Component-Value sets, resulting in a numeric value capped at 1,000,000,000 (one billion). So, using the above configuration (XFactor only), the job’s priority would be:
Priority = 1 x 1 x XFactorValue
The first 1 comes from serviceweight and the second from xfactorweight.
Using this numeric score, Moab than orders the jobs from the highest number (priority) to the lowest. In other words, a priority score of 1 is probably very low.
A Little Experiment
As this is simply mathematics, it is possible to create very specific, specialized and complex functions for calculating priority. In other words, it gives one very fine-grained control. However, I was curious to see how these were actually being used at our customer sites.
I decided to do a little (non-scientific) experiment.
As part of our support process, customers have the opportunity to upload a “snapshot” of their system configuration. Part of this is the mdiag -p output, which contains their configured Component and Sub-Component weights. Going back through the archive, I was able to identify 119 unique systems, and extracted what they had configured. The results were interesting, and are presented below (ordered by most used to least used):
Let’s take a quick look at some of the highest used sub-components from the “survey.” Again, because of the data-gathering approach/technique, this isn’t scientific, but it is interesting. It should also be noted “weights” are not factored into this data, only whether or not the sub-component is being used.
|QUEUETIME||In my opinion, there really isn’t a whole lot of surprise here. QUEUETIME is the only one of the attributes that is turned on by default. Naturally, this will result in a high position on this list. What is interesting is there are nine systems that have gone out of their way to turn it off. My guess is they, like the example above, have swapped it out for XFACTOR.|
|CLASS||Many traditional HPC schedulers are heavily queue based. As such, I wasn’t too surprised to see CLASS, which is just another name for a queue, to be this high in the list. Many of us in the industry, admins and users alike, are comfortable with thinking about jobs in terms of queues. Moab’s priority system is very flexible, thus allowing this traditional mindset to be readily modeled.|
|QOS||I’m fairly certain this one earned its position in the list for several different reasons, including QoS’s ability to easily modify the priorities given by Classes inside of Moab. End-users use the basic queuing facilities of CLASS and then modify it some way through QOS. It makes sense.|
|FSUSER||Here we have our first Fairshare entry in the list. Fairshare is a great way to softly balance the cluster with different usage targets. I’m not overly surprised the most common approach is to choose User as the credential for doing the balancing.|
|XFACTOR||To be honest, I was a little surprised by the number of sites using XFACTOR. However, in retrospect, as it is similar to QUEUETIME, but favors short (i.e., short wallclock limit) jobs, it does make sense on certain large systems.|
|USER||Again, like with Fairshare, it appears the most common credential on which to base priorities is USER. No surprises here.|
|PROC||Finally, we have PROC. This is almost certainly being used to favor large (i.e., many processor) jobs. The idea here is that if one places the large jobs first, it leaves open spaces/holes that can then be backfilled by the smaller jobs. It is similar to Stephen R. Covey‘s “The Big Rocks of Life” analogy.|
We could go on through the list, but I think that’s sufficient for now.
This is the first in a three part series on the Moab’s job prioritization. Part II will deal with using mdiag -p and understanding its output. Part III will then cover some of the less-used and less-understood sub-components in more detail.
Until next time…