Why is the system slow?
In Understanding Moab Scheduling: Part I, I discussed the Moab scheduling iteration and the different stages it passes through. Today, we are going to look at answering the above question by learning about the command that allows the administrator to monitor these different stages. Additionally, I will provide some of the common issues seen with the scheduling iteration and how to identify and troubleshoot them.
The “magic” command: mdiag -S -v -v
If you were to run mdiag -S -v -v from the commandline as an administrator, it will provide the very verbose diagnostic information for the scheduler. You will have something that looks similar to the following:
Let’s look at some specific parts of this output.
Information about how Moab is currently licensed for your site is contained in the top part of the output.
The host name of the server on which Moab is running and the port to which it is bound is displayed here.
Here is the total number of processors from which the system is licensed. This number should be as large or larger than the number of physical processor cores in your cluster. Hyperthreading will double that number.
Finally we have the licenses expiration time and date. Make sure you have renewed and installed a new license before this date and time. As Moab will look both in the /opt/moab/ and the /opt/moab/etc/ for its license file. Make sure you haven’t installed it to both, as this can cause confusion as to which one is being used by Moab.
As was discussed before, Moab passes through several different phases as part of the scheduling iteration. The high-level phases are:
- Update Information from Resource Managers
- Manage Workload
- Refresh Reservation
- Update Statistics
- Handle User Requests
Moab keeps track of the amount of time it spends in each one of these phases. They are represented in the output of mdiag -S, and it is important to understand how this information is displayed. It should also be noted Moab is more specific here for some of the phases. In other words, some of the high-level phases are split into multiple stages for display purposes. Here is the relevant section of the output:
Here we see this information is presented in three different sections:
- Time(sec) – Average amount of time in each sub-phase measured in seconds
- Load(5m) – Percentage of time over the last 5 minutes associated with each sub-phase
- Load(24h) – Percentage of time over the last 24 hours associated with each sub-phase
As we are mainly concerned with troubleshooting in this particular article, I’m going to focus on the Load(5m) metrics, as I’ve found they are the best to use in this case. When debugging a problem, the Time(sec) metric is a bit difficult (for me) to get a good feel regarding exactly what is going on. If you’ve noticed the problem soon enough, the Load(24h) metric many not have been sufficiently effected yet to be useful.
Let’s go through each sub-phase individually. Notice, they will not necessarily be displayed in the same order as noted above.
The Sched sub-phase relates to the amount of time Moab is spending time doing the actual scheduling. This is how much time is being used to determine which jobs are going to run where. Spikes here are uncommon, though they can occur when overall system policies are changed or there is a major change in the workload patterns coming in to the system.
The RMLoad sub-phase is the amount of time Moab is spending reading information from the resource managers. If Moab is having a problem contacting the resource managers (a blocking operation), the value for this sub-phase will spike. If that’s the case, the troubleshooting investigation should look at what is happening with the resource managers and the communication channels between them and Moab.
The other instance where this number may spike is if there is a significant increase in the number of resources being reported by a resource manager. This is very, very uncommon in an established system.
The RMProcess sub-phase records the amount of time it takes Moab to move the information received from its resource managers into Moab’s internal data structures. It is very uncommon for this metric to spike, as it would only happen in the case where the resource managers start reporting significantly larger number of resources.
The RMAction sub-phase is the amount of time Moab spends sending its scheduling decisions to the resource managers (e.g., “Start Job X”). Spikes in this metric are generally caused by there being some issue either with the resource managers or the communication channel between them and Moab. Generally, a spike seen here will correlate with a spike in RMLoad.
While triggers are handled by a separate scheduling cycle, their use does have some overhead, which is recorded here with the Trigger sub-phase. Generally, most HPC systems do not have a significant number of triggers, and this number remains fairly small and steady. Only in specialized cases where a large number of triggers (i.e., hundreds or thousands) are being constantly added and deleted is it likely for this metric to spike. If it does, trigger debugging is the next step.
The User sub-phase is a very common one for problems, as it represents the amount of time Moab is spending answering end-user requests, including those coming through web portals and MWS. Spikes can be caused by users having their scripts or jobs constantly issuing commands to Moab, such as continual checkjob requests to find job status.
If this metric is found to have spiked, the logs can be consulted to determine the source of the unusually high number of requests.
The final one is Idle. This one signifies the amount of time Moab is just waiting around for user requests. Some might think we want this number to be 0, but in fact that would indicate there was a problem. When Moab can’t fulfill everything it needs to do in a scheduling iteration, it first sacrifices the Idle time and then sacrifices the User time if the former wasn’t sufficient. One always wants the Idle metric to be positive and not particularly close to zero.
Scheduling Iteration Tuning
Just a little farther down in the mdiag -S -v -v output, one will find the following section. Right off, there are two important numbers here.
The first highlighted number is the configured time for the scheduling iteration (RMPollInteral). The second is the actual/effective average time for each iteration. It is very common that these two numbers do not match. Let’s look at the two different possibilities:
- Actual is less than Configured – This is generally not a problem. There are a number of different things that may cause a scheduling iteration to start early. As long as the other sub-phase metrics are good, there isn’t a problem.
- Actual is more than Configured – This may be a problem. The first several scheduling iterations after Moab is initially started will generally be extraordinarily long, as Moab is doing additional “start-up” activities. After these are completed, one should see the average value moving back down towards the configured value. On stable, long-running systems, we really shouldn’t see the average be more than configured. This will usually be accompanied by the Idle metric being at or near zero. If this does occur, one needs to determine which of the other sub-phases are a problem. Alternatively, in some cases the configured scheduling iteration time simply needs to be increased, as there just plain isn’t enough time for Moab to complete all of its tasks within the allotted time.
Correlating all of this information will provide a good starting point for understanding what is going on when things appear to be running slowly from an end-user command point of view.
Steps to Take Today
There is one final step that is very important. In order to know when one of the sub-phase metrics are out of whack, one needs to know what the stable baseline is for each of these.
ACTION: Run mdiag -S -v -v today to get your baseline sub-phase metrics. Write them down and keep them in a safe place.
Every system is unique with different numbers and types of resources with a different workload pattern. Your numbers will be unique to your system. Then, if a slowdown is noticed, you’ll be able to run this command to compare with what is normal for your system.
Now you know, and knowing is half the battle.
~ G.I. Joe