Walk into any hardware store and ask to buy a hammer. You’ll be taken to a wall filled with hammers of every size, shape and weight. Some will have waffle heads, others are rounded. Some will have a claw, while others appear to have a hatchet sticking out the back. They will be made of steel, aluminum, plastic, and even hard rubber.
Why am I being presented with a menagerie when all I want to do is drive a nail?
This is the antithesis of Abraham Maslow‘s law of the instrument, “if all you have is a hammer, everything looks like a nail.” In HPC, we tend to have far more nails (jobs) than hammers (supercomputer resources). This being the case, the selection of the tool is of utmost importance.
My father is a general building contractor, and as such he has used and can explain the purposes and properties for all of the different hammers. He can quickly relate how the drywall hammer’s rounded head will seat the nail without breaking the drywall’s paper coating, while a framing hammer can easily destroy the paper, which needs to remain intact to hold up the drywall.
He knows the right tool for the job.
As an HPC system administrator, there are a number of different concerns that raise their heads every day. Competing interests and agendas constantly attempt to preempt one another for attention, each one believing it is the most important concern of the moment. Some of the questions constantly on the mind of an HPC system administrator include:
- Is the cluster running at an optimal level?
- Do I have jobs that are not running that should be? Why?
- How do I train end users to increase their productivity and reduce questions to the support staff?
- When and how can I do upgrades so as to impact the work as little as possible?
- How do I reduce operating costs while still increasing productivity?
- How do I model our organizational politics in our usage policies so everyone is satisfied (or at least so no one is loudly complaining)?
In reality, the ability to quickly and succinctly answer these questions is one of the hallmarks of a good HPC system administrator. This mission is supported or hampered by the tools at one’s disposal.
Our Age of Anxiety is, in great part, the result of trying to do today’s job with yesterday’s tools and yesterday’s concepts.
~ Marshall McLuhan
Here are a couple of key points for consideration.
Goals, Goals, Goals
Every organization has a set of goals, whether codified or not, which defines what is important to the organization. As an HPC system administrator, it is important to understand and believe in these goals. They will define how the HPC cluster is configured and managed.
Every organization is different, and so what works for another organization, system or previous job may not be optimal for the current situation. Always strive to understand what the organization is doing and how HPC fits into that overall picture. Then do everything possible to leverage HPC in the attainment of those goals.
HPC is really an engine for innovation.
Once the goals are clearly understood, one needs to then define “optimize.”
/ˈäptəˌmīz/ (verb) make the best or most effective use of (a situation, opportunity, or resource).
The problem with the word optimize is that it is ambiguous. Is one trying to optimize for:
- Pure utilization
- Power-to-work ratio
- Priority workload first
- Some combination of these and others
The answer to this question will vary from organization to organization. But, it is one of those key metrics to help the HPC system administrator know whether or not they are effectively using the system to meet the organizational goals. From a practical point of view, it’s almost always some mixture of multiple, competing goals and priorities.
Truth be told, everyone always wants every facet optimized, whether or not that’s possible. Unfortunately, sometimes these goals are mutually exclusive, which brings us to our next point: policies.
Policies are at the heart of any HPC system. Some systems use (and only need) a simple First-In-First-Out queuing policies. Other systems have hundreds of directives that describe how the system is suppose to optimize in myriad situations. The level of policy needed is determined by the preceding questions, the actual workload and organization politics.
Politics exist within any organization, large or small, and they aren’t necessarily a bad thing. However, in my experience, most policies on HPC systems are put in place to deal with one level of politics or another. Every system is subjected to some level of politics.
Therefore, your tools need to be able to deal with the politics of your organization through policy. A mistake that’s commonly made is to underestimate the effect of politics on the running of the HPC cluster. Consequently, tools necessary to handle said politics are not obtained, resulting in a constant state of struggle. As much as we’d just be left alone to “run” the system, the human factor always comes into play. After all, we are trying to help the organization succeed with its goals.
So, what are the next steps?
- Identify your organizational goals.
- Identify the localized meaning of “optimized”.
- Identify the organizational politics that need to be supported.
- Ensure your tools (e.g., scheduler) can handle the goals and politics. Upgrade or change, if necessary.
Making sure the right tools are in place is paramount in freeing the HPC system administrators to spend more time optimizing the system to meet the organizational goals.
Go forth and pick the best tool for your job!