Systems Management

Project > Systems Management

Systems Management

The EuroEXA project delivered a range of systems management tools, touching every step of compute performance – from scheduling and file management to debugging and performance analysis.

We harnessed Proximity Optimised Scheduling to ensure jobs are scheduled on the closest possible nodes – driving both speed and energy efficiency. For similar reasons, we used BeeGFS to optimise the location of data storage across the system – reducing both the volume and the distance of data transmission.

For Hardware Management, we used the Linux Foundation-backed OpenBMC ported onto an UltraZed FPGA board, giving us the tools to manage, monitor and drive proactive maintenance.

Finally, we have implemented a graphical performance analysis toolkit called Aftermath, together with a debugging and profiling toolkit from Arm – both of which offer the chance to identify further optimisation opportunities and improvements in our programming approaches.

Proximity Optimised Scheduling

The EuroEXA prototype employed Slurm, an open-source cluster management and job scheduling tool widely used in HPC systems. Slurm’s functionality is extended along two directions – memory utilisation and resource allocation.

First, Slurm can improve memory utilisation thanks to UNIMEM, the project’s innovative global shared memory architecture. HPC workloads can vary widely in their memory footprint, depending on the type of application, the number of processes and whether the model is strong or weak scaling. Classical HPC clusters would then over allocate memory capacity – isolating memory capacity that’s not required, so that it can’t be used by other applications running on other nodes. Our improved batch scheduler will allocate memory as a global resource, taking account of proximity to maintain performance, while maximising the overall system throughput.

Secondly, Slurm also works to perform resource allocation for MPI jobs – taking into account the application’s communication profile, the topology and a time series of node failures. With a custom profiling tool for MPI applications, we can also obtain an application’s communication pattern. From there, Slurm can assign processes with a heavy communication profile on nearby nodes, as well as avoiding failure-prone nodes – cutting down on wasted resources or the need to restart jobs.

We have managed to integrate all of this functionality into Slurm via plugins – with just a single code line change in Slurm.

Hardware Management

EuroEXA uses a hardware management approach built upon OpenBMC, an open-source Board Management Controller developed through a collaborative project run by the Linux Foundation. It’s currently used in high-performance servers and embedded systems from a range of manufacturers, so it’s a system used in most of today’s Open Compute Project systems.

We ported OpenBMC onto an UltraZed FPGA board, allowing us to add functionality that’s specific to the challenges of our hardware. Produced by AvNET, the UltraZed-EG is a self-contained SBC with programmable logic, which allows us to integrate 32xUARTS, 8xI2C interfaces, multiple General Purpose I/O and a built-in 1Gbit Ethernet link. With onboard storage, it provides a self-hosted Linux-based software environment. Within the EuroEXA architecture, the board controls all power, monitoring and health operations of the 10GE management network, the EuroExa-specific high-speed FPGA-based switch and the compute nodes themselves. Redfish is fully supported, together with web application interfaces.

Using this industry-standard environment for control and monitoring allows us to harness existing datacentre management tools for overall system monitoring. Moreover, this approach shows how an ExaScale system based on our hybrid design could be controlled alongside traditional HCP systems.

Being an open-source project supported by several industry leaders (IBM, Intel, Facebook, Google, Microsoft, and others) offers several benefits. Primarily, this means that the codebase is mature and well tested, while security fixes are open and readily available. However, it also ensures that the development work behind our EuroEXA platform, with the UltraZed-EG, will feed back into OpenBMC community – making a wider contribution, in line with the ultimate goals of the EuroEXA project.

Multi-tier Holistic Resilience

As the size and complexity of the HPC cluster increases, the mean time between a failure also increases. Checkpointing is the current solution, with limited innovation specifically to reduce the likelihood and impact of a failure. The EuroEXA precursor project, ExaNEST specifically considered the efficiency of checkpointing by leveraging the contained abstraction of a node through virtualization. EuroEXA has focused on reducing the performance impact of checkpointing while investigating solutions to reduce to occurrence and severity of any failure. This includes techniques to resolve failures locally, while communicating across the stack as necessary.

Infrastructure Resilience

The infrastructure includes the mechanics, the power, and the cooling of the IT of a HPC deployment. Along with the performance advances in EuroEXA in exposing locality, it creates a hierarchy against which failures can be managed. No longer are all node equal to all other nodes in a system, we created a hierarchy both physically in the hardware and in the design of the firmware/software: Nodes are independently powered and network together in units of four; Four such units create a blade with its independent hardware management and localised networking. Eight blades create a network group and creates the next level of network isolation. Four network groups create the cabinet which provides redundancy of power supply and the unit of system replication. The liquid cooling system is designed with redundant paths and pumps with an innovative solution to prevent leaks and a subsequent catastrophic failures by running at sub-atmospheric pressures, ensuring leaks pull in air, that can then be vented, and moving leaks to be simply a scheduled maintenance issue.

Topology Resilience

Today the HPC network topology is a web of equally interconnected nodes. Any loss in a region of the network causes significant re-routing pressures, and likely application failure. The EuroEXA network topology is built using three different approaches which together reduces the impact of any failure in the topology. The direct links at blade and network group protects those resources from failures elsewhere in the system topology with these topologies regions adopting a geographical addressing scheme able compensate for disconnects independent of the implemented higher-level topology decisions otherwise driven by system size.

Communications Resilience

The ExaNET communications protocol used across the topology has been designed with resilience on the end-to-end communications. The hardware designs have been partitioned to ensure physical and link level failures are isolated so that routing paths can be adjusted – simplified by the locality hierarchy of the overall topology. Novel approaches to flow control and back pressure management also reduces failures due to potential overdriving the network as a consequence of adopting a hierarchy of increasing bandwidths necessary to reduce port counts and the total hop count across the topology.

Checkpointing Efficiency

The industries ongoing enhancements in how to checkpoint a node, EuroEXA focuses on increasing the efficiency and hence “down time” associated with checkpointing. Adopting the learnings from ExaNEST virtualised checkpointing, the traditional centralised HPC storage system is unable to scale in support of checkpointing at exascale. The fastest storage solutions in the IO500 are currently capable of a few 100GBytes/s. Current TOP500 systems are 100,000’s of nodes, meaning less than 1Mbyte bandwidth per node to share machine snapshots between nodes for resilience. Leading to low fidelity checkpoints, and significant impact to overall performance. The EuroEXA system creates the HPC distributed storage system between and across all nodes which would provide a 100,000 node system the ability to snapshot at over 200TBytes/s with the EuroEXA network design able to distribute checkpoints for resilience without overloading the interconnect.

Memory Resilience

The physical protection of data in DRAM has been limited to EEC, typically with the ability to fix a single bit and identify a 2nd bit of failure. Any further errors going unnoticed. Although bit-flipping errors are reduced on EEC DRAM, research has shown such multi-bit errors can and do occur. Adding ECC to a memory both reduces the effective capacity, but given the complexity is its generation (10 level XOR) limits performance. The EuroEXA memory compression technologies show that compression could reduce the cost of ECC protection, while the design showing that being more intelligent around how memory is accessed can provide more time for manipulation of the data stream. We see this as a foundation to enable a future investigation of software defined, multi-bit EEC schemes able to move beyond the single correct/single detect schemes while maintaining capacity and performance.

Serdes and Device Resilience

The foundation of a HPC system are the analogue and digital devices from which everything else is built. The security and resilience of these devices is fundamental to the correct operation of the higher-level circuits and cores. EuroEXA investigated various structures and approaches to increase the resilience to wear and security issues – which together would reduce the occurrence of failures, and hence increase system level resilience.

Distributed file system

BeeGFS is a parallel cluster file system developed both for exceptional performance and for ease of installation and management. Just as the performance of modern processors and network technologies is ever-increasing, so too is the size of the data sets being processed. To handle this huge amount of data and deliver it to the computing cores as fast as possible, the CC HPC has spent several years developing BeeGFS.

A free, open-source piece of software, BeeGFS distributes individual files across multiple servers chunk by chunk and, in doing so, can be read or written in parallel. Today, it’s in use on a diverse range of computer clusters – from installations with only a few machines to several systems of the Top500 of the world's fastest supercomputers. It’s also a fundamental component of a wide range of projects led by various research organisations and government bodies.

Our goal of fitting more compute power in less space – hence, minimising the distance the data has to travel – fits perfectly with the BeeGFS concept. It even offers the tools to precisely control where data is stored. With the chance to make the most of storage capacity built into the compute nodes, BeeGFS can help us reduce these distances even further – achieving ever-greater performance.

Debugging and Profiling Toolkit

The EuroEXA system uses the Arm Forge tool package, which includes the Arm DDT debugger, Arm MAP profiler and Arm Performance Reports application analysis tool. The package is designed for distance systems and can be configured remotely – allowing code to be executed on an HPC cluster while the front end is running on a local system, avoiding common response issues with remote graphics.

While it’s a commercial product, Arm is providing licences and support for the package to partners throughout the project.

Debugger: Arm DDT

Arm DDT is a leading parallel debugger, which can run on any scale of system – from simple desktop computers to PetaScale supercomputers. During operations, DDT runs processes at both the front and back end of the system, collecting data from the executing software and returning to the user for inspection in a GUI (Figure 1). To do this, it uses MPI, OpenMP or GPGPU parallelism for debugging, with an efficient and scalable communication structure to minimise latency during the process. While it supports languages popular in HPC, like Fortran and C/C++, initial support is also available for Python.

Figure 1: Arm DDT in action, showing a debugging session of 8 process MPI parallelised Fortran software, with the source code viewer and processes split between two subprograms, together with the array viewer showing the split between multiple processes (colours), with the rows and columns represented on the X and Y axes and the array value represented on the Z axis.

Profiler: Arm MAP

Arm MAP is a highly scalable parallel profiler used in HPC development, supporting the same programming languages and parallel environment as Arm DDT.

Using a sampling-based approach, it profiles without having to modify any code – determining per-process or per-thread load imbalance and highlighting which functions, subroutines or lines are causing performance bottlenecks. Usefully, it offers all this functionality within a simple and user-friendly GUI (Figure 2).

Figure 2: A MAP example from an MPI-parallelised C program, showing the split between compute and MPI operations and the associated power draw of the system.

Application Analysis: Arm Performance Reports

Arm Performance Reports relies on the same sampler and performance metrics as Arm MAP and can be used on an application without recompiling. However, instead of only providing source-level performance data, it also provides a summary of the behaviour in a simple HTML file – guiding users on how to maximise its efficiency (Figure 3).

Figure 3: A performance report for an MPI program running 32 processes, showing a breakdown of execution time split between compute, MPI operations and I/O.

Graphical Performance Analysis

Aftermath is a graphical tool for the trace-based performance analysis of OpenMP and OpenStream programs. It allows programmers to relate performance data from a parallel execution to the programming model, for example, to loops and tasks in OpenMP.

It also allows programmes to explore trace file interactively, by generating detailed statistics for arbitrary subsets of a trace on-the-fly, while providing the detail for individual events for close inspection.

Aftermath is free, released under the GNU GPL v2 and its tracing library is licensed under the GNU LGPL v2.1.

By using Aftermath, the EuroEXA has the benefit of:

Reactive user interface, even for large trace files
Multiple, connected views for exploration, statistics and detailed analysis
OpenMP support
- Visualisation and detailed inspection of parallel loops
- Instrumented run-time based on the LLVM/clang OpenMP run-time
- OMPT support with extension for detailed parallel loop chunk analysis
OpenStream support
- Native support for dependent tasks
- Support for data-flow and task graph analysis
- Native support for NUMA
Machine learning support for identifying performance anomalies either at the task-instance level or loop-chunk level, and for identifying the main contributing factors.