Application Development

Project > Application Development

Application Development

We developed the EuroEXA prototypes in close collaboration with experts across key application domains – from forecasting the weather and modelling particle physics, to simulating the human brain and searching for new drugs to treat disease.

By focusing on the needs of these 14 applications, we ensured that EuroEXA delivers performance and energy efficiency shaped for real-world usage – ultimately providing a world-beating ExaScale architecture that drives economic prosperity and scientific progress.

These applications have been adapted to exploit the EuroEXA architecture – enabling high performance across various aspects of the system – and can be tuned up further to achieve maximum performance and efficiency.

In particular, the project has achieved significant progress with the UK weather and climate model and with the virtual screening drug discovery project.

Delivering performance for demanding applications

The EuroEXA architecture was co-designed and evaluated through 14 applications across various areas of HPC: Climate and Weather, Physics and Energy, and Life Sciences and Bioinformatics.

These applications have been used to drive the co-design of the architecture, and they are being ported and optimised to make the most of the EuroEXA architecture’s unique capabilities. For instance, the Trifecta Scalable Interconnect and the Data Fluent Processing approach are being harnessed through the use of FPGAs via the portable programming environments (OpenStream, OmpSs, Maxeler toolchain and PSyClone) and communication APIs (MPI and GPI).

Our applications are representative of the most important types used across the application CoEs – the Square Kilometer Array (SKA) and the Human Brain Project (HBP). They also provided the project with a diverse set of application requirements, including a mix of compute-bound (neuromarketing, Quantum ESPRESSO, Macau, B-PMF and LBm), memory-bound (LFRic, NEMO, ALYA, MIRIAD, AVU-GSR and NEST), communication-bound (GADGET, NEST and InfOli) and I/O bound (MIRIAD) – together with a mix of requirements (IFS).

These applications are also primarily focused on capability computing (MIRIAD) or capacity computing (neuromarketing, matrix factorisation for model building), while many of them work with extreme data (MIRIAD, AVU-GSR, FRTM, NEST, B-PMF and Macau).

Put simply, the 14 applications within the EuroEXA project group offer a diverse and challenging array of approaches, parameters and requirements – creating a demanding testing ground for honing the system’s capabilities and performance.

Case Study: Porting the LFRic Weather and Climate Model

One of the EuroEXA target applications is the new weather and climate model, LFRic (named in honour of Lewis Fry Richardson), which is being developed by the UK Met Office and its partners to be deployed within the next few years.

Much of the LFRic model’s runtime consists of compute-intensive operations, which are suitable for acceleration using FPGAs. We have used the Xilinx Vivado toolset including High-Level Synthesis (HLS) to generate bitstreams for programming the FPGA from standard C code. Running on a small FPGA – the Xilinx Zynq UltraScale+ ZU9 – we achieved performance of 5.3 GigaFLOPS for a matrix-vector kernel and are now working on scaling this up to state-of-the-art chips.

We have shown how multiple kernels from the LFRic model can be offloaded onto the FPGA. And, when combined with the halo exchange between sub-domains – already implemented in LFRic using MPI message passing – we will demonstrate the seamless use of multiple FPGAs in a multi-node cluster.

We engineered the LFRic model code through a division of coding priorities – focusing separately on the science code and the code necessary for performance on parallel systems. This is achieved by writing the science code according to an application programming interface (API), which enables the PSyclone domain specific compiler to generate parallel code. Recent PSyclone developments then allow it to generate OpenCL code specifically targeting FPGA acceleration.

A detailed description of optimising the matrix-vector kernel using Vivado HLS is given in the paper “First Steps in Porting the LFRic Weather and Climate Model to the FPGAs of the EuroExa Architecture”, M.Ashworth, G.D. Riley, A. Attwood and J. Mawer, Scientific Programming, 2019.

Case Study: Virtual Screening on FPGA

Virtual Molecule Screening (VMS) is a computational technique for simulating potential drug treatments, using machine learning to predict if a chemical compound is likely to bind to a drug target. As drug discovery involves testing many different compounds, this virtual screening demands a lot of computational power – making it an ideal opportunity for FPGA accelerators.

Thanks to some key properties of FPGAs, we can obtain a 10x increase in speed compared to using CPUs or to using GPUs, both at a much lower power consumption.

The key properties we needed to exploit to achieve this result are:

Parallelism: Compared to CPUs, FPGAs have a clock-speed that is 10x lower than CPUs. They compensate for this by providing many more parallel resources (DSP blocks, memory, LUTs, etc.). We were able to benefit from these extra resources by exploiting parallelism at many levels: inside the prediction pipeline, but also across proteins and molecules.
Code Complexity: Mapping code efficiently on FPGA is a time-consuming task, even with help of high-level synthesis tools. Thanks to the simplicity of the code and the use of a code generator, we were able to achieve a good increase in speed with relatively little effort.
Memory Bandwidth: As with all accelerators, getting data in and out efficiently is key to good performance. In this case, we streamed the fingerprints in and predictions out in a linear fashion, and by storing the model itself in the FPGA on-chip memory beforehand.
Bit-Width Reduction: FPGAs deal much better with reduced bit-width fixed point numbers than with double or single floating-point numbers. We were able to reduce the bit width of the input and output streams, and of the model itself, from 64-bit floating point to 16-bit fixed point, without a significant loss in prediction accuracy.

More details can be found in the paper: Vander Aa, T., Ashby, T., & Wuyts, R. (2019). Virtual Screening on FPGA. BNAIC/BENELEARN.