Publications

Experiences on Intel Knights Landing at the One-Year Mark

Abstract:

Brain modeling has been presenting significant challenges to the world of high-performance computing (HPC) over the years. The field of computational neuroscience has been developing a demand for physiologically plausible neuron models, that feature increased complexity and thus, require greater computational power. We explore Intel’s newest generation of Xeon Phi computing platforms, named Knights Landing (KNL), as a way to match the need for processing power and as an upgrade over the previous generation of Xeon Phi models, the Knights Corner (KNC). Our neuron simulator of choice features a Hodgkin-Huxley-based (HH) model which has been ported on both generations of Xeon Phi platforms and aggressively draws on both platforms’ computational assets. The application uses the OpenMP interface for efficient parallelization and the Xeon Phi’s vectorization buffers for Single-Instruction Multiple Data (SIMD) processing. In this study we offer insight into the efficiency with which the application utilizes the assets of the two Xeon Phi generations and we evaluate the merits of utilizing the KNL over its predecessor. In our case, an out-of-the-box transition on Knights Landing, offers on average 2.4 × “>×× speed up while consuming 48% less energy than KNC.

DOI: https://doi.org/10.1007/978-3-319-67630-2_27

Enabling Shared Memory Communication in Networks of MPSoCs

Abstract:

Ongoing transistor scaling and the growing complexity of embedded system designs has led to the rise of MPSoCs (Multi‐Processor System‐on‐Chip), combining multiple hard‐core CPUs and accelerators (FPGA, GPU) on the same physical die. These devices are of great interest to the supercomputing community, who are increasingly reliant on heterogeneity to achieve power and performance goals in these closing stages of the race to exascale. In this paper, we present a network interface architecture and networking infrastructure, designed to sit inside the FPGA fabric of a cutting‐edge MPSoC device, enabling networks of these devices to communicate within both a distributed and shared memory context, with reduced need for costly software networking system calls. We will present our implementation and prototype system and discuss the main design decisions relevant to the use of the Xilinx Zynq Ultrascale+, a state‐of‐the‐art MPSoC, and the challenges to be overcome given the device’s limitations and constraints. We demonstrate the working prototype system connecting two MPSoCs, with communication between processor and remote memory region and accelerator. We then discuss the limitations of the current implementation and highlight areas of improvement to make this solution production‐ready.

DOI: https://doi.org/10.1002/cpe.4774 

Download Paper Here

High Frequency Functional Ultrasound in Mice

Abstract:

Functional ultrasound (fUS) is a relatively new imaging modality to study the brain with a high spatiotemporal resolution and a wide field-of-view. In fUS detailed images of cerebral blood flow and volume are used to derive functional information, as changes in local flow and/or volume may reflect neuronal activation through neurovascular coupling. Most fUS studies so far have been performed in rats. Translating fUS to mice, which is a favorable animal model for neuroscience, pleads for a higher spatial resolution than what has been reported so far. As a consequence the temporal sampling of the blood flow should also be increased in order to adequately capture the wide range in blood velocities, as the Doppler shifts are inversely proportional to the spatial resolution. Here we present our first detailed images of the mouse brain vasculature at high spatiotemporal resolution. In addition we show some early experimental work on tracking brain activity upon local electrical stimulation.

DOI: 10.1109/ULTSYM.2018.8579865

Download Paper Here

Efficient and Flexible Spatiotemporal Clutter Filtering of High Frame Rate Images Using Subspace Tracking

Abstract:

Current methods to measure blood flow using ultrafast Doppler imaging often make use of a Singular Value Decomposition (SVD). The SVD has been shown to be an effective way to remove clutter signals associated with slow moving tissue. Conventionally, the SVD is calculated from an ensemble of frames, after which the first dominant eigenvectors are removed. The Power Doppler Image (PDI) is then computed by averaging over the remaining components. The SVD method is computationally intensive and lacks flexibility due to the fixed ensemble length. We propose a method, based on the Projection Approximation Subspace Tracking (PAST) algorithm, which is computationally efficient and allows us to sequentially estimate and remove the principal components, while also offering flexibility for calculating the PDI, e.g. by using any convolutional filter. During a functional ultrasound (fUS) measurement, the intensity variations over time for every pixel were correlated to a known stimulus pattern. The results show that for a pixel chosen around the location of the stimulation electrode, the PAST algorithm achieves a higher Pearson correlation coefficient than the state-of-the-art SVD method, highlighting its potential to be used for fUS measurements.
 

DOI: 10.1109/ULTSYM.2018.8579775 

Download Paper Here

Performance effects on HPC workloads of global memory capacity sharing

Abstract:

While most HPC systems use the traditional “shared nothing” system architecture, with self-contained nodes communicating via the device I/O and associated software layers, there are several ongoing initiatives to build systems that share resources at a coarser granu-larity, even across the whole machine. Such systems open up new opportunities for resource allocation, in which memory is managed as a shared resource, with an aim to improve time to completion and/or throughput of production workloads. This is expected to be especially beneficial due to the large differences in HPC workloads memory bandwidths and per-node memory footprints. As a first step towards tackling the resource allocation problem, this paper provides a simulation-based methodology for character-ising the effect of memory capacity sharing on performance. The simulation methodology allows characterisation of performance on architectures that are not yet readily available, and it enables co-design of the architecture. To that end, we extend MUSA, a multi-node simulator [13], to simulate an infrastructure that implements inter-node memory capacity sharing. We carried out experiments on real-world HPC workloads: AMG, SPECFEM3D and HYDRO and evaluated the simulations when 50% and 100% of the requests from the remote workload are contending for memory on the local node. Our results show that complementary workloads experience a performance impact of 20 %, respective to their baselines.
 

An Experimental Analysis of the Opportunities to Use Field Programmable Gate Array Multiprocessors for On-board Satellite Deep Learning Classification of Spectroscopic Observations from Future ESA Space Missions

Abstract:

Satellite-to-earth data transmissions are increasingly becoming a bottleneck, as transmission speed improvements do not keep up with the pace of on-board data generation. Hence, on-board satellite payload data processing becomes essential, provided such processing can be performed with a sufficiently small energy footprint. In this work we demonstrate that with appropriate pruning of weights, suitable data structures to reduce off-chip memory requirements, and a highly parallel application-specific architecture, Field Programmable Gate Array (FPGA) technology can be used for on-board satellite processing of observation by Convolutional Neural Network (CNN) architectures, and at an order-of-magnitude smaller energy requirements compared to Graphics Processing Units (GPUs) running the same algorithms. We demonstrate a 0.4% error vs. results from Tensorflow running on GPUs towards estimation of the galaxy redshift from spectroscopic observations. The results are from actual executions on FPGAs which have space-qualified equivalent parts. The main contribution of this work is the demonstration that accurate observation analysis task can be performed in space, so that only critical information is transmitted to ground stations instead of raw data.
 

Multinode Implementation of An Extended Hodgkin-Huxley Simulator

Abstract:

Mathematical models with varying degrees of complexity have been proposed and simulated in an attempt to represent the intricate mechanisms of the human neuron. One of the most biochemically realistic and analytical models, based on the Hodgkin–Huxley (HH) model, has been selected for study in this paper. In order to satisfy the model’s computational demands, we present a simulator implemented on Intel Xeon Phi Knights Landing manycore processors. This high-performance platform features an x86-based architecture, allowing our implementation to be portable to other common manycore processing machines. This is reinforced by the fact that Phi adopts the popular OpenMP and MPI programming models. The simulator performance is evaluated when calculating neuronal networks of varying sizes, density and network connectivity maps. The evaluation leads to an analysis of the neuronal synaptic patterns and their impact on performance when tackling this type of workload on a multinode system. It will be shown that the simulator can calculate 100 ms of simulated brain activity for up to 2 millions of biophysically-accurate neurons and 2 billion neuronal synapses within one minute of execution time. This level of performance renders the application an efficient solution for large-scale detailed model simulation.

DOI: https://doi.org/10.1016/j.neucom.2018.10.062 

Download Paper Here

Prototyping a Biologically Plausible Neuron Model on a Heterogeneous CPU-FPGA Board

Abstract:

A heterogeneous hardware-software system implemented on an Avnet ZedBoard Zynq SoC platform, is proposed for the computation of an extended Hodgkin Huxley (eHH), biologically plausible neural model. SoC’s ARM A9 is in charge of handling execution of a single neuron as defined in the eHH model, each with a O(N) computational complexity, while the computation of the gap-junctions interactions for each cell is offloaded on the SoC’s FPGA, cutting its O(N 2 ) complexity by exploiting parallel-computing hardware techniques. The proposed hw-sw solution allows for speed-ups of about 18 times visa-vis à vectorized software implementation on the SoC’s cores, and is comparable to the speed of the same model optimized for a 64-bit Intel Quad Core i7, at 3.9GHz.

DOI: 10.1109/LASCAS.2019.8667538 

Download Paper Here

Adaptive Word Reordering for Low-Power Inter-Chip Communication

Abstract:

The energy for data transfer has an increasing effect on the total system energy as technology scales, often overtaking computation energy. To reduce the power of inter-chip interconnects, an adaptive encoding scheme called Adaptive Word Reordering (AWR) is proposed that effectively decreases the number of signal transitions, leading to a significant power reduction. AWR outperforms other adaptive encoding schemes in terms of decrease in transitions, yielding up to 73% reduction in switching. Furthermore, complex bit transition computations are represented as delays in the time domain to limit the power overhead due to encoding. The saved power outweighs the overhead beyond a moderate wire length where the I/O voltage is assumed equal to the core voltage. For a typical I/O voltage, the decrease in power is significant reaching 23% at just 1 mm.

 

DOI: 10.23919/DATE.2019.8714820

Download Paper Here

INRFlow : an interconnection networks research flow-level simulation framework

Abstract:

This paper presents INRFlow, a mature, frugal, flow-level simulation framework for modelling large-scale networks and computing systems. INRFlow is designed to carry out performance-related studies of interconnection networks for both high performance computing systems and datacentres. It features a completely modular design in which adding new topologies, routings or traffic models requires minimum effort. Moreover, INRFlow includes two different simulation engines: a static engine that is able to scale to tens of millions of nodes and a dynamic one that captures temporal and causal relationships to provide more realistic simulations. We will describe the main aspects of the simulator, including system models, traffic models and the large variety of topologies and routings implemented so far. We conclude the paper with a case study that analyses the scalability of several typical topologies. INRFlow has been used to conduct a variety of studies including evaluation of novel topologies and routings (both in the context of graph theory and optimization), analysis of storage and bandwidth allocation strategies and understanding of interferences between application and storage traffic.

DOI: https://doi.org/10.1016/j.jpdc.2019.03.013 

Download Paper Here

Cut to Fit: Tailoring the Partitioning to the Computation

Abstract:

Graph analytics applications are very often built using off-the-shelf analytics frameworks, which are profiled and optimized for the general case and have to perform for all kind of graphs. As performance is affected by the selection of the partition strategy analytics frameworks often offer a selection of partitioning algorithms. In this paper we evaluate the impact of partitioning strategies on the performance of graph computations. We evaluate eight graph partitioning algorithms on a diverse set of graph datasets, using four standard graph algorithms by measuring a set of five partitioning metrics. We analyze the performance of each partitioning strategy with respect to (i) the properties of the graph dataset, (ii) each analytics computation and (iii) the number of partitions. We confirm that there is no optimal partitioner across all experiments and moreover, find no metric always correlated with performance, that could be targeted by novel partitioners. Finally, we find that partitioning time may become a significant part of total time, and investing a lot of time to approximate perfect partitioning may not be worth it. We propose that a “good enough” strategy may be to have a very fast (and locally computable) heuristic to select among the best performing partitioners for any given problem instance. We demonstrate this by proposing PARSEL, a very simple partitioner selector. PARSEL selects among the two best-performing partitioners very fast; we demonstrate that even such a simple heuristic can outperform either partitioning strategy alone.
 
 

Prospects for Low-power Acceleration of HPC Workloads in EuroEXA: FPGA Acceleration of a Numerical Weather Forecast Code

Abstract:

The EuroExa project proposes a High-Performance Computing (HPC) architecture which is both scalable to Exascale performance levels and delivers world-leading power efficiency. This is achieved through the use of low-power ARM processors accelerated by closely-coupled FPGA programmable components. In order to demonstrate the efficacy of the design, the EuroExa project includes application porting work across a rich set of applications. One such application is the new weather and climate model, LFRic (named in honour of Lewis Fry Richardson), which is being developed by the UK Met Office and its partners for operational deployment in the middle of the next decade. Much of the run-time of the LFRic model consists of compute intensive operations which are suitable for acceleration using FPGAs. We have selected the Xilinx Vivado toolset including High-Level Synthesis (HLS) which generates IP blocks that can be combined with other standard IP blocks in Vivado Design Studio and a bitstream generated for programming the FPGA. A design using twelve matrix-vector IP blocks achieves 5.34 double precision Gflop/s. We shall discuss the implementation, the performance achieved and the prospects for acceleration of the full LFRic weather model.

A Unified Novel Neural Network Approach and a Prototype Hardware Implementation for Ultra-Low Power EEG Classification

Abstract:

This paper introduces a novel electroencephalogram (EEG) data classification scheme together with its implementation in hardware using an innovative approach. The proposed scheme integrates into a single, end-to-end trainable model a spatial filtering technique and a neural network based classifier. The spatial filters, as well as, the coefficients of the neural network classifier are simultaneously estimated during training. By using different time-locked spatial filters, we introduce for the first time the notion of “attention” in EEG processing, which allows for the efficient capturing of the temporal dependencies and/or variability of the EEG sequential data. One of the most important benefits of our approach is that the proposed classifier is able to construct highly discriminative features directly from raw EEG data and, at the same time, to exploit the function approximation properties of neural networks, in order to produce highly accurate classification results. The evaluation of the proposed methodology, using public available EEG datasets, indicates that it outperforms the standard EEG classification approach based on filtering and classification as two separated steps. Moreover, we present a prototype implementation of the proposed scheme in state-of-the-art reconfigurable hardware; our novel implementation outperforms by more than one order of magnitude, in terms of power efficiency, the conventional CPU-based approaches.
 
 

Receive-side notification for enhanced RDMA in FPGA-based networks

Abstract:

FPGAs are rapidly gaining traction in the domain of HPC thanks to the advent of FPGA-friendly data-flow workloads, as well as their flexibility and energy efficiency. However, these devices pose a new challenge in terms of how to better support their communications, since standard protocols are known to hinder their performance greatly either by requiring CPU intervention or consuming too much FPGA logic. Hence, the community is moving towards custom-made solutions. This paper analyses an optimization to our custom, reliable, interconnect with connectionless transport|a mechanism to register and track inbound RDMA communication at the receive-side. This way, it provides completion notifications directly to the remote node which saves a round-trip latency. The entire mechanism is designed to sit within the fabric of the FPGA, requiring no software intervention. Our solution is able to reduce the latency of a receive operation by around 20% for small message sizes (4KB) over a single hop (longer distances would experience even higher improvement). Results from synthesis over a wide parameter range confirm that this optimization is scalable both in terms of the number of concurrent outstanding RDMA operations, and the maximum message size.

First steps in Porting the LFRic Weather and Climate Model to the FPGAs of the EuroEXA Architecture

Abstract:

In recent years, there has been renewed interest in the use of field-programmable gate arrays (FPGAs) for high-performance computing (HPC). In this paper, we explore the techniques required by traditional HPC programmers in porting HPC applications to FPGAs, using as an example the LFRic weather and climate model. We report on the first steps in porting LFRic to the FPGAs of the EuroExa architecture. We have used Vivado High-Level Syntheusywwi to implement a matrix-vector kernel from the LFRic code on a Xilinx UltraScale+ development board containing an XCZU9EG multiprocessor system-on-chip. We describe the porting of the code, discuss the optimization decisions, and report performance of 5.34 Gflop/s with double precision and 5.58 Gflop/s with single precision. We discuss sources of inefficiencies, comparisons with peak performance, comparisons with CPU and GPU performance (taking into account power and price), comparisons with published techniques, and comparisons with published performance, and we conclude with some comments on the prospects for future progress with FPGA acceleration of the weather forecast model. The realization of practical exascale-class high-performance computinems requires significant improvements in the energy efficiency of such systems and their components. This has generated interest in computer architectures which utilize accelerators alongside traditional CPUs. FPGAs offer huge potential as an accelerator which can deliver performance for scientific applications at high levels of energy efficiency. The EuroExa project is developing and building a high-performance architecture based upon ARM CPUs with FPGA acceleration targeting exascale-class performance within a realistic power budget.

DOI: https://doi.org/10.1155/2019/7807860 

Download Paper Here

Design Exploration of Multi-tier Interconnection Networks for ExaScale Systems

Abstract:

Interconnection networks are one of the main limiting factors when it comes to scale out computing systems. In this paper, we explore what role the hybridization of topologies has on the design of an state-of-the-art exascale-capable computing system. More precisely we compare several hybrid topologies and compare with common single-topology ones when dealing with large-scale applicationlike traffic. In addition we explore how different aspects of the hybrid topology can affect the overall performance of the system. In particular, we found that hybrid topologies can outperform state-of-the-art torus and fattree networks as long as the density of connections is high enough–one connection every two or four nodes seems to be the sweet spot–and the size of the subtori is limited to a few nodes per dimension. Moreover, we explored two different alternatives to use in the upper tiers of the interconnect, a fattree and a generalised hypercube, and found little difference between the topologies, mostly depending on the workload to be executed.

DOI: 10.1145/3337821.3337903

Download Paper Here

Enabling Standalone FPGA Computing

Abstract:

One of the key obstacles in the advancement of large-scale distributed FPGA platforms is the ability of the accelerator to act autonomously from the CPU, whilst maintaining tight coupling to system memory. This work details out efforts in decoupling the networking capabilities of the FPGA from CPU resources using a custom transport layer and network protocol. We highlight the reasons that previous solutions are insufficient for the requirements of HPC, and we show the performance benefits of offloading out transport into the FPGA fabric. our results show promising throughput and latency benefits and show competitive Flops being achievable for network dependent computing in a distributed environment. 

DOI: https://doi.org/10.5281/zenodo.3342921

Download Paper Here

Porting a Lattice Boltzmann simulation to FPGAs using OmpSs

Abstract:

Reconfigurable computing, exploiting Field Programmable Gate Arrays (FPGA), has become of great interest for both academia and industry research thanks to the possibility to greatly accelerate a variety of applications. The interest has been further boosted by recent developments of FPGA programming frameworks which allows to design applications at a higher-level of abstraction, for example using directive based approaches. In this work we describe our first experiences in porting to FPGAs an HPC application, used to simulate Rayleigh-Taylor instability of fluids with different density and temperature using Lattice Boltzmann Methods. This activity is done in the context of the FET HPC H2020 EuroEXA project which is developing an energy-efficient HPC system, at exa-scale level, based on Arm processors and FPGAs. In this work we use the OmpSs directive based programming model, one of the models available within the EuroEXA project. OmpSs is developed by the Barcelona Supercomputing Center (BSC) and allows to target FPGA devices as accelerators, but also commodity CPUs and GPUs, enabling code portability across different architectures. In particular, we describe the initial porting of this application, evaluating the programming efforts required, and assessing the preliminary performances on a Trenz development board hosting a Xilinx Zynq UltraScale+ MPSoC embedding a 16nm FinFET+ programmable logic and a multi-core Arm CPU.

Energy-Efficiency Evaluation of FPGAs for Floating-Point Intensive Workload

Abstract:

In this work we describe a method to measure the computing performance and energy-efficiency to be expected of an FPGA device. The motivation of this work is given by their possible usage as accelerators in the context of floating-point intensive HPC workloads. In fact, FPGA devices in the past were not considered an efficient option to address floating-point intensive computations, but more recently, with the advent of dedicated DSP units and the increased amount of resources in each chip, the interest towards these devices raised. Another obstacle to a wide adoption of FPGAs in the HPC field has been the low level hardware knowledge commonly required to program them, using Hardware Description Languages (HDLs). Also this issue has been recently mitigated by the introduction of higher level programming framework, adopting so called High Level Synthesis approaches, reducing the development time and shortening the gap between the skills required to program FPGAs wrt the skills commonly owned by HPC software developers. In this work we apply the proposed method to estimate the maximum floating-point performance and energy-efficiency of the FPGA embedded in a Xilinx Zynq Ultrascale+ MPSoC hosted on a Trenz board.

Scalability Analysis of Optical Beneš Networks Based on Thermally/Electrically tuned Mach-Zehnder Interferometers

Abstract:

Silicon Photonic interconnects are a promising technology for scaling computing systems into the exa-scale domain. However, significant challenges exist in terms of optical losses and complexity. In this work, we examine the applicability of thermally/electrically tuned Beneš network based on Mach-Zehnder Interferometers for on-chip interconnects as regards its scalability and how optical loss and laser power scale with the number of endpoints. In addition, we propose three hardware-inspired routing strategies that leverage the inherent asymmetry present in the switching components. We evaluate a range of NoC sizes, from 16 up to 1024 endpoints, using 4 realistic workloads and found very promising results. Our routing strategies offer an optical loss reduction of up to 32% as well as a laser power reduction by 33% for 32 endpoints.
 

DOI: https://doi.org/10.1145/3356045.3360715

Download Paper Here

Low Power High Performance Computing on Arm System-on-Chip in Astrophysics

Abstract:

In this paper, we quantitatively evaluate the impact of computation on the energy consumption on Arm MPSoC platforms, exploiting both CPUs and embedded GPUs. Performance and energy measures are made on a direct N-body code, a real scientific application from the astrophysical domain. The time-to-solutions, energy-to-solutions and energy delay product using different software configurations are compared with those obtained on a general purpose x86 desktop and PCIe GPGPU. With this work, we investigate the possibility of using commodity single boards based on Arm MPSoC as an HPC computational resource for real Astrophysical production runs. Our results show to which extent those boards can be used and which modification are necessary to a production code to profit of them. A crucial finding of this work is the effect of the emulated double precision on the GPU performances that allow to use embedded and gaming GPUs as excellent HPC resources.

DOI: 10.1007/978-3-030-32520-6_33 

 

Direct N-body Application on Low-Power and Energy-Efficient Parallel Architectures

Abstract:

The aim of this work is to quantitatively evaluate the impact of computation on the energy consumption on ARM MPSoC platforms, exploiting CPUs, embedded GPUs and FPGAs. One of them possibly represents the future of High Performance Computing systems: a prototype of an Exascale supercomputer. Performance and energy measurements are made using a state-of-the-art direct N-body code from the astrophysical domain. We provide a comparison of the time-to-solution and energy delay product metrics, for different software configurations. We have shown that FPGA technologies can be used for application kernel acceleration and are emerging as a promising alternative to “traditional” technologies for HPC, which purely focus on peak-performance than on power-efficiency.

DOI: https://arxiv.org/abs/1910.14496 

Download Paper Here

Software and Hardware co-design for low-power HPC platforms

Abstract:

In order to keep an HPC cluster viable in terms of economy, serious cost limitations on the hardware and software deployment should be considered, prompting researchers to reconsider the design of modern HPC platforms. In this paper we present a cross-layer communication architecture suitable for emerging HPC platforms based on heterogeneous multiprocessors. We propose simple hardware primitives that enable protected, reliable and virtualized, user-level communication that can easily be integrate in the same package with the processing unit. Using an efficient user-space software stack the proposed architecture provides efficient, low-latency communication mechanisms to HPC applications. Our implementation of the MPI standard that exploits the aforementioned capabilities delivers point-to-point and collective primitives with low overheads, including an eager protocol with end-to-end latency of 1.4  μs

“>μsμs

. We port and evaluate our communication stack using real HPC applications in a cluster of 128 ARMv8 processors that are tightly coupled with FPGA logic. The network interface primitives occupy less than 25% of the FPGA logic and only 3 Mbits of SRAM while they can easily saturate the 16 Gb/s links in our platform.

DOI: https://doi.org/10.1007/978-3-030-34356-9_9 

Download Paper Here

Exploring Complex Brain-Simulation Workloads on Multi-GPU Deployments

Abstract:

In-silico brain simulations are the de-facto tools computational neuroscientists use to understand large-scale and complex brain-function dynamics. Current brain simulators do not scale efficiently enough to large-scale problem sizes (e.g., >100,000 neurons) when simulating biophysically complex neuron models. The goal of this work is to explore the use of true multi-GPU acceleration through NVIDIA’s GPUDirect technology on computationally challenging brain models and to assess their scalability. The brain model used is a state-of-the-art, extended Hodgkin-Huxley, biophysically meaningful, three-compartmental model of the inferior-olivary nucleus. The Hodgkin-Huxley model is the most widely adopted conductance-based neuron representation, and thus the results from simulating this representative workload are relevant for many other brain experiments. Not only the actual network-simulation times but also the network-setup times were taken into account when designing and benchmarking the multi-GPU version, an aspect often ignored in similar previous work. Network sizes varying from 65K to 2M cells, with 10 and 1,000 synapses per neuron were executed on 8, 16, 24, and 32 GPUs. Without loss of generality, simulations were run for 100 ms of biological time. Findings indicate that communication overheads do not dominate overall execution while scaling the network size up is computationally tractable. This scalable design proves that large-network simulations of complex neural models are possible using a multi-GPU design with GPUDirect.

DOI: https://doi.org/10.1145/3371235

Download Paper Here