Supercomputing Frontiers and Innovations

Accelerating Seismic Redatuming Using Tile Low-Rank Approximations on NEC SX-Aurora TSUBASA

2021-09-14T23:53:25+05:00

With the aim of imaging subsurface discontinuities, seismic data recorded at the surface of the Earth must be numerically re-positioned inside the subsurface where reflections have originated, a process referred to as redatuming. The recently developed Marchenko method is able to handle full-wavefield data including multiple arrivals. A downside of this approach is that a multi-dimensional convolution operator must be repeatedly evaluated to solve an expensive inverse problem. As such an operator applies multiple dense matrix-vector multiplications (MVM), we identify and leverage the data sparsity structure for each frequency matrix and propose to accelerate the MVM step using tile low-rank (TLR) matrix approximations. We study the TLR impact on time-to-solution for the MVM using different accuracy thresholds whilst at the same time assessing the quality of the resulting subsurface seismic wavefields and show that TLR leads to a minimal degradation in terms of signal-to-noise ratio on a 3D synthetic dataset. We mitigate the load imbalance overhead and provide performance evaluation on two distributed-memory systems. Our MPI+OpenMP TLR-MVM implementation reaches up to 3X performance speedup against the dense MVM counterpart from NEC scientific library on 128 NEC SX-Aurora TSUBASA cards. Thanks to the second generation of high bandwidth memory technology, it further attains up to 67X performance speedup compared to the dense MVM from Intel MKL when running on 128 dual-socket 20-core Intel Cascade Lake nodes with DDR4 memory. This corresponds to 110 TB/s of aggregated sustained bandwidth for our TLR-MVM implementation, without suffering deterioration in the quality of the reconstructed seismic wavefields.

Porting and Optimizing Molecular Docking onto the SX-Aurora TSUBASA Vector Computer

2021-09-14T23:53:25+05:00

In computer-aided drug design, the rapid identification of drugs is critical for combating diseases. A key method in this field is molecular docking, which aims to predict the interactions between two molecules. Molecular docking involves long simulations running compute-intensive algorithms, and thus, can profit a lot from hardware-based acceleration. In this work, we investigate the performance efficiency of the SX-Aurora TSUBASA vector computer for such simulations. Specifically, we present our methodology for porting and optimizing AutoDock, a widely-used molecular docking program. Using a number of platform-specific code optimizations, we achieved executions on the SX-Aurora TSUBASA that are in average 3.6× faster than on modern 128-core CPU servers, and up to a certain extent, competitive to V100 and A100 GPUs. To the best of our knowledge, this is the first molecular docking implementation for the SX-Aurora TSUBASA.

First Experience of Accelerating a Field-Induced Chiral Transition Simulation Using the SX-Aurora TSUBASA

2021-09-14T23:53:25+05:00

An analysis method based on the Ginzburg-Landau equation for the superconductivity is applied to the field-induced chiral transition simulation (FICT). However, the FICT is time consuming because it takes approximately 10 hours on a single SX-ACE vector processor. Moreover, the FICT must be repeatedly performed with parameters changed to understand the mechanism of the phenomenon. The newly emerged SX-Aurora TSUBASA, the successor of the SX-ACE processor, is expected to provide much higher performance to the programs executed on the SX-ACE as is. However, the SX-Aurora TSUBASA processor has changed its architecture of compute nodes and gives users three different execution models, which leads to users’ concerns and questions in terms of how three execution models should be selectively used. In this paper, we report the first experience of using the SX-Aurora TSUBASA processor for the FICT. Specifically, we have developed three implementations of the FICT corresponding to the three execution models suggested by the SX-Aurora TSUBASA. For acceleration of the FICT, improvement of the vectorization ratio in the program execution and the efficient transfer of data to the general purpose processor as the vector host from the vector processor as the vector engine is explored. The evaluation in this paper shows how acceleration of the FICT is achieved as well as how much effort of users is required.

Evaluating the Performance of OpenMP Offloading on the NEC SX-Aurora TSUBASA Vector Engine

2021-09-14T23:53:26+05:00

The NEC SX-Aurora TSUBASA vector engine (VE) follows the tradition of long vector processors for high-performance computing (HPC). The technology combines the vector computing capabilities with the popularity of standard x86 architecture by integrating it as an accelerator. To decrease the burden of code porting for different accelerator types, the OpenMP specification is designed to be single parallel programming model for all of them. Besides the availability of compiler and runtime implementations, the functionality as well as the performance is important for the usability and acceptance of this paradigm. In this work, we present LLVM-based solutions for OpenMP target device offloading from the host to the vector engine and vice versa (reverse offloading). Therefore, we use our source-to-source transformation tool sotoc as well as the native LLVM-VE code path. We assess the functionality and present the first performance numbers of real-world HPC kernels. We discuss the advantages and disadvantage of the different approaches and show that our implementation is competitive to other GPU OpenMP runtime implementations. Our work gives scientific programmers new opportunities and flexibilities for the development of scalable OpenMP offloading applications for SX-Aurora TSUBASA.

Performance and Power Analysis of a Vector Computing System

2021-09-14T23:53:26+05:00

The performance of recent computing systems has drastically improved due to the increase in the number of cores. However, this approach is reaching the limitation due to the power constraints of facilities. Instead, this paper focuses on a vector processing with long vector length that has a potential to realize high performance and high power efficiency. This paper discusses the potential through the optimization of two benchmarks, the Himeno and HPCG benchmarks, for the latest vector computing system SX-Aurora TSUBASA. The architecture of SX-Aurora TSUBASA owes the high efficiency to making good of its long vector length. Considering these characteristics, various levels of optimizations required for a large-scale vector computing system are examined such as vectorization, loop unrolling, use of cache, domain decomposition, process mapping, and problem size tuning. The evaluation and analysis suggest that the optimizations improve the sustained performance, power efficiency, and scalability of both benchmarks. Therefore, it is clarified that the SX-Aurora TSUBASA architecture can achieve higher power efficiency due to its high sustained memory bandwidth paired with the long vector computing.

Distributed Graph Algorithms for Multiple Vector Engines of NEC SX-Aurora TSUBASA Systems

2021-09-14T23:53:26+05:00

This paper describes the world-first attempt to develop distributed graph algorithm implementations, aimed for modern NEC SX-Aurora TSUBASA vector systems. Such systems are equipped with up to eight powerful vector engines, which are capable to significantly accelerate graph processsing and simultaneously increase the scale of processed input graphs. This paper describes distributed implementations of three widely-used graph algorithms: Page Rank (PR), Bellman-Ford Single Source Shortest Paths (further referred as SSSP) and Hyperlink-Induced Topic Search (HITS), evaluating their performance and scalability on Aurora 8 system. In this paper we describe graph partitioning strategies, communication strategies, programming models and single-VE optimizations used in these implementations. The developed implementations achieve 40, 6.6 and 1.3 GTEPS performance on PR, SSSP and HITS algorithm on 8 vector engines, at the same time achieving up to 1.5x, 2x and 2.5x acceleration on 2, 4 and 8 vector engines of Aurora 8 systems. Finally, this paper describes an approach to incorporate distributed graph processing support into our previously developed Vector Graph Library (VGL) framework – a novel framework for graph analytics on NEC SX-Aurora TSUBASA architecture.

Optimizing Load Balance in a Parallel CFD Code for a Large-scale Turbine Simulation on a Vector Supercomputer

2021-09-14T23:53:27+05:00

A turbine for power generation is one of the essential infrastructures in our society. A turbine's failure causes severe social and economic impacts on our everyday life. Therefore, it is necessary to foresee such failures in advance. However, it is not easy to expect these failures from a real turbine. Hence, it is required to simulate various events occurring in the turbine by numerical simulations of the turbine. A multiphysics CFD code, ‘‘Numerical Turbine,’' has been developed on vector supercomputer systems for large-scale simulations of unsteady wet steam flows inside a turbine. To solve this problem, the Numerical Turbine code is a block structure code using MPI parallelization, and the calculation space consists of grid blocks of different sizes. Therefore, load imbalance occurs when executing the code in MPI parallelization. This paper creates an estimation model that finds the calculation time from each grid block's calculation amount and calculation performance. It proposes an OpenMP parallelization method for the load balance of MPI applications. This proposed method reduces the load imbalance by considering the vector performance according to the calculation amount based on the model. Moreover, this proposed method recognizes the need to reduce the load imbalance without pre-execution. The performance evaluation shows that the proposed method improves the load balance from 24.4 % to 9.3 %.