Publications

Google scholar profile

: High Performance Computing code optimizations: Tuning performance and accuracy. Pablo de Oliveira Castro. October 2022. Habilitation à diriger des recherches, Université Paris-Saclay. [ bib | http | .pdf ]
Since the beginning of the field of high performance computing (HPC) after World War II, there has been a rapid increase in computing resources and simulation complexity. HPC fine-grained simulation of complex systems, such as fluid dynamics or molecular interactions, has made possible important advances in many scientific fields and in the industry. Reducing the cost, both in time and energy, of computer simulations is critical. The precision of simulations should be sufficient to provide scientific insights but as low as possible to save energy and computation time. HPC relies on complex heterogeneous architectures with massive concurrency at different execution levels, a deep memory hierarchy, and dedicated interconnect networks. This inherent complexity generates an optimization space composed of many factors such as the chosen architecture, the algorithmic variant, the floating-point precision and format, the compiler optimization passes, and the thread mapping. The first part of the manuscript presents methods to accelerate the optimization of HPC applications. We present the design of CERE, an open-source tool that automatically decomposes applications into standalone regions, called codelets. Instead of studying the whole application, a minimal set of codelets capturing its performance behavior serves as a proxy for optimization. The optimization space is further reduced by the use of adaptive sampling techniques, that estimate the performance from a limited number of factor combinations. We demonstrate these techniques in different domains, such as optimizing a seismic imaging proto-application and reducing simulation time for hardware-software co-design. The second part of the manuscript uses alternative floating-point models, such as Monte Carlo arithmetic, to explore the compromise between numerical precision and performance. A probabilistic definition of the number of significant digits is introduced and used to estimate the accuracy of a computation. We discuss the design of verificarlo, an open-source framework for numerical optimization, and demonstrate how it can be applied to pinpoint numerical bugs in large HPC codes such as neuroimaging pipelines, Density Functional Theory quantum mechanical modeling, or structure simulations. Verificarlo is also used to identify the parts of the code that can use smaller floating-point formats, reducing the computation cost. We apply these techniques to optimize the speed and energy consumption of a conjugate-gradient solver used in Computational Fluid Dynamics. Finally, we examine the challenge of reducing the power consumption of HPC through a survey of the literature. We advocate for sobriety in our usage of computing resources: instead of always reaching for more complex simulations, we should find the right fit for our problem.

Recent papers

: Harnessing MPI mutations for AI error detection. Asia Auville, Tim Jammer, Eric Petit, Pablo de Oliveira Castro, Emmanuelle Saillard, and Mihail Popov. In Proceedings of the 40th ACM International Conference on Supercomputing, ICS '26, page 1000–1013, New York, NY, USA, 2026. Association for Computing Machinery. [ bib | DOI | http | .pdf ]
MPI errors are challenging to identify despite the significant number of expert verification tools. Dynamic tools (i.e., requiring profiling) are computationally expensive and accurate in error detection, whereas static analysis (i.e., operating at source code or compilation) is computationally cheap but less accurate. Interestingly, the recent success of AI and LLMs offers an alternative to increase static analysis accuracy while preserving its low overhead. Yet current methods remain difficult to benchmark, too general, and poorly adapted to the specific challenges of high-performance computing. In this paper, we investigate how AI-powered tools can efficiently and accurately detect errors in real-world MPI applications. We propose a novel MPI Mutated Dataset (MMD), constructed from MPI programs extracted from thousands of open-source GitHub projects. After sorting and filtering these files, we inject errors that realistically emulate developers’ mistakes using synthetic code mutations. We leverage the dataset to train different AI models and assess their generalization capabilities against standard verification tools. We train compact multilayer perceptrons (MLP) with AST-T5 embeddings (i.e., transformer only encoder network) to detect diverse MPI errors (from the MPI-BugBench (MBB) classification) including CallOrdering or InvalidParameter, achieving detection accuracies up to (92.0%). In some cases, our method outperforms the expert prompted Qwen-3 Coder LLM and provides competitive results compared to established dynamic verification tools. We further evaluate Claude Sonnet 4.5 across different prompting strategies, showing that engineered prompts improve InvalidParam detection accuracy from (84.8%) to (93.6%), surpassing the dynamic tool ITAC ((87.0%)). The full agentic pipeline achieves (73.9%) overall multi-class accuracy and (87.6%) bug detection recall on MBB. This paves the way for large-scale verification efforts that can use our dataset as a foundation to thoroughly investigate each MPI error in real world applications.
: A Probabilistic Error Analysis of Exponentiation Algorithms. Pablo de Oliveira Castro, El-Mehdi El Arar, and Devan Sohier. (preprint), June 2026. [ bib | http | .pdf ]
The accumulation of rounding errors is a fundamental source of inaccuracy in floating-point computations, in particular, in low precision formats. In this work, we investigate the statistical behavior of rounding errors arising from both stochastic rounding (SR) and round-to-nearest (RN) modes in two exponentiation algorithms: naive iterative multiplication and exponentiation by squaring. For both modes and both algorithms, we derive probabilistic characterizations of the accumulated error. We show that the error converges to a normal distribution when the precision is high enough. We derive a precise estimation of the distribution parameters under the classical model in which mantissas are assumed to be log-uniformly distributed. For both algorithms, we derive theoretical expressions describing the propagation of rounding errors and compare the resulting error distributions under SR and RN. Numerical experiments assess the accuracy of the proposed models and validate the theoretical predictions.
: Benchmarking Large Language Models on Floating-Point Error Classification. Lisa Taldir, Muhammad Ahmad Saeed, David Defour, Pablo de Oliveira Castro, and Eric Petit. (preprint), March 2026. [ bib | http | .pdf ]
This paper investigates the capability of Large Language Models (LLMs) to detect and classify floating-point errors statically in software code. We introduce InterFLOPBench, a benchmark of 90 C kernels with 1 130 test samples designed to evaluate LLMs across six categories of floating-point error: cancellation, comparison, division by zero, overflow, underflow and NaN, compared across 14 LLMs. The evaluation framework treats floating-point error detection as a multi-label classification problem and employs the F1-score metric to measure performance. Results demonstrate that latest models (Qwen 3 32b, Gemini 2.5 Flash, Phi 4 Reasoning, DeepSeek R1T2, and gpt-oss 20b and 120b) achieve a performance greater than 0.88 overall F1-score. Performance varies between error categories, between explicit operations such as division by zero (Average F1-score: 0.8479) and more subtle numerical phenomena such as underflow (Average F1-score: 0.6059) and cancellation (Average F1-score: 0.6164).
: QMCkl: A Kernel Library for Quantum Monte Carlo Applications. Emiel Slootman, Vijay Gopal Chilkuri, Aurélien Delval, Max Hoffer, Tommaso Gorni, François Coppens, Joris van de Nes, Ramón L. Panadés-Barrueta, Evgeny Posenitskiy, Abdallah Ammar, Edgar Josué Landinez Borda, Kévin Camus, Oto Kohulàk, Emmanuel Giner, Pablo de Oliveira Castro, Cédric Valensi, William Jalby, Claudia Filippi, and Anthony Scemama. Journal of Chemical Physics, 164(11), 2026. [ bib | DOI | http ]
Quantum Monte Carlo (QMC) methods deliver highly accurate electronic structure calculations but are computationally intensive. The quantum Monte Carlo kernel library (QMCkl) provides a modular, portable collection of high-performance kernels implementing the core building blocks of QMC calculations. It offers a C-compatible API, supports the TREXIO standard for input, and covers essential QMC kernels including atomic and molecular orbitals, cusp corrections, Jastrow factor, and the necessary derivatives also to perform variational and structural optimization.
: MLKAPS: Machine Learning and Adaptive Sampling for HPC Kernel Auto-tuning. Mathys Eliott Jam, Eric Petit, Pablo de Oliveira Castro, David Defour, Greg Henry, and William Jalby. ACM Transactions on Architecture Code Optimization (TACO), November 2025. [ bib | DOI | http | http ]
Many High-Performance Computing (HPC) libraries rely on decision trees to select the best kernel hyperparameters at runtime,depending on the input and environment. However, finding optimized configurations for each input and environment is challengingand requires significant manual effort and computational resources. This paper presents MLKAPS, a tool that automates this task usingmachine learning and adaptive sampling techniques. MLKAPS generates decision trees that tune HPC kernels' design parameters toachieve efficient performance for any user input. MLKAPS scales to large input and design spaces, outperforming similar state-of-the-artauto-tuning tools in tuning time and mean speedup. We demonstrate the benefits of MLKAPS on the highly optimized Intel MKLdgetrf LU kernel and show that MLKAPS finds blindspots in the manual tuning of HPC experts. It improves over 85% of the inputswith a geomean speedup of x1.30. On the Intel MKL dgeqrf QR kernel, MLKAPS improves performance on 85% of the inputs with ageomean speedup of x1.18.
: Error Analysis of sum-product algorithms under stochastic rounding. Pablo de Oliveira Castro, El-Mehdi El Arar, Eric Petit, and Devan Sohier. SIAM Journal on Scientific Computing, 47(6):B1481--B1502, 2025. [ bib | DOI | .pdf ]
The quality of numerical computations can be measured through their forward error, for which finding good error bounds is challenging in general. For several algorithms and using stochastic rounding (SR), probabilistic analysis has been shown to be an effective alternative for obtaining tight error bounds. This analysis considers the distribution of errors and evaluates the algorithm's performance on average. Using martingales and the Azuma-Hoeffding inequality, it provides error bounds that are valid with a certain probability and in O(n√u) instead of deterministic worst-case bounds in O(nu), where n is the number of operations and u is the unit roundoff. In this paper, we present a general method that automatically constructs a martingale for any computation scheme with multi-linear errors based on additions, subtractions, and multiplications. We apply this generalization to algorithms previously studied with SR, such as pairwise summation and the Horner algorithm, and prove equivalent results. We also analyze a previously unstudied algorithm, Karatsuba polynomial multiplication, which illustrates that the method can handle reused intermediate computations.
: Noise injection for performance bottleneck analysis. Aurélien Delval, Pablo de Oliveira Castro, William Jalby, and Etienne Renault. In Euro-Par 2025 31st International Conference on Parallel and Distributed Computing, volume 15900 of Lectures Notes in Computer Science, pages 3 -- 23. Springer, August 2025. (Best Paper Award). [ bib | DOI | http ]
Bottleneck evaluation is a crucial part of performance tuning of HPC applications, as it directly influences the search for optimizations and the selection of the best hardware for a given code. In this paper, we introduce a new model-agnostic, instruction-accurate framework for bottleneck analysis based on performance noise injection. This method provides a precise analysis that complements existing techniques, particularly in quantifying unused resource slack. Specifically, we classify programs based on whether they are limited by computation, data access bandwidth, or latency by injecting additional noise instructions that target specific bottleneck sources. Our approach is built on the LLVM compiler toolchain, ensuring easy portability across different architectures and microarchitectures, which constitutes an improvement over many state-of-the-art tools. We validate our framework on a range of hardware benchmarks and kernels, including a detailed study of a sparse-matrix–vector product (SPMXV) kernel, where we successfully detect distinct performance regimes. These insights further inform hardware selection, as demonstrated by our comparative evaluation between HBM and DDR memory systems.
: Enabling mixed-precision in spectral element codes. Yanxiang Chen, Pablo de Oliveira Castro, Paolo Bientinesi, Niclas Jansson, and Roman Iakymchuk. Future Generation Computer Systems, 174:107990, 2025. [ bib | DOI | http ]
Mixed-precision computing has the potential to significantly reduce the cost of exascale computations, but determining when and how to implement it in programs can be challenging. In this article, we propose a methodology for enabling mixed-precision with the help of computer arithmetic tools, roofline model, and computer arithmetic techniques. As case studies, we consider Nekbone (Nek5000 developers), a mini-application for the Computational Fluid Dynamics (CFD) solver Nek5000 (Fischer et al.), and a modern Neko (Jansson et al., 2024) CFD application. With the help of the Verificarlo (Denis et al., 2016) tool and computer arithmetic techniques, we introduce a strategy to address stagnation issues in the preconditioned Conjugate Gradient method in Nekbone and apply these insights to implement a mixed-precision version of Neko. We evaluate the derived mixed-precision versions of these codes by combining metrics in three dimensions: accuracy, time-to-solution, and energy-to-solution. Notably, mixed-precision in Nekbone reduces time-to-solution by roughly 1.62x and energy-to-solution by 2.43x on MareNostrum 5, while in the real-world Neko application, the gain is up to 1.3x in both time and energy, with the accuracy that matches double-precision results.

Environmental impact of computation

Some thoughts on the environmental impact of High Performance Computing, 2022-10-26.
Computing within environmental limits, presentation at the LIX sustainability seminar, 2024-02-01.
Aborder la question environnementale de manière transverse, presentation au Workshop sur l’Informatique Durable (WID) dans l’enseignement, 2024-06-21.

Computer Arithmetic

: Enabling Mixed-Precision with the Help of Tools: A Nekbone Case Study. Yanxiang Chen, Pablo de Oliveira Castro, Paolo Bientinesi, and Roman Iakymchuk. In Parallel Processing and Applied Mathematics, pages 34--50. Springer Nature, 2025. [ bib | DOI ]
Mixed-precision computing has the potential to significantly reduce the cost of exascale computations, but determining when and how to implement it in programs can be challenging. In this article, we consider Nekbone, a mini-application for the Computational Fluid Dynamics (CFD) solver Nek5000, as a case study, and propose a methodology for enabling mixed-precision with the help of computer arithmetic tools and roofline model. We evaluate the derived mixed-precision program by combining metrics in three dimensions: accuracy, time-to-solution, and energy-to-solution. Notably, the introduction of mixed-precision in Nekbone, reducing time-to-solution by 40.7% and energy-to-solution by 47% on 128 MPI ranks without sacrificing the accuracy.
: Verificarlo CI: Continuous Integration for Numerical Optimization and Debugging. Aurélien Delval, François Coppens, Eric Petit, Roman Iakymchuk, and Pablo de Oliveira Castro. In Proceedings of the 35th Parallel CFD International Conference 2024, volume 69 of Schriften des Forschungszentrums Jülich IAS Series, pages 104 -- 107, Jülich, Sep 2025. Forschungszentrum Jülich GmbH Zentralbibliothek, Verlag. [ bib | DOI | http | .pdf ]
Floating-point accuracy is an important concern when developing numerical simulations or other compute-intensive codes. Tracking the introduction of numerical regression is often delayed until it provokes unexpected bug for the end-user. In this paper, we introduce Verificarlo CI, a continuous integration workflow for the numerical optimization and debugging of a code over the course of its development. We demonstrate applicability of Verificarlo CI on two test-case applications.
: Bounds on non-linear errors for variance computation with stochastic rounding. El-Mehdi El Arar, Devan Sohier, Pablo de Oliveira Castro, and Eric Petit. SIAM Journal on Scientific Computing, 46(5):B579--B599, 2024. [ bib | DOI | .pdf ]
Abstract. The main objective of this work is to investigate nonlinear errors and pairwise summation using stochastic rounding (SR) in variance computation algorithms. We estimate the forward error of computations under SR through two methods: the first is based on a bound of the variance and the Bienaymé–Chebyshev inequality, while the second is based on martingales and the Azuma–Hoeffding inequality. The study shows that for pairwise summation, using SR results in a probabilistic bound of the forward error proportional to sqrt(log (n))u rather than the deterministic bound in O(log (n)u) when using the default rounding mode. We examine two algorithms that compute the variance, one called “textbook” and the other “two-pass,” which both exhibit nonlinear errors. Using the two methods mentioned above, we show that the forward errors of these algorithms have probabilistic bounds under SR in O(sqrt(n)u) instead of nu for the deterministic bounds. We show that this advantage holds using pairwise summation for both textbook and two-pass, with probabilistic bounds of the forward error proportional to sqrt(log (n))u.
: TREXIO: A file format and library for quantum chemistry. Evgeny Posenitskiy, Vijay Gopal Chilkuri, Abdallah Ammar, Michał Hapka, Katarzyna Pernal, Ravindra Shinde, Edgar Josué Landinez Borda, Claudia Filippi, Kosuke Nakano, Otto Kohulák, Sandro Sorella, Pablo de Oliveira Castro, William Jalby, Pablo López Ríos, Ali Alavi, and Anthony Scemama. The Journal of Chemical Physics, 158(17), 05 2023. [ bib | DOI | http | .pdf ]
TREXIO is an open-source file format and library developed for the storage and manipulation of data produced by quantum chemistry calculations. It is designed with the goal of providing a reliable and efficient method of storing and exchanging wave function parameters and matrix elements, making it an important tool for researchers in the field of quantum chemistry. In this work, we present an overview of the TREXIO file format and library. The library consists of a front-end implemented in the C programming language and two different back-ends: a text back-end and a binary back-end utilizing the hierarchical data format version 5 library, which enables fast read and write operations. It is compatible with a variety of platforms and has interfaces for Fortran, Python, and OCaml programming languages. In addition, a suite of tools have been developed to facilitate the use of the TREXIO format and library, including converters for popular quantum chemistry codes and utilities for validating and manipulating data stored in TREXIO files. The simplicity, versatility, and ease of use of TREXIO make it a valuable resource for researchers working with quantum chemistry data.
: Stochastic Rounding Variance and Probabilistic Bounds: A New Approach. El-Mehdi El Arar, Devan Sohier, Pablo de Oliveira Castro, and Eric Petit. SIAM Journal on Scientific Computing, 45(5):C255--C275. [ bib | DOI | http | .pdf ]
Abstract. Stochastic rounding (SR) offers an alternative to the deterministic IEEE-754 floating-point rounding modes. In some applications such as PDEs, ODEs, and neural networks, SR empirically improves the numerical behavior and convergence to accurate solutions while the theoretical background remains partial. Recent works by Ipsen, Zhou, Higham, and Mary have computed SR probabilistic error bounds for basic linear algebra kernels. For example, the inner product SR probabilistic bound of the forward error is proportional to (sqrt nu) instead of (nu) for the default rounding mode. To compute the bounds, these works show that the errors accumulated in computation form a martingale. This paper proposes an alternative framework to characterize SR errors based on the computation of the variance. We pinpoint common error patterns in numerical algorithms and propose a lemma that bounds their variance. For each probability and through the Bienaymé–Chebyshev inequality, this bound leads to a better probabilistic error bound in several situations. Our method has the advantage of providing a tight probabilistic bound for all algorithms fitting our model. We show how the method can be applied to give SR error bounds for the inner product and Horner polynomial evaluation.
: The Positive Effects of Stochastic Rounding in Numerical Algorithms. El-Mehdi El Arar, Devan Sohier, Pablo de Oliveira Castro, and Eric Petit. In 29th IEEE Symposium on Computer Arithmetic ARITH 2022, Virtual conference, France, 2022. [ bib | http | .pdf ]
Recently, stochastic rounding (SR) has been implemented in specialized hardware but most current computing nodes do not yet support this rounding mode. Several works empirically illustrate the benefit of stochastic rounding in various fields such as neural networks and ordinary differential equations. For some algorithms, such as summation, inner product or matrixvector multiplication, it has been proved that SR provides probabilistic error bounds better than the traditional deterministic bounds. In this paper, we extend this theoretical ground for a wider adoption of SR in computer architecture. First, we analyze the biases of the two SR modes: SR-nearness and SR-up-or-down. We demonstrate on a case-study of Euler's forward method that IEEE-754 default rounding modes and SR-up-or-down accumulate rounding errors across iterations and that SR-nearness, being unbiased, does not. Second, we prove a O(√ n) probabilistic bound on the forward error of Horner's polynomial evaluation method with SR, improving on the known deterministic O(n) bound.
: Numerical uncertainty in analytical pipelines lead to impactful variability in brain networks. Gregory Kiar, Yohan Chatelain, Pablo de Oliveira Castro, Eric Petit, Ariel Rokem, Gaël Varoquaux, Bratislav Misic, Alan C. Evans, and Tristan Glatard. PLOS ONE, 16(11):1--16, 11 2021. [ bib | DOI | http ]
The analysis of brain-imaging data requires complex processing pipelines to support findings on brain function or pathologies. Recent work has shown that variability in analytical decisions, small amounts of noise, or computational environments can lead to substantial differences in the results, endangering the trust in conclusions. We explored the instability of results by instrumenting a structural connectome estimation pipeline with Monte Carlo Arithmetic to introduce random noise throughout. We evaluated the reliability of the connectomes, the robustness of their features, and the eventual impact on analysis. The stability of results was found to range from perfectly stable (i.e. all digits of data significant) to highly unstable (i.e. 0 − 1 significant digits). This paper highlights the potential of leveraging induced variance in estimates of brain connectivity to reduce the bias in networks without compromising reliability, alongside increasing the robustness and potential upper-bound of their applications in the classification of individual differences. We demonstrate that stability evaluations are necessary for understanding error inherent to brain imaging experiments, and how numerical analysis can be applied to typical analytical workflows both in brain imaging and other domains of computational sciences, as the techniques used were data and context agnostic and globally relevant. Overall, while the extreme variability in results due to analytical instabilities could severely hamper our understanding of brain organization, it also affords us the opportunity to increase the robustness of findings.
: Confidence Intervals for Stochastic Arithmetic. Devan Sohier, Pablo de Oliveira Castro, François Févotte, Bruno Lathuilière, Eric Petit, and Olivier Jamond. ACM Transactions Mathematical Software, 47(2), April 2021. [ bib | DOI | http | .pdf | .pdf ]
Quantifying errors and losses due to the use of Floating-point (FP) calculations in industrial scientific computing codes is an important part of the Verification, Validation, and Uncertainty Quantification process. Stochastic Arithmetic is one way to model and estimate FP losses of accuracy, which scales well to large, industrial codes. It exists in different flavors, such as CESTAC or MCA, implemented in various tools such as CADNA, Verificarlo, or Verrou. These methodologies and tools are based on the idea that FP losses of accuracy can be modeled via randomness. Therefore, they share the same need to perform a statistical analysis of programs results to estimate the significance of the results.In this article, we propose a framework to perform a solid statistical analysis of Stochastic Arithmetic. This framework unifies all existing definitions of the number of significant digits (CESTAC and MCA), and also proposes a new quantity of interest: the number of digits contributing to the accuracy of the results. Sound confidence intervals are provided for all estimators, both in the case of normally distributed results, and in the general case. The use of this framework is demonstrated by two case studies of industrial codes: Europlexus and code_aster.
: Shadow computation with BFloat16 to compute numerical accuracy. David Defour, Pablo De Oliveira Castro, Matei Istoan, and Eric Petit. In IEEE 28th Symposium on Computer Arithmetic (ARITH), June 2021. [ bib | http ]
: A Study of the Effects and Benefits of Custom-Precision Mathematical Libraries for HPC Codes. E. Brun, D. Defour, P. De Oliveira Castro, M. Istoan, D. Mancusi, E. Petit, and A. Vaquet. IEEE Transactions on Emerging Topics in Computing, 9(3):1467--1478, 2021. [ bib | DOI ]
Mathematical libraries are being specifically developed to use fixed-width data-paths on processors and target common floating-point formats like binary32 and binary64. In this article we propose a framework to evaluate the effects of mathematical library calls accuracy in scientific computations. First, our tool collects for each call-site of a mathematical function the input-data profile. Then, using a heuristic exploration algorithm, we estimate the minimal required accuracy by rounding the result to lower precisions. The data profile and accuracy measurement per call-site is used to speculatively select the mathematical function implementation with the most appropriate accuracy for a given scenario. We have tested the methodology with the Intel MKL VML library with predefined accuracy levels. We demonstrate the benefits of our approach on two real-world applications: SGP4, a satellite tracking application, and PATMOS, a Monte Carlo neutron transport code. We experiment and discuss its generalization across data-sets, and finally propose a speculative runtime implementation for PATMOS. The experiment provides an insight into the performance improvements that can be achieved by leveraging the control of per-function call-site accuracy-mode execution of the Intel MKL VML library.
: Comparing perturbation models for evaluating stability of neuroimaging pipelines. Gregory Kiar, Pablo de Oliveira Castro, Pierre Rioux, Eric Petit, Shawn T Brown, Alan C Evans, and Tristan Glatard. The International Journal of High Performance Computing Applications, 34(5):491--501, 2020. [ bib | DOI | www: | http ]
With an increase in awareness regarding a troubling lack of reproducibility in analytical software tools, the degree of validity in scientific derivatives and their downstream results has become unclear. The nature of reproducibility issues may vary across domains, tools, data sets, and computational infrastructures, but numerical instabilities are thought to be a core contributor. In neuroimaging, unexpected deviations have been observed when varying operating systems, software implementations, or adding negligible quantities of noise. In the field of numerical analysis, these issues have recently been explored through Monte Carlo Arithmetic, a method involving the instrumentation of floating-point operations with probabilistic noise injections at a target precision. Exploring multiple simulations in this context allows the characterization of the result space for a given tool or operation. In this article, we compare various perturbation models to introduce instabilities within a typical neuroimaging pipeline, including (i) targeted noise, (ii) Monte Carlo Arithmetic, and (iii) operating system variation, to identify the significance and quality of their impact on the resulting derivatives. We demonstrate that even low-order models in neuroimaging such as the structural connectome estimation pipeline evaluated here are sensitive to numerical instabilities, suggesting that stability is a relevant axis upon which tools are compared, alongside more traditional criteria such as biological feasibility, computational efficiency, or, when possible, accuracy. Heterogeneity was observed across participants which clearly illustrates a strong interaction between the tool and data set being processed, requiring that the stability of a given tool be evaluated with respect to a given cohort. We identify use cases for each perturbation method tested, including quality assurance, pipeline error detection, and local sensitivity analysis, and make recommendations for the evaluation of stability in a practical and analytically focused setting. Identifying how these relationships and recommendations scale to higher order computational tools, distinct data sets, and their implication on biological feasibility remain exciting avenues for future work.
: Custom-Precision Mathematical Library Explorations for Code Profiling and Optimization. David Defour, Pablo de Oliveira Castro, Matei Istoan, and Eric Petit. In 27th IEEE Symposium on Computer Arithmetic, ARITH 2020, pages 121--124, 2020. [ bib | http ]
The typical processors used for scientific computing have fixed-width data-paths. This implies that mathematical libraries were specifically developed to target each of these fixed precisions (binary16, binary32, binary64). However, to address the increasing energy consumption and throughput requirements of scientific applications, library and hardware designers are moving beyond this one-size-fits-all approach. In this article we propose to study the effects and benefits of using user-defined floating-point formats and target accuracies in calculations involving mathematical functions. Our tool collects input-data profiles and iteratively explores lower precisions for each call-site of a mathematical function in user applications. This profiling data will be a valuable asset for specializing and fine-tuning mathematical function implementations for a given application. We demonstrate the tool's capabilities on SGP4, a satellite tracking application. The profile data shows the potential for specialization and provides insight into answering where it is useful to provide variable-precision designs for elementary function evaluation.
: Automatic exploration of reduced floating-point representations in iterative methods. Yohan Chatelain, Eric Petit, Pablo de Oliveira Castro, Ghislain Lartigue, and David Defour. In Euro-Par 2019 Parallel Processing - 25th International Conference, Lecture Notes in Computer Science. Springer, 2019. [ bib | .pdf | .pdf ]
With the ever-increasing need for computation of scientific applications, new application domains, and major energy constraints, the landscape of floating-point computation is changing. New floating-point representation formats are emerging and there is a need for tools to simulate their impact in legacy codes. In this paper, we propose an automatic tool to evaluate the effect of adapting the floating point precision for each operation over time, which is particularly useful in iterative schemes. We present a backend to emulate any IEEE-754 floating-point operation in lower precision. We tested the numerical errors resilience of our solutions thanks to Monte Carlo Arithmetic and demonstrated the effectiveness of this methodology on YALES2, a large Combustion-CFD HPC code, by achieving 28% to 67% reduction in communication volume by lowering precision.
: VeriTracer: Context-enriched tracer for floating-point arithmetic analysis. Yohan Chatelain, Pablo de Oliveira Castro, Eric Petit, David Defour, Jordan Bieder, and Marc Torrent. In 25th IEEE Symposium on Computer Arithmetic, ARITH 2018, Amherst, MA, USA. June 25th-27th, 2018, pages 65--72. IEEE, 2018. [ bib | .pdf | .pdf ]
VeriTracer automatically instruments a code and traces the accuracy of floating-point variables over time. VeriTracer enriches the visual traces with contextual information such as the call site path in which a value was modified. Contextual information is important to understand how the floating-point errors propagate in complex codes. VeriTracer is implemented as an LLVM compiler tool on top of Verificarlo. We demonstrate how VeriTracer can detect accuracy loss and quantify the impact of using a compensated algorithm on ABINIT, an industrial HPC application for Ab Initio quantum computation.
: Verificarlo: Checking Floating Point Accuracy through Monte Carlo Arithmetic. Christophe Denis, Pablo de Oliveira Castro, and Eric Petit. In 23nd IEEE Symposium on Computer Arithmetic, ARITH 2016, Silicon Valley, CA, USA, July 10-13, 2016, pages 55--62, 2016. [ bib | DOI | http | .pdf ]
Numerical accuracy of floating point computation is a well studied topic which has not made its way to the end-user in scientific computing. Yet, it has become a critical issue with the recent requirements for code modernization to harness new highly parallel hardware and perform higher resolution computation. To democratize numerical accuracy analysis, it is important to propose tools and methodologies to study large use cases in a reliable and automatic way. In this paper, we propose verificarlo, an extension to the LLVM compiler to automatically use Monte Carlo Arithmetic in a transparent way for the end-user. It supports all the major languages including C, C++, and Fortran. Unlike source-to-source approaches, our implementation captures the influence of compiler optimizations on the numerical accuracy. We illustrate how Monte Carlo Arithmetic using the verificarlo tool outperforms the existing approaches on various use cases and is a step toward automatic numerical analysis.

Performance Characterization

: Scalable Work-Stealing Load-Balancer for HPC Distributed Memory Systems. Clement Fontenaille, Eric Petit, Pablo de Oliveira Castro, Seijilo Uemura, Devan Sohier, Piotr Lesnicki, Ghislain Lartigue, and Vincent Moureau. In COLOC: 2nd Workshop on Data Locality, in conjunction with Euro-Par 2018, 2018. [ bib ]
: Piecewise holistic autotuning of parallel programs with CERE. Mihail Popov, Chadi Akel, Yohan Chatelain, William Jalby, and Pablo de Oliveira Castro. Concurrency and Computation: Practice and Experience, page e4190, 2017. [ bib | DOI | http | http ]
Current architecture complexity requires fine tuning of compiler and runtime parameters to achieve best performance. Autotuning substantially improves default parameters in many scenarios, but it is a costly process requiring long iterative evaluations. We propose an automatic piecewise autotuner based on CERE (Codelet Extractor and REplayer). CERE decomposes applications into small pieces called codelets: Each codelet maps to a loop or to an OpenMP parallel region and can be replayed as a standalone program. Codelet autotuning achieves better speedups at a lower tuning cost. By grouping codelet invocations with the same performance behavior, CERE reduces the number of loops or OpenMP regions to be evaluated. Moreover, unlike whole-program tuning, CERE customizes the set of best parameters for each specific OpenMP region or loop. We demonstrate the CERE tuning of compiler optimizations, number of threads, thread affinity, and scheduling policy on both nonuniform memory access and heterogeneous architectures. Over the NAS benchmarks, we achieve an average speedup of 1.08x after tuning. Tuning a codelet is 13x cheaper than whole-program evaluation and predicts the tuning impact with a 94.7% accuracy. Similarly, exploring thread configurations and scheduling policies for a Black‐Scholes solver on an heterogeneous big.LITTLE architecture is over 40x faster using CERE.
: Piecewise Holistic Autotuning of Compiler and Runtime Parameters. Mihail Popov, Chadi Akel, William Jalby, and Pablo de Oliveira Castro. In Christos Kaklamanis, Theodore S. Papatheodorou, and Paul G. Spirakis, editors, Euro-Par 2016 Parallel Processing - 22nd International Conference, volume 9833 of Lecture Notes in Computer Science, pages 238--250. Springer, 2016. [ bib | .pdf | .pdf ]
Current architecture complexity requires fine tuning of compiler and runtime parameters to achieve full potential performance. Autotuning substantially improves default parameters in many scenarios but it is a costly process requiring a long iterative evaluation. We propose an automatic piecewise autotuner based on CERE (Codelet Extractor and REplayer). CERE decomposes applications into small pieces called codelets: each codelet maps to a loop or to an OpenMP parallel region and can be replayed as a standalone program. Codelet autotuning achieves better speedups at a lower tuning cost. By grouping codelet invocations with the same performance behavior, CERE reduces the number of loops or OpenMP regions to be evaluated. Moreover unlike whole-program tuning, CERE customizes the set of best parameters for each specific OpenMP region or loop. We demonstrate CERE tuning of compiler optimizations, number of threads and thread affinity on a NUMA architecture. On average over the NAS 3.0 benchmarks, we achieve a speedup of 1.08x after tuning. Tuning a single codelet is 13x cheaper than whole-program evaluation and estimates the tuning impact on the original region with a 94.7% accuracy. On a Reverse Time Migration (RTM) proto-application we achieve a 1.11x speedup with a 200x cheaper exploration.
: PCERE: Fine-grained Parallel Benchmark Decomposition for Scalability Prediction. Mihail Popov, Chadi Akel, Florent Conti, William Jalby, and Pablo de Oliveira Castro. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pages 1151--1160. IEEE, 2015. [ bib | .pdf | .pdf ]
Evaluating the strong scalability of OpenMP applications is a costly and time-consuming process. It traditionally requires executing the whole application multiple times with different number of threads. We propose the Parallel Codelet Extractor and REplayer (PCERE), a tool to reduce the cost of scalability evaluation. PCERE decomposes applications into small pieces called codelets: each codelet maps to an OpenMP parallel region and can be replayed as a standalone program. To accelerate scalability prediction, PCERE replays codelets while varying the number of threads. Prediction speedup comes from two key ideas. First, the number of invocations during replay can be significantly reduced. Invocations that have the same performance are grouped together and a single representative is replayed. Second, sequential parts of the programs do not need to be replayed for each different thread configuration. PCERE codelets can be captured once and replayed accurately on multiple architectures, enabling cross-architecture parallel performance prediction. We evaluate PCERE on a C version of the NAS 3.0 Parallel Benchmarks (NPB). We achieve an average speed-up of 25 times on evaluating OpenMP applications scalability with an average error of 4.9% (median error of 1.7%).
: CERE: LLVM Based Codelet Extractor and REplayer for Piecewise Benchmarking and Optimization. Pablo de Oliveira Castro, Chadi Akel, Eric Petit, Mihail Popov, and William Jalby. ACM Transactions on Architecture and Code Optimization (TACO), 12(1):6, 2015. [ bib | DOI | .pdf ]
This article presents Codelet Extractor and REplayer (CERE), an open-source framework for code isolation. CERE finds and extracts the hotspots of an application as isolated fragments of code, called codelets. Codelets can be modified, compiled, run, and measured independently from the original application. Code isolation reduces benchmarking cost and allows piecewise optimization of an application. Unlike previous approaches, CERE isolates codes at the compiler Intermediate Representation (IR) level. Therefore CERE is language agnostic and supports many input languages such as C, C++, Fortran, and D. CERE automatically detects codelets invocations that have the same performance behavior. Then, it selects a reduced set of representative codelets and invocations, much faster to replay, which still captures accurately the original application. In addition, CERE supports recompiling and retargeting the extracted codelets. Therefore, CERE can be used for cross-architecture performance prediction or piecewise code optimization. On the SPEC 2006 FP benchmarks, CERE codelets cover 90.9% and accurately replay 66.3% of the execution time. We use CERE codelets in a realistic study to evaluate three different architectures on the NAS benchmarks. CERE accurately estimates each architecture performance and is 7.3x to 46.6x cheaper than running the full benchmark.
: Fine-grained Benchmark Subsetting for System Selection. Pablo de Oliveira Castro, Yuriy Kashnikov, Chadi Akel, Mihail Popov, and William Jalby. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '14, pages 132:132--132:142, New York, NY, USA, 2014. ACM. [ bib | DOI | http | .pdf | .pdf ]
System selection aims at finding the best architecture for a set of programs and workloads. It traditionally requires long running benchmarks. We propose a method to reduce the cost of system selection. We break down benchmarks into elementary fragments of source code, called codelets. Then, we identify two causes of redundancy: first, similar codelets; second, codelets called repeatedly. The key idea is to minimize redundancy inside the benchmark suite to speed it up. For each group of similar codelets, only one representative is kept. For codelets called repeatedly and for which the performance does not vary across calls, the number of invocations is reduced. Given an initial benchmark suite, our method produces a set of reduced benchmarks that can be used in place of the original one for system selection. We evaluate our method on the NAS SER benchmarks, producing a reduced benchmark suite 30 times faster in average than the original suite, with a maximum of 44 times. The reduced suite predicts the execution time on three target architectures with a median error between 3.9% and 8%.
: Adaptive Sampling for Performance Characterization of Application Kernels. Pablo de Oliveira Castro, Eric Petit, Asma Farjallah, and William Jalby. Concurrency and Computation: Practice and Experience, 2013. [ bib | DOI | .pdf ]
Characterizing performance is essential to optimize programs and architectures. The open source Adaptive Sampling Kit (ASK) measures the performance trade-off in large design spaces. Exhaustively sampling all sets of parameters is computationally intractable. Therefore, ASK concentrates exploration in the most irregular regions of the design space through multiple adaptive sampling strategies. The paper presents the ASK architecture and a set of adaptive sampling strategies, including a new approach called Hierarchical Variance Sampling. ASK's usage is demonstrated on three performance characterization problems: memory stride accesses, Jacobian stencil code, and an industrial seismic application using 3D stencils. ASK builds accurate models of performance with a small number of measures. It considerably reduces the cost of performance exploration. For instance, the Jacobian stencil code design space, which has more than 31 × 10^8 combinations of parameters, is accurately predicted using only 1500 combinations.
: Is Source-code Isolation Viable for Performance Characterization? Chadi Akel, Yuriy Kashnikov, Pablo de Oliveira Castro, and William Jalby. In International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI). IEEE Computer Society, 2013. [ bib | .pdf | .pdf ]
Source-code isolation finds and extracts the hotspots of an application as independent isolated fragments of code, called codelets. Codelets can be modified, compiled, run, and measured independently from the original application. Source-code isolation reduces benchmarking cost and allows piece-wise optimization of an application. Source-code isolation is faster than whole-program benchmarking and optimization since the user can concentrate only on the bottlenecks. This paper examines the viability of using isolated codelets in place of the original application for performance characterization and optimization. On the NAS benchmarks, we show that codelets capture 92.3% of the original execution time. We present a set of techniques for keeping codelets as faithful as possible to the original hotspots: 63.6% of the codelets have the same assembly as the original hotspots and 81.6% of the codelets have the same run time performance as the original hotspots.
: Evaluating Architecture and Compiler Design through Static Loop Analysis. Yuriy Kashnikov, Pablo de Oliveira Castro, Emmanuel Oseret, and William Jalby. In High Performance Computing and Simulation (HPCS), 2013 International Conference on, pages 535 -- 544. IEEE Computer Society, 2013. [ bib | DOI | .pdf ]
Using the MAQAO loop static analyzer, we characterize a corpus of binary loops extracted from common benchmark suits such as SPEC, NAS, etc. and several industrial applications. For each loop, MAQAO extracts low-level assembly features such as: integer and floating-point vectorization ratio, number of registers used and spill-fill, number of concurrent memory streams accessed, etc. The distributions of these features on a large representative code corpus can be used to evaluate compilers and architectures and tune them for the most frequently used assembly patterns. In this paper, we present the MAQAO loop analyzer and a characterization of the 4857 binary loops. We evaluate register allocation and vectorization on two compilers and propose a method to tune loop buffer size and stream prefetcher based on static analysis of benchmarks.
: ASK: Adaptive Sampling Kit for Performance Characterization. Pablo de Oliveira Castro, Eric Petit, Jean Christophe Beyler, and William Jalby. In Christos Kaklamanis, Theodore S. Papatheodorou, and Paul G. Spirakis, editors, Euro-Par 2012 Parallel Processing - 18th International Conference, volume 7484 of Lecture Notes in Computer Science, pages 89--101. Springer, 2012. [ bib | .pdf | .pdf ]
Characterizing performance is essential to optimize programs and architectures. The open source Adaptive Sampling Kit (ASK) measures the performance trade-offs in large design spaces. Exhaustively sampling all points is computationally intractable. Therefore, ASK concentrates exploration in the most irregular regions of the design space through multiple adaptive sampling methods. The paper presents the ASK architecture and a set of adaptive sampling strategies, including a new approach: Hierarchical Variance Sampling. ASK’s usage is demonstrated on two performance characterization problems: memory stride accesses and stencil codes. ASK builds precise models of performance with a small number of measures. It considerably reduces the cost of performance exploration. For instance, the stencil code design space, which has more than 31.10^8 points, is accurately predicted using only 1500 points.
: Computing-Kernels Performance Prediction Using DataFlow Analysis and Microbenchmarking. Eric Petit, Pablo de Oliveira Castro, Tarek Menour, Bettina Krammer, and William Jalby. In International Workshop on Compilers for Parallel Computers, 2012. [ bib ]

Dataflow Parallelism

: DSL Stream Programming on Multicore Architectures. Pablo de Oliveira Castro, Stéphane Louise, and Denis Barthou, chapter 7, pages 143--163. John Wiley & Sons, Ltd, 2017. [ bib | DOI | http ]
Summary The challenge for stream programming on multicore architectures is to describe stream manipulation, dependent on the application, and adapt this stream to complex and changing multicore architectures. Stream languages are a powerful alternative to program multicore processors for two main reasons: they offer a deterministic execution based on a sound mathematical formalism - synchronous data flow (SDF) and the expression of the parallelism is implicitly described by the stream structure. So far, research efforts in stream-specific languages have focused on two language categories: languages such as Array-OL and Block Parallel describe streams through dependences between filters, and languages such as StreamIt and Brook explicitly manipulate stream operators. A high-level typed DSL, called SLICES, is used to describe the data dependencies. This chapter introduces a formal framework for building correct transformations of intermediary stream language (SJD) programs and an iterative exploration algorithm to optimize a program according to a metric.
: Automatic mapping of stream programs on multicore architectures. Pablo de Oliveira Castro, Stéphane Louise, and Denis Barthou. In International Workshop on Compilers for Parallel Computers, 2010. [ bib ]
: A Multidimensional Array Slicing DSL for Stream Programming. Pablo de Oliveira Castro, Stéphane Louise, and Denis Barthou. In Complex, Intelligent and Software Intensive Systems, International Conference, pages 913--918. IEEE Computer Society, 2010. [ bib | DOI | .pdf ]
Stream languages offer a simple multi-core programming model and achieve good performance. Yet expressing data rearrangement patterns (like a matrix block decomposition) in these languages is verbose and error prone. In this paper, we propose a high-level programming language to elegantly describe n-dimensional data reorganization patterns. We show how to compile it to stream languages.
: Reducing memory requirements of stream programs by graph transformations. Pablo de Oliveira Castro, Stéphane Louise, and Denis Barthou. In High Performance Computing and Simulation (HPCS), 2010 International Conference on, pages 171--180. IEEE Computer Society, 2010. [ bib | DOI | .pdf ]
Stream languages explicitly describe fork-join parallelism and pipelines, offering a powerful programming model for many-core Multi-Processor Systems on Chip (MPSoC). In an embedded resource-constrained system, adapting stream programs to fit memory requirements is particularly important. In this paper we present a new approach to reduce the memory footprint required to run stream programs on MPSoC. Through an exploration of equivalent program variants, the method selects parallel code minimizing memory consumption. For large program instances, a heuristic accelerating the exploration phase is proposed and evaluated. We demonstrate the interest of our method on a panel of ten significant benchmarks. Using a multi-core modulo scheduling technique, our approach lowers considerably the minimal amount of memory required to run seven of these benchmarks while preserving throughput.
: Design-Space Exploration of Stream Programs through Semantic-Preserving Transformations. Pablo de Oliveira Castro, Stéphane Louise, and Denis Barthou. [ bib | .pdf ]
Stream languages explicitly describe fork-join parallelism and pipelines, offering a powerful programming model for many-core Multi-Processor Systems on Chip (MPSoC). In an embedded resource-constrained system, adapting stream programs to fit memory requirements is particularly important. In this paper we present a design-space exploration technique to reduce the minimal memory required when running stream programs on MPSoC; this allows to target memory constrained systems and in some cases obtain better performance. Using a set of semantically preserving transformations, we explore a large number of equivalent program variants; we select the variant that minimizes a buffer evaluation metric. To cope efficiently with large program instances we propose and evaluate an heuristic for this method. We demonstrate the interest of our method on a panel of ten significant benchmarks. As an illustration, we measure the minimal memory required using a multi-core modulo scheduling. Our approach lowers considerably the minimal memory required for seven of the ten benchmarks.
: Expression et optimisation des réorganisations de données dans du parallélisme de flots. Pablo de Oliveira Castro. PhD thesis, Université de Versailles Saint Quentin en Yvelines, 2010. [ bib | .pdf | .pdf ]

If you download IEEE publications please consider the copyright notice.