- Title:

On Efficient Parallelization of Line-Sweep Computations - Author:

Daniel Chavarria-Miranda, Alain Darte, Robert Fowler, John Mellor-Crummey

Rice University, Houston, TX, USA and LIP, ENS-Lyon, France - Full Paper(.ps version)
- Abstract:

Strategies for partitioning an application's data and computation determine an application's possible parallelizations and ultimately its parallel efficiency. BLOCK partitionings supported by High Performance Fortran (HPF) and OpenMP can lead to good performance for loosely synchronous computations, they can be inefficient for more tightly-coupled codes. One important class of tightly-coupled computations not well supported by the BLOCK partitionings are line sweeps that solve one-dimensional recurrences along each dimension of a multi-dimensional discretized physical domain. Computational methods based on line sweeps include Alternating Direction Implicit (ADI) integration (a common technique for solving partial differential equations), and fractional step methods among others (Naik, IJHSC93). For this class of computations, applying a standard block partitioning to any of the spatial dimensions is problematic -- recurrences along the partitioned dimension partially serialize execution.Multipartitionings, also known as Bruno-Capella partitionings, are a strategy for partitioning multidimensional arrays among a collection of processors so that line-sweep computations can be solved efficiently. The principal property of these partitionings is that for any directional sweep across a multipartitioned array, all processors have a tile to compute in each phase of the computation yielding a perfectly balanced parallelization with no serialization. A secondary benefit of these partitionings is that only coarse-grain communication is required. Any number of processors can be utilized with a two-dimensional multipartitioning, but a 3-dimensional multipartioning requires that the number of processors be a perfect square. In this work we explore the problem of generalizing partitionings in three or more dimensions to any number of processors that is a composite number. Two interesting problems must be solved for such partitionings. First, we must determine the number of cuts to be applied to each partitioned dimension to maintain full parallelism in each phase of a sweep across any data dimension. The second problem is determining the assignment of tiles to each processor. We are exploring how to best compute effective assignments for the more general case.

We are in the process of completing an implementation of multipartitioning in the Rice dHPF compiler for High Performance Fortran. Our prototype has been used to parallelize a sequential version of the NAS BT computational fluid dynamics benchmark from NASA Ames Laboratory, and, to date, this compiler-based parallelization achieves 80% of the parallel efficiency of the hand-coded parallelization on 25 processors. We intend to incorporate our strategy for achieving partitioning in three or more dimensions for a composite number of processors into a run-time library for the dHPF compiler.

Please contact our
webadmin with any comments or changes.

Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh.

Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh.