Inciter software design
This page discusses the high-level software design and some implementation aspects of Inciter. The discussion roughly follows as execution happens in time, covering what happens from program start to the beginning of time stepping. On any of the Charm++ concepts below, consult the Charm++ manual.
1. Startup and migration of read-only global-scope data
Startup
All executables in Xyst use the Charm++ runtime system. Runtime execution starts in the Charm++ mainmodule
constructor, defined in Main/Main
. In Main's constructor, the command line is parsed, followed by echoing some information to the screen. Then the input (control) file is parsed and a log file is written that contains what was parsed, which helps debugging input-related problems.
Global-scope data
After the main chare constructor has finished, the runtime system initializes global scope data and migrates it to all other processing elements (PEs), which from that point is considered read-only.
This global-scope data are defined at the beginning of Main/
Create single Charm++ chare Transporter
Up to this point execution is serial, since there is only a single main Charm++ chare. Main::Transporter
, after the runtime system has finished migrating all global-scope data. Transporter
, defined in Inciter/Transporter::
, which eventually calls back to Main::
, calling CkExit()
, signaling the runtime system to exit.
2. Important classes
Here are the important classes that interoperate within inciter:
Transporter
(single chare, driver)Partitioner
(chare nodegroup, mesh partitioner)tk::
(chare group, mesh writer, performing file output in parallel)MeshWriter Refiner
(chare array, mesh refiner)Sorter
(chare array, performs distributed mesh-node reordering)Discretization
(chare array, generic solver base class)
Chare above means a single Charm++ chare. There is a single instance of this class. By design, Transporter
is a single chare and is used as a driver that creates objects, used as a target of global parallel reductions, and thus global synchronization points. Transporter
also does most of the printouts to screen, collects statistics, and is the end-point of execution in Transporter::
.
Group above means a Charm++ chare group. A group is a processor-aware chare collection, which means that there is guaranteed to be a single instance of a group per PE which does not migrate. MeshWriter
is a group because it calls the MPI-only library, ExodusII, for outputing mesh and mesh-based solution field data to files. Note that while MeshWriter
is a group, the way we use it makes it similar to a nodegroup but with group semantics: we call its Charm++ entry method MeshWriter::write()
from multiple array elements (from Discretization) targeting ony the first PE of each logical node. This yields "serializing" every call per node, required for the underlying non-thread-safe NetCDF/HDF5 library calls, made by ExodusII. This way writing large solution data works in both non-SMP and SMP mode correctly and efficiently, since parallel output load is configurable separately from the number of work-units for computation.
Nodegroup above means a Charm++ chare nodegroup. A nodegroup is a processor-aware chare collection, which means that there is guaranteed to be a single instance of a nodegroup per logical (e.g., compute) node which does not migrate. Partitioner
is a nodegroup because it calls MPI-only libraries.
Array above means a Charm++ chare array, whose elements can migrate (if enabled) and thus they actively participate in automatic load balancing. With nonzero overdecomposition, there may be more array elements (workers) than the number of available PEs. Arrays do the bulk of the heavy lifting in a calculation, i.e., computing right-hand sides for PDE operators, and hold the bulk of unknown/solution arrays. The degree of overdecomposition can be specified by the -u
command line argument. This argument accepts a real value between 0.0 and 1.0, inclusive. 0.0 means no overdecomposition, which corresponds to partitioning the mesh into a number of pieces equalling the number of PEs available. (-u 0.0
yields an execution style that is most similar to how MPI codes are traditionally used, which is the default.) Nonzero overdecomposition yields larger number of mesh partitions than the available PEs. The extreme of 1.0
represents the largest degree of overdecomposition, which also results in the smallest work units.
Bound arrays: Discretization
and its specialized children, RieCG
, etc., Refiner
, are bound arrays. This means that the runtime system migrates its corresponding array elements together. Bound arrays facilitate modularization among workers that migrate. Since array elements that are bound always appear together on a given PE, even after migration, they can also be thought of as part of the same class, because they can access data from each other (but still respecting the C++ rules of public
, private
, etc). However, dividing functionality into classes, as always, helps readibility and makes reasoning about code easier.
3. Setup
Transporter constructor
Inciter's setup starts with Transporter
's constructor. After some printouts on configuration, some of the important classes are created, introduced above. In some cases, we simply call Charm++'s ckNew()
without arguments. This means an empty Charm++ chare array is created, we get hold of its proxy, but no constructors are run yet. An example for this is Sorter
. We do empty array creation for two main reasons: (1) we don't yet have all data that is needed to create the given class (more preparation is required but we need its proxy already), and/or (2) we need to pass specific data to each array (or group) element's constructor, which can only be done via Charm++'s dynamic insertion feature. (Passing it via Charm++'s ckNew()
would pass the same data to all elements to be created in a broadcast fashion, and this is not what we want most of the time.) For example, Sorter
needs part of the mesh connectivity, coordinates, etc., for which we need to have the mesh partitioned first. This also allows nicely adhering to the RAII idiom, which helps writing clean and correct code.
Transporter thus creates tk::
In many cases, we pass a number of callbacks to chare arrays and groups when their Charm++ proxy is created with ckNew()
. These callbacks are of type CkCallback
. This is one technique that Charm++ works well with. We use these callbacks as a kind of type erasure so that these classes can interoperate with each other as well as with their host, Transporter. These callbacks mostly denote reduction targets to Transporter
, that are necessary synchronization points during setup. A CkCallback
is similar to std::function or a C-style function pointer, but can also store a callback to a Charm++ chare object that happens to reside across the network, i.e., an entry method call via a proxy.
Partitioner constructor
While from Transporter
's constructor we create a number of Charm++ chare arrays, most of them are empty to start with. An exception is Partitioner
, whose constructor starts by reading a different chunk of the mesh in blocks as they appear in the file. We read a chunk of the mesh and associated node coordinates for all mesh cells that are in the chunk read. Partitioner
's constructor also reads/computes the triangle element connectivity associated to side sets as well as the node lists associated to side sets. Note that only the portion of all this data is read in on a nodegroup that belongs to a given chunk of the mesh. At this point this results in a simple partitioning which is suitable for parallel reading from the mesh file, but almost certainly not ideal for computing equations later, because topology information has not been considered yet resulting in a large surface-area-to-volume ratio of these partitions of the computational domain, since we cannot assume that the ordering in the mesh file also corresponds to close physical-space proximity. Partitioner
's constructor finishes with a global reduction to Transporter::
, which sums the number of elements in the complete problem (read in in a distributed fashion).
Compute total load and overdecomposition
After reading the mesh in a distributed fashion, we now know the total number of cells and Transporter::
computes the total number of worker chares that will be used to partition the problem into. If there is a nonzero overdecomposition configured by the command line argument -u
, the number of chares (partitions) may be more than the available PEs. This is computed by tk::
Mesh partitioning
Partitioner::
sets up the necessary data for calling an external mesh partitioner. Currently, various coordinate-based partitioners are hooked up from Zoltan. This works in distributed-memory parallel fashion, and calls MPI under the hood (inside the library call). Note that the number of desired mesh partitions equals the number of Charm++ (worker) chares we want, which can be larger than the number of PEs (or compute nodes). The output of mesh partitioning is a map that assigns a chare id to each mesh cell. Next is to categorize (or group) all cells (their connectivity and node coordinates) together that are assigned to each chare by the partitioner and send them to their (owner) chare. This is started out in Partitioner::
, and communicated, in a point-to-point fashion, to other Partitioner chares by calling the Partitioner::contribute
call to Transporter::
. When all contribute calls have arrived, communication of the different parts of the mesh has finished. This is followed by optional initial mesh refinement.
Optional initial mesh refinement
Transporter::
issues a broadcast to Partitioner::
, which uses dynamic array insertion to create all Refiner
chare array elements. Note that there may be multiple Refiner
objects created on each Partitioner
PE (or compute node in SMP mode). Each Refiner
constructor gets a chunk of the mesh (which now are smaller chunks than the chunks that were originally read from file), corresponding to the total number of chares. Not only the mesh connectivity and node coordinates, but also the boundary face connectivity and boundary node lists associated to multiple side sets (used for boundary conditions) are also passed to Refiner, and only those portions of these data structures that belong to the particular Refiner
chare.
Refiner
's constructor evaluates user configuration and decides if initial (before time stepping starts) mesh refinement is to be performed or not. If not, execution is simply skipped ahead to Refiner::
There are two ways Refiner is used:
- Mesh refinement before time stepping (t<0),
t0ref
, and - Mesh refinement during time stepping (t>0),
dtref
.
Initial mesh refinement consists of potentially multiple steps of different refinement types, e.g., uniform
, ic
, coordref
, which respectively stand for uniform refinement, i.e., split each tetrahedron into 8 new ones, initial-conditions-based, i.e., non-uniform refinement based on estimating the error in the initial conditions on a given initial mesh, and coordinate-based, which allows specifying planes and simple extents in 3 dimensional space and allows tagging edges for refinement between two extremes. Additionally the user can also tag edges manually, creating a list of pairs of global node ids.
Refiner::
Refiner::Sorter
array chares but also reports back to Transporter::
Optional PE-locality mesh node reordering
Sorter
takes a chunk of the mesh, together with its physical boundary face and node data structures, and sets up a symmetric node communication map, which stores the global node IDs of chare boundaries associated to chare IDs on each chare. This map is symmetric in a sense that if two chares share a node after partitioning, the same node ID will be stored on each chare assigned to the other chare ID. This map is used for point-point communication during time stepping.
Sorter
performs optional PE-locality mesh node reordering in parallel. This reordering is optional and is off by default. The reordering assigns new global mesh node IDs that roughly increase with chare ID. This is useful for schemes that need to solve a global linear system. For more details on the global sorting algorithm see Papers.
Whether node reordering is to be performed or not is evaluated in Sorter::
Creating workers
The worker Charm++ chare array elements, that store the mesh and associated data structures and compute PDE operators during time stepping, are organized into a single-level base-child relationship for code reuse and organization. However, the base-child relationship is slightly differently implemented compared to what would be familiar with inheritance in standard object-oriented programming.
There is a single base class, Discretization
, that encapsulates data and member functions that are generic to all mesh-based discretization schemes. It stores the mesh, connectivity, node coordinates, etc. As Discretization
is a chare array, its elements are distributed across the whole problem on all available compute nodes and PEs, and can also migrate for load balancing. The derived (or child) classes are the classes that implement a particular discretization scheme, e.g., RieCG.
Sorter::Discretization
chares. This is done via the usual Charm++ dynamic array insertion passing the mesh chunk and associated data structures to their constructors. Eventually, this is followed by creating the child workers in Sorter::
Execution from the Discretization
constructor follows to Transporter::Discretization::
, which computes nodal volumes as well as the total volume of the complete problem, which ends up in Transporter::
. Volume calculations are followed by computing various statistics on the mesh cell sizes, including min/max edge lengths, and histograms, which are useful diagnostics to estimate load imbalance.
The various mesh statistics aggregate and arrive independently in reduction target member functions of Transporter
, minstat()
, maxstat()
, sumstat()
, and pdfstat()
. When all of these are complete (see src/
After the child constructors have started, one way or another, they are expected to call Transporter::comfinal(), which signals the end of setting up any (potentially additional) communication maps needed by the specific discretization schemes. After this the child classes are called again via their member function setup()
. All specific discretization schemes, "deriving from" Discretization
, are expected to define certain member functions, such as setup()
, which do discretization-scheme-specific setup and is eventually expected to call dt()
, which starts time stepping (by computing the size of the next time step). dt()
is also the member function that is called again and again starting a new time step during time stepping.
4. Time stepping
After all the above, time stepping starts. Time stepping is and may be done very differently by the different types of discretizations but they are all expected to define a few member functions that are common so they can interoperate with their host, Transporter
. These are Charm++ entry methods and must also have the same function signature (as in a derived class). Since the set of these function may change, the best way to find out what is required is to compare the Charm++ interface (.ci
) files for the specific discretization schemes, e.g., RieCG, etc., and look for entry methods that are defined by all child schemes.