Inciter software design

This page discusses the high-level software design and some implementation aspects of Inciter. The discussion roughly follows as execution happens in time, covering what happens from program start to the beginning of time stepping. On any of the Charm++ concepts below, consult the Charm++ manual.

1. Startup and migration of read-only global-scope data

Startup

All executables in Xyst use the Charm++ runtime system. Runtime execution starts in the Charm++ mainmodule constructor, defined in Main/Inciter.cpp as Main. In Main's constructor, the command line is parsed, followed by echoing some information to the screen. Then the input (control) file is parsed and a log file is written that contains what was parsed, which helps debugging input-related problems.

Global-scope data

After the main chare constructor has finished, the runtime system initializes global scope data and migrates it to all other processing elements (PEs), which from that point is considered read-only.

This global-scope data are defined at the beginning of Main/Inciter.cpp in the same order as they appear in the companion Charm++ interface file Main/inciter.ci. This is the order in which these data are initialized and migrated to all PEs. Global scope data is limited to such read-only data. It stores data initialized during parsing the command line and the input (control) file.

Create single Charm++ chare Transporter

Up to this point execution is serial, since there is only a single main Charm++ chare. Main::execute() then fires up a single Charm++ chare instance of Transporter, after the runtime system has finished migrating all global-scope data. Transporter, defined in Inciter/Transporter.cpp, is the main driver class of Inciter that is a Charm++ chare from which all execution happens, e.g., via broadcasts, and to which all execution ends up in, leading to Transporter::finish(), which eventually calls back to Main::finalize(), calling CkExit(), signaling the runtime system to exit.

2. Important classes

Here are the important classes that interoperate within inciter:

Transporter (single chare, driver)
Partitioner (chare nodegroup, mesh partitioner)
tk::MeshWriter (chare group, mesh writer, performing file output in parallel)
Refiner (chare array, mesh refiner)
Sorter (chare array, performs distributed mesh-node reordering)
Discretization (chare array, generic solver base class)
- RieCG (chare array, solver child class, specialized to RieCG)
- KozCG (chare array, solver child class, specialized to KozCG)
- ZalCG (chare array, solver child class, specialized to ZalCG)
- LaxCG (chare array, solver child class, specialized to LaxCG)
- ChoCG (chare array, solver child class, specialized to ChoCG)
- LohCG (chare array, solver child class, specialized to LohCG)
- ...

Chare above means a single Charm++ chare. There is a single instance of this class. By design, Transporter is a single chare and is used as a driver that creates objects, used as a target of global parallel reductions, and thus global synchronization points. Transporter also does most of the printouts to screen, collects statistics, and is the end-point of execution in Transporter::finish().

Group above means a Charm++ chare group. A group is a processor-aware chare collection, which means that there is guaranteed to be a single instance of a group per PE which does not migrate. MeshWriter is a group because it calls the MPI-only library, ExodusII, for outputting mesh and mesh-based solution field data to files. Note that while MeshWriter is a group, the way we use it makes it similar to a nodegroup but with group semantics: we call its Charm++ entry method MeshWriter::write() from multiple array elements (from Discretization) targeting ony the first PE of each logical node. This yields "serializing" every call per node, required for the underlying non-thread-safe NetCDF/HDF5 library calls, made by ExodusII. This way writing large solution data works in both non-SMP and SMP mode correctly and efficiently, since parallel output load is configurable separately from the number of work-units for computation.

Nodegroup above means a Charm++ chare nodegroup. A nodegroup is a processor-aware chare collection, which means that there is guaranteed to be a single instance of a nodegroup per logical (e.g., compute) node which does not migrate. Partitioner is a nodegroup because it calls MPI-only libraries.

Note

Charm++ nodegroups, are similar to groups in that they are processor-aware but instead of one per PE, nodegroups are guaranteed to be one per logical (compute) node, where a compute node is understood to be a logical node which may or may not be equivalent to a physical compute node (i.e., a shared-memory node), depending on how a given execution environment is configured in Charm++'s symmetric multi-processing mode (SMP) mode. (For example, a run in SMP mode can be configured with a logical node corresponding to a shared-memory compute node or to a socket within a compute node). Similar to groups, nodegroups do not migrate but there are significant differences in their parallel computing semantics in how the runtime system invokes their member functions compared to groups. In particular, while member functions of chare arrays and groups execute exclusively in a single instance, node groups allow multiple non-exclusive member functions to execute concurrently in a single nodegroup instance. Note that there is no difference between groups and nodegroups if Charm++ is built in non-SMP mode, which is the default. For more details see the Charm++ manual.

Array above means a Charm++ chare array, whose elements can migrate (if enabled) and thus they actively participate in automatic load balancing. With nonzero overdecomposition, there may be more array elements (workers) than the number of available PEs. Arrays do the bulk of the heavy lifting in a calculation, i.e., computing right-hand sides for PDE operators, and hold the bulk of unknown/solution arrays. The degree of overdecomposition can be specified by the -u command line argument. This argument accepts a real value between 0.0 and 1.0, inclusive. 0.0 means no overdecomposition, which corresponds to partitioning the mesh into a number of pieces equalling the number of PEs available. (-u 0.0 yields an execution style that is most similar to how MPI codes are traditionally used, which is the default.) Nonzero overdecomposition yields larger number of mesh partitions than the available PEs. The extreme of 1.0 represents the largest degree of overdecomposition, which also results in the smallest work units.

Bound arrays: Discretization and its specialized children, RieCG, etc., Refiner, are bound arrays. This means that the runtime system migrates its corresponding array elements together. Bound arrays facilitate modularization among workers that migrate. Since array elements that are bound always appear together on a given PE, even after migration, they can also be thought of as part of the same class, because they can access data from each other (but still respecting the C++ rules of public, private, etc). However, dividing functionality into classes, as always, helps readability and makes reasoning about code easier.

3. Setup

Transporter constructor

Inciter's setup starts with Transporter's constructor. After some printouts on configuration, some of the important classes are created, introduced above. In some cases, we simply call Charm++'s ckNew() without arguments. This means an empty Charm++ chare array is created, we get hold of its proxy, but no constructors are run yet. An example for this is Sorter. We do empty array creation for two main reasons: (1) we don't yet have all data that is needed to create the given class (more preparation is required but we need its proxy already), and/or (2) we need to pass specific data to each array (or group) element's constructor, which can only be done via Charm++'s dynamic insertion feature. (Passing it via Charm++'s ckNew() would pass the same data to all elements to be created in a broadcast fashion, and this is not what we want most of the time.) For example, Sorter needs part of the mesh connectivity, coordinates, etc., for which we need to have the mesh partitioned first. This also allows nicely adhering to the RAII idiom, which helps writing clean and correct code.

Transporter thus creates tk::MeshWriter, Sorter, Refiner, Partitioner, etc. These classes play importart roles during setup, but some are also (re-)used during time stepping.

In many cases, we pass a number of callbacks to chare arrays and groups when their Charm++ proxy is created with ckNew(). These callbacks are of type CkCallback. This is one technique that Charm++ works well with. We use these callbacks as a kind of type erasure so that these classes can interoperate with each other as well as with their host, Transporter. These callbacks mostly denote reduction targets to Transporter, that are necessary synchronization points during setup. A CkCallback is similar to std::function or a C-style function pointer, but can also store a callback to a Charm++ chare object that happens to reside across the network, i.e., an entry method call via a proxy.

Partitioner constructor

While from Transporter's constructor we create a number of Charm++ chare arrays, most of them are empty to start with. An exception is Partitioner, whose constructor starts by reading a different chunk of the mesh in blocks as they appear in the file. We read a chunk of the mesh and associated node coordinates for all mesh cells that are in the chunk read. Partitioner's constructor also reads/computes the triangle element connectivity associated to side sets as well as the node lists associated to side sets. Note that only the portion of all this data is read in on a nodegroup that belongs to a given chunk of the mesh. At this point this results in a simple partitioning which is suitable for parallel reading from the mesh file, but almost certainly not ideal for computing equations later, because topology information has not been considered yet resulting in a large surface-area-to-volume ratio of these partitions of the computational domain, since we cannot assume that the ordering in the mesh file also corresponds to close physical-space proximity. Partitioner's constructor finishes with a global reduction to Transporter::load(), which sums the number of elements in the complete problem (read in in a distributed fashion).

Compute total load and overdecomposition

After reading the mesh in a distributed fashion, we now know the total number of cells and Transporter::load() computes the total number of worker chares that will be used to partition the problem into. If there is a nonzero overdecomposition configured by the command line argument -u, the number of chares (partitions) may be more than the available PEs. This is computed by tk::linearLoadDistributor(). We then start mesh partitioning by issuing a broadcast to Partitioner::partition().

Mesh partitioning

Partitioner::partition() sets up the necessary data for calling an external mesh partitioner. Currently, various coordinate-based partitioners are hooked up from Zoltan. This works in distributed-memory parallel fashion, and calls MPI under the hood (inside the library call). Note that the number of desired mesh partitions equals the number of Charm++ (worker) chares we want, which can be larger than the number of PEs (or compute nodes). The output of mesh partitioning is a map that assigns a chare id to each mesh cell. Next is to categorize (or group) all cells (their connectivity and node coordinates) together that are assigned to each chare by the partitioner and send them to their (owner) chare. This is started out in Partitioner::distribute(), and communicated, in a point-to-point fashion, to other Partitioner chares by calling the Partitioner::addMesh() entry method. When every Partitioner object has received their assigned mesh, they issue a global contribute call to Transporter::distributed(). When all contribute calls have arrived, communication of the different parts of the mesh has finished. This is followed by optional initial mesh refinement.

Optional initial mesh refinement

Transporter::distributed() issues a broadcast to Partitioner::refine(), which uses dynamic array insertion to create all Refiner chare array elements. Note that there may be multiple Refiner objects created on each Partitioner PE (or compute node in SMP mode). Each Refiner constructor gets a chunk of the mesh (which now are smaller chunks than the chunks that were originally read from file), corresponding to the total number of chares. Not only the mesh connectivity and node coordinates, but also the boundary face connectivity and boundary node lists associated to multiple side sets (used for boundary conditions) are also passed to Refiner, and only those portions of these data structures that belong to the particular Refiner chare.

Refiner's constructor evaluates user configuration and decides if initial (before time stepping starts) mesh refinement is to be performed or not. If not, execution is simply skipped ahead to Refiner::endt0ref(). If initial refinement is configured, we descend into Refiner::t0ref().

There are two ways Refiner is used:

Mesh refinement before time stepping (t<0), t0ref, and
Mesh refinement during time stepping (t>0), dtref.

Initial mesh refinement consists of potentially multiple steps of different refinement types, e.g., uniform, ic, coordref, which respectively stand for uniform refinement, i.e., split each tetrahedron into 8 new ones, initial-conditions-based, i.e., non-uniform refinement based on estimating the error in the initial conditions on a given initial mesh, and coordinate-based, which allows specifying planes and simple extents in 3 dimensional space and allows tagging edges for refinement between two extremes. Additionally the user can also tag edges manually, creating a list of pairs of global node ids.

Refiner::next() evaluates, at the end of each initial refinement step, if a next initial refinement step is to be performed. In the last step, execution continues to Refiner::endt0ref(), which creates Sorter.

Refiner::endt0ref() not only creates the Sorter array chares but also reports back to Transporter::refined(), signaling the end of the initial mesh refinement. In this reduction target the final number of mesh cells and nodes are also aggregated across the whole problem.

Optional PE-locality mesh node reordering

Sorter takes a chunk of the mesh, together with its physical boundary face and node data structures, and sets up a symmetric node communication map, which stores the global node IDs of chare boundaries associated to chare IDs on each chare. This map is symmetric in a sense that if two chares share a node after partitioning, the same node ID will be stored on each chare assigned to the other chare ID. This map is used for point-point communication during time stepping.

Sorter performs optional PE-locality mesh node reordering in parallel. This reordering is optional and is off by default. The reordering assigns new global mesh node IDs that roughly increase with chare ID. This is useful for schemes that need to solve a global linear system. For more details on the global sorting algorithm see Papers.

Whether node reordering is to be performed or not is evaluated in Sorter::start(). If reordering is done, it starts in Sorter::mask(). If reordering is not done, we skip to Sorter::createDiscWorkers(), which starts creating the worker chares, that will eventually perform the heavy lifting during time stepping: computing the PDE operators during time integration.

Creating workers

The worker Charm++ chare array elements, that store the mesh and associated data structures and compute PDE operators during time stepping, are organized into a single-level base-child relationship for code reuse and organization. However, the base-child relationship is slightly differently implemented compared to what would be familiar with inheritance in standard object-oriented programming.

There is a single base class, Discretization, that encapsulates data and member functions that are generic to all mesh-based discretization schemes. It stores the mesh, connectivity, node coordinates, etc. As Discretization is a chare array, its elements are distributed across the whole problem on all available compute nodes and PEs, and can also migrate for load balancing. The derived (or child) classes are the classes that implement a particular discretization scheme, e.g., RieCG.

Sorter::createDiscWorkers() starts creating the Discretization chares. This is done via the usual Charm++ dynamic array insertion passing the mesh chunk and associated data structures to their constructors. Eventually, this is followed by creating the child workers in Sorter::createWorkers().

Execution from the Discretization constructor follows to Transporter::disccreated() which fires up Discretization::vol(), which computes nodal volumes as well as the total volume of the complete problem, which ends up in Transporter::totalvol(). Volume calculations are followed by computing various statistics on the mesh cell sizes, including min/max edge lengths, and histograms, which are useful diagnostics to estimate load imbalance.

The various mesh statistics aggregate and arrive independently in reduction target member functions of Transporter, minstat(), maxstat(), sumstat(), and pdfstat(). When all of these are complete (see src/Inciter/transporter.ci how structured direct acyclic graphs (DAGs) are used to communicate this condition to the runtime system). Eventually, Transporter::stat() is called in which execution continues by creating the child (or derived) workers in Sorter::createWorkers(). As usual, the children are also created by dynamic array insertion. See Papers for more information on how DAGs are used to implement task-parallelism.

After the child constructors have started, one way or another, they are expected to call Transporter::comfinal(), which signals the end of setting up any (potentially additional) communication maps needed by the specific discretization schemes. After this the child classes are called again via their member function setup(). All specific discretization schemes, "deriving from" Discretization, are expected to define certain member functions, such as setup(), which do discretization-scheme-specific setup and is eventually expected to call dt(), which starts time stepping (by computing the size of the next time step). dt() is also the member function that is called again and again starting a new time step during time stepping.

4. Time stepping

After all the above, time stepping starts. Time stepping is and may be done very differently by the different types of discretizations but they are all expected to define a few member functions that are common so they can interoperate with their host, Transporter. These are Charm++ entry methods and must also have the same function signature (as in a derived class). Since the set of these function may change, the best way to find out what is required is to compare the Charm++ interface (.ci) files for the specific discretization schemes, e.g., RieCG, etc., and look for entry methods that are defined by all child schemes.