Using Actors to increase scalability and fault tolerance of SUMMA

Author: Kyle Klenk, Kevin Green, Raymond Spiteri – University of Saskatchewan

Title: Using Actors to increase scalability and fault tolerance of SUMMA

Abstract: SUMMA is a modeling framework that is used for hydrological simulations over large-scale domains, such as the North American continent, which consists of more than half a million hydrological response units (HRUs). In the standard approach to perform such simulations on shared computing resources, the HRUs are divided into batches, and the batches are then submitted as individual jobs. To keep scheduler usage fair, job submission limits are imposed on the number of jobs that can be submitted at a time, severely limiting the number of CPUs that can be utilized for submissions using a single CPU core. Thus, in order to utilize more CPUs, users need to employ some creative bash scripting to submit jobs as sub-tasks to jobs that use more than one CPU core. There are a number of drawbacks with this approach. First, scripting tasks to specific CPUs within a job can lead to suboptimal performance and utilization of compute resources because not all jobs take the same amount of time to run, i.e., there is a straggler effect. Second, if HRUs fail within a job, the job is halted. The failed HRU then has to be identified, corrected, and manually resubmitted. Both of these problems require careful attention on the part of the user to ensure that the simulation is completed in a timely manner and that the results are correct. To address these issues, we wrapped SUMMA in an actor model framework called the C++ Actor Framework (CAF). The actor model is an abstraction of concurrent computation that uses actors as the basic units of computation. An actor has a private state and its own thread of execution and can only communicate with other actors through messages. We developed our implementation of SUMMA-Actors that represents HRUs as HRU-Actors. Separating HRUs into actors allows them to run concurrently, thus increasing scalability as well as adaptability while alleviating the need to script jobs or sub-tasks to specific CPUs. With Summa-Actors, jobs automatically utilize all available CPUs, increasing a user’s flexibility in the job submission process. With the implementation of the actor model, we observe essentially perfect scaling and a massive reduction in the straggler effect, reducing the total wall-clock time of an array job submission. To enable fault tolerance, SUMMA-Actors employs state separation to contain failures within a single HRU to their respective actor. The actor’s private state prevents HRU failures from propagating and crashing the entire job. In addition, SUMMA-Actors uses a hierarchical supervision strategy enabled by the actor model’s ability to spawn and monitor other actors. Our approach implements a supervisor actor called the job-actor. The job-actor allows SUMMA-Actors to address failures at run-time by modifying the HRU settings and restarting the failed HRU during run-time, removing the need to resubmit the job to the queue. All told, SUMMA-Actors provides a substantive reduction in wall clock time and human effort required to complete large-scale SUMMA simulations.

CIROH Training and Developers Conference 2023 Abstracts