Ian Lumb, Chris Smith, Brian MacDonald, Rodney Jones
Platform Computing Corporation, Markham, Ontario, Canada

Ralph Stroup
Hewlett Packard Corporation, Richardson, Texas, USA

'Mission-Critical' Technical Computing

The concept of `mission-critical' is well-known and established in commercial computing. In practice, this translates to a `five 9s' service-level expectation that ensures maximal infrastructure availability on a per-transaction basis.

In contrast, the pursuit of technical computing typically leads to tasks whose resource requirements push well beyond the scope of the transaction. Examples of such tasks range from CFD simulations of airflow and crash dynamics in the automotive industry, to numerical weather prediction and climate modeling in education and research, to the rendering of animated shorts to full-length features in the digital content creation arena, to the common-place design and test simulations of the electronic design automation sector.


As a direct consequence it is often impractical to guarantee infrastructure availability for the duration of a technical-computing task, as these tasks may execute for timescales from hours to days to weeks. Moreover, when an effective workload management solution is in place, such expectations may not yield optimal overall usage of the entire compute infrastructure. Hence, `mission critical' in the technical computing space refers much more to recoverability in tandem with robust, effective workload management, than to `classic high availability'.

Achieving this `mission-critical edge' for technical computing is considered here through practical examples involving Hewlett Packard Corporation's MC/ServiceGuard plus Platform Computing Corporation's Load Sharing Facility (LSF) and SiteAssure. In particular, we illustrate how this integration re-invents the traditional notion of the checkpoint/restart functionality, while allowing for maximized I/O bandwidth through local disk access. In addition to application recoverability, the creation of an MC/ServiceGuard application package for LSF ensures that the resource management system itself is made highly available and recoverable.

Together, these products illustrate how a workload-management framework can be integrated into a high-availability infrastructure, and thereby address the requirements demanded by the technical computing `data center'.