Sharon Brunett
California Institute of Technology
Center for Advanced Computing Research

An Initial Study of Multithreaded Coarse Grain Application Performance on the Hewlett Packard V2500

Multithreading is a technique often used to attain performance and scalability for well suited coarse-grain applications. The underlying hardware and software supporting such an application can and do have a profound effect on a code's performance and scaling behavior.

This paper uses a thermal explosion application kernel to: analyze effects of runtime process management on the HP V2500; exploit available tools for tracking and tuning performance; study scalability characteristics on the HP V2500 and HP X2000.

The Thermal Explosion benchmark used in our investigation simulates an explosive wave initiating chemical reactions in reactive materials. The implementation is meant to stress a system's ability to keep memory close to processors, without special care being paid to data locality. Benchmark execution across various spatial parameters will help assess system performance characteristics as parallelism is increased and caches are saturated.

The simulation begins with a detonation wave, composed of a lead shock wave, initiating a chemical reaction in a reactive material. In turn, chemical energy is released which sustains the shock wave. The mesh is kept static throughout the simulation, causing highly time-variable work-loads per grid point to move through the mesh along with the propagating explosive wave. As the explosive wave moves through the spatial extent, the system's ability to keep memory "close" to processors is increasingly important for large and complex problems adding to the work-load imbalance. In addition, sufficient numbers of concurrent threads must be available for good performance.



The Hewlett Packard X200 and V2500 servers are symmetric multiprocessor (SMP) cache coherent nonuniform memory-access (ccNUMA) systems. The notion of a simple to program, large, integrated memory is appealing to a many types of scientific codes. The fundamental building block of the X and V-Class systems is the hypernode. Each hypernode is a SMP, containing multiple processors connected, local memory and an I/O subsystem A crossbar switch on each node provides nonblocking access from CPUs and I/O devices to the memory subsystem.

Since our application is multithreaded by virtue of hand placed compiler directives, studying the scheduling policies applicable to threads is worthwhile. "Mpsched" allows a user to specify a variety of options for controlling the processor or locality domain on which a specific process executes. Mpsched options, " -T Policy " apply a specified scheduling policy to newly created threads of a process. The launch, or scheduling, policies are straight forward: RoundRobin, Least loaded, Fill first, and Packed.

Performance tests include increasingly larger grid sizes, which knowingly fit in or exceed cache. Various scheduling options are employed to measure which policies work best under what circumstances.

Investigation with cxperf, pmon, and gpm on the V-Class shows large memory latencies and significant cache misses, for particular problem sizes.

The application kernel simulates a highly time-variable work imbalance per grid point. The resulting scaling behavior and performance on th HP VClass is reasonable, compared to the XClass, providing the problem size is large enough to yield adequate parallelism and a good choice of launch policies is selected.