Caltech Center for Advanced Computing Research » Posts for tag 'hardware'

SHC Software Stack Upgrade – Update

Important Information for SHC Users

As of Sept 8, 2009, SHC has been transitioned to the new sw stack (RHEL+OpenIB). There are currently 115 core4 nodes and 65 core8 nodes, in production. For more information, please visit the SHC Getting Started / System Guide.

SHC Software Stack Upgrade

Important Information for SHC Users

Over the next couple of days, more backend nodes from shc-a will be transitioned to shc-[c,new]’s cluster of backend nodes, running the new software stack. By Sept 4, there will be just 24 shc-a backend nodes, all the rest of the compute nodes will be running the new software stack, seen from shc-[new,c].

  • Please port your codes to the new software environment if you’ve not already done so!
  • Please report any porting problems you’re having; we’ll help asap.
  • Details on how to rebuild your code for the new SHC environment can be found here
  • Your MPI based code must be rebuilt for the new and improved shc software stack.

Preventive Maintenance on Sept 8 from 0800 to 1400 will encompass testing the complete transition of SHC compute and head node resources to the upgraded software stack environment. The fully upgraded production SHC cluster configuration will be two head nodes (shc-[a,b]) and 1180 Opteron compute node cores (163 dual cpu/dual core + 66 dual cpu/quad core).

Questions or concerns about the upgrade? Just let us know.

SHC Cluster Expansion

CACR’s 163 node Shared Heterogeneous Cluster (SHC) has recently expanded by an additional 20 nodes. Each of these new nodes contains 16 GB of memory and have two quad-core, 2.5 GHz AMD Opteron Processors (model 2380). As with the existing SHC nodes, each of the new nodes is connected via Infiniband to CACR’s Infiniband Switch.

The SHC provides computing capabilities specifically configured to meet the needs of applications from Caltech’s PSAAP, Turbulent Mixing, Applied and Computational Mathematics, and Numerical Relativity communities. For more information about the SHC, including information for test users of the new nodes, see this page.

CACR’s Shared Heterogeneous Cluster (SHC) Now Online

The nature of financial support for high-end computing resources has evolved given the widespread adoption of Beowulf clusters. Research groups that need computing often obtain funds for clusters as part of their grants. CACR participates in some of these efforts, and supports significant dedicated resources for high-energy physics, astronomy, geophysics, physics-based simulation, and others. Unfortunately, the balkanization of computation by this model has created inefficiencies. The clusters do not take advantage of economies of scale, can be underutilized, and poorly administered. CACR has developed a shared cluster model, and Professors Paul Dimotakis, Dan Meiron, and Kip Thorne have agreed to be pioneer partners in this effort. CACR has purchased a machine optimized for parallel numerical codes that can sustain over 1 trillion floating point operations per second. It consists of 352 2.2Ghz AMD Opteron cores, 700+ Gigabytes of memory, all interconnected by an Infiniband networking fabric that can move 160+ Gigabytes/s between the compute nodes. The cluster is administered by CACR with funds from the partner groups, and each group has an allocation of time on the machine proportionate to its contribution. By sharing, the groups get better pricing from vendors, professional systems administration by experienced CACR staff, and the ability to use a much larger machine than each group could afford separately. Some of the partners are also supporting efforts at CACR in visualization and code tuning. The shared cluster model is extremely scalable, and CACR is interested in expanding the machine to increase simulation capability and add support for data intensive science. Please contact CACR’s Executive Director, Mark Stalzer (stalzer at caltech.edu) for more information.

First Phase of TeraGrid Goes into Production

The first computing systems of the National Science Foundation’s TeraGrid project are in production mode, making 4.5 T eraflops of distributed computing power available to scientists across the country who are conducting research in a wide range of disciplines, from astrophysics to environmental science.

The TeraGrid is a multi-year effort to build and deploy the world’s largest, most comprehensive distributed infrastructure for open scientific research. The TeraGrid also offers storage, visualization, database, and data collection capabilities. Hardware at multiple sites across the country is networked through a 40-gigabit per second backplane — the fastest research network on the planet.

The systems currently in production represent the first of two deployments, with the completed TeraGrid scheduled to provide over 20 T eraflops of capability. The phase two hardware, which will add more than 11 T eraflops of capacity, was installed in December 2003 and is scheduled to be available to the research community this spring.

“We are pleased to see scientific research being conducted on the initial production TeraGrid system,” said Peter Freeman, head of NSF’s Computer and Information Sciences and Engineering directorate. “Leading-edge supercomputing capabilities are essential to the emerging cyberinfrastructure, and the TeraGrid represents NSF’s commitment to providing high-end, innovative resources.”

The TeraGrid sites are: Argonne National Laboratory; the Center for Advanced Computing Research (CACR) at the California Institute of Technology; Indiana University; the National Center for Supercomputing Applications (NCSA) at the University of Illinois, Urbana-Champaign; Oak Ridge National Laboratory; the Pittsburgh Supercomputing Center (PSC); Purdue University; the San Diego Supercomputer Center (SDSC) at the University of California, San Diego; and the Texas Advanced Computing Center at The University of Texas at Austin.

“This is an exciting milestone for scientific computing — the TeraGrid is a new concept and there has never been a distributed computing system of its size and scope,” said NCSA interim director Rob Pennington, the TeraGrid site lead for NCSA. “In addition to its immediate value in enabling new science, the TeraGrid project is a tool for the development of a national cyberinfrastructure, and the cooperative relationships forged through this effort provide a framework for future innovation and collaboration.”

“The TeraGrid partners have worked extremely hard during the two-year construction phase of this project and are delighted that this initial phase of what will be an unprecedented level of computing and data resources is now online for the nation’s researchers to use,” said Fran Berman, SDSC director and co-principal investigator of the TeraGrid project. “The TeraGrid is one of the foundations of cyberinfrastructure that will provide even more computing resources later this year.”

The computing systems that entered production this month consist of more than 800 Itanium-family IBM processors running Linux. NCSA maintains a 2.7-teraflop cluster, which was installed in spring 2003, and SDSC has a 1.3-teraflop cluster. The 6-teraflop, 3,000-processor HP AlphaServerSC Terascale Computing System (TCS) at PSC is also a component of the TeraGrid infrastructure.

“The launch of the National Science Foundation’s TeraGrid project provides scientists and researchers across the nation with access to unprecedented computational power,” said David Turek, vice president of Deep Computing with IBM.”Working with the NSF, IBM is committed to the continued development of breakthrough Grid technologies that benefit our scientific/technical and commercial customers.”

Allocations for use of the TeraGrid were awarded by the NSF’s Partnerships for Advanced Computational Infrastructure (PACI) last October. Among the first wave of researchers to use the TeraGrid are scientists studying the evolution of the universe and the cleanup of contaminated groundwater, simulating seismic events, and analyzing biomolecular dynamics.

Among the allocations awarded included one for Caltech physicist Harvey Newman . Newman leads a team of investigators who are developing codes for CERN’s Compact Muon Solenoid (CMS) collaboration. The CMS experiment will begin operation at the Large Hadron Collider (LHC) in 2007. The Caltech team’s planned use of the TeraGrid will be a valuable and possibly critical factor in the success of several planned “Data Challenges” for CMS. These Challenges are designed to test the readiness of the global Grid-enabled computing system being developed for the experiment, in collaboration with partner projects such as PPDG, GriPhyN, iVDGL, DataTAG, LCG, and others. The TeraGrid will further a program of developing optimized search strategies for the Higgs particles, thought to be responsible for mass in the Universe, for super-symmetry, and for investigating new physics processes beyond the Standard Model of particle physics.

To learn more about the TeraGrid, go to www.teragrid.org