System Guides: Shared Heterogeneous Cluster

1. How to Get An Account

http://www.cacr.caltech.edu/resources/accounts/

2.  How to Access the Machine

Connect to the front end, shc.cacr.caltech.edu:

ssh -l username shc.cacr.caltech.edu

You should edit, compile, build and submit your compute node jobs on the front end.

NOTE: password entry into the head node of the Opteron cluster is not allowed, but rather access is permitted via ssh public keys. If you do not have a public key, please see http://www.cacr.caltech.edu/resources/sshkey-instructions.cfm.

3.  System Configuration Information / Technical Summary

The Shared Heterogenous Cluster provides computing capabilities specifically configured to meet the needs of applications from Caltech's ASC, Turbulent Mixing, and Numerical Relativity communities. The configuration, integrated by Hewlett-Packard and CACR's technical support team, consists of 163 AMD Opteron dual core dual processor nodes, connected via Voltaire's Infiniband and an Extreme Networks BlackDiamond 8810 GigEswitch. The typical programming model used for SHC jobs is MPI, with flexible queue policies allowing for development and production runs.

Cluster Configuration Overview Diagram (PDF file)

Technical Summary:

Architecture
  • Opteron Linux Cluster
Head Nodes
  • 8 processors, single core
  • 16 GB ECC SDRAM memory
  • 2 nodes
Compute Nodes
  • dual-processor, dual core
  • 8 GB ECC SDRAM memory
  • 162 nodes (648 cores)
Processor
  • 86 AMD Opteron 275, 2.2 GHz
  • 77 AMD Opteron 280, 2.4 GHz
Network Interconnect
  • PCI-x InfiniBand
  • Copper Gigabit Ethernet
Disk
  • ~84 TB raw, RAID6 nfs project work area /shc/datastore-[01,02,03]
  • 180 GB local scratch/node
Operating System
  • Linux 2.6.15.6 (SuSE SLES 9.0)
Compilers
  • GNU 3.4.5, 4.2.[1,2]: Fortran77 C
  • C++
  • PGI 6.2, 7.[0,1]
  • Pathscale 2.[4,5]
Batch System
  • Torque with Maui
MPI
  • Open MPI, MPICH

4.  Available File Systems and Descriptions/Intended Usage

5.  Compiling/Preparing to Run

Available Compilers:

Available MPIs:

Useful "home grown" commands for accessing the compilers and MPI utilities:

/usr/local/bin/pkgs  lists available packages on shc. Packages are items on the software stack which are commonly used, and necessarily need to have environment variables and pathes updated in order to be used properly. Available packages are:

In order to set your environment to use a particular package, do use pkgname. For example, to use the PathScale compilers do

use pathscale

which will update your PATH, LD_LIBRARY_PATH, MANPATH and set the PathScale enviroment variable to the appropriate installation base directory. To drop a package from your environment, do  drop pkgname. For example, to drop the PGI compilers from your environment do  drop pgi

6.  Supported Debuggers and Debugging Tips

7. How to Launch and Manage Parallel Jobs

Interactive jobs on the compute nodes are initiated from the head node via qsub -I -l nodes=X -l walltime=HH:MM:SS where X is the number of nodes you want for the specified wall time. Batch jobs are submitted via, qsub sample.pbs where sample.pbs looks like:

Submitting a job to the weekend queue (24 hour runtime max) requires -q weekend as an option to qsub or a #PSUB -q weekend line in your batch script.

SHC has two types of compute nodes, opeteron 275 and 280. There are eighty-six 275s, running at 2.2GHz, and seventy-seven 280s running at 2.4GHz.
http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_8796_9240,00.html

Submitting a run to use the 2.4 GHz nodes requires the tag, operon280, in the resource spec. For example, to request four 2.4GHz nodes, do

#PBS -l nodes=4:opteron280

Specifically requesting the slower 2.2GHz nodes is done via

#PBS -l nodes=4:opteron275

qstat -Q -f gives a detailed description of all the shc queues.

Wrappers for compiling MPI codes

7a. Commonly Used Job Scheduler Commands

8. Getting Help/Communicationg with Other Users/Staff

Send mail to shc-support at cacr.caltech.edu. CACR technical support is available during standard business hours (M-F, 8am to 5pm); after hours responses are as time permits.

9. Expectations About System Down Times/News

Scheduled Preventive Maintenance (PM) is Monday from 0900 to 1300. CACR Operations will not always take the full PM for systems work. If you would like to schedule benchmarking or test runs with dedicated access to the entire system, asking shc-support for a portion of PM time is fine and encouraged.

News about operational changes (e.g. system software upgrades, file system policy changes, dedicated runs) will be posted using news. New news items since last ssh'ing to the head node will be displayed automatically.

Using news:

10. Known Problems

pbsnodes -a   believes nodes are either free or down. Allocated nodes appear free, so look at the job field and note there's a job id assigned to the allocated nodes and not to the free nodes.

qstat -a  allows you to monitor your job in a basic way by looking at the Elapsed Time field grow with time. qstat with no arguments doesn't report useful walltime/cputime for jobs in the Time Use field because the time reported is the negligible time consumed by the script which launches the mpi job, not the mpi job itself.

11. Miscellaneous System Software

11a. System Monitoring

Monitoring activity on shc (as a whole or on a somewhat fine grain level) can be done using Ganglia, an opensource distributed monitoring system for clusters. Just point your browser at http://shc.cacr.caltech.edu/ganglia/ to see an overview of system wide and node specific activity.

12. Performance

Single core Linpack, 3.811 Gflops (86% peak)

13. Policies

File system policies will regularly be reviewed. When 70% utilization is realized on pvfs, users will be warned to clean up.

All users are expected to adhere to the CACR Computing Policies.

14. Accounting and Job Priority Policies

Jobs are scheduled according to "weight." Many factors are taken into consideration when determining a job's weight, including cpu time consumed recently by user/group, time spent waiting in the queue, priority of the group, runtime and node count being requested, etc.

The intent of FairShare job scheduling is to prevent a user or group from dominating compute resources - striking a balance between utilization of cpu resources, job throughput, and fair scheduling between projects and project members.

Current fair share reporting outputs can be otained via executing diagnose -f from an shc head node

Notice that there are 8 columns, numbered 0 to 7. If entries associated with a user in these columns show a value other than "---", this means there was cpu consumption X, where X ranges from 0 to 7 days ago. Consumption a week ago (column 7) is weighted less heavily (0.4783) than consumption today (column 0). The value listed reflects job duration, job size and when the job ran.

Notice the column with the "%" heading. This shows the "penalty" or weight associated with a given user and/or group. If a value > 8.00 (Target value) appears by a user, this indicates a new jobs submitted will get be "penalized", due to recent consumption by himself or his group, allowing queued jobs from users with lower "%" values to run sooner.

Groups have different priorities - note that vtf3d and tmx have 2x priority over sxs, but as consumption by tmx or vtf3d reaches the target, jobs from these group decrease in priority so group domination of the cpu resources does not generally occur.

Queues have different priorities and policies. When special runs are requested, these jobs are placed in the system queue and given highest priority. The system queue is typically used for long jobs prior to a conference with an upcoming deadline, etc. Jobs submitted to the weekend queue have priority on weekends, over the dque, but the weekend queue is only active (running jobs) Friday 1700 to Monday 0800. Holidays extend weekend queue active hours.

 

[ last updated 3 November 2008]