System Guides: Shared Heterogeneous Cluster
1. How to Get An Account
http://www.cacr.caltech.edu/resources/accounts/
2. How to Access the Machine
Connect to the front end, shc.cacr.caltech.edu:
ssh -l username shc.cacr.caltech.edu
You should edit, compile, build and submit your compute node jobs on the front end.
NOTE: password entry into the head node of the Opteron cluster is not allowed, but rather access is permitted via ssh public keys. If you do not have a public key, please see http://www.cacr.caltech.edu/resources/sshkey-instructions.cfm.
3. System Configuration Information / Technical Summary
The Shared Heterogenous Cluster provides computing capabilities specifically configured to meet the needs of applications from Caltech's ASC, Turbulent Mixing, and Numerical Relativity communities. The configuration, integrated by Hewlett-Packard and CACR's technical support team, consists of 163 AMD Opteron dual core dual processor nodes, connected via Voltaire's Infiniband and an Extreme Networks BlackDiamond 8810 GigEswitch. The typical programming model used for SHC jobs is MPI, with flexible queue policies allowing for development and production runs.
Cluster Configuration Overview Diagram (PDF file)
Technical Summary:
| Architecture |
|
| Head Nodes |
|
| Compute Nodes |
|
| Processor |
|
| Network Interconnect |
|
| Disk |
|
| Operating System |
|
| Compilers |
|
| Batch System |
|
| MPI |
|
4. Available File Systems and Descriptions/Intended Usage
5. Compiling/Preparing to Run
Available Compilers:
Available MPIs:
/usr/local/bin/pkgs lists available packages on shc. Packages are items on the software stack which are commonly used, and necessarily need to have environment variables and pathes updated in order to be used properly. Available packages are:
In order to set your environment to use a particular package, do use pkgname. For example, to use the PathScale compilers do
use pathscale
which will update your PATH, LD_LIBRARY_PATH, MANPATH and set the PathScale enviroment variable to the appropriate installation base directory. To drop a package from your environment, do drop pkgname. For example, to drop the PGI compilers from your environment do drop pgi
6. Supported Debuggers and Debugging Tips
7. How to Launch and Manage Parallel Jobs
Interactive jobs on the compute nodes are initiated from the head node via qsub -I -l nodes=X -l walltime=HH:MM:SS where X is the number of nodes you want for the specified wall time. Batch jobs are submitted via, qsub sample.pbs where sample.pbs looks like:
#!/bin/csh -f
# ask for 2 nodes for 1 hour #PBS -l nodes=2 #PBS -l walltime=01:00:00 # # Direct stdout/err as desired #PBS -o /home/sharon/examples/hello_out #PBS -e /home/sharon/examples/hello_err # # Let's make sure each node can load the executable /usr/bin/ldd $HOME/examples/hello # cd to the exeuctable area if desired or launch with absolute path # Do an 8 way hello run
/usr/local/openmpi/bin/mpirun -np 8 $HOME/examples/hello
Submitting a job to the weekend queue (24 hour runtime max) requires -q weekend as an option to qsub or a #PSUB -q weekend line in your batch script.
SHC has two types of compute nodes, opeteron 275 and 280. There are eighty-six 275s, running at 2.2GHz, and seventy-seven 280s running at 2.4GHz.
http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_8796_9240,00.html
Submitting a run to use the 2.4 GHz nodes requires the tag, operon280, in the resource spec. For example, to request four 2.4GHz nodes, do
#PBS -l nodes=4:opteron280
Specifically requesting the slower 2.2GHz nodes is done via
#PBS -l nodes=4:opteron275
qstat -Q -f gives a detailed description of all the shc queues.
Wrappers for compiling MPI codes
7a. Commonly Used Job Scheduler Commands
When monitoring jobs with qstat, look at "Elap Time" (elapsed time) rather than "Time Use". This is because "Elap Time" is the time since the job started, while "Time Use" is the CPU time used by the user process; this number is usually zero or close to it, since it counts the script that actually launches the MPI job, not the job itself.
8. Getting Help/Communicationg with Other Users/Staff
Send mail to shc-support at cacr.caltech.edu. CACR technical support is available during standard business hours (M-F, 8am to 5pm); after hours responses are as time permits.
9. Expectations About System Down Times/News
Scheduled Preventive Maintenance (PM) is Monday from 0900 to 1300. CACR Operations will not always take the full PM for systems work. If you would like to schedule benchmarking or test runs with dedicated access to the entire system, asking shc-support for a portion of PM time is fine and encouraged.
News about operational changes (e.g. system software upgrades, file system policy changes, dedicated runs) will be posted using news. New news items since last ssh'ing to the head node will be displayed automatically.
Using news:
10. Known Problems
pbsnodes -a believes nodes are either free or down. Allocated nodes appear free, so look at the job field and note there's a job id assigned to the allocated nodes and not to the free nodes.
qstat -a allows you to monitor your job in a basic way by looking at the Elapsed Time field grow with time. qstat with no arguments doesn't report useful walltime/cputime for jobs in the Time Use field because the time reported is the negligible time consumed by the script which launches the mpi job, not the mpi job itself.
11. Miscellaneous System Software
11a. System Monitoring
Monitoring activity on shc (as a whole or on a somewhat fine grain level) can be done using Ganglia, an opensource distributed monitoring system for clusters. Just point your browser at http://shc.cacr.caltech.edu/ganglia/ to see an overview of system wide and node specific activity.
Single core Linpack, 3.811 Gflops (86% peak)
File system policies will regularly be reviewed. When 70% utilization is realized on pvfs, users will be warned to clean up.
All users are expected to adhere to the CACR Computing Policies.
Jobs are scheduled according to "weight." Many factors are taken into consideration when determining a job's weight, including cpu time consumed recently by user/group, time spent waiting in the queue, priority of the group, runtime and node count being requested, etc.
The intent of FairShare job scheduling is to prevent a user or group from dominating compute resources - striking a balance between utilization of cpu resources, job throughput, and fair scheduling between projects and project members.
Current fair share reporting outputs can be otained via executing diagnose -f from an shc head node
Notice that there are 8 columns, numbered 0 to 7. If entries associated with a user in these columns show a value other than "---", this means there was cpu consumption X, where X ranges from 0 to 7 days ago. Consumption a week ago (column 7) is weighted less heavily (0.4783) than consumption today (column 0). The value listed reflects job duration, job size and when the job ran.
Notice the column with the "%" heading. This shows the "penalty" or weight associated with a given user and/or group. If a value > 8.00 (Target value) appears by a user, this indicates a new jobs submitted will get be "penalized", due to recent consumption by himself or his group, allowing queued jobs from users with lower "%" values to run sooner.
Groups have different priorities - note that vtf3d and tmx have 2x priority over sxs, but as consumption by tmx or vtf3d reaches the target, jobs from these group decrease in priority so group domination of the cpu resources does not generally occur.
Queues have different priorities and policies. When special runs are requested, these jobs are placed in the system queue and given highest priority. The system queue is typically used for long jobs prior to a conference with an upcoming deadline, etc. Jobs submitted to the weekend queue have priority on weekends, over the dque, but the weekend queue is only active (running jobs) Friday 1700 to Monday 0800. Holidays extend weekend queue active hours.
[ last updated 3 November 2008]