System Guides: SHC
1. How to Get An Account
2. How to Access the Machine
Connect to the front end, shc.cacr.caltech.edu:
ssh -l username shc.cacr.caltech.edu
Edits, compiles, builds, and job submissions are done on the front end.
NOTE: password entry into the head node is not allowed, but rather access is permitted via ssh public keys. If you do not have a public key, please see SSH key generation instructions.
3. System Configuration Information / Technical Summary
The Shared Heterogenous Cluster provides computing capabilities specifically configured to meet the needs of applications from Caltech’s Predictive Science Academic Alliance Program, Turbulent Mixing, Applied and Computational Mathematics, and Numerical Relativity communities.
The configuration, integrated by Hewlett-Packard and CACR’s technical support team, consists of 203 AMD dual processor nodes, connected via Voltaire’s Infiniband and an Extreme Networks BlackDiamond 8810 GigEswitch.
The typical programming model used for SHC jobs is MPI, with flexible queue policies allowing for development and production runs.
Cluster Configuration Overview Diagram (coming soon)
Technical Summary:
| Architecture |
|
| Head Nodes |
|
| Compute Nodes |
|
| Processor |
|
| Network Interconnect |
|
| Disk |
|
| Operating System |
|
| Compilers |
|
| Batch System |
|
| MPI |
|
4. Available File Systems and Descriptions/Intended Usage
- nfs mounted project work area, /nfs/ds[01,02,03,04]/project-id, seen globally. RAID6, not backed up
- psaap project users have a dedicated nfs mounted project area, /nfs/ds02/psaap/username
- sxs project users have their own dedicated nfs mounted project space: /nfs/ds0[3,4]/sxs/username and /nfs/ds0[3,4]/sxs_bbhdata for group readable data sets
- a local home – seen globally and backed up daily.
- shc-storage01, hosting /nfs/ds01/, is a dual processor, dual core AMD Opteron 870 (2 GHz) system with 16 GB of memory. A 16 port Areca controller hosts an 8 TB (raw) RAID6 disk array. A 24-port Areca controller hosts 24×750GB disks.
- shc-storage02 hosting /nfs/ds02 is dual core AMD Opteron 875 (2.2 Ghz), 16 GB memory, 40×750 GB disks
- shc-storage03, hosting /nfs/ds03 is a dual processor, quad core AMD Opteron 8350 (2 Ghz), 16 GB memory, 40×750 GB SATA disks
- Each compute node has a local scratch disk (~180 GB) which is freshly purged when the scheduler gives you access to the nodes. Local scratch is visible only to the compute node controlling the disk.
- Local scratch is not backed up.
5. Compiling/Preparing to Run
Available Compilers:
- PGI: pgcc, pgCC, pgf90, pgf77
- PathScale: pathcc, pathCC, pathf90
- Gnu: gcc, g++, g77
- install root = /usr/bin
Available MPIs:
- OpenMPI-1.3.3, Open MPI
/usr/local/bin/pkgs lists available packages on shc. Packages are items on the software stack which are commonly used, and necessarily need to have environment variables and paths updated in order to be used properly. Available packages are:
- ATLAS-3.9.14
- comsol
- dakota-4.2
- fftw3_3.2.1
- gcc-4.3.3
- gnuplot_4.2.[4,5]
- grace-5.1.22
- HDF[4,5.1.8.[1,2,3]]
- ls-dyna – The LS-HYNA package
- matlab-R2009a
- mpiP
- openmpi_1.3.3
- Paraview-3.4.0
- pathscale_3.[1,2]
- petsc-2.3.[1.3]
- pgi_[8,9].0
- python-2.6.2 (for users who need numpy, scipy, matplotlib and mpi4py)
- python-3.0.1
- python-3.1
- qt-4.[3.5,4.3,5.2]
- tecplot-2009
- totalview_8.6
- vtk-5.2.1
- use – -list (shows available packages, equivalent to “pkgs”)
- use – -what (shows what packages you are using, equivalent to “using”)
In order to set your environment to use a particular package, do use pkgname. For example, to use the PathScale compilers do
use pathscale
which will update your PATH, LD_LIBRARY_PATH, MANPATH and set the PathScale enviroment variables appropriately. To drop a package from your environment, do drop pkgname. For example, to drop the PGI compilers from your environment do drop pgi
6. Supported Debuggers and Debugging Tips
- pathdb (PathScale Debugger)
- pgdebug (PGI Debugger)
- libefence.a – electric fence library for boundary violation reporting
- valgrind – To use valgrind with MPI jobs, do [l,h]mpirun -n X /usr/bin/valgrind –log-file=memlog a.out where memlog is a random name, and you’ll see memlog.pid for each task with info about memory and pointer usage.
- TotalView Documentation
- Examples are located in /usr/local/TotalView/linux-x86-64/examples
- 32 processor, single user license
- To run your MPI application under TotalView:
- compile and link your code with -g
- add -tv to the mpirun argument list
- Example requesting 4 interactive nodes for 2 hrs, including X11 forwarding, running a 16 way a.out executable qsub -I -l nodes=4 -l walltime=2:0:0 -X mpirun -tv -np 16 ./a.out
7. How to Launch and Manage Parallel Jobs
Interactive jobs on the compute nodes are initiated from the head node via
qsub -I -l nodes=X:core4 -l walltime=HH:MM:SS
where X is the number of nodes you want for the specified wall time. Batch jobs are submitted via,
qsub sample.pbs
where sample.pbs looks like:
#!/bin/csh -f
# ask for 2 core4 nodes for 1 hour #PBS -l nodes=2:core4 #PBS -l walltime=01:00:00 # # Direct stdout/err as desired #PBS -o /home/sharon/examples/hello_out #PBS -e /home/sharon/examples/hello_err # # Let's make sure each node can load the executable /usr/bin/ldd $HOME/examples/hello # cd to the exeuctable area if desired or launch with absolute path # Do an 8 way hello run /usr/local/openmpi/bin/mpirun -np 8 $HOME/examples/hello
SHC has three types of compute nodes, Opteron 275, 280, and 2380. There are eighty-six 275s, running at 2.2GHz, and seventy-seven 280s running at 2.4GHz, and sixty-six 2380s running at 2.5 GHz.
http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_8796_9240,00.html
- Short (
<2 hour) jobs are given scheduling priority on 6 “core4″ nodes Mon-Fri, 8am to 6pm. - Available queues on shc are “productionQ”, “weekdayQ”, “weekendQ”, and “dedicatedQ”. The philosophy is to generally not have to specify a queue. Specify instead, the resources you need and your job will be routed to the proper queue.
- To accommodate dual processor, quad core nodes as well as dual processor, dual core nodes in the shc environment, resource specification is as follows:
-l nodes=NN[:<type<][:ppnval][+MM[:<type>][:ppnval]]...
| NN, MM | Number of nodes of the specified typerequired by the job; default is 1. |
| type | Node type: “core8″ for eight-core nodesand “core4″ for four-core nodes;
default value is “core4″. |
| ppnval | Number of processes that will be run oneach node, specified as “ppn=K”, where
“K” is a value not exceeding the core count; default value is the number of cores the allocated nodes have. |
Jobs submitted with a runtime of > 12 hours will automatically be routed to a weekendQ.
dedicatedQ runs need special approval
WeekdayQ runs are M-F, 0800 to 1700
WeekendQ runs are F-M, 1700 to 0800. Caltech holidays extend the active window for the WeekendQ
Job submission on shc examples:
A standard, single-node, 4 way MPI executable which needs to run
for 30 minutes on a four-core node:
qsub -l walltime=00:30:00 -l nodes=1:core4:ppn=4 jobA qsub -l walltime=00:30:00 -l nodes=1:core4 jobA qsub jobA
*all of the above are equivalent
Equivalent ways to submit a six-node, 48 way, jobB which needs to
run for 1.5 hours on eight-core nodes:
qsub -l walltime=1:30:00 -l nodes=6:core8:ppn=8 -q productionQ jobB qsub -l walltime=1:30:00,nodes=6:core8 jobB
Equivalent way to submit a job that should run on a mixture of
node types (three eight-core nodes plus six four-core nodes)
for four hours:
qsub -l walltime=4:00:00 -l nodes=3:core8+6:core4 jobC qsub -l walltime=4:00:00,nodes=3:core8+6:core4 jobC
- How manynodes are available vs. free, node_status
- Display the status of a batch job, qstat -a When monitoring jobs with qstat, look at “Elap Time” (elapsed time) rather than “Time Use”. This is because “Elap Time” is the time since the job started, while “Time Use” is the CPU time used by the user process; this number is usually zero or close to it, since it counts the script that actually launches the MPI job, not the job itself.
- Delete (cancel) a job, qdel PBS_JOBID
- Show all running jobs on the system, qstat -r
- Show detailed information for a specified job, qstat -f PBS_JOBID
- Show all queues, qstat -q
- Show queue limits for all queues, qstat -Q
- Show quick information of the server, qstat -B
- Show compute node status, pbsnodes -a Jobs are routed and prioritized depending on the walltime you request, and fairshare valve. A system queue exists for special jobs. Just send mail to
and request this queue be activated for your benchmark runs, long runs, special needs runs, etc. - Show summaries of running, idle, blocked jobs - showq
8. Getting Help/Communicationg with Other Users/Staff
Send mail to
. CACR technical support is available during standard business hours (M-F, 8am to 5pm); after hours responses are as time permits.
9. Expectations About System Down Times/News
Scheduled Preventive Maintenance (PM) is Monday from 0900 to 1300. CACR Operations will not always take the full PM for systems work. If you would like to schedule benchmarking or test runs with dedicated access to the entire system, asking shc-support for a portion of PM time is fine and encouraged.
News about operational changes (e.g. system software upgrades, file system policy changes, dedicated runs) will be posted using news. New news items since last ssh’ing to the head node will be displayed automatically.
Using news:
- /usr/local/bin/news -help
- Usage:
- news prints all new news items
- news X prints news item “X”
- news -a prints all news items
- news -l prints names of all news items
- news -h prints this message
Single core Linpack, 3.811 Gflops (86% peak)
11. Policies
All users are expected to adhere to the CACR Computing Policies.
12. Accounting and Job Priority Policies
Jobs are scheduled according to “weight.” Many factors are taken into consideration when determining a job’s weight, including cpu time consumed recently by user/group, time spent waiting in the queue, priority of the group, runtime and node count being requested, etc.
The intent of FairShare job scheduling is to prevent a user or group from dominating compute resources – striking a balance between utilization of cpu resources, job throughput, and fair scheduling between projects and project members.
Current fair share reporting outputs can be otained via executing diagnose -f from an shc head node
Notice that there are 8 columns, numbered 0 to 7. If entries associated with a user in these columns show a value other than “—”, this means there was cpu consumption X, where X ranges from 0 to 7 days ago. Consumption a week ago (column 7) is weighted less heavily (0.4783) than consumption today (column 0). The value listed reflects job duration, job size and when the job ran.
Notice the column with the “%” heading. This shows the “penalty” or weight associated with a given user and/or group. If a value > 8.00 (Target value) appears by a user, this indicates a new jobs submitted will get be “penalized”, due to recent consumption by himself or his group, allowing queued jobs from users with lower “%” values to run sooner.
Groups have different priorities – based on their contributions to the cluster and recent cpu usage. qsub -p priority, where “high” and “low” are acceptable fields changes a jobs’ priority within the project’s collection of submitted jobs. Accounts are charged accordingly. Use this option sparingly! The Default priority is “low”.
Queues have different priorities and policies. When special runs are requested, these jobs are placed in the dedicatedQ queue and given highest priority. The dedicatedQ queue is typically used for long jobs prior to a conference with an upcoming deadline, etc. Jobs submitted to the weekendQ queue have priority on weekends, over the weekdayQ, but the weekendQ queue is only enabled (running jobs) Friday 1700 to Monday 0800. Holidays extend weekendQ queue active hours.






