ssh -l username shc.cacr.caltech.edu
Edits, compiles, builds, and job submissions are done on the front end.
NOTE: password entry into the head node is not allowed, but rather access is permitted via ssh public keys. If you do not have a public key, please see SSH key generation instructions. 3. System Configuration Information / Technical Summary The Shared Heterogenous Cluster provides computing capabilities specifically configured to meet the needs of applications from Caltech's Predictive Science Academic Alliance Program, Turbulent Mixing, Applied and Computational Mathematics, and Numerical Relativity communities. The configuration, integrated by Hewlett-Packard and CACR's technical support team, consists of 203 AMD dual processor nodes, connected via Voltaire's Infiniband and an Extreme Networks BlackDiamond 8810 GigEswitch. The typical programming model used for SHC jobs is MPI, with flexible queue policies allowing for development and production runs. Cluster Configuration Overview Diagram (coming soon) Technical Summary:| Architecture |
|
| Head Nodes |
|
| Compute Nodes |
|
| Processor |
|
| Network Interconnect |
|
| Disk |
|
| Operating System |
|
| Compilers |
|
| Batch System |
|
| MPI |
|
/usr/local/bin/pkgs lists available packages on shc. Packages are items on the software stack which are commonly used, and necessarily need to have environment variables and paths updated in order to be used properly. Available packages are:
In order to set your environment to use a particular package, do use pkgname. For example, to use the Intel compilers do
use intel
which will update your PATH, LD_LIBRARY_PATH, MANPATH and set the Intel environment variables appropriately. To drop a package from your environment, do
drop pkgname
For example, to drop the PGI compilers from your environment do
drop pgi
Alternatively to the "use" environment, "module" can be used to set the user environment.
To see all available modules:module avail
To add a module (openmpi for Open64) to your environment:
module add openmpi/open64
Some of the modules depend on others and if not all prerequisites are met, you may see a message like:openmpi/open64(29):ERROR:151: Module 'openmpi/open64' depends on one of the module(s) 'open64/4.2.3.1'
openmpi/open64(29):ERROR:102: Tcl command execution failed: prereq This tells us that before adding the openmpi/open64 module to the environment, the open64 module has to be added first:module add open64
module add openmpi/open64 To remove a module from the user environment:module del openmpi
(This will remove any version of openmpi in the user's environment.)
To remove all loaded modules:module clear
or module purge To see what modules are already in the user's environment:module list
For more help on module, do:module help
or module help<modulename>
6. Supported Debuggers and Debugging Tips
qsub -I -l nodes=X:core4 -l walltime=HH:MM:SS
where X is the number of nodes you want for the specified wall time. Batch jobs are submitted via,
qsub sample.pbswhere sample.pbs looks like:
#!/bin/csh -f
# ask for 2 core4 nodes for 1 hour #PBS -l nodes=2:core4 #PBS -l walltime=01:00:00 # # Direct stdout/err as desired #PBS -o /home/sharon/examples/hello_out #PBS -e /home/sharon/examples/hello_err # # Let's make sure each node can load the executable /usr/bin/ldd $HOME/examples/hello # cd to the exeuctable area if desired or launch with absolute path # Do an 8 way hello run /usr/local/openmpi/bin/mpirun -np 8 $HOME/examples/hello
< 2 hour) jobs are given scheduling priority on 6 "core4" nodes Mon-Fri, 8am to 6pm.-l nodes=NN[:<type>][:ppnval][+MM[:<type>][:ppnval]]...
| NN, MM | Number of nodes of the specified typerequired by the job; default is 1. |
| type | Node type: “core8″ for eight-core nodesand “core4″ for four-core nodes; default value is “core4″. |
| ppnval | Number of processes that will be run oneach node, specified as “ppn=K”, where “K” is a value not exceeding the core count; default value is the number of cores the allocated nodes have. |
>12 hours will automatically be routed to a weekendQ.
dedicatedQ runs need special approval
WeekdayQ runs are M-F, 0800 to 1700
WeekendQ runs are F-M, 1700 to 0800. Caltech holidays extend the active window for the WeekendQ
Job submission on shc examples:
A standard, single-node, 4 way MPI executable which needs to run
for 30 minutes on a four-core node:
qsub -l walltime=00:30:00 -l nodes=1:core4:ppn=4 jobA qsub -l walltime=00:30:00 -l nodes=1:core4 jobA qsub jobA*all of the above are equivalent Equivalent ways to submit a six-node, 48 way, jobB which needs to run for 1.5 hours on eight-core nodes:
qsub -l walltime=1:30:00 -l nodes=6:core8:ppn=8 -q productionQ jobB qsub -l walltime=1:30:00,nodes=6:core8 jobBEquivalent way to submit a job that should run on a mixture of node types (three eight-core nodes plus six four-core nodes) for four hours:
qsub -l walltime=4:00:00 -l nodes=3:core8+6:core4 jobC qsub -l walltime=4:00:00,nodes=3:core8+6:core4 jobC
Send mail to
. CACR technical support is available during standard business hours (M-F, 8am to 5pm); after hours responses are as time permits.
News about operational changes (e.g. system software upgrades, file system policy changes, dedicated runs) will be posted using news. New news items since last ssh'ing to the head node will be displayed automatically.
Using news:
Single core Linpack, 3.811 Gflops (86% peak)
All users are expected to adhere to the CACR Computing Policies.
Jobs are scheduled according to "weight." Many factors are taken into consideration when determining a job's weight, including cpu time consumed recently by user/group, time spent waiting in the queue, priority of the group, runtime and node count being requested, etc.
The intent of FairShare job scheduling is to prevent a user or group from dominating compute resources - striking a balance between utilization of cpu resources, job throughput, and fair scheduling between projects and project members. Current fair share reporting outputs can be otained via executing diagnose -f from an shc head node Notice that there are 8 columns, numbered 0 to 7. If entries associated with a user in these columns show a value other than "---", this means there was cpu consumption X, where X ranges from 0 to 7 days ago. Consumption a week ago (column 7) is weighted less heavily (0.4783) than consumption today (column 0). The value listed reflects job duration, job size and when the job ran. Notice the column with the "%" heading. This shows the "penalty" or weight associated with a given user and/or group. If a value> 8.00 (Target value) appears by a user, this indicates a new jobs submitted will get be "penalized", due to recent consumption by himself or his group, allowing queued jobs from users with lower "%" values to run sooner.
Groups have different priorities - based on their contributions to the cluster and recent cpu usage. qsub -p priority, where "high" and "low" are acceptable fields changes a jobs' priority within the project's collection of submitted jobs. Accounts are charged accordingly. Use this option sparingly! The Default priority is "low".
Queues have different priorities and policies. When special runs are requested, these jobs are placed in the dedicatedQ queue and given highest priority. The dedicatedQ queue is typically used for long jobs prior to a conference with an upcoming deadline, etc. Jobs submitted to the weekendQ queue have priority on weekends, over the weekdayQ, but the weekendQ queue is only enabled (running jobs) Friday 1700 to Monday 0800. Holidays extend weekendQ queue active hours.