[TU Chemnitz] [SFB-Logo]

CLIC usage

- experience of users for users -

(Last modified: )



CLIC
 

This is a collection of topics I have experienced during the first weeks and months of testing on CLIC.
It's something like FAQ, but there is no guarantee for correctness in each point.

If anyone has other or additional experience of interest for other users - please let me know (m.pester@mathematik.tu-chemnitz.de)

Note:
There may be changes (version numbers, paths, ...) by upgrading the system. Thus, parts of this page may become obsolete some day.
The author is not responsible for the contents of other pages that are linked here.


What do I need to run a program on CLIC?

  1. You need a valid login for the domain hrz.tu-chemnitz.de and an account (=project name) for CLIC
    (refer to URZ-Benutzerservice)
  2. Define a "job" (interactive or batch job, see below).
  3. Login to any host (HP, Sun or Linux)  in the domain hrz.tu-chemnitz.de .
  4. You may use xpbs to define a lot of options for your job ...
    WARNINGxpbs can be considered as a good idea but it would need improvements to become a real "user interface"
    Please have a look at those tangling xpbs-windows and find out what the options mean :-)
  5. Submit your job to the queue system (PBS = Portable Batch System):
      qsub -I my_interactive_job
    or
      qsub my_batch_job

    Note: The former command  pbs_qsub  (from /uni/global/bin) is dying out and was replaced by  qsub  (from /usr/local/bin).

  6. It is strongly recommended to have the following features (possibly by changing your login scripts)

Interactive jobs

"Submitting" an interactive job by  qsub -I ... will give you a shell  in your current terminal. If you use xpbs to submit an interactive job, xpbs opens a window (xterm) to execute that shell. The file $PBS_NODEFILE contains the list of hostnames (nodes) assigned to your job. The interactive shell runs on the first of them.

A simple example:

Batch jobs

If you are able to redirect all input and output of your program to files, you should use the real batch mode to run it.
The job definition file has to contain some PBS specific options written as comment for the shell (#PBS -option)
and all the shell commands to be executed on the first node of your subcluster.

A simple example:

Access to a special queue

In an "emergency" case (defect switch) we had access to CLIC by a special queue only. In such a case, your batch job should contain, e.g.,

 #PBS -q clicDefectQ@clic0a1.hrz.tu-chemnitz.de

and use only hostfiles with appendix eth0 instead of eth1.

Using LAM-MPI on CLIC

For general information concerning the usage of LAM-MPI refer to this (german) document.

On CLIC, there are installed 3 (marginal different) versions of LAM-MPI on CLIC under /usr/local/packages. I decided to use the TCP version since the others seem to have advantages for dual processor boards only.

Notes on LAM MPI 6.5.1: Using very long packages and highly parallel simultaneous exchange, LAM 6.3.2 managed it to transfer 140 Mbit/s per node, but LAM 6.5.1 did not exceed 128 Mbit/s per node.
Most recent installed version is LAM MPI 6.5.6
 MPIHOME=/usr/local/packages/lam-rpi-tcp-6.5.6

Remarks:

Using MPICH on CLIC

MPICH is installed locally on CLIC under /usr/local/packages/mpich-1.2.4.ssh/. Just as described above for LAM MPI you can do the following:

 Using PVM on CLIC

PVM can be used from /afs/tucz/project/sfb393/pvm3 for the PVM architecture LINUX. The PVM daemon is started by
 $PVM_ROOT/lib/pvm [-n<master_hostname>] <hostfile>
where <hostfile> may be $PBS_NODEFILE or ${PBS_NODEFILE}.lam.eth1 as described above in Remarks
The flag -n<master_hostname> is important for the correct use of the communication network (eth1). You can get <master_hostname> as `head -1 ${PBS_NODEFILE}.lam.eth1`.

Typical problems using PVM:

A few little scripts

For simplification there are a few shell scripts to initialize a subcluster  either for LAM-MPI or for MPICH (and for PVM, too):
  clic_init_lam      [ < input_file ]
  clic_init_mpich    [ < input_file ]
  clic_init_mpichmpd [ < input_file ]
  clic_init_pvm [-x] [ < input_file ]
By default they will select the corresponding machinefile (using the communication network), start the corresponding daemon and then (in case of success) run a shell interactively. Before entering the subshell  the scripts will add the bin directory of the appropiate MPI version to the top of your search path (if you didn't it yet).
The user may specify the argument "eth0" if he explicitly demands the service network instead of the communication network.
If you leave the shell (exit) the daemons are killed and temporary files are deleted.
For simplicity, each of the scripts defines an environment variable CLIC_NUMNODES with the number of nodes defined in $PBS_NODEFILE. This variable is available for the subshell.
For usage in batch mode you may redirect the input from a file which contains the mpirun command and data for your program.
Of course, you may also write it in your batch job, e.g.
  clic_init_lam <<EOF
  mpirun -np 16 <myexecutable>
  EOF
Another helpful script may be the following which executes a specified command via ssh on each of the nodes listed in $PBS_NODEFILE
  clic_chk [-b] [command]
the flag "-b" means to execute the ssh commands in the background instead of one by one. If no command is specified, clic_chk will only echo an "OK" from each node (to check if ssh works).
As a special case of clic_chk you may run the script
  clic_chk_load
which extracts those nodes from $PBS_NODEFILE with a load average of more than 0.10. This will take a while for large number of nodes, the program pload (see below) may be better for a quick test.
If you want to check the connection via another machine file than $PBS_NODEFILE, please use
  chkhosts [-b] machine_file [command]
instead.
The script  clic_init_pvm  is similar to those for MPI. The flag  -x  is for interactive use only, since it will open an additional xterm running the PVM console. Hence, you need a working DISPLAY - xhost [+] connection.

The current state of CLIC may be displayed by

   clic_show
The output of this script looks like this (you may also see the current state here) :
Server           Max Tot Que Run Hld Wat Trn Ext Status
---------------- --- --- --- --- --- --- --- --- ----------
clic0a1.hrz.tu-c   0  41  24  17   0   0   0   0 Scheduling

clic0a1.hrz.tu-chemnitz.de: 
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
1075.clic0a1.hr fci      clicNode thin.job    14190  16  --    --  2000: R 189:1
1269.clic0a1.hr frank    clicNode pbs_clc.sh  11888   1  --    --  150:0 R 96:23
1270.clic0a1.hr frank    clicNode pbs_clc.sh  10806   1  --    --  150:0 R 96:15
1272.clic0a1.hr klpa     clicNode STDIN       13630 111  --    --  250:0 R 94:20
1309.clic0a1.hr tnc      clicNode inter.sh    27726   1  --   512b 250:0 R 87:35
1310.clic0a1.hr tnc      clicNode inter.sh    21408   1  --   512b 250:0 R 87:20
1333.clic0a1.hr frank    clicNode pbs_clc.sh   1314   1  --    --  150:0 R 22:40
1340.clic0a1.hr frank    clicNode pbs_clc.sh  10339   1  --    --  150:0 R 21:19
1346.clic0a1.hr ikondov  clicNode set1a_2-d1  15939  48  --    --  25:00 R 17:00
1350.clic0a1.hr mibe     clicNode bdmpitest     --  238  --    --  08:00 Q   -- 
1351.clic0a1.hr klpa     clicNode STDIN       12615  44  --    --  250:0 R 14:39
1352.clic0a1.hr ikondov  clicNode set1a_34-d  32096  48  --    --  25:00 R 10:23
1353.clic0a1.hr ikondov  clicNode set1a_34-d  20435  48  --    --  25:00 R 10:23
1355.clic0a1.hr ikondov  clicNode set1a_34-d  19408  48  --    --  25:00 R 10:24
1356.clic0a1.hr ikondov  clicNode set1a_34-d  10725  48  --    --  25:00 R 10:22
1357.clic0a1.hr ikondov  clicNode set1a_34-d   9506  48  --    --  25:00 R 10:23
1358.clic0a1.hr ikondov  clicNode set1a_34-d  19348  48  --    --  25:00 R 08:03
1359.clic0a1.hr ikondov  clicNode set1a_34-d    --   48  --    --  25:00 Q   -- 
1360.clic0a1.hr ikondov  clicNode set1a_34-d    --   48  --    --  25:00 Q   -- 
1361.clic0a1.hr ikondov  clicNode set1a_34-d    --   48  --    --  25:00 Q   -- 
1362.clic0a1.hr ikondov  clicNode set1a_34-d    --   48  --    --  25:00 Q   -- 
1363.clic0a1.hr ikondov  clicNode set1a_34-d    --   48  --    --  25:00 Q   -- 
 ..........
1379.clic0a1.hr ikondov  clicNode set1a_34-d    --   48  --    --  25:00 Q   -- 
1380.clic0a1.hr ikondov  clicNode set1a_34-d    --   48  --    --  25:00 Q   -- 
1381.clic0a1.hr ikondov  clicNode set1a_34-d    --   48  --    --  25:00 Q   -- 
1382.clic0a1.hr pester   clicNode STDIN         --    4  --    --  04:00 R 01:07 
    522 nodes  in use,
      0 nodes  free,
      7 nodes  offline.

Not a script but a small program may be used to find out if another user left some of your nodes "unclean":
  mpirun -np ... pload.CLIC.lamXXX   (for LAM-MPI, XXX=632 or 656 for the current LAM version)
  mpirun -np ... pload.CLIC.mpich   (for MPICH).
This program will run a few seconds and then show a time diagram with one row per node. Nodes which have much more CPU time than others should be inspected in order to find hanging processes (please send a message to clicadmin if you found such processes of other users, or system processes such like klogd).

Where can you find the scripts?
  /afs/tucz/project/sfb393/bin/
  or /usr/local/bin/
    (some of them modified by Mike Becher; more options and help)
 

Local compiling and linking?

It is very annoying, if you have to wait for a CLIC node assigned by PBS - if you only want to get your program compiled and linked.
In my tests I found no problems to use locally installed versions of LAM-MPI, MPICH in order to compile and link the programs on my desktop. Then the executable runs on CLIC. There is also no problem to have different Linux distributions (local: S.u.S.E., CLIC: RedHat).
The local installations (not really "local") can be used by anyone else:
 
 
LAM-MPI 6.3.2 /afs/tucz/project/sfb393/packages/lammpi.CLIC
LAM-MPI 6.5.9 /afs/tucz/project/sfb393/lammpi
MPICH 1.1.1 /afs/tucz/project/sfb393/mpich
PVM 3.4 /afs/tucz/project/sfb393/pvm3

NOTE for LAM-MPI:
By default mpif77 calls "f77". In our local installation, however, f77 is not usable, so I modified the script mpif77 to use g77 as default.
You may check the command line by

   mpif77 -showme

Using our private libraries?

The libraries we have been developing and using for several years are also usable for CLIC. The library path is
 /afs/tucz/project/sfb393/FEM/libs/$archi
where $archi is an environment variable defining the architecture and/or the parallel system to use.
Here is an overview for Linux:
 
archi= where to use for "make" Message passing library Hypercube communication library
LINUX any Linux computer (*.mathematik) MPICH libMPIcom.a
PVM libCubecom.a
LINUX_lam any Linux computer (*.mathematik) LAM-MPI  libMPIcubecom.a  or 
libMPIcom.a
CLIC CLIC nodes 
(clicxxxx.hrz)
LAM-MPI libMPIcubecom.a or
libMPIcom.a
MPICH libMPICHcom.a
PVM libCubecom.a
LinuxPGI Linux with access to /afs LAM-MPI libMPIcubecom.a or
libMPIcom.a
Intel Linux with access to /afs
Compiler needs a fistful of environment variables
LAM-MPI libMPIcubecom.a or
libMPIcom.a
MPICH libMPICHcom.a or
libMPICHcubecom.a
What else do you need? In each case have a look at the file
/afs/tucz/project/sfb393/FEM/libs/$archi/default.mk
to verify default paths and variables (possibly to overide in your Makefile)

An attempt to compare ...

Each message passing system has some particular features. I will try to split them into advantages and disadvantages:
 
  advantages disadvantages
LAM-MPI
  • communication network (eth1) can be used
  • after lamboot the mpirun command is very quick
  • very good behavior of MPI_sendrecv (upto 140 Mbit/s, no lack of performance caused by the switches for more than 100 nodes)
  • problems with more than 228 nodes (fixed)
    (workaround: mpirun -lamd ...)
  • lamboot and wipe need "some time"
  • and may hang up if one node has trouble 
    (because they use ssh sequentially)
  • bad implementation of  global communication MPI_Allreduce, ....
  • the directory where the executable is started from must be readable for urz:clicnodes or system:anyuser (no ssh connection - no AFS token)
MPICH
  • good implementation of global communication
      (MPI_All...), 
     [but does not use the duplex mode of ethernet cards  (so only upto 100 MBit/s ?)]
  • with serv_p4 the first two disadvantages of the right column disappear
  • mpirun takes a long time (as lamboot does for LAM-MPI)
  • mpirun creates a lot of ssh-processes on node_0
  • programs use 100% cpu time if they are waiting in send/recv
  • memory problems for very long messages (P4_GLOBMEMSIZE must be increased)
PVM
  • can use 512 nodes or more (if they are available :-)
  • PVM daemon starts faster than that of LAM-MPI
  • communication performance worse than MPI
    (total bandwidth is 30...60 % of MPI)
libCubecom.a
  • needs only send and recv from any message passing library (to be used for PVM)
  • number of nodes must be a power of 2
libMPIcom.a or
libMPICHcom.a
  • uses more features of MPI 
  • "Cube"-communication works with any number of nodes
  • bad global communication performance for LAM-MPI
libMPIcubecom.a
  • private implementation of global communications (Cube_DoD, Cube_DoI, Cube_Cat) using only MPI_sendrecv
    (best performance with LAM-MPI)
  • number of nodes must be a power of 2

DISPLAY Problems

[since ∼ March 2006]
Due to software upgrade w.r.t. OpenSSH and X-server some problems have occured to receive graphical output from a parallel running program on CLIC.

Reason: ssh tunneling for X11 data does not work backward from CLIC and a new default configuration of the Xserver on local machines rejects any connection other than such secure tunnels.

Workaround: You must "forward" the DISPLAY variable that was obtained by ssh on the compute server where you logged in from outside.
Assume "remhost" to be the hostname of this compute server, then the value of $DISPLAY will be something like remhost:xx.0. You can forward this variable to CLIC as an argument of qsub:
qsub -I ... -v DISPLAY=$DISPLAY

Last Changes

Frame verlassen Fakultät für Mathematik, TU Chemnitz
webstat , Matthias Pester, 12.12.2000