processchunks(1)                                              processchunks(1)



NAME
       processchunks - Run command files in parallel on multiple processors

SYNOPSIS
       processchunks  [options]  machine_list  root_name_of_command_files

       processchunks  [options]  -q #  queue_command_and_arguments  root_name

DESCRIPTION
       processchunks will process a set of command files in parallel on multi-
       ple machines or processors that have access to a common filesystem.  It
       can also be used to run a single command file on a remote machine.


   Command Files
       The command files should be in the format runnable with Vmstocsh
       (e.g., using subm) and be named like files produced by Chunksetup.
       Specifically, the files for processing individual components should be
       named rootname-001.com, rootname-002.com, etc., where "rootname" is the
       root name of the command files, entered as the last command line argu-
       ment.  If the command files create temporary files, they must be named
       uniquely, such as by ending their names in .$$ and placing them in
       /usr/tmp.  (See Chunksetup for an alternative using hostnames).  The
       temporary files should be removed by the individual command files with
       statements like
       $if (-e /usr/tmp/tmpfil.$$) rm -f /usr/tmp/tmpfil.$$

       In addition to the files for individual chunks, there may also be two
       additional files: rootname-start.com, to perform initial tasks, and
       rootname-finish.com, for final assembly and cleanup.  If the start file
       exists, Processchunks will run this on the first machine before start-
       ing to distribute the chunks; and if the finish file exists, it will
       also be run on the first machine when all chunks are finished.

       It is also possible to execute groups of command files in parallel,
       then run one command file when all of these are done, then go on to
       another group of files in parallel.  This is done by adding "-sync"
       after the number for each file to be run for synchronization.  The
       files should all still have unique, sequential numbers.  For example,
       if the 10th and 20th files are named rootname-010-sync.com and root-
       name-020-sync.com, then files 1 to 9 will be run to completion in par-
       allel, then rootname-010-sync.com will be run alone on the first
       machine, then 11 to 19 will be run in parallel, then root-
       name-020-sync.com will be run alone, then all remaining files will be
       run in parallel.

       To run any single command file on a remote machine, simply use the -s
       option and provide the machine name and the name of the command file.


   Running Processchunks
       The machine list provided on the command line should be a list of
       machine names or IP addresses, separated by commas and with no embedded
       spaces.  Machines with multiple processors can be entered as many times
       as the number of processors that you want to use, or a machine name can
       be entered with a "#" and the number of processors to use.  For exam-
       ple, "tubule#4,eclipse#6" would use 4 processors on tubule and 6 on
       eclipse.  (See the -G option for the different syntax and restrictions
       when using a GPU and running on machines where you want to select spe-
       cific GPUs.)  The local machine on which you are running Processchunks
       can be identified by its hostname or as "localhost"; in either case the
       command will be run directly on the machine.  If this is the only
       machine, then it can be identified by a single number specifying the
       number of processors to use.  For example, you can enter "4" instead of
       "localhost,localhost,localhost,localhost".  For all other machines, the
       command will be run with ssh.  You must be set up to log into each of
       the machines with ssh without having to enter a password (see below).
       Each machine must have access to the current directory.  The -w option
       can be used to specify the path by which remote machines can access the
       current directory.

       Processchunks first probes all of the machines to see if connections
       can be established and to show you the first line of the "w" output on
       each machine (or the output of "imodwincpu" on a Windows machine).
       This will also allow you to provide any one-time interactive confirma-
       tion needed by ssh, such as when you first log in to the machine.  If
       the "w" or "imodwincpu" command cannot be run on a machine, it is auto-
       matically removed from the list.  After this probe, the program asks
       you to confirm that you want to proceed with this list of machines
       (unless you use the -g option to skip this confirmation).  Thus, you
       could run Processchunks with a large list of machines, examine this
       output, then exit and restart after eliminating machines with too much
       load.

       After the confirmation, Processchunks starts jobs on all of the
       machines and watches for completion.  It detects successful completion
       by using a command that makes "CHUNK DONE" be written at the end of the
       log.  Jobs are started with a reduced priority specified by the "nice-
       ness" entry.  When the job on a machine finishes, that machine is given
       another job and the program reports the total number of chunks done so
       far.  When a job appears to have failed, the job is started again.
       Initially the job may tend to be restarted on the same machine where it
       failed, but near the end when there are other machines free, it will be
       sent to another machine.

       Except on Windows, if you type "Esc" and "Enter" at the beginning of a
       line, the program will give you four choices: killing any running jobs
       and terminating, stopping after letting any running jobs finish,
       attempting to restart with the current list of machines, or just con-
       tinuing.  If you choose to terminate or restart, existing jobs will
       first be killed to avoid having the same job running twice at the same
       time.  When you rerun this program after stopping in this way, be sure
       to use the -r option so that existing results will be retained.  This
       method of interacting with the program replaces the use of Ctrl-C with
       the earlier script version of processchunks.

       On Windows, interaction in the terminal running the program is not pos-
       sible, and the program is automatically run in the background via an
       alias to protect it from being interrupted with Ctrl-C.  Thus the only
       way to issue instructions to it is by entering them into a file that
       the program watches.  The default name of this file is "process-
       chunks.input" but you can set a different name with the -c option.

       Processchunks has mechanism for deciding when a job appears to be hung,
       at which point it will kill it and assign it to a different machine if
       possible.  Essentially, it tests for whether the job has taken a crite-
       rion amount longer than the longest job run so far (where at least 2
       prior jobs must have completed for this test to be used), and also
       whether the log file is older than a certain amount (see the -C and -T
       options for details).  By default, no tests are applied to non-sync
       chunks, although this can be enabled.  In addition, the information
       about times of completed jobs is reset after every sync chunk, since
       different segments of a computation may have different intrinsic times.
       The default settings aim to be very conservative, so the operation
       could easily hang for an hour or more until the criteria are met, but
       this feature should allow overnight runs to complete.


   Running on a Cluster Queue
       Processchunks can submit all command files to a cluster queue instead
       of running them on specific machines with ssh, provided that there is a
       program that can perform the operations needed to place a chunk on the
       queue and kill jobs.  IMOD contains a script, Queuechunk, which can
       do these operations for some specific queue types.  For other queues or
       variations, this script would have to be modified or replaced.  To use
       a queue, enter the -q option with the maximum number of jobs to be
       placed on the queue at any time, and replace the list of machine name
       with the command that Processchunks needs to issue, e.g.:
          "queuechunk -t pbs -q fast"
       Note that the command will contain multiple words and must be enclosed
       in quotes as above.  With the version distributed in IMOD, the type
       must be specified with -t, and there is an option -q to specify the
       name of the queue, and an option -h to specify the name of the head
       node, which would then be contacted via ssh.


   Initial Configuration
       To set up ssh keys for access without passwords, simply run
          ssh-keygen -t dsa
       and type Enter for all of the queries.  Then enter:
          cp $HOME/.ssh/id_dsa.pub $HOME/.ssh/authorized_keys
       If necessary, distribute this authorized_keys file to the .ssh directo-
       ries on other machines that do not have the same home directory.  The
       three different kinds of platforms (Linux/Unix, Mac OSX, Windows) may
       require separately generated keys.  If this seems to be the case, run
       ssh-keygen once on each type of machine then combine the
       .ssh/id_dsa.pub files from each into one authorized_keys file.  Dis-
       tribute this combined authorized_keys file to the .ssh directories.

       Processchunks runs jobs via ssh by starting a non-interactive bash
       login shell on the remote computer.  Any programs to be run must be on
       your path when logging in as a bash user.  If you use only IMOD pro-
       grams in the command file and IMOD has been installed by the default
       method, there should be no problem.  However, if you set up the IMOD
       environment in your own startup files instead of in the default way,
       you need to make sure that the environment is defined when running this
       bash shell.  Specifically, you need to set the IMOD_DIR environment
       variable then source $IMOD_DIR/IMOD-....sh in your ~/.bash_profile and
       not just in your ~/.bashrc, since the latter is not run by the non-
       interactive shell.  Similarly, if you are running programs outside IMOD
       whose environment is set up in your own startup files, you need to do
       that in to set environment only in ~/.bash_profile.  Thus, the recom-
       mended approach is to set the environment in ~/.bashrc and source that
       file from ~/.bash_profile.

       To diagnose problems, there are a few tests that you can run by issuing
       the same kind of command that Processchunks uses to run jobs, of the
       form:
          ssh -x machine bash --login -c \'"command"\'
       where the command can be relatively complex because of the quoting; but
       the quoting can be omitted for simple commands.  Specifically, use
          ssh -x machine bash --login -c \'"cd directory"\'
       to test if you can change to the given directory.  Use
          ssh -x machine bash --login -c imodinfo
       to test if the remote machine can run IMOD this way, and use
          ssh -x machine bash --login -c env > env.out
       to collect the environment if there is trouble running IMOD.

       Parallel processing through eTomo uses Processchunks.  The available
       machines are generally defined in a file "cpu.adoc" that is located in
       the directory pointed to by the environment variable IMOD_CALIB_DIR
       (default /usr/local/ImodCalib).  See the example cpu.adoc file in
       $IMOD_DIR/autodoc and the manual page for cpuadoc for full details
       on configuring this file.  For a single machine with multiple CPUs,
       there are three simple options for enabling the parallel processing:
          1) In eTomo, open the Settings dialog from the Options menu, check
       "Enable Parallel processing" and enter the number of processors.
          2) Make a file /usr/local/ImodCalib/cpu.adoc with these two lines:
              [Computer = localhost]
              number = 4
       where the number should be set to the number of processors.
           3) set the environment variable IMOD_PROCESSORS to the number of
       processors, e.g.,
           setenv IMOD_PROCESSORS 4     (for tcsh users)
           export IMOD_PROCESSORS=4     (for bash users)


OPTIONS
       Processchunks uses the PIP package for input (see the manual page for
       pip).  Options can be specified either as command line arguments
       (with the -) or one per line in a command file (without the -).
       Options can be abbreviated to unique letters; the currently valid
       abbreviations for short names are shown in parentheses.

       -r     Resume processing and retain all existing log files.  The
              default is to remove all existing log files, run rootname-
              start.com if it exists, and then run all of the individual com-
              mand files, finishing with rootname-finish.com if it exists.
              With this option, the program will not rerun any command files
              whose corresponding log files end with "CHUNK DONE", including
              the start and finish files.

       -s     Run a single command file on a remote machine (i.e., the first
              machine in machine list).  The command file is not required to
              be numbered.  The rootname given on the command line can be
              either the full name or the name excluding ".com".

       -G     Distribute jobs to multiple or specific GPUs on each machine.
              With this option, each machine name can be followed by a colon
              and one or more specific GPU device numbers separated by colons.
              Devices are numbered from 1 and the numbers must be positive.
              The program will run each job with the environment variable
              IMOD_USE_GPU2 set to the given GPU number, or set to 0 for a
              machine where no GPU numbers were entered (0 requests the best
              available GPU on that machine.)  No machine name can be entered
              more than once, and the "#" syntax cannot be used to specify
              multiple CPUs.  For example, "tubule:2:1,eclipse:2,druid" would
              use two GPUs on tubule, a specific one on eclipse, and the best
              or only one on druid.

       -O    Integer
              Set the limit on the number of threads processes will run in
              parallel by setting environment variable OMP_NUM_THREADS to the
              given number.  A value of 0 means do not limit multiple threads.
              The default is 0 (no limit) when a single file is run with -s or
              when the -G option is given; otherwise the default is 1.

       -g     Go start processing after probing the machines, without waiting
              for confirmation from the user.

       -n    Integer
              Run jobs with "niceness" set to the given value, which can range
              from 0 for no reduction in priority to 19 for maximum reduction.
              The default nice value is 18.

       -w    File name
              The full path for reaching the current directory on the remote
              machines.  This entry is needed when working on a local disk
              whose mounted path on the other machines is different from its
              path (as given by pwd) on the local machine.

       -d    Integer
              Drop a machine from the list if it fails this number of times in
              a row.  The default criterion is 5.

       -e    Integer
              Quit if a chunk gives a processing error (as opposed to failing
              to start) this number of times.  All running jobs will be
              killed.  The default limit is 5.

       -C    Three floats
              This option enters three factors that control the program's
              decision to kill a chunk that appears to have hung up.  They
              are: 1) A criterion for how much longer a chunk can take than
              the slowest chunk so far on the same machine.  The default is 4;
              this criterion needs to reflect the intrinsic variability among
              chunks and extent to which machine performance could vary during
              a run.  A value of 0 completely disables chunk timing tests.  2)
              A multiplier to that criterion when comparing a chunk time with
              the slowest time so far on any machine.  The default is 3, so
              the default overall criterion is 12; this factor needs to
              reflect the intrinsic spread in machine capabilities.  3) A mul-
              tiplier for both of these criteria when considering sync chunks.
              The default is 0, for no testing for sync chunks; the factor
              could be 1 if sync chunks are comparable or faster than non-sync
              chunks.

       -T    Two integers
              Timeouts for log file activity for non-sync and sync chunks, in
              seconds.  The default is 300,0.  A value of 0 disables this
              test.  A non-sync chunk is killed if it is too slow by the tim-
              ing tests controlled with the "-C" option, AND if it log is
              older than the first number entered here.  The same dual test is
              applied for a sync chunk if the log activity test is enabled
              with the second number here and the timing test is enabled with
              a non-zero factor in the third number of the "-C" entry.  How-
              ever, if the log activity test is enabled and the timing test is
              not enabled, non-sync chunks will be killed solely on the basis
              of the log becoming too old.

       -c    File name
              Check the given file periodically for lines with commands to
              quit, pause, or drop a machine (Q, P, or D machine_name).  The
              default name of this file is "processchunks.input".

       -q    Integer
              Put chunks on a cluster queue instead of sending them to indi-
              vidual machines via ssh.  The given value indicates the maximum
              number of chunks to submit at any one time.  With this option,
              the list of machine names must be replaced by the command needed
              to interact with the queue.

       -Q    Text string
              When running on a queue, this option can be used to specify the
              name that Processchunks will use when it reports chunks being
              started and finishing.  The entry must be a single word with no
              embedded spaces.  It need not match the actual name of the
              queue; the default is "queue".

       -P     Output process ID.  When this option is entered together with
              "-g", the program will skip probing machine loads.

       -v     Verbose output

       -V    Text string
              This option enables debugging output from a specific class or
              function.  It has no effect if "-v" is not entered too.  The
              entry has the form "?|?,2|class[,function[,...]][,2]|2" where 2
              means more verbose and ? or function[,...] lists the functions
              that have information to print.  The matching to class and func-
              tion names is case-insensitive and to the end of the name,
              rather than to the whole name.

       -help  Print usage message

       -StandardInput
              Read parameter entries from standard input


FILES
       Log files will be generated for all command files that are run.  The C-
       shell script produced by Vmstocsh for rootname-nnn.com is saved to
       rootname-nnn.csh.  This file is removed after the command file com-
       pletes.

BUGS
       Ctrl-C cannot be blocked on Windows, at least when running from a
       mintty terminal, so the program needs to be run in the background to
       protect it from being killed inadvertently.


AUTHOR
       Sue Held  <sueh at colorado dot edu>

HISTORY
       Up until IMOD 4.2.8, processchunks was a script written in C shell.  It
       is now a program in C++ using Qt.

SEE ALSO
       chunksetup, vmstocsh, queuechunk, cpuadoc



BL3DEMC                              4.7.3                    processchunks(1)