processchunks(1) General Commands Manual processchunks(1)
NAME
processchunks - Run command files in parallel on multiple processors
SYNOPSIS
processchunks [options] machine_list root_name_of_command_files
processchunks [options] -q ncpus "queue_cmd_and_arguments" root_name
DESCRIPTION
processchunks will process a set of command files in parallel on multi-
ple machines or processors that have access to a common filesystem. It
can also be used to run a single command file on a remote machine.
Command Files
The command files should be in the format runnable with Vmstopy
(e.g., using subm) and be named like files produced by Chunksetup.
Specifically, the files for processing individual components should be
named rootname-001.com, rootname-002.com, etc., where "rootname" is the
root name of the command files, entered as the last command line argu-
ment. As of IMOD 4.10.3, the extension ".pcm" may be used instead of
".com", provided that all files have the same extension. As of IMOD
4.12.15, command files need not be in the current directory; the root
name entry can include the path to a subdirectory or directory located
elsewhere.
If the command files create temporary files, they must be named
uniquely, such as by ending their names in .$$ and placing them in
/usr/tmp. (See Chunksetup for an alternative using hostnames). The
temporary files should be removed by the individual command files with
statements like
$if (-e /usr/tmp/tmpfil.$$) rm -f /usr/tmp/tmpfil.$$
In addition to the files for individual chunks, there may also be two
additional files: rootname-start.com, to perform initial tasks, and
rootname-finish.com, for final assembly and cleanup. If the start file
exists, Processchunks will run this on the first machine before start-
ing to distribute the chunks; and if the finish file exists, it will
also be run on the first machine when all chunks are finished.
It is also possible to execute groups of command files in parallel,
then run one command file when all of these are done, then go on to
another group of files in parallel. This is done by adding "-sync"
after the number for each file to be run for synchronization. The
files should all still have unique, sequential numbers. For example,
if the 10th and 20th files are named rootname-010-sync.com and root-
name-020-sync.com, then files 1 to 9 will be run to completion in par-
allel, then rootname-010-sync.com will be run alone on the first
machine, then 11 to 19 will be run in parallel, then root-
name-020-sync.com will be run alone, then all remaining files will be
run in parallel.
If there are more than 999 chunks, the files can be numbered with 3
digits below 1000, 4 digits for 1000 to 9999, 5 digits for 10000 to
99999, and 6 digits for 100000 and above. Alternatively, all files can
be given the same number of digits, e.g., "0001" to "4567". The pro-
gram will not run more than 999,999 chunks. Large numbers of chunks
will use a lot of resources both within processchunks and in the
filesystem, so it is desirable to keep the number of chunks under 10000
if possible. However, the capability for using up to 1,000,000 chunks
is there if needed.
To run any single command file on a remote machine, simply use the -s
option and provide the machine name and the name of the command file.
Running Processchunks
The machine list provided on the command line should be a list of
machine names or IP addresses, separated by commas and with no embedded
spaces. Machines with multiple processors can be entered as many times
as the number of processors that you want to use, or a machine name can
be entered with a ":" and the number of processors to use. (Prior to
IMOD 4.8.25, a "#" was used, but this cannot be distinguished from the
start of a comment when reading options from standard input.) For
example, "tubule:4,eclipse:6" would use 4 processors on tubule and 6 on
eclipse. (See the -G option for the different syntax and restrictions
when using a GPU and running on machines where you want to select spe-
cific GPUs.) The local machine on which you are running Processchunks
can be identified by its hostname or as "localhost"; in either case the
command will be run directly on the machine. If this is the only
machine, then it can be identified by a single number specifying the
number of processors to use. For example, you can enter "4" instead of
"localhost,localhost,localhost,localhost". For all other machines, the
command will be run with ssh. You must be set up to log into each of
the machines with ssh without having to enter a password (see below).
Each machine must have access to the current directory. The -w option
can be used to specify the path by which remote machines can access the
current directory.
Processchunks first probes all of the machines to see if connections
can be established and to show you the first line of the "w" output on
each machine (or the output of "imodwincpu" on a Windows machine).
This will also allow you to provide any one-time interactive confirma-
tion needed by ssh, such as when you first log in to the machine. If
the "w" or "imodwincpu" command cannot be run on a machine, it is auto-
matically removed from the list. After this probe, the program asks
you to confirm that you want to proceed with this list of machines
(unless you use the -g option to skip this confirmation). Thus, you
could run Processchunks with a large list of machines, examine this
output, then exit and restart after eliminating machines with too much
load.
After the confirmation, Processchunks starts jobs on all of the
machines and watches for completion. It detects successful completion
by using a command that makes "CHUNK DONE" be written at the end of the
log. Jobs are started with a reduced priority specified by the "nice-
ness" entry. When the job on a machine finishes, that machine is given
another job and the program reports the total number of chunks done so
far. When a job appears to have failed, the job is started again.
Initially the job may tend to be restarted on the same machine where it
failed, but near the end when there are other machines free, it will be
sent to another machine.
Except on Windows, if you type "Esc" and "Enter" at the beginning of a
line, the program will give you four choices: killing any running jobs
and terminating, stopping after letting any running jobs finish,
attempting to restart with the current list of machines, or just con-
tinuing. If you choose to terminate or restart, existing jobs will
first be killed to avoid having the same job running twice at the same
time. When you rerun this program after stopping in this way, be sure
to use the -r option so that existing results will be retained. This
method of interacting with the program replaces the use of Ctrl-C with
the earlier script version of processchunks.
On Windows, interaction in the terminal running the program is not pos-
sible, and the program is automatically run in the background via an
alias to protect it from being interrupted with Ctrl-C. Thus the only
way to issue instructions to it is by entering them into a file that
the program watches. The default name of this file is "process-
chunks.input" but you can set a different name with the -c option.
Processchunks has a mechanism for deciding when a job appears to be
hung, at which point it will kill it and assign it to a different
machine if possible. Essentially, it tests for whether the job has
taken a criterion amount longer than the longest job run so far (where
at least 2 prior jobs must have completed for this test to be used),
and also whether the log file is older than a certain amount (see the
-C and -T options for details). By default, no tests are applied to
sync chunks, although this can be enabled. In addition, the informa-
tion about times of completed jobs is reset after every sync chunk,
since different segments of a computation may have different intrinsic
times. The default settings aim to be very conservative, so the opera-
tion could easily hang for an hour or more until the criteria are met,
but this feature should allow overnight runs to complete.
Running on a Cluster Queue
Processchunks can submit all command files to a cluster queue instead
of running them on specific machines with ssh, provided that there is a
program that can perform the operations needed to place a chunk on the
queue and kill jobs. IMOD contains a script, Queuechunk, which can
do these operations for some specific queue types (see that man page
for the definitive list of supported types). For other queues or vari-
ations, this script would have to be modified or replaced. To use a
queue, enter the -q option with the maximum number of jobs to be placed
on the queue at any time, and replace the list of machine name with the
command that Processchunks needs to issue, e.g.:
"queuechunk -t pbs -q fast"
Note that the command will contain multiple words and must be enclosed
in quotes as above. With the version distributed in IMOD, the type
must be specified with -t, and there is an option -Q to specify the
name of the queue, and an option -h to specify the name of the head
node, which would then be contacted via ssh.
Initial Configuration
Information on configuring your system to use Processchunks has been
move to the IMOD User's Guide, section 1.9, Setting up Parallel Pro-
cessing in IMOD.
OPTIONS
Processchunks uses the PIP package for input (see the manual page for
pip). Options can be specified either as command line arguments
(with the -) or one per line in a command file (without the -).
Options can be abbreviated to unique letters; the currently valid
abbreviations for short names are shown in parentheses.
-r Resume processing and retain all existing log files. The
default is to remove all existing log files, run rootname-
start.com if it exists, and then run all of the individual com-
mand files, finishing with rootname-finish.com if it exists.
With this option, the program will not rerun any command files
whose corresponding log files end with "CHUNK DONE", including
the start and finish files.
-s Run a single command file on a remote machine (i.e., the first
machine in machine list). The command file is not required to
be numbered. The rootname given on the command line can be
either the full name or the name excluding ".com".
-m Run multiple, independently named command files, whose full
names or names without extension are given as non-option argu-
ments after the machine list. The command files are not
required to be numbered. The program behavior changes in sev-
eral ways: processing will continue even if some chunks fail;
chunks that fail will be retried only once unless a different
value is entered with the -e option; the detection of slow/hung
chunks is disabled unless the -C option is entered.
-G Distribute jobs to multiple or specific GPUs on each machine.
With this option, each machine name can be followed by a colon
and one or more specific GPU device numbers (not number of
devices) separated by colons. Devices are numbered from 1 and
the numbers must be positive. The program will run each job
with the environment variable IMOD_USE_GPU2 set to the given GPU
number, or set to 0 for a machine where no GPU numbers were
entered (0 requests the best available GPU on that machine.) No
machine name can be entered more than once. For example,
"tubule:2:1,eclipse:2,druid" would use two GPUs on tubule, a
specific one (not two) on eclipse, and the best or only one on
druid.
-O Integer
Set the limit on the number of threads processes will run in
parallel by setting environment variable OMP_NUM_THREADS to the
given number. A value of 0 means do not limit multiple threads.
The default is 0 (no limit) when a single file is run with -s or
when the -G option is given; otherwise the default is 1.
-M Integer
Run a limited number of chunks at a time, giving each chunk a
unique set of CPU resources that it can use to run parallel
operations. Specifically, all of the entered processors are
divided among the entered number of jobs, with the goal of giv-
ing each job as many cores as possible on the machine on which
the job runs. Two environment variables are set for each job:
MULTI_PROC_CPU_LIST with a list of processors that the job can
use when running Processchunks; and MULTI_PROC_THREAD_LIMIT with
the number of cores on the local machine, which the job can use
to limit the number of cores used for operations parallelized
with OpenMP. If a machine is dropped either due to failures or
by an external "D" entry, the program will allow running chunks
to complete, reallocate the processors without that machine, and
continue.
-p Text string
Specification for a pool of GPUs when doing multi-processor
jobs. This list of machines is independent of the CPU machine
list. The format is the same as for a machine list entered when
the -G option is used. When this pool is provided, the environ-
ment variable MULTI_PROC_GPU_POOL is set for each job. If a
machine on this list is dropped, the pool will be revised; also,
machines that are not in the set of ones used to run the multi-
processor jobs will be probed and removed from the pool if they
can no longer be connected to.
-g Go start processing after probing the machines, without waiting
for confirmation from the user.
-n Integer
Run jobs with "niceness" set to the given value, which can range
from 0 for no reduction in priority to 19 for maximum reduction.
The default nice value is 18.
-w File name
The full path for reaching the current directory on the remote
machines. This entry is needed when working on a local disk
whose mounted path on the other machines is different from its
path (as given by pwd) on the local machine.
-d Integer
Drop a machine from the list if it fails this number of times in
a row. If there are still other jobs running on the machine, it
will be put on hold with no new jobs assigned. If all those
jobs fail, it will be dropped then, but if any succeed, the
failure count is reset. The default criterion is 5 for regular
machines and 10 for a cluster queue.
-e Integer
Quit if a chunk gives a processing error (as opposed to failing
to start) this number of times, unless running multiple named
command files with the -m option. All running jobs will be
killed when quitting. The default limit is 5 for numbered com-
mand files on multiple machines, 10 if running on a queue, or 2
when running with the -m option or when running on one machine.
-C Three floats
This option enters three factors that control the program's
decision to kill a chunk that appears to have hung up. They
are: 1) A criterion for how much longer a chunk can take than
the slowest chunk so far on the same machine. The default is 4;
this criterion needs to reflect the intrinsic variability among
chunks and extent to which machine performance could vary during
a run. A value of 0 completely disables chunk timing tests. 2)
A multiplier to that criterion when comparing a chunk time with
the slowest time so far on any machine. The default is 3, so
the default overall criterion is 12; this factor needs to
reflect the intrinsic spread in machine capabilities. 3) A mul-
tiplier for both of these criteria when considering sync chunks.
The default is 0, for no testing for sync chunks; the factor
could be 1 if sync chunks are comparable or faster than non-sync
chunks.
-T Two integers
Timeouts for log file activity for non-sync and sync chunks, in
seconds. The default is 300,0. A value of 0 disables this
test. A non-sync chunk is killed if it is too slow by the tim-
ing tests controlled with the "-C" option, AND if it log is
older than the first number entered here. The same dual test is
applied for a sync chunk if the log activity test is enabled
with the second number here and the timing test is enabled with
a non-zero factor in the third number of the "-C" entry. How-
ever, if the log activity test is enabled and the timing test is
not enabled, non-sync chunks will be killed solely on the basis
of the log becoming too old.
-c File name
Check the given file periodically for lines with commands to
quit, pause, or drop a machine (Q, P, or D machine_name). The
default name of this file is "processchunks.input". With a quit
command, the program will kill all running chunks. With a
pause, it will allow everything that is running to finish but
not start any new ones. When pausing on a cluster, it will can-
cel any jobs on the queue that have not started yet but allow
started jobs to finish.
-L Integer
Set the limit on the number of jobs started sequentially during
one cycle of checking jobs and machines. The time between
cycles is 1 second. The default is 20.
-q Integer
Put chunks on a cluster queue instead of sending them to indi-
vidual machines via ssh. The given value indicates the maximum
number of chunks to submit at any one time. With this option,
the list of machine names must be replaced by the command needed
to interact with the queue.
-Q Text string
When running on a queue, this option can be used to specify the
name that Processchunks will use when it reports chunks being
started and finishing. The entry must be a single word with no
embedded spaces. It need not match the actual name of the
queue; the default is "queue".
-I Text string
When running on a queue, this option can be used to specify a
command that will be issued to the queue before submitting jobs,
such as for allocating nodes. It can have embedded spaces.
-D Text string
When running on a queue, this option can be used to specify a
command that will be issued to the queue before exiting. It can
have embedded spaces.
-W Integer
Seconds to wait for the queue initialization command given with
the -I to finish. Enter -1 to wait indefinitely, or a positive
value up to 2000000. The default is 30, or a value set with the
environment variable IMOD_QUEUE_INIT_WAIT. An entry here over-
rides a value set with that variable.
-JC Integer
Number of cores per job on cluster queue when running multi-pro-
cessor jobs. Both -q and -M must be entered as well. With this
entry, Processchunks will set the environment variable
MULTI_PROC_JOB_CORES instead of MULTI_PROC_QUEUE_COMMAND, which
will cause Batchruntomo to run chunks directly on that number
of cores instead of submitting them to a queue, and to allow
processes to use that number of threads.
-JG Integer
Number of GPUs per job on cluster when running multi-processor
jobs; cores per job must also be specified with -JC.
-SQ Text string
Specification of secondary queue to be used by running jobs when
running multi-processor jobs, instead of the main queue speci-
fied by -q. With this entry, Processchunks will set the envi-
ronment variable MULTI_PROC_GPU_QUEUE. The primary use for this
option is to provide a separate queue with a GPU, to be used
only with processes that require it, so that GPU resources are
obtained only when needed.
-SN Integer
Maximum number of jobs to submit together to secondary queue
-P Output process ID. When this option is entered together with
"-g", the program will skip probing machine loads.
-v Verbose output
-V Text string
This option enables debugging output from a specific class or
function. It has no effect if "-v" is not entered too. The
entry has the form "?|?,2|class[,function[,...]][,2]|2" where 2
means more verbose and ? or function[,...] lists the functions
that have information to print. The matching to class and func-
tion names is case-insensitive and to the end of the name,
rather than to the whole name.
-help Print usage message
-StandardInput
Read parameter entries from standard input
FILES
Log files will be generated for all command files that are run. The C-
shell script produced by Vmstocsh for rootname-nnn.com is saved to
rootname-nnn.csh. This file is removed after the command file com-
pletes.
BUGS
Ctrl-C cannot be blocked on Windows, at least when running from a
mintty terminal, so the program needs to be run in the background to
protect it from being killed inadvertently.
AUTHOR
Sue Held <sueh at colorado dot edu>
HISTORY
Up until IMOD 4.2.8, processchunks was a script written in C shell. It
is now a program in C++ using Qt.
SEE ALSO
chunksetup, vmstocsh, queuechunk, cpuadoc
IMOD 5.2.5 processchunks(1)