Batch Tomogram Reconstruction with Batchruntomo in IMOD 5.1

University of Colorado, Boulder



Introduction
Setting General Batch Processing
The Stack Table
Setting Basic Parameters for the Data Sets
Using the Advanced Parameter Interface
Running the Data Sets
Running on a Cluster

Introduction

Etomo provides an interface for reconstructing multiple tomograms automatically using Batchruntomo. The data sets should be sufficiently similar so that, for the most part, the same parameters and procedures can be applied to all of them. The interface allows you to set a number of parameters, but in each case a different value can be used for an individual data set. The parameters that we think most likely to vary are included in a table of data sets. For the other parameters, there is one tab of the interface to set the values to apply in general, which are referred to as global values. If necessary, you can open a copy of this screen for an individual data set and set different values there.

You may want to go through the Example of Batch Reconstruction either before or after reading this document.

For simplicity, the basic interface presents a selected subset of the many parameters that can be set from the regular reconstruction interface. However, you can now switch to an advanced interface that presents a much larger number of other parameters. Initially, we relied on templates as a mechanism for controlling the values of parameters not exposed in the interface. Templates still have an important role even with the advanced parameter screen available. Templates and the current editor for saving them are described in Using Etomo. In brief, they are text files with the extension ".adoc" containing name-value pairs called directives, whose format is described in the Batchruntomo man page. The available directives are listed in the directive table. If you want to make a template for personal use and do so by hand, put it in the directory .etomotemplate under your home directory (this is where Etomo's template editor places user templates by default). Using Etomo describes what to do with templates for general use.

The interface is organized into four tabs that would generally be visited in sequence from left to right. However, they can all be accessed and changed at any time. Also note that you can close the interface and reopen it to resume working on a project; you should find all of the settings as you left them, although a few may no longer be changed. The project file has the extension ".ebt".

Setting General Batch Processing Parameters

The Batch Setup tab has items that should be filled in first.

The Stack Table

On the Stacks, you add the tilt series that you want to process to a table. When you press Add Stack(s), a file chooser will open to allow you to select the stack files. You can select multiple files and add them together. If you have many dual-axis data sets, you can select all of the "a" and "b" files together, and the program will show just the "a" files in the table. The stacks can have an extension of either ".st" or ".mrc"; the latter will be renamed to ".st" for processing.

For your first addition to the table, the Dual axis, Montage, and Beads on Two Surfaces checkboxes are set based on the defaults that you have set in the Options-Settings dialog, as modified by any templates you have chosen. The setting for Dual Axis box will also be modified as appropriate when both "a" and "b" stacks are entered, or when the stack root name does not end in "a" or "b". Further entries will inherit the settings of these three boxes from the previous line in the table. The Copy Down button will copy these three settings from the selected line to the one below, which is fairly useless and should be changed to copy to all lines below. For now, the easiest way to get these boxes set for a large number of data sets is to add one, set the buttons, then add the rest.

The Boundary Model is used to indicate regions where the fiducial seed model should be selected for tracking, or where patch tracking should be done. If you have data sets needing such models, check the box before pressing the 3dmod icon to draw the model, so that 3dmod can be given the right filename and location. The file is named with the data set root name plus "_rawbound.mod" and is placed in the current location of the data set. For a dual-axis data set, the model is transformed to be used with the second axis. When using fiducials, the transformation is based on the previous run of Transferfid for transferring fiducials; when patch tracking, Transferfid is used at this point to find the transformation. You need to draw one or more contours just on one view, the zero-degree one if possible.

If entries are made to Exclude Views, the views will currently be carried through into the coarse and fine aligned stacks but skipped in tracking, alignment and reconstruction. An option can be set in the Preprocessing section of the Advanced screen or a directive supplied in a template or starting batch file to remove the excluded views. That same section has options to enable automatic exclusion of dark images near the ends of the tilt series stack (e.g., by setting SD criterion for excluding high tilt views to 0.5).

Setting Parameters for the Data Sets

Using the Advanced Parameter Interface

Press the Advanced button on the Dataset Values tab to switch to the advanced interface. The same button can be used on the Dataset Values dialog for an individual data set to set advanced parameters for that set. The button changes to Basic can be used to switch back to the basic interface. Unlike Advanced mode in the reconstruction interface, the Advanced dialog here does not include any of the parameters in the basic interface.

This interface is organized as a set of stacked sections that can be individually opened and closed by clicking the bar with the name of the section. The sections correspond to the ones in the master directive table, and all directives described in that table but not included in the basic interface are available in the advanced interface. If you are using a template or starting batch file that includes any directives not listed in the master table, then there will be an additional section at the bottom with those directives.

The directives that appear in the interface are controlled by the two choices in the Which Directives to Show box. Turning on Only items containing a value will make it show just the directives that have a value set from either a template, starting batch file, or by your setting the value. With Only items output to batch file checked, only the directives that will actually be written to batch file will be shown, not ones with template values.

When the directives being shown are restricted, sections with no directives to show are closed and disabled.

The entry fields depend on the type of parameter.

Values from templates and their labels will be shown in blue, as well as values from the "batchDefaults.adoc" file, which is treated like a bottom-level template. When there is a non-boolean value from a template, the X to the right of the field will be enabled. This button allows you to override the template and revert to a default value by placing a blank directive into the batch file. Press this button to select an override; the field will then turn black and display ">OVERRIDE<". The override button will remain enabled and you can press it again to revert to the template value.

The program checks whether entries match the expected type and number and do not have extraneous characters. Erroneous entries will be displayed in red. However, there is no check for whether more complex restrictions are satisfied, such as two options being mutually exclusive or negative values for entries that should be positive. These errors will not be caught until the program in question runs.

Running the Data Sets

When you select the Run tab, you should first make selections in the Resources to Use section to indicate whether to use multiple CPUs and one or more GPUs. If you make no selections, only a single CPU will be used. After you select either Use multiple CPUs or Parallel GPUs, a table appears at the top with computing resources. If both are selected, the table will show available CPUs on the left and available GPUs on the right.

With Use multiple CPUs selected, you have the option of running batch jobs in parallel. When not using that option, reconstructions are run sequentially. Your selections in the Resources section determine what resources are used for each single reconstruction. When Run multiple batch jobs in parallel is selected, there is one command file per data set and they are all passed to Processchunks to run, along with the number to run at one time. If a cluster is available, the Use a cluster checkbox in the Parallel Processing section will be enabled and can be selected to use the cluster instead. See Running on a Cluster for details. When running with regular computer resources, the selected CPUs are divided as equally as possible among the different jobs, whereas the GPUs are managed dynamically so that each job can have access to multiple GPUs when it reaches a step that needs GPUs. See the Splitbatch man page for details on how this dynamic allocation works. The entry for Maximum # of GPUs to use by one job determines how many GPUs a job will request, but it may get fewer or only one. If Local GPU is selected, it does not mean that there is one GPU on each machine; select this only if you want to use just a single GPU on the machine running Etomo.

If you enable Email notification and enter an address, Batchruntomo will send an email whenever a data set is aborted and when all processing is complete. For the email to work, you may need to define an SMTP mail server; this can be done in the Options - Settings dialog.

The Subset of steps to run section allows you to control a stopping or starting point for the run. If you turn on Stop after, you can select one of the available stopping points. All data sets selected for running (by means of the button in the Run column of the Datasets table) will be run to the same selected stopping point. When the run stops at such a point, you can then turn on Start from and select a starting point. Generally you would want to select the starting point paired with the stopping point, but you can go back to an earlier one if desired. Ordinarily, it would not work to select a starting point later than the earliest point reached by any of the data sets included in the run, so this ius not allowed by default. However, if you have completed a step manually, such that it would not be a problem to start at a later step, then you can select Enable starting from any step and select any step. When starting past the Fine Alignment step, Batchruntomo no longer recomputes the fine alignment (which involves adjusting alignment parameters as needed.)

Press Run to start a run. During the run, you can use Kill Process to stop processing as quickly as possible, or Pause to make it stop after the current data set finishes or reaches the stopping point (or when all running data sets reach that point, if running in parallel). When running multiple jobs in parallel, a Kill will not take effect for each job until it finishes its current step and checks for a quit signal. After a Pause or Kill, the Resume button can be used to restart the run from where it left off. When resuming from a Kill, each data set that was killed will be run from the beginning or from the selected starting point, not from the step where it was killed.

Almost nothing can be changed after a Pause or Kill: data set parameters and starting and stopping points are disabled; data sets marked for running can be dropped out but none can be added. The situation is more flexible when all data sets have reached a selected stopping point; it is possible to manipulate which sets are included in the run. However, data set parameters currently can still not be changed (we plan to enable a subset of parameters that will have an effect when changed). To remove all these restrictions in either situation, press Reset. This has several consequences: 1) data set parameters can be edited again; 2) the Resume button is disabled and the program forgets about what would be needed to resume; 3) all data sets selected for running will be run from the selected started point or the beginning, even if they already reached a later point. Thus, if you use Reset after a Pause or Kill, you have to manually turn off the Run checkboxes for any data sets that have already been run and that you do not wish to rerun.

When not running in parallel, Etomo saves and runs a single command file in the project directory, named "rootname.com", where "rootname" is the project root name from the Batch Setup tab. During the run, a corresponding log file will be created in the project directory and will contain all of the log output from the run. There will be a copy of the portion of that log for each data set in its respective directory, named "batchruntomo.log"; this can be opened with the Open Log button in the Run table. The full log for all runs can be opened from the menu brought up by right-clicking over the panel. Selected extracts are shown in the Project Log. The portion of the project log for each data set can be opened with the Proj Log button.

The situation is similar when running in parallel, except that there is a command file, and eventually a corresponding log file, for each data set in the project directory. This log is essentially the same as the "batchruntomo.log" in the data set directory, but if there is an error before the latter is started, you would have to examine the log for that data set in the project directory.

You can exit Etomo after starting a run and reopen it later. The program will "reconnect" to the run, whether it is finished or not, and update the status for all data sets.

Running on a Cluster

A cluster can be used only when running multiple Batchruntomo jobs in parallel. (If the cpu.adoc file on the system defines only cluster queues and no computers, running in parallel is obligatory.) When using a cluster is selected, there are several changes in the interface: 1) The Maximum # of GPUs to use by one job entry is disabled; instead the number of GPUs is determined by the properties of the queue being run on. 2) The Resources to Use items are disabled. 3) The cluster-specific section How a Batch Job Should Run Processes is enabled and has a set of choices that depend on what kind of queues are available in the table.

A variety of ways of running on a cluster are supported. The Splitbatch man page has a complete description of the different possibilities and how to configure the cpu.adoc for them. The choices enabled in How a Batch Job Should Run Processes reflect which possibilities are available and let you control which kind of queues can be selected in the Parallel Processing table. The possibilities are:

  1. Each job gets access to a single core and no GPU. Batchruntomo will submit chunks for parallelized operations like reconstruction to the queue and run other operations directly on that core. Select Job submits processes to single-core queue to run in this mode. This option is available only if there is a queue defined that provides just one core and no GPU. In this case, the spinner is available in the Used column of the Parallel Processing table to select the maximum number of jobs to submit to the queue.
  2. Each job is allocated several cores on a node, and possibly one or more GPUs. Batchruntomo will run all CPU-based operations using these resources and not submit any of them to the queue. It will allow multi-threaded operations (parallelized with OpenMP) to use all of the cores and run other parallelized operations in multiple chunks. If GPU(s) are provided, it will use those for operations that can be run on a GPU, splitting them into chunks if there are multiple GPUs. Select Job runs on a node and runs processes directly on that node to run in this mode, which is available only if there is a queue with either a GPU or multiple cores. In this case, the the Used column of the table contains no spinner and cannot be edited; its value is kept equal to the value for Run up to # jobs.
  3. Either of these modes can be combined with a different way of accessing a GPU: using a "secondary" queue that provides one GPU. In this case, all operations that can be done on a GPU are split into chunks and the chunks are submitted to that secondary queue. In this way, GPUs are requested only when needed, similar to the dynamic allocation of GPUs when not running on a cluster. Check Use one GPU on secondary queue for this option, which is available only if there is a queue that provides just one GPU. Once this is checked, the radio buttons in the 2nd column of the table are enabled, and an eligible queue can be selected there. Also, the radio buttons in the 1st column are disabled for queues offering a GPU; a CPU-only queue must be selected there. The spinner in the Used column for the secondary queue should be set to the maximum number of GPU jobs that one Batchruntomo job will submit at once. If this is set to one, the operation will not be split into chunks, which is probably more efficient.
After selecting the desired mode of operation (if there are any such choices), select the appropriate queue(s) from among the enabled ones.