Content-type: text/html Manpage of pca


Section: User Commands (1)
Index Return to Main Contents


pca - principal components analysis of of aligned subvolumes  


pca prmFile iterationNumber numParticles
pca prmFile iterationNumber numParticles avgFilename
pca prmFile iterationNumber numParticles avgFilename autoclose
pcaSP prmFile iterationNumber numParticles
pcaSP prmFile iterationNumber numParticles avgFilename
pcaSP prmFile iterationNumber numParticles avgFilename autoclose  


Perform principal components analysis on a set of aligned subvolumes, optionally using the wedge-masked difference (WMD) corrected covariance matrix. Alternatively, compute the singular value decomposition of the re-scaled or constrained cross-correlation. See Heumann et al, "Clustering and variance maps for cryo-electron tomography using wedge-masked differences", Journal of Structural Biology, (2011) 175:288-299, (doi:10.1016/j.jsb.2011.05.011) for details. The resulting outut is suitable for clustering (i.e. unsupervised classification) using program clusterPca.
The name of the parameter file. See the PEET and averageAll man pages and below for descriptions of parameter file settings. The parameter file must contain the same settings previously used to align and average the subvolumes; it may also contain additional parameters described below to specify the behavior of pca.
An integer specifying the alignment iteration number to analyze.
An integer specifying the number of particles to analyze.
avgFilename (optional, but highly recommended)
The name of the MRC file containing the averaged subvolume to subtract when computing wedge-masked differences. Typically, this should be an average containing numParticles subvolumes. If this parameter is omitted, the wedge-masked difference correction will not be performed. Specifying averageFilename is not effective and can be omitted when pcaMethod (below) is 3.
Normally, pca produces a number of plots and waits for the user to manually close these windows before exiting. If autoclose is non-zero, it will exit on completion without waiting.

The following .prm file parameters are specific to pca:

pcaSzSubvol = integer vector of length 3
The size (less than or equal to szVol) of a central subvolume to analyze. Default is to use the full particle size (szVol).
pcaMethod = < 1 | 2 | 3 >
Specify the type of calculation desired: 1 = pca, 2 = SVD of re-scaled cross-correlation, 3 = SVD of constrained cross-correlation. Method 1 in conjunction with WMD-correction (i.e with avgFilename specified) is recommended for normal use. Method 2 is less accurate and slower; method 3 is only slightly less accurate but much slower.
pcaFnParticleMask = string
If desired, restrict analysis to specific region(s) within the subvolue by specifying the name of an MRC volume of size szVol containing a binary mask with non-zero entries indicating the voxels to be analyzed and 0's the voxels to ignore. (Default = no mask; i.e. use all voxels).
pcaNumEigenimages = <int>
The number of eigenimages to save when using pcaMethod 1. (Default = 4).
pcaMaxNumComponents = <int>
An upper limit of the number of principal components and corresponding coefficients/features to be saved (Default = 20). Saving many more components that will be used for clustering will consume large amounts of disk space and result in long write times.

If the output base name (fnOutput) specified in the .prm file is <basename>, the primary output of pca will be a file pca<numParticles>_<basename>.mat containing the results of the eigendecomposition and other data required for program clusterPca. Depending on the volume size, number of particle, and the method selected this file can require 10's of gigabytes.

Additionally, pca will produce 3 graphs showing 1) the fractional, cumulative sum of the singular values versus the number of features, 2) a histogram showing the size of the first 10 singular values, and 3) histograms presenting the distributions of the first 8 coefficients/features. For method 1, singular values are proportional the variance explained by the corresponding feature. These graphs are helpful for choosing features to use for subsequent clustering. Copies in pdf format will also be written to files pcaFig<n>.pdf. Other formats may be selected from each figure's pull-down file menu.

Finally, for method 1 only, MRC images corresponding to the first 4 features (eigenvectors) will be saved to files eigenImage<n>.mrc; if desired, the number of eigenimages to save can be specified by setting pcaNumEigenimages in the parameter file. In these volumes large magnitudes (displayed as either black or white) correspond to voxels which change rapidly (with opposite signs) along the corresponding feature vector, while 0 (medium gray) indicates voxels with little or no change.

NOTES: This program is both compute and memory intensive. A fast, multi-core machine with at least 32 GB of ram is suggested for typical applications (e.g. 600 particles of size 140x140x140 voxels). Specific requirements scale roughly with the product of volume size and number of particles. Insufficient memory will result in thrashing (the system will become unresponsive while showing very low cpu usage) or an error message. When full resolution is not required, prior binning may make a previously unworkable situation tractable, and will also reduce noise sensitivity. Alternatively, program pcaSP is functionally identical to pca except that key data structures and computations are single- rather than double-precision, reducing memory requirements by nearly a factor of 2. Another solution is to perform principal components analysis on a representative subset of the data, followed by decomposition of the entire data set along these principal components with program usePreviousPca.  


John Heumann  


PEET(1), alignSubset(1), averageAll(1), clusterPca(1), removeDuplicates(1), and usePreviousPca(1).




This document was created by man2html, using the manual pages.
Time: 18:16:05 GMT, January 11, 2021