Content-type: text/html Manpage of clusterPca


Section: User Commands (1)
Index Return to Main Contents


clusterPca - K-means clustering (unsupervised classification) of aligned subvolumes.  


clusterPca prmFile matFilename nClusters features
clusterPca prmFile matFilename nClusters features autoclose
clusterPca prmFile matFilename nClusters features autoclose flgPdf
clusterPca prmFile matFilename nClusters features autoclose flgPdf...
clusterPca prmFile matFilename nClusters features autoclose flgPdf...
  method maxIter
clusterPca prmFile matFilename nClusters features autoclose flgPdf...
  method maxIter nReplicates  


Cluster selected, aligned subvolumes into nClusters classes by repeated application of K-means to the specified features/coefficients. Coefficients and other required data will be read from the .mat file matFilename previously created by executing "pca" or "usePreviousPca". See the pca(1) and usePreviousPca(1) man pages for more details. Arguments to clusterPca are:
The file name of the parameter file previously used to perform the alignment and principal component analysis (pca). See the PEET, averageAll, and pca man pages for descriptions of applicable parameter file settings.
The name of the .mat file previously computed by program pca. E.g. if the base output name specified in the .prm file is <basename> and 400 particles were selected during pca, this file would be named pca400_<basename>.mat.
The integer number of clusters desired.
A string selecting which of the features/coefficients to use for clustering. Pca will have computed the set of features and attempted to arrange them in order of decreasing importance. To select features 1 trough 4, for example, you would specify "1:4" or "1,2,3,4". Similarly, to select features 1, 3, and 4, you would specify "1,3,4" or "1,3:4".
Normally, clusterPca produces a scatter plot showing the results of clustering and waits for the user to manually close this window before exiting. If autoclose is non-zero, it will exit on completion without waiting.
If true / non-zero save the cluster scatter plot in pdf rather than png format. Pdf will be higher quality but can take much longer to save for large datasets.
The method to use for clustering. Select one of 'kmeans' for k-means clustering (default), or 'hac' for hierarchichal ascendant clustering.
The maximum number of iterations (Default = 100) to allow during each run of the K-means algorithm. If multiple warnings occur about failure to converge occur during k-means clustering, try increasing this number. This may be necessary for very large datasets.
The number of times (Default = 10) to run the K-means algorithm with different starting seeds.

Repeated execution of clusterPca will typically be required to choose the desired features and number of clusters. To facilitate this process, Akiake and Bayes information criteria (AIC and BIC, respectively) and their improvement relative to a single (i.e. homogeneous) cluster are reported. Both improvements should be well above zero for the chosen cluster to be judged significant, with larger improvements corresponding to larger (penalized, negative log) likelihoods for the specified clustering.

On completion, clusterPca will display a graph of the clustering results. This graph will also be saved to clusterPca.png unless flgPdf is true. (Prior to 1.10.1, pdf was the default, but this proved excessively slow for large datasets). If more than 3 features were used, the clusters will be projected down to the space spanned by the first 3 features for viewing. 3D plots can be rotated for better visibility using the toolbar, and can be saved manually in a variety of formats using the plot's File menu.

An IMOD model for tomogram <T> with particles from class I in object I will be saved to a file named class_tom<T>_<fnModParticle>.mod, where <fnModParticle> is the name of model corresponding to tomogram <T>. Objects nClusters+1 and nClusters+2 will be associated with particles flagged as duplicates and unclassified particles, respectively. These models are intended for viewing (e.g. with 3dmodv) only, and will not be compatible with the either the input or output motive lists, since particles will have been re-ordered and dispersed among multiple objects.

Finally, output motive lists with the assigned classes stored in column 20 of the motive list will be written to files named pca_<basename>_MOTL_Tom<T>_Iter<I>.csv, where <basename> is the specified output name, <T> is the tomgram number, and <I> is 1 plus the iteration number used to generate the alignment. Subvolume averages corresponding to an individual class can be generated by backing up the original output motive lists (i.e. without the "pca_" prefix), replacing them with the new ones, and then running averageAll with selectClassID in the prm file set to the desired class number. Typically, one would also set lstThresholds to a single, large number to avoid generated multiple averages for each class.

NOTE: like pca, this program is potentially memory intensive. 32 GB or more of ram may be required for typical applications (e.g. approximately 600 particles of size 140x140x140 voxels). Binning or single-precision computation may be required for larger data sets or systems with insufficient memory. See the pca man page for additional details.  


John Heumann  


PEET(1), alignSubset(1), averageAll(1), pca(1), removeDuplicates(1), usePreviousPca(1).




This document was created by man2html, using the manual pages.
Time: 18:16:05 GMT, January 11, 2021