Classification workflow

Once Installation is successful, the usage for this workflow can be obtained by typing either of these following commands in the terminal.

neuropredict_classify
np_classify -h

The options for this workflow are also shown below:

Easy, standardized and comprehensive predictive analysis.

usage: neuropredict_classify [-h] [-m META_FILE] [-o OUT_DIR]
                             [-y PYRADIGM_PATHS [PYRADIGM_PATHS ...]]
                             [-u USER_FEATURE_PATHS [USER_FEATURE_PATHS ...]]
                             [-d DATA_MATRIX_PATHS [DATA_MATRIX_PATHS ...]]
                             [-t TRAIN_PERC] [-n NUM_REP_CV]
                             [-k REDUCED_DIM_SIZE]
                             [-g {none,light,exhaustive}]
                             [-is {median,mean,most_frequent,raise}]
                             [-cl COVARIATES [COVARIATES ...]]
                             [-cm COVAR_METHOD]
                             [-dr {selectkbest_mutual_info_classif,selectkbest_f_classif,variancethreshold,isomap,lle,lle_modified,lle_hessian,lle_ltsa}]
                             [-z MAKE_VIS] [-c NUM_PROCS] [--po PRINT_OPT_DIR]
                             [-v] [-f FS_SUBJECT_DIR]
                             [-a ARFF_PATHS [ARFF_PATHS ...]]
                             [-p POSITIVE_CLASS]
                             [-sg [SUB_GROUPS [SUB_GROUPS ...]]]
                             [-e {randomforestclassifier,extratreesclassifier,decisiontreeclassifier,svm,xgboost}]

Named Arguments

-m, --meta_file

Abs path to file containing metadata for subjects to be included for analysis.

At the minimum, each subject should have an id per row followed by the class it belongs to.

E.g. .. parsed-literal:

sub001,control
sub002,control
sub003,disease
sub004,disease
-o, --out_dir

Output folder to store gathered features & results.

-f, --fs_subject_dir

Absolute path to SUBJECTS_DIR containing the finished runs of Freesurfer parcellation. Each subject will be queried after its ID in the metadata file. E.g. --fs_subject_dir /project/freesurfer_v5.3

Input data and formats

Only one of the following types can be specified.

-y, --pyradigm_paths

Path(s) to pyradigm datasets.

Each path is self-contained dataset identifying each sample, its class and features.

-u, --user_feature_paths

List of absolute paths to user’s own features.

Format: Each of these folders contains a separate folder for each subject ( named after its ID in the metadata file) containing a file called features.txt with one number per line. All the subjects (in a given folder) must have the number of features ( #lines in file). Different parent folders (describing one feature set) can have different number of features for each subject, but they must all have the same number of subjects (folders) within them.

Names of each folder is used to annotate the results in visualizations. Hence name them uniquely and meaningfully, keeping in mind these figures will be included in your papers. For example,

--user_feature_paths /project/fmri/ /project/dti/ /project/t1_volumes/

Only one of the --pyradigm_paths, user_feature_paths, data_matrix_path or arff_paths options can be specified.

-d, --data_matrix_paths

List of absolute paths to text files containing one matrix of size N x p ( num_samples x num_features).

Each row in the data matrix file must represent data corresponding to sample in the same row of the meta data file (meta data file and data matrix must be in row-wise correspondence).

Name of this file will be used to annotate the results and visualizations.

E.g. --data_matrix_paths /project/fmri.csv /project/dti.csv /project/t1_volumes.csv

Only one of --pyradigm_paths, user_feature_paths, data_matrix_path or arff_paths options can be specified.

File format could be

  • a simple comma-separated text file (with extension .csv or .txt), which can

easily be read back with numpy.loadtxt(filepath, delimiter=','), or

  • a numpy array saved to disk (with extension .npy or .numpy) that can read

in with numpy.load(filepath).

One could use numpy.savetxt(data_array, delimiter=',') or numpy.save( data_array) to save features.

File format is inferred from its extension.

-a, --arff_paths

List of paths to files saved in Weka’s ARFF dataset format.

Note:
  • this format does NOT allow IDs for each subject.

  • given feature values are saved in text format, this can lead to large files

with high-dimensional data,

compared to numpy arrays saved to disk in binary format.

More info: https://www.cs.waikato.ac.nz/ml/weka/arff.html

Cross-validation

Parameters related to training and optimization during cross-validation

-t, --train_perc

Percentage of the smallest class to be reserved for training.

Must be in the interval [0.01 0.99].

If sample size is sufficiently big, we recommend 0.5. If sample size is small, or class imbalance is high, choose 0.8.

-n, --num_rep_cv

Number of repetitions of the repeated-holdout cross-validation.

The larger the number, more stable the estimates will be.

-k, --reduced_dim_size

Number of features to select as part of feature selection. Options:

  • ‘tenth’

  • ‘sqrt’

  • ‘log2’

  • ‘all’

  • or an integer k <= min(dimensionalities from all dataset)

Default: tenth of the number of samples in the training set.

For example, if your dataset has 90 samples, you chose 50 percent for training (default), then Y will have 90*.5=45 samples in training set, leading to 5 features to be selected for training. If you choose a fixed integer k, ensure all the feature sets under evaluation have atleast k features.

-g, --gs_level

Possible choices: none, light, exhaustive

Flag to specify the level of grid search during hyper-parameter optimization on the training set.

Allowed options are : ‘none’, ‘light’ and ‘exhaustive’, in the order of how many values/values will be optimized. More parameters and more values demand more resources and much longer time for optimization.

The ‘light’ option tries to “folk wisdom” to try least number of values (no more than one or two), for the parameters for the given classifier. (e.g. a lage number say 500 trees for a random forest optimization). The ‘light’ will be the fastest and should give a “rough idea” of predictive performance. The ‘exhaustive’ option will try to most parameter values for the most parameters that can be optimized.

-p, --positive_class

Name of the positive class (e.g. Alzheimers, MCI etc) to be used in calculation of area under the ROC curve. This is applicable only for binary classification experiments.

Default: class appearing last in order specified in metadata file.

-sg, --sub_groups

This option allows the user to study different combinations of classes in a multi-class (N>2) dataset.

For example, in a dataset with 3 classes CN, FTD and AD, two studies of pair-wise combinations can be studied separately with the following flag --sub_groups CN,FTD CN,AD. This allows the user to focus on few interesting subgroups depending on their dataset/goal.

Format: Different subgroups must be separated by space, and each sub-group must be a comma-separated list of class names defined in the meta data file. Hence it is strongly recommended to use class names without any spaces, commas, hyphens and special characters, and ideally just alphanumeric characters separated by underscores.

Any number of subgroups can be specified, but each subgroup must have atleast two distinct classes.

Default: 'all', leading to inclusion of all available classes in a all-vs-all multi-class setting.

Predictive Model

Parameters of pipeline comprising the predictive model

-is, --impute_strategy

Possible choices: median, mean, most_frequent, raise

Strategy to impute any missing data (as encoded by NaNs).

Default: ‘raise’, which raises an error if there is any missing data anywhere. Currently available imputation strategies are: (‘median’, ‘mean’, ‘most_frequent’)

-cl, --covariates

List of covariates to be taken into account. They must be present in the original feature set in pyradigm format, which is required to implement the deconfounding (covariate regression) properly. The pyradigm data structure allows you to specify data type (categorical or numerical) for each covariate/attribute, which is necessary to encode them accurately.

Specify them as a comma-separated list of strings without any spaces or special characters, exactly as you encoded them in the input pyradigm dataset. Example: -cl age,site

-cm, --covar_method

Type of “deconfounding” method to handle confounds/covariates. This method would be trained on the training set only (not their targets, just features and covariates). The trained model is then used to deconfound both training features (prior to fitting the predictive model) and to transform the test set prior to making predictions on them.

Available choices: (‘residualize’, ‘augment’)

-dr, --dim_red_method

Possible choices: selectkbest_mutual_info_classif, selectkbest_f_classif, variancethreshold, isomap, lle, lle_modified, lle_hessian, lle_ltsa

Feature selection, or dimensionality reduction method to apply prior to training the classifier.

NOTE: when feature ‘selection’ methods are used, we are able to keep track of which features in the original input space were selected and hence visualize their feature importance after the repetitions of CV. When the more generic ‘dimensionality reduction’ methods are used, features often get transformed to new subspaces, wherein the link to original features is lost. Hence, importance values for original input features can not be computed and, are not visualized.

Default: VarianceThreshold, removing features with 0.001 percent of lowest variance (zeros etc).

-e, --classifier

Possible choices: randomforestclassifier, extratreesclassifier, decisiontreeclassifier, svm, xgboost

String specifying one of the implemented classifiers. (Classifiers are carefully chosen to allow for the comprehensive report provided by neuropredict).

Default: ‘RandomForestClassifier’

Visualization

Parameters related to generating visualizations

-z, --make_vis

Option to make visualizations from existing results in the given path. This is helpful when neuropredict failed to generate result figures automatically e.g. on a HPC cluster, or another environment when DISPLAY is either not available.

Computing

Parameters related to computations/debugging

-c, --num_procs

Number of CPUs to use to parallelize CV repetitions.

Default : 4.

Number of CPUs will be capped at the number available on the machine if higher is requested.

--po, --print_options

Prints the options used in the run in an output folder.

-v, --version

show program’s version number and exit