Usage¶
Note
Since version 0.6, to leverage the advanced features of handling confounds/covariates within nested cross-validation, neuropredict
requires the input datasets in the pyradigm
format. This is needed to accurately infer the names and data-types of confounding variables and values, which is very hard or impossible with in CSV files, esp when dealing multiple modalities/feature-sets. Learn more about this data structure at http://raamana.github.io/pyradigm/.
pyradigm
datasets not only enable great use of neuropredict, but also help with reproducibility in a number of ways, including making it easier to share the datasets between collaborators and colleagues etc, but also to track their provenance with various user-defined attributes. More @ http://raamana.github.io/pyradigm/.
The command line interface for neuropredict is strongly recommended, given its focus on batch processing multiple datasets and comparisons. There are two main streams to neuropredict: the Classification workflow and the Regression workflow. Check their respective pages for instructions on their usage.
The high-level differences between the two workflows are the following:
The targets for prediction in the classification workflow are categorical and discrete (i.e. “health” vs. “disease”, “monkey” vs. “chair”, “1” vs “2” etc), whereas in the regression workflow they are numerical and continuous (e.g. test score, income, age, disease severity).
In the classification workflow, you can have more than two classes (from now on referred to as targets to be consistent), and hence we offer the ability to select a sub-group (subset of classes) for analysis. For example, if your dataset has 4 classes A, B, C, and D, you can choose to analyze one binary comparison A vs. B with the following flag
--sub_groups A,B
. You can also add a 3-class comparison B vs. C vs. D to that subgroup with-sg A,B B,C,D
. That concept of sub-grouping does not exist in the regression workflows, as there is a no meaningful or universal way of categorizing regression target valuesIn addition, the concepts of class imbalance, and stratifying the training set to control the class sizes in the training and/or test sets exist only in the classification workflow.
The performance metrics change drastically between the two workflows: classification analyses often focus on accuracy, AUC and confusion matrices, whereas regression analyses discuss \(r^2\),
MAE
, explained variance andMSE
etc. Hence, the results saved to disk differ in their structure at the metric level (as well as in additional attributes), and hence needed to be handled separately depending on the workflow. However, the results are stored in theCVResults
class, in a comprehensive manner, including original predictions for each CV repetition. So, you could choose to compute different or more metrics easily yourself.
If something is unclear, please let me know by opening an issue on github. Thanks.