Frequently Asked Questions

  • What is the overarching goal for neuropredict?

    • To offer a comprehensive report on predictive analysis effortlessly while following all the best practices!

    • Aiming to interface directly with the outputs of various neuroscience and other popular tools, to reduce the barriers for labs without software expertise and to save time for the labs with software expertise

    • This tool is generic, as it is simply about proper estimation of predictive performance. Hence, there is nothing tied to neuroscience data (despite its name), so users could input arbitrary set of features and targets from any domain (astronomy, nutrition, medicine, phrama or otherwise) and leverage its to produce a comprehensive report.

  • What is your default predictive model/pipeline?

    • Predictive analysis [by default] is performed with Random Forest Classifier/Regressor, after some basic preprocessing comprising of robust scaling and removal of low-variance features.

    • Model selection (grid search of optimal hyper parameters) is performed in an inner cross-validation, following all the best practices.

  • Why did you pick random forests to be the default model?

    • Because they have consistently demonstrated top performance across multiple domains:

      • Fernández-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? Journal of Machine Learning Research, 15, 3133–3181. [Link]

      • Lebedev, A. V., Westman, E., Van Westen, G. J. P., Kramberger, M. G., Lundervold, A., Aarsland, D., et al. (2014). Random Forest ensembles for detection and prediction of Alzheimer’s disease with a good between-cohort robustness. NeuroImage: Clinical, 6, 115–125. [Link]

    • Because it’s a natively multi-class model (ideal for classification problems), and automatically estimates feature importance.

  • Can I use a different model?

    • Yes. User can choose among few techniques offered by scikit-learn, xgboost

    • We plan to integrate and support any useful machine learning library as well. Let me know if you would like something that’s not already integrated. As long as it is implemented in python, we will integrate it.

  • Can I try multiple models?

    • To produce results with different models, you can run neuropredict multiple times with a different model/combination each time.

    • We do not yet offer the option batch-process and compare multiple models at once, to discourage p-hacking type of hunt for maximum possible / publishable performance. Review the table below for some examples of subtle sources of bias, and check out broader discussion in my cross-validation tutorial here.

      _images/subtle_bias_hacking_types_CV.png
    • If you do try different models and different combinations of choices offered, we strongly recommend you document your trials, and report them when publishing your results.

  • What are the options for my feature selection / dimensionality reduction?

    • By default, neuropredict selects the top k = n_train/10 features based on their variable importance, as computed by Random Forest classifier/regressor, where n_train = number of training samples. The value of n_train depends on the size of the smallest class in the dataset and is train_perc*n_smallest*n_C, where train_perc is the amount of dataset the user reserved for training, n_C is the number of classes in the dataset and n_smallest is the size of smallest class.

    • you could also choose the value of k via the -k / --reduced_dim_size option

    • The choice of stratifying the training set by the size of smallest class n_smallest in the given dataset helps alleviate class-imbalance problems as well as improve the robustness of the model.

    • The following dimensionality reduction methods, via the -dr / --dim_red_method option, are available at the moment: * feature selection: SelectKBest_mutual_info_classif, SelectKBest_f_classif, VarianceThreshold * dimensionality reduction: Isomap, LLE, LLE_modified, LLE_Hessian, LLE_LTSA

    • We plan to implement offer even more choices for feature selection and dimensionality reduction in the near future - Let me know your suggestions. The benefit of trying many arbitrary choices for feature selection method seems unclear. The overarching goals of neuropredict might help answer the current choice:

      • to enable novice predictive modeling users to get started easily and quickly,

      • provide a thorough estimate of baseline performance of their feature sets, instead of trying to find an arbitrary combination of predictive modeling tools to drive the numerical performance as high as possible.

    • Keep in mind, Random forest model/regressor automatically discards features without any useful signal.

    • neuropredict is designed such that another model or combination of classifiers could easily be plugged in. We may be adding an option to integrate one of the following options to automatically select a model with the highest performance: scikit-optimize, auto_ml and tpot etc.

  • Does neuropredict handle covariates?

    • Yes. In fact, this is a unique feature for neuropredict, that is simply not possible in scikit-learn by itself due to some design limitations. We are not aware of any other libraries offering this feature.

    • Using this features requires the use of the pyradigm data structures, which offers you the ability to add in arbitrary set of attributes for each subject.

  • Can I get ROC curves?

    • Not at the moment, as the presented results and report is obtained from a large number of CV iterations and there is not one ROC curve to represent it all.

    • It is indeed possible to average ROC curves from multiple iterations (see below) and visualize it. This feature will be added soon.

      • ROC Reference: Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874.

    • For multi-class classification problems, ROC analysis (hyper-surface to be precise) becomes intractable quickly. The author is currently not aware of any easy solutions. if you are aware of any solutions or could contribute, it would be greatly appreciated.

  • Can I compare an arbitrary set of my own custom-designed features?

    • Yes. That would be quite easy when datasets are managed via pyradigm. You can also use the -u option to supply arbitrary set of paths where your custom features are stored in a loose / decentralized format (with features for a single samplet stored in a separate folder/file) e.g.

      • -y /myproject/awsome-new-idea-v2.0.PyradigmDataset.pkl /project-B/DTI_FA_Method1.PyradigmDataset.pkl

      • -u /sideproject/resting-dynamic-fc /prevproject/resting-dynamic-fc-competingmethod