Usage

To use the confounds library in a project, import the necessary classes or functions e.g.

from confounds import Residualize, Augment, DummyDeconfounding

Example usage

Let’s say the features from your dataset are contained in X (N x p), and the confounds/covariates for those samples stored in C (of size N x c), with X and C having row-wise correspondence for each samplet.

X is often split into train_X and test_X inside the cross-validation loop, and the corresponding splits for C are train_C and test_C. Then, the estimator classes (e.g. Residualize()) from confounds library are used in the following easy manner:

resid = Residualize()          # instantiation. You could choose different models
# NOTE 2nd argument to transform method are confounding variables, not target values
resid.fit(train_X, train_C)    # training on X and C

# NOTE 2nd argument to transform method are confounding variables, not target values
deconf_train_X = resid.transform(train_X, train_C)   # deconfounding train_X
deconf_test_X  = resid.transform(test_X , test_C)    #      and then test_X

That’s it.

You can also replace Residualize() with Augment() (which simply concatenates the covariates values to X horizontally) as well as DummyDeconfounding() which does nothing but return X back.

Here is an broader example showing how the deconfouding classes can be used in a full cross-validation loop:

for iter in range(num_CV_iters):

    train_data, test_data, train_targets, original_test_targets = split_ds(dataset, iter)
    # split_ds() could be as simple as train_test_split() from sklearn, when you are
    #  using a simple flat numerical-only numpy-array based feature set.
    #  I discourage ndarray in favour of pyradigm, ideal for linked tables of mixed-data-types
    train_X, train_C, test_X , test_C = get_covariates(train_data, test_data)

    resid = Residualize()
    resid.fit(train_X, train_C)
    # NOTE 2nd argument here are covariates, not y (targets)
    deconf_train_X = resid.transform(train_X, train_C)
    deconf_test_X  = resid.transform(test_X , test_C)
    # make sure you transform both test/train sets with the same estimator!

    # wrapper containing calls to GridSearchCV() etc
    # pipeline here is the predictive model, sequence of sklearn estimators
    best_pipeline = optimize_pipeline(pipeline, deconf_train_X, train_targets)
    predicted_test_targets = best_pipeline.predict(deconf_test_X)
    results[iter] = evaluate_performance(predicted_test_targets, original_test_targets)

If the above example is confusing or you disagree with it, please let me know. Appreciate your help in improving the methods and documentation. Thanks much!

Note

Only Residualize(model='linear'), Augment() and DummyDeconfounding() methods are considered usable and stable. The rest are yet to be developed, and subject to change without notice.

Warning

Scikit-learn does not offer the ability to pass in covariates (or any other variables) besides X and y to their estimators. So, although classes from this confounds library act as scikit-learn estimators (passing estimator checks), they should NOT be used in a scikit-learn pipeline interface e.g. to pass it on GridSearchCV or similar convenience classes for optimization purposes.

Note

I highly recommend using the pyradigm data structure to manage the features and covariates of a given dataset, using the classes ClassificationDataset and RegressionDataset. The latest version of pyradigm also provides the MultiDatasetClassify and MultiDatasetRegress classes that would make this data management even easier when engaging in comparisons across multiple modalities/feature-sets on the same sample (same subjects with same covariates). Check it out at this URL https://raamana.github.io/pyradigm/.