Usage¶
To use the confounds
library in a project, import the necessary classes or functions e.g.
from confounds import Residualize, Augment, DummyDeconfounding
Example usage¶
Let’s say the features from your dataset are contained in X
(N x p), and the confounds/covariates for those samples stored in C
(of size N x c), with X and C having row-wise correspondence for each samplet.
X is often split into train_X
and test_X
inside the cross-validation loop, and the corresponding splits for C are train_C
and test_C
. Then, the estimator classes (e.g. Residualize()
) from confounds
library are used in the following easy manner:
resid = Residualize() # instantiation. You could choose different models
# NOTE 2nd argument to transform method are confounding variables, not target values
resid.fit(train_X, train_C) # training on X and C
# NOTE 2nd argument to transform method are confounding variables, not target values
deconf_train_X = resid.transform(train_X, train_C) # deconfounding train_X
deconf_test_X = resid.transform(test_X , test_C) # and then test_X
That’s it.
You can also replace Residualize()
with Augment()
(which simply concatenates the covariates values to X horizontally) as well as DummyDeconfounding()
which does nothing but return X back.
Here is an broader example showing how the deconfouding classes can be used in a full cross-validation loop:
for iter in range(num_CV_iters):
train_data, test_data, train_targets, original_test_targets = split_ds(dataset, iter)
# split_ds() could be as simple as train_test_split() from sklearn, when you are
# using a simple flat numerical-only numpy-array based feature set.
# I discourage ndarray in favour of pyradigm, ideal for linked tables of mixed-data-types
train_X, train_C, test_X , test_C = get_covariates(train_data, test_data)
resid = Residualize()
resid.fit(train_X, train_C)
# NOTE 2nd argument here are covariates, not y (targets)
deconf_train_X = resid.transform(train_X, train_C)
deconf_test_X = resid.transform(test_X , test_C)
# make sure you transform both test/train sets with the same estimator!
# wrapper containing calls to GridSearchCV() etc
# pipeline here is the predictive model, sequence of sklearn estimators
best_pipeline = optimize_pipeline(pipeline, deconf_train_X, train_targets)
predicted_test_targets = best_pipeline.predict(deconf_test_X)
results[iter] = evaluate_performance(predicted_test_targets, original_test_targets)
If the above example is confusing or you disagree with it, please let me know. Appreciate your help in improving the methods and documentation. Thanks much!
Note
Only Residualize(model='linear')
, Augment()
and DummyDeconfounding()
methods are considered usable and stable. The rest are yet to be developed, and subject to change without notice.
Warning
Scikit-learn does not offer the ability to pass in covariates (or any other variables) besides X
and y
to their estimators. So, although classes from this confounds
library act as scikit-learn estimators (passing estimator checks), they should NOT be used in a scikit-learn pipeline interface e.g. to pass it on GridSearchCV
or similar convenience classes for optimization purposes.
Note
I highly recommend using the pyradigm
data structure to manage the features and covariates of a given dataset, using the classes ClassificationDataset
and RegressionDataset
. The latest version of pyradigm
also provides the MultiDatasetClassify
and MultiDatasetRegress
classes that would make this data management even easier when engaging in comparisons across multiple modalities/feature-sets on the same sample (same subjects with same covariates). Check it out at this URL https://raamana.github.io/pyradigm/.