API for Regression dataset¶
A tutorial-like presentation is available at Features and usage, which teaches the overall usage pattern and how to use the following API for the RegressionDataset.
- 
class 
pyradigm.regress.RegressionDataset(dataset_path=None, in_dataset=None, data=None, targets=None, description='', feature_names=None, dtype=<class 'numpy.float64'>, allow_nan_inf=False)[source]¶ Bases:
pyradigm.base.BaseDatasetRegressionDataset where target values are numeric (such as float or int)
- Attributes
 attrReturns attributes dictionary, keyed-in by [attr_name][samplet_id]
attr_dtypeDirect access to attr dtype
datadata in its original dict form.
dataset_attrReturns dataset attributes
descriptionText description (header) that can be set by user.
dtypeReturns the data type of the features in the Dataset
feature_namesReturns the feature names as an numpy array of strings.
num_featuresnumber of features in each samplet.
num_sampletsnumber of samplets in the entire dataset.
num_targetsTotal number of unique classes in the dataset.
samplet_idsSample identifiers (strings) forming the basis of Dataset
shapeReturns the pythonic shape of the dataset: num_samplets x num_features.
target_setSet of unique classes in the dataset.
target_sizesReturns the sizes of different objects in a Counter object.
targetsReturns the array of targets for all the samplets.
Methods
add_attr(self, attr_name, samplet_id, attr_value)Method to add samplet-wise attributes to the dataset.
add_dataset_attr(self, attr_name, attr_value)Adds dataset-wide attributes (common to all samplets).
add_samplet(self, samplet_id, features, target)Adds a new samplet to the dataset with its features, label and class ID.
attr_summary(self)Simple summary of attributes currently stored in the dataset
data_and_targets(self)Dataset features and targets in a matrix form for learning.
del_attr(self, attr_name[, samplet_ids])Method to retrieve specified attribute for a list of samplet IDs
del_samplet(self, sample_id)Method to remove a samplet from the dataset.
extend(self, other)Method to extend the dataset vertically (add samplets from anotehr dataset).
from_arff(arff_path[, encode_nonnumeric])Loads a given dataset saved in Weka’s ARFF format.
get(self, item[, not_found_value])Method like dict.get() which can return specified value if key not found
get_attr(self, attr_name[, samplet_ids])Method to retrieve specified attribute for a list of samplet IDs
get_data_matrix_in_order(self, subset_ids)Returns a numpy array of features, rows in the same order as subset_ids
get_feature_subset(self, subset_idx)Returns the subset of features indexed numerically.
get_subset(self, subset_ids)Returns a smaller dataset identified by their keys/samplet IDs.
get_target(self, target_id)Returns a smaller dataset of samplets with the requested target.
glance(self[, nitems])Quick and partial glance of the data matrix.
random_subset(self[, perc])Returns a random sub-dataset (of specified size by percentage) within each class.
random_subset_ids(self[, perc])Returns a random subset of sample ids, of size percentage*num_samplets
random_subset_ids_by_count(self[, count])Returns a random subset of sample ids, of size percentage*num_samplets
samplet_ids_with_target(self, target_id)Returns a list of samplet ids with a given target_id
save(self, file_path[, …])Method to save the dataset to disk.
summarize(self)Summary of classes: names, numeric labels and sizes
train_test_split_ids(self[, train_perc, count])Returns two disjoint sets of samplet ids for use in cross-validation.
transform(self, func[, func_description])Applies a given a function to the features of each subject
- 
add_attr(self, attr_name, samplet_id, attr_value)¶ Method to add samplet-wise attributes to the dataset.
Note: attribute values get overwritten by default, if they already exist for a given samplet
- Parameters
 attr_name (str) – Name of the attribute to be added.
samplet_id (str or list of str) – Identifier(s) for the samplet
attr_value (generic or list of generic) – Value(s) of the attribute. Any data type allowed, although it is strongly recommended to keep the data type the same, or compatible, across all the samplets in this dataset.
- 
add_dataset_attr(self, attr_name, attr_value)¶ Adds dataset-wide attributes (common to all samplets).
This is a great way to add meta data (such as version, processing details, anything else common to all samplets). This is better than encoding the info into the description field, as this allow programmatic retrieval.
- Parameters
 attr_name (str) – Identifier for the attribute
attr_value (object) – Value of the attribute (any datatype)
- 
add_samplet(self, samplet_id, features, target, overwrite=False, feature_names=None, attr_names=None, attr_values=None)¶ Adds a new samplet to the dataset with its features, label and class ID.
This is the preferred way to construct the dataset.
- Parameters
 samplet_id (str) – An identifier uniquely identifies this samplet.
features (list, ndarray) – The features for this samplet
target (int, str) – The label for this samplet
overwrite (bool) – If True, allows the overwite of features for an existing subject ID. Default : False.
feature_names (list) – The names for each feature. Assumed to be in the same order as features
attr_names (str or list of str) – Name of the attribute to be added for this samplet
attr_values (generic or list of generic) – Value of the attribute. Any data type allowed as long as they are compatible across all the samplets in this dataset.
- Raises
 ValueError – If samplet_id is already in the Dataset (and overwrite=False), or If dimensionality of the current samplet does not match the current, or If feature_names do not match existing names
TypeError – If samplet to be added is of different data type compared to existing samplets.
- 
property 
attr¶ Returns attributes dictionary, keyed-in by [attr_name][samplet_id]
- 
property 
attr_dtype¶ Direct access to attr dtype
- 
attr_summary(self)¶ Simple summary of attributes currently stored in the dataset
- 
property 
data¶ data in its original dict form.
- 
data_and_targets(self)¶ Dataset features and targets in a matrix form for learning.
Also returns samplet_ids in the same order.
- Returns
 data_matrix (ndarray) – 2D array of shape [num_samplets, num_features] with features corresponding row-wise to samplet_ids
targets (ndarray) – Array of numeric targets for each samplet corresponding row-wise to samplet_ids
samplet_ids (list) – List of samplet ids
- 
property 
dataset_attr¶ Returns dataset attributes
- 
del_attr(self, attr_name, samplet_ids='all')¶ Method to retrieve specified attribute for a list of samplet IDs
- Parameters
 attr_name (str) – Name of the attribute
samplet_ids (str or list) – One or more samplet IDs whose attribute is being queried. Default: ‘all’, all the existing samplet IDs will be used.
- Returns
 - Return type
 None
- Raises
 Warning – If attr_name was never set for this dataset
- 
del_samplet(self, sample_id)¶ Method to remove a samplet from the dataset.
- Parameters
 sample_id (str) – samplet id to be removed.
- Raises
 UserWarning – If samplet id to delete was not found in the dataset.
- 
property 
description¶ Text description (header) that can be set by user.
- 
property 
dtype¶ Returns the data type of the features in the Dataset
- 
extend(self, other)¶ Method to extend the dataset vertically (add samplets from anotehr dataset).
- Parameters
 other (Dataset) – second dataset to be combined with the current (different samplets, but same dimensionality)
- Raises
 TypeError – if input is not an Dataset.
- 
property 
feature_names¶ Returns the feature names as an numpy array of strings.
- 
classmethod 
from_arff(arff_path, encode_nonnumeric=False)¶ Loads a given dataset saved in Weka’s ARFF format.
- Parameters
 arff_path (str) – Path to a dataset saved in Weka’s ARFF file format.
encode_nonnumeric (bool) – Flag to specify whether to encode non-numeric features (categorical, nominal or string) features to numeric values. Currently used only when importing ARFF files. It is usually better to encode your data at the source, and then import them. Use with caution!
- 
get(self, item, not_found_value=None)¶ Method like dict.get() which can return specified value if key not found
- 
get_attr(self, attr_name, samplet_ids='all')¶ Method to retrieve specified attribute for a list of samplet IDs
- Parameters
 attr_name (str) – Name of the attribute
samplet_ids (str or list) – One or more samplet IDs whose attribute is being queried. Default: ‘all’, all the existing samplet IDs will be used.
- Returns
 attr_values – Attribute values for the list of samplet IDs
- Return type
 ndarray
- Raises
 KeyError – If attr_name was never set for dataset, or any of the samplets requested
- 
get_data_matrix_in_order(self, subset_ids)¶ Returns a numpy array of features, rows in the same order as subset_ids
- Parameters
 subset_ids (list) – List od samplet IDs to extracted from the dataset.
- Returns
 matrix – Matrix of features, for each id in subset_ids, in order.
- Return type
 ndarray
- 
get_feature_subset(self, subset_idx)¶ Returns the subset of features indexed numerically.
- Parameters
 subset_idx (list, ndarray) – List of indices to features to be returned
- Returns
 Dataset – with subset of features requested.
- Return type
 Dataset
- Raises
 UnboundLocalError – If input indices are out of bounds for the dataset.
- 
get_subset(self, subset_ids)¶ Returns a smaller dataset identified by their keys/samplet IDs.
- Parameters
 subset_ids (list) – List od samplet IDs to extracted from the dataset.
- Returns
 sub-dataset – sub-dataset containing only requested samplet IDs.
- Return type
 Dataset
- 
get_target(self, target_id)[source]¶ Returns a smaller dataset of samplets with the requested target.
- Parameters
 target_id (str or list) – identifier(s) of the class(es) to be returned.
- Returns
 With subset of samples belonging to the given class(es).
- Return type
 - Raises
 ValueError – If one or more of the requested classes do not exist in this dataset. If the specified id is empty or None
- 
glance(self, nitems=5)¶ Quick and partial glance of the data matrix.
- Parameters
 nitems (int) – Number of items to glance from the dataset. Default : 5
- Returns
 - Return type
 dict
- 
property 
num_features¶ number of features in each samplet.
- 
property 
num_samplets¶ number of samplets in the entire dataset.
- 
property 
num_targets¶ Total number of unique classes in the dataset.
- 
random_subset(self, perc=0.5)[source]¶ Returns a random sub-dataset (of specified size by percentage) within each class.
- Parameters
 perc (float) – Fraction of samples to be taken from the dataset
- Returns
 subdataset – random sub-dataset of specified size.
- Return type
 
- 
random_subset_ids(self, perc=0.5)[source]¶ Returns a random subset of sample ids, of size percentage*num_samplets
- Parameters
 perc (float) – Fraction of the dataset
- Returns
 subset – List of samplet ids
- Return type
 list
- Raises
 ValueError – If no subjects from one or more classes were selected.
UserWarning – If an empty or full dataset is requested.
- 
random_subset_ids_by_count(self, count=5)[source]¶ Returns a random subset of sample ids, of size percentage*num_samplets
- Parameters
 count (float) – Fraction of dataset
- Returns
 subset – List of samplet ids
- Return type
 list
- Raises
 ValueError – If no subjects from one or more classes were selected.
UserWarning – If an empty or full dataset is requested.
- 
property 
samplet_ids¶ Sample identifiers (strings) forming the basis of Dataset
- 
samplet_ids_with_target(self, target_id)[source]¶ Returns a list of samplet ids with a given target_id
- Parameters
 target_id (float) – Value of the target to query
- Returns
 subset_ids – List of samplet ids belonging to a given target_id.
- Return type
 list
- 
save(self, file_path, allow_constant_features=False, allow_constant_features_across_samplets=False)¶ Method to save the dataset to disk.
- Parameters
 file_path (str) – File path to save the current dataset to
allow_constant_features (bool) – Flag indicating whether to allow all the values for features for a samplet to be identical (e.g. all zeros). This flag (when False) intends to catch unusual, and likely incorrect, situations when all features for a given samplet are all zeros or are all some other constant value. In normal, natural, and real-world scenarios, different features will have different values. So when they are 0s or some other constant value, it is indicative of a bug somewhere. When constant values is intended, pass True for this flag.
allow_constant_features_across_samplets (bool) – While the previous flag allow_constant_features looks at one samplet at a time (across features; along rows in feature matrix X: N x p), this flag checks for constant values across all samplets for a given feature (along the columns). When similar values are expected across all samplets, pass True to this flag.
- Raises
 IOError – If saving to disk is not successful.
- 
property 
shape¶ num_samplets x num_features.
- Type
 Returns the pythonic shape of the dataset
- 
summarize(self)[source]¶ Summary of classes: names, numeric labels and sizes
- Returns
 target_set (list) – List of names of all the classes
target_sizes (list) – Size of each class (number of samples)
- 
property 
target_set¶ Set of unique classes in the dataset.
- 
property 
target_sizes¶ Returns the sizes of different objects in a Counter object.
- 
property 
targets¶ Returns the array of targets for all the samplets.
- 
train_test_split_ids(self, train_perc=None, count=None)[source]¶ Returns two disjoint sets of samplet ids for use in cross-validation.
Offers two ways to specify the sizes: fraction or count. Only one access method can be used at a time.
- Parameters
 train_perc (float) – fraction of samplets to build the training subset, between 0 and 1
count (int) – exact count of samplets to build the training subset, between 1 and N-1
- Returns
 train_set (list) – List of ids in the training set.
test_set (list) – List of ids in the test set.
- Raises
 ValueError – If the fraction is outside open interval (0, 1), or If counts are outside larger than the smallest class, or If unrecognized format is provided for input args, or If the selection results in empty subsets for either train or test sets.
- 
transform(self, func, func_description=None)¶ - Applies a given a function to the features of each subject
 and returns a new dataset with other info unchanged.
- Parameters
 func (callable) –
A callable that takes in a single ndarray and returns a single ndarray. Ensure the transformed dimensionality must be the same for all subjects.
If your function requires more than one argument, use functools.partial to freeze all the arguments except the features for the subject.
func_description (str, optional) – Human readable description of the given function.
- Returns
 xfm_ds – with features obtained from subject-wise transform
- Return type
 Dataset
- Raises
 TypeError – If given func is not a callable
ValueError – If transformation of any of the subjects features raises an exception.
Examples
A simple example below shows the application of the .apply_xfm() to create a new pyradigm dataset:
from pyradigm import ClassificationDataset as ClfDataset thickness = ClfDataset(in_path='ADNI_thickness.csv') # get_pcg_median is a function that computes the median value in PCG pcg_thickness = thickness.apply_xfm(func=get_pcg_median, description = 'applying ROI mask for PCG') pcg_median = pcg_thickness.apply_xfm(func=np.median, description='median per subject')
A complex example using a function that takes more than one argument:
from functools import partial import hiwenet thickness = ClfDataset(in_path='ADNI_thickness.csv') roi_membership = read_roi_membership() calc_hist_dist = partial(hiwenet, groups = roi_membership) thickness_hiwenet = thickness.transform(func=calc_hist_dist, description = 'histogram weighted networks') median_thk_hiwenet = thickness_hiwenet.transform(func=np.median, description='median per subject')