API for Regression dataset¶

A tutorial-like presentation is available at Features and usage, which teaches the overall usage pattern and how to use the following API for the RegressionDataset.

class pyradigm.regress.RegressionDataset(dataset_path=None, in_dataset=None, data=None, targets=None, description='', feature_names=None, dtype=<class 'numpy.float64'>, allow_nan_inf=False)[source]¶

Bases: pyradigm.base.BaseDataset

RegressionDataset where target values are numeric (such as float or int)

Attributes

attr: Returns attributes dictionary, keyed-in by [attr_name][samplet_id]
attr_dtype: Direct access to attr dtype
data: data in its original dict form.
dataset_attr: Returns dataset attributes
description: Text description (header) that can be set by user.
dtype: Returns the data type of the features in the Dataset
feature_names: Returns the feature names as an numpy array of strings.
num_features: number of features in each samplet.
num_samplets: number of samplets in the entire dataset.
num_targets: Total number of unique classes in the dataset.
samplet_ids: Sample identifiers (strings) forming the basis of Dataset
shape: Returns the pythonic shape of the dataset: num_samplets x num_features.
target_set: Set of unique classes in the dataset.
target_sizes: Returns the sizes of different objects in a Counter object.
targets: Returns the array of targets for all the samplets.

Methods

`add_attr`(self, attr_name, samplet_id, attr_value)	Method to add samplet-wise attributes to the dataset.
`add_dataset_attr`(self, attr_name, attr_value)	Adds dataset-wide attributes (common to all samplets).
`add_samplet`(self, samplet_id, features, target)	Adds a new samplet to the dataset with its features, label and class ID.
`attr_summary`(self)	Simple summary of attributes currently stored in the dataset
`data_and_targets`(self)	Dataset features and targets in a matrix form for learning.
`del_attr`(self, attr_name[, samplet_ids])	Method to retrieve specified attribute for a list of samplet IDs
`del_samplet`(self, sample_id)	Method to remove a samplet from the dataset.
`extend`(self, other)	Method to extend the dataset vertically (add samplets from anotehr dataset).
`from_arff`(arff_path[, encode_nonnumeric])	Loads a given dataset saved in Weka’s ARFF format.
`get`(self, item[, not_found_value])	Method like dict.get() which can return specified value if key not found
`get_attr`(self, attr_name[, samplet_ids])	Method to retrieve specified attribute for a list of samplet IDs
`get_data_matrix_in_order`(self, subset_ids)	Returns a numpy array of features, rows in the same order as subset_ids
`get_feature_subset`(self, subset_idx)	Returns the subset of features indexed numerically.
`get_subset`(self, subset_ids)	Returns a smaller dataset identified by their keys/samplet IDs.
`get_target`(self, target_id)	Returns a smaller dataset of samplets with the requested target.
`glance`(self[, nitems])	Quick and partial glance of the data matrix.
`random_subset`(self[, perc])	Returns a random sub-dataset (of specified size by percentage) within each class.
`random_subset_ids`(self[, perc])	Returns a random subset of sample ids, of size percentage*num_samplets
`random_subset_ids_by_count`(self[, count])	Returns a random subset of sample ids, of size percentage*num_samplets
`samplet_ids_with_target`(self, target_id)	Returns a list of samplet ids with a given target_id
`save`(self, file_path[, …])	Method to save the dataset to disk.
`summarize`(self)	Summary of classes: names, numeric labels and sizes
`train_test_split_ids`(self[, train_perc, count])	Returns two disjoint sets of samplet ids for use in cross-validation.
`transform`(self, func[, func_description])	Applies a given a function to the features of each subject

add_attr(self, attr_name, samplet_id, attr_value)¶

Method to add samplet-wise attributes to the dataset.

Note: attribute values get overwritten by default, if they already exist for a given samplet

Parameters

attr_name (str) – Name of the attribute to be added.
samplet_id (str or list of str) – Identifier(s) for the samplet
attr_value (generic or list of generic) – Value(s) of the attribute. Any data type allowed, although it is strongly recommended to keep the data type the same, or compatible, across all the samplets in this dataset.

add_dataset_attr(self, attr_name, attr_value)¶

Adds dataset-wide attributes (common to all samplets).

This is a great way to add meta data (such as version, processing details, anything else common to all samplets). This is better than encoding the info into the description field, as this allow programmatic retrieval.

Parameters

attr_name (str) – Identifier for the attribute
attr_value (object) – Value of the attribute (any datatype)

add_samplet(self, samplet_id, features, target, overwrite=False, feature_names=None, attr_names=None, attr_values=None)¶

Adds a new samplet to the dataset with its features, label and class ID.

This is the preferred way to construct the dataset.

Parameters

samplet_id (str) – An identifier uniquely identifies this samplet.
features (list, ndarray) – The features for this samplet
target (int, str) – The label for this samplet
overwrite (bool) – If True, allows the overwite of features for an existing subject ID. Default : False.
feature_names (list) – The names for each feature. Assumed to be in the same order as features
attr_names (str or list of str) – Name of the attribute to be added for this samplet
attr_values (generic or list of generic) – Value of the attribute. Any data type allowed as long as they are compatible across all the samplets in this dataset.

Raises

ValueError – If samplet_id is already in the Dataset (and overwrite=False), or If dimensionality of the current samplet does not match the current, or If feature_names do not match existing names
TypeError – If samplet to be added is of different data type compared to existing samplets.

property attr¶: Returns attributes dictionary, keyed-in by [attr_name][samplet_id]

property attr_dtype¶: Direct access to attr dtype

attr_summary(self)¶: Simple summary of attributes currently stored in the dataset

property data¶: data in its original dict form.

data_and_targets(self)¶

Dataset features and targets in a matrix form for learning.

Also returns samplet_ids in the same order.

Returns

data_matrix (ndarray) – 2D array of shape [num_samplets, num_features] with features corresponding row-wise to samplet_ids
targets (ndarray) – Array of numeric targets for each samplet corresponding row-wise to samplet_ids
samplet_ids (list) – List of samplet ids

property dataset_attr¶: Returns dataset attributes

del_attr(self, attr_name, samplet_ids='all')¶

Method to retrieve specified attribute for a list of samplet IDs

Parameters

attr_name (str) – Name of the attribute
samplet_ids (str or list) – One or more samplet IDs whose attribute is being queried. Default: ‘all’, all the existing samplet IDs will be used.

Returns

Return type

None

Raises

Warning – If attr_name was never set for this dataset

del_samplet(self, sample_id)¶

Method to remove a samplet from the dataset.

Parameters: sample_id (str) – samplet id to be removed.
Raises: UserWarning – If samplet id to delete was not found in the dataset.

property description¶: Text description (header) that can be set by user.

property dtype¶: Returns the data type of the features in the Dataset

extend(self, other)¶

Method to extend the dataset vertically (add samplets from anotehr dataset).

Parameters: other (Dataset) – second dataset to be combined with the current (different samplets, but same dimensionality)
Raises: TypeError – if input is not an Dataset.

property feature_names¶: Returns the feature names as an numpy array of strings.

classmethod from_arff(arff_path, encode_nonnumeric=False)¶

Loads a given dataset saved in Weka’s ARFF format.

Parameters

arff_path (str) – Path to a dataset saved in Weka’s ARFF file format.
encode_nonnumeric (bool) – Flag to specify whether to encode non-numeric features (categorical, nominal or string) features to numeric values. Currently used only when importing ARFF files. It is usually better to encode your data at the source, and then import them. Use with caution!

get(self, item, not_found_value=None)¶: Method like dict.get() which can return specified value if key not found

get_attr(self, attr_name, samplet_ids='all')¶

Method to retrieve specified attribute for a list of samplet IDs

Parameters

attr_name (str) – Name of the attribute
samplet_ids (str or list) – One or more samplet IDs whose attribute is being queried. Default: ‘all’, all the existing samplet IDs will be used.

Returns

attr_values – Attribute values for the list of samplet IDs

Return type

ndarray

Raises

KeyError – If attr_name was never set for dataset, or any of the samplets requested

get_data_matrix_in_order(self, subset_ids)¶

Returns a numpy array of features, rows in the same order as subset_ids

Parameters: subset_ids (list) – List od samplet IDs to extracted from the dataset.
Returns: matrix – Matrix of features, for each id in subset_ids, in order.
Return type: ndarray

get_feature_subset(self, subset_idx)¶

Returns the subset of features indexed numerically.

Parameters: subset_idx (list, ndarray) – List of indices to features to be returned
Returns: Dataset – with subset of features requested.
Return type: Dataset
Raises: UnboundLocalError – If input indices are out of bounds for the dataset.

get_subset(self, subset_ids)¶

Returns a smaller dataset identified by their keys/samplet IDs.

Parameters: subset_ids (list) – List od samplet IDs to extracted from the dataset.
Returns: sub-dataset – sub-dataset containing only requested samplet IDs.
Return type: Dataset

get_target(self, target_id)[source]¶

Returns a smaller dataset of samplets with the requested target.

Parameters: target_id (str or list) – identifier(s) of the class(es) to be returned.
Returns: With subset of samples belonging to the given class(es).
Return type: RegressionDataset
Raises: ValueError – If one or more of the requested classes do not exist in this dataset. If the specified id is empty or None

glance(self, nitems=5)¶

Quick and partial glance of the data matrix.

Parameters: nitems (int) – Number of items to glance from the dataset. Default : 5
Returns
Return type: dict

property num_features¶: number of features in each samplet.

property num_samplets¶: number of samplets in the entire dataset.

property num_targets¶: Total number of unique classes in the dataset.

random_subset(self, perc=0.5)[source]¶

Returns a random sub-dataset (of specified size by percentage) within each class.

Parameters: perc (float) – Fraction of samples to be taken from the dataset
Returns: subdataset – random sub-dataset of specified size.
Return type: RegressionDataset

random_subset_ids(self, perc=0.5)[source]¶

Returns a random subset of sample ids, of size percentage*num_samplets

Parameters

perc (float) – Fraction of the dataset

Returns

subset – List of samplet ids

Return type

list

Raises

ValueError – If no subjects from one or more classes were selected.
UserWarning – If an empty or full dataset is requested.

random_subset_ids_by_count(self, count=5)[source]¶

Returns a random subset of sample ids, of size percentage*num_samplets

Parameters

count (float) – Fraction of dataset

Returns

subset – List of samplet ids

Return type

list

Raises

ValueError – If no subjects from one or more classes were selected.
UserWarning – If an empty or full dataset is requested.

property samplet_ids¶: Sample identifiers (strings) forming the basis of Dataset

samplet_ids_with_target(self, target_id)[source]¶

Returns a list of samplet ids with a given target_id

Parameters: target_id (float) – Value of the target to query
Returns: subset_ids – List of samplet ids belonging to a given target_id.
Return type: list

save(self, file_path, allow_constant_features=False, allow_constant_features_across_samplets=False)¶

Method to save the dataset to disk.

Parameters

file_path (str) – File path to save the current dataset to
allow_constant_features (bool) – Flag indicating whether to allow all the values for features for a samplet to be identical (e.g. all zeros). This flag (when False) intends to catch unusual, and likely incorrect, situations when all features for a given samplet are all zeros or are all some other constant value. In normal, natural, and real-world scenarios, different features will have different values. So when they are 0s or some other constant value, it is indicative of a bug somewhere. When constant values is intended, pass True for this flag.
allow_constant_features_across_samplets (bool) – While the previous flag allow_constant_features looks at one samplet at a time (across features; along rows in feature matrix X: N x p), this flag checks for constant values across all samplets for a given feature (along the columns). When similar values are expected across all samplets, pass True to this flag.

Raises

IOError – If saving to disk is not successful.

property shape¶

num_samplets x num_features.

Type: Returns the pythonic shape of the dataset

summarize(self)[source]¶

Summary of classes: names, numeric labels and sizes

Returns

target_set (list) – List of names of all the classes
target_sizes (list) – Size of each class (number of samples)

property target_set¶: Set of unique classes in the dataset.

property target_sizes¶: Returns the sizes of different objects in a Counter object.

property targets¶: Returns the array of targets for all the samplets.

train_test_split_ids(self, train_perc=None, count=None)[source]¶

Returns two disjoint sets of samplet ids for use in cross-validation.

Offers two ways to specify the sizes: fraction or count. Only one access method can be used at a time.

Parameters

train_perc (float) – fraction of samplets to build the training subset, between 0 and 1
count (int) – exact count of samplets to build the training subset, between 1 and N-1

Returns

train_set (list) – List of ids in the training set.
test_set (list) – List of ids in the test set.

Raises

ValueError – If the fraction is outside open interval (0, 1), or If counts are outside larger than the smallest class, or If unrecognized format is provided for input args, or If the selection results in empty subsets for either train or test sets.

transform(self, func, func_description=None)¶

Applies a given a function to the features of each subject: and returns a new dataset with other info unchanged.

Parameters

func (callable) –
A callable that takes in a single ndarray and returns a single ndarray. Ensure the transformed dimensionality must be the same for all subjects.

If your function requires more than one argument, use functools.partial to freeze all the arguments except the features for the subject.
func_description (str, optional) – Human readable description of the given function.

Returns

xfm_ds – with features obtained from subject-wise transform

Return type

Dataset

Raises

TypeError – If given func is not a callable
ValueError – If transformation of any of the subjects features raises an exception.

Examples

A simple example below shows the application of the .apply_xfm() to create a new pyradigm dataset:

from pyradigm import ClassificationDataset as ClfDataset

thickness = ClfDataset(in_path='ADNI_thickness.csv')
# get_pcg_median is a function that computes the median value in PCG
pcg_thickness = thickness.apply_xfm(func=get_pcg_median,
                            description = 'applying ROI mask for PCG')
pcg_median = pcg_thickness.apply_xfm(func=np.median,
                            description='median per subject')

A complex example using a function that takes more than one argument:

from functools import partial
import hiwenet

thickness = ClfDataset(in_path='ADNI_thickness.csv')
roi_membership = read_roi_membership()
calc_hist_dist = partial(hiwenet, groups = roi_membership)

thickness_hiwenet = thickness.transform(func=calc_hist_dist,
                            description = 'histogram weighted networks')
median_thk_hiwenet = thickness_hiwenet.transform(func=np.median,
                            description='median per subject')