API for Classification dataset

A tutorial-like presentation is available at Features and usage, which teaches the overall usage pattern and how to use the following API for the ClassificationDataset.

class pyradigm.classify.ClassificationDataset(dataset_path=None, in_dataset=None, data=None, targets=None, description='', feature_names=None, dtype=<class 'numpy.float64'>, allow_nan_inf=False)[source]

Bases: pyradigm.base.BaseDataset

The main class for user-facing ClassificationDataset.

Note: samplet is defined to refer to a single row in feature matrix X: N x p

Attributes
attr

Returns attributes dictionary, keyed-in by [attr_name][samplet_id]

attr_dtype

Direct access to attr dtype

data

data in its original dict form.

dataset_attr

Returns dataset attributes

description

Text description (header) that can be set by user.

dtype

Returns the data type of the features in the Dataset

feature_names

Returns the feature names as an numpy array of strings.

num_features

number of features in each samplet.

num_samplets

number of samplets in the entire dataset.

num_targets

Total number of unique classes in the dataset.

samplet_ids

Sample identifiers (strings) forming the basis of Dataset

shape

Returns the pythonic shape of the dataset: num_samplets x num_features.

target_set

Set of unique classes in the dataset.

target_sizes

Returns the sizes of different objects in a Counter object.

targets

Returns the array of targets for all the samplets.

Methods

add_attr(self, attr_name, samplet_id, attr_value)

Method to add samplet-wise attributes to the dataset.

add_dataset_attr(self, attr_name, attr_value)

Adds dataset-wide attributes (common to all samplets).

add_samplet(self, samplet_id, features, target)

Adds a new samplet to the dataset with its features, label and class ID.

attr_summary(self)

Simple summary of attributes currently stored in the dataset

data_and_labels(self)

Deprecated: symbolic link to the .data_and_targets() method for backwards compatibility.

data_and_targets(self)

Dataset features and targets in a matrix form for learning.

del_attr(self, attr_name[, samplet_ids])

Method to retrieve specified attribute for a list of samplet IDs

del_samplet(self, sample_id)

Method to remove a samplet from the dataset.

extend(self, other)

Method to extend the dataset vertically (add samplets from anotehr dataset).

from_arff(arff_path[, encode_nonnumeric])

Loads a given dataset saved in Weka’s ARFF format.

get(self, item[, not_found_value])

Method like dict.get() which can return specified value if key not found

get_attr(self, attr_name[, samplet_ids])

Method to retrieve specified attribute for a list of samplet IDs

get_class(self, target_id)

Returns a smaller dataset belonging to the requested classes.

get_data_matrix_in_order(self, subset_ids)

Returns a numpy array of features, rows in the same order as subset_ids

get_feature_subset(self, subset_idx)

Returns the subset of features indexed numerically.

get_subset(self, subset_ids)

Returns a smaller dataset identified by their keys/samplet IDs.

glance(self[, nitems])

Quick and partial glance of the data matrix.

random_subset(self[, perc_in_class])

Returns a random sub-dataset (of specified size by percentage) within each class.

random_subset_ids(self[, perc_per_class])

Returns a random subset of sample ids (size in percentage) within each class.

random_subset_ids_by_count(self[, …])

Returns a random subset of sample ids of specified size by count,

rename_targets(self, new_targets)

Helper to rename the classes, if provided by a dict keyed in by the original samplet ids

sample_ids_in_class(self, class_id)

Returns a list of sample ids belonging to a given class.

save(self, file_path[, …])

Method to save the dataset to disk.

summarize(self)

Summary of classes: names and sizes

train_test_split_ids(self[, train_perc, …])

Returns two disjoint sets of samplet ids for use in cross-validation.

transform(self, func[, func_description])

Applies a given a function to the features of each subject

add_attr(self, attr_name, samplet_id, attr_value)

Method to add samplet-wise attributes to the dataset.

Note: attribute values get overwritten by default, if they already exist for a given samplet

Parameters
  • attr_name (str) – Name of the attribute to be added.

  • samplet_id (str or list of str) – Identifier(s) for the samplet

  • attr_value (generic or list of generic) – Value(s) of the attribute. Any data type allowed, although it is strongly recommended to keep the data type the same, or compatible, across all the samplets in this dataset.

add_dataset_attr(self, attr_name, attr_value)

Adds dataset-wide attributes (common to all samplets).

This is a great way to add meta data (such as version, processing details, anything else common to all samplets). This is better than encoding the info into the description field, as this allow programmatic retrieval.

Parameters
  • attr_name (str) – Identifier for the attribute

  • attr_value (object) – Value of the attribute (any datatype)

add_samplet(self, samplet_id, features, target, overwrite=False, feature_names=None, attr_names=None, attr_values=None)

Adds a new samplet to the dataset with its features, label and class ID.

This is the preferred way to construct the dataset.

Parameters
  • samplet_id (str) – An identifier uniquely identifies this samplet.

  • features (list, ndarray) – The features for this samplet

  • target (int, str) – The label for this samplet

  • overwrite (bool) – If True, allows the overwite of features for an existing subject ID. Default : False.

  • feature_names (list) – The names for each feature. Assumed to be in the same order as features

  • attr_names (str or list of str) – Name of the attribute to be added for this samplet

  • attr_values (generic or list of generic) – Value of the attribute. Any data type allowed as long as they are compatible across all the samplets in this dataset.

Raises
  • ValueError – If samplet_id is already in the Dataset (and overwrite=False), or If dimensionality of the current samplet does not match the current, or If feature_names do not match existing names

  • TypeError – If samplet to be added is of different data type compared to existing samplets.

property attr

Returns attributes dictionary, keyed-in by [attr_name][samplet_id]

property attr_dtype

Direct access to attr dtype

attr_summary(self)

Simple summary of attributes currently stored in the dataset

property data

data in its original dict form.

data_and_labels(self)[source]

Deprecated: symbolic link to the .data_and_targets() method for backwards compatibility.

data_and_targets(self)

Dataset features and targets in a matrix form for learning.

Also returns samplet_ids in the same order.

Returns

  • data_matrix (ndarray) – 2D array of shape [num_samplets, num_features] with features corresponding row-wise to samplet_ids

  • targets (ndarray) – Array of numeric targets for each samplet corresponding row-wise to samplet_ids

  • samplet_ids (list) – List of samplet ids

property dataset_attr

Returns dataset attributes

del_attr(self, attr_name, samplet_ids='all')

Method to retrieve specified attribute for a list of samplet IDs

Parameters
  • attr_name (str) – Name of the attribute

  • samplet_ids (str or list) – One or more samplet IDs whose attribute is being queried. Default: ‘all’, all the existing samplet IDs will be used.

Returns

Return type

None

Raises

Warning – If attr_name was never set for this dataset

del_samplet(self, sample_id)

Method to remove a samplet from the dataset.

Parameters

sample_id (str) – samplet id to be removed.

Raises

UserWarning – If samplet id to delete was not found in the dataset.

property description

Text description (header) that can be set by user.

property dtype

Returns the data type of the features in the Dataset

extend(self, other)

Method to extend the dataset vertically (add samplets from anotehr dataset).

Parameters

other (Dataset) – second dataset to be combined with the current (different samplets, but same dimensionality)

Raises

TypeError – if input is not an Dataset.

property feature_names

Returns the feature names as an numpy array of strings.

classmethod from_arff(arff_path, encode_nonnumeric=False)

Loads a given dataset saved in Weka’s ARFF format.

Parameters
  • arff_path (str) – Path to a dataset saved in Weka’s ARFF file format.

  • encode_nonnumeric (bool) – Flag to specify whether to encode non-numeric features (categorical, nominal or string) features to numeric values. Currently used only when importing ARFF files. It is usually better to encode your data at the source, and then import them. Use with caution!

get(self, item, not_found_value=None)

Method like dict.get() which can return specified value if key not found

get_attr(self, attr_name, samplet_ids='all')

Method to retrieve specified attribute for a list of samplet IDs

Parameters
  • attr_name (str) – Name of the attribute

  • samplet_ids (str or list) – One or more samplet IDs whose attribute is being queried. Default: ‘all’, all the existing samplet IDs will be used.

Returns

attr_values – Attribute values for the list of samplet IDs

Return type

ndarray

Raises

KeyError – If attr_name was never set for dataset, or any of the samplets requested

get_class(self, target_id)[source]

Returns a smaller dataset belonging to the requested classes.

Parameters

target_id (str or list) – identifier(s) of the class(es) to be returned.

Returns

With subset of samples belonging to the given class(es).

Return type

ClassificationDataset

Raises

ValueError – If one or more of the requested classes do not exist in this dataset. If the specified id is empty or None

get_data_matrix_in_order(self, subset_ids)

Returns a numpy array of features, rows in the same order as subset_ids

Parameters

subset_ids (list) – List od samplet IDs to extracted from the dataset.

Returns

matrix – Matrix of features, for each id in subset_ids, in order.

Return type

ndarray

get_feature_subset(self, subset_idx)

Returns the subset of features indexed numerically.

Parameters

subset_idx (list, ndarray) – List of indices to features to be returned

Returns

Dataset – with subset of features requested.

Return type

Dataset

Raises

UnboundLocalError – If input indices are out of bounds for the dataset.

get_subset(self, subset_ids)

Returns a smaller dataset identified by their keys/samplet IDs.

Parameters

subset_ids (list) – List od samplet IDs to extracted from the dataset.

Returns

sub-dataset – sub-dataset containing only requested samplet IDs.

Return type

Dataset

glance(self, nitems=5)

Quick and partial glance of the data matrix.

Parameters

nitems (int) – Number of items to glance from the dataset. Default : 5

Returns

Return type

dict

property num_features

number of features in each samplet.

property num_samplets

number of samplets in the entire dataset.

property num_targets

Total number of unique classes in the dataset.

random_subset(self, perc_in_class=0.5)[source]

Returns a random sub-dataset (of specified size by percentage) within each class.

Parameters

perc_in_class (float) – Fraction of samples to be taken from each class.

Returns

subdataset – random sub-dataset of specified size.

Return type

ClassificationDataset

random_subset_ids(self, perc_per_class=0.5)[source]

Returns a random subset of sample ids (size in percentage) within each class.

Parameters

perc_per_class (float) – Fraction of samples per class

Returns

subset – Combined list of sample ids from all classes.

Return type

list

Raises
  • ValueError – If no subjects from one or more classes were selected.

  • UserWarning – If an empty or full dataset is requested.

random_subset_ids_by_count(self, count_per_class=1)[source]
Returns a random subset of sample ids of specified size by count,

within each class.

Parameters

count_per_class (int) – Exact number of samples per each class.

Returns

subset – Combined list of sample ids from all classes.

Return type

list

rename_targets(self, new_targets)[source]

Helper to rename the classes, if provided by a dict keyed in by the original samplet ids

Parameters

new_targets (dict) – Dict of targets keyed in by sample IDs.

Raises
  • TypeError – If targets is not a dict.

  • ValueError – If all samples in dataset are not present in input dict, or one of they samples in input is not recognized.

sample_ids_in_class(self, class_id)[source]

Returns a list of sample ids belonging to a given class.

Parameters

class_id (str) – class id to query.

Returns

subset_ids – List of sample ids belonging to a given class.

Return type

list

property samplet_ids

Sample identifiers (strings) forming the basis of Dataset

save(self, file_path, allow_constant_features=False, allow_constant_features_across_samplets=False)

Method to save the dataset to disk.

Parameters
  • file_path (str) – File path to save the current dataset to

  • allow_constant_features (bool) – Flag indicating whether to allow all the values for features for a samplet to be identical (e.g. all zeros). This flag (when False) intends to catch unusual, and likely incorrect, situations when all features for a given samplet are all zeros or are all some other constant value. In normal, natural, and real-world scenarios, different features will have different values. So when they are 0s or some other constant value, it is indicative of a bug somewhere. When constant values is intended, pass True for this flag.

  • allow_constant_features_across_samplets (bool) – While the previous flag allow_constant_features looks at one samplet at a time (across features; along rows in feature matrix X: N x p), this flag checks for constant values across all samplets for a given feature (along the columns). When similar values are expected across all samplets, pass True to this flag.

Raises

IOError – If saving to disk is not successful.

property shape

num_samplets x num_features.

Type

Returns the pythonic shape of the dataset

summarize(self)[source]

Summary of classes: names and sizes

Returns

  • target_set (list) – List of names of all the classes

  • target_sizes (list) – Size of each class (number of samples)

property target_set

Set of unique classes in the dataset.

property target_sizes

Returns the sizes of different objects in a Counter object.

property targets

Returns the array of targets for all the samplets.

train_test_split_ids(self, train_perc=None, count_per_class=None)[source]

Returns two disjoint sets of samplet ids for use in cross-validation.

Offers two ways to specify the sizes: fraction or count. Only one access method can be used at a time.

Parameters
  • train_perc (float) – fraction of samplets from each class to build the training subset.

  • count_per_class (int) – exact count of samplets from each class to build the training subset.

Returns

  • train_set (list) – List of ids in the training set.

  • test_set (list) – List of ids in the test set.

Raises

ValueError – If the fraction is outside open interval (0, 1), or If counts are outside larger than the smallest class, or If unrecognized format is provided for input args, or If the selection results in empty subsets for either train or test sets.

transform(self, func, func_description=None)
Applies a given a function to the features of each subject

and returns a new dataset with other info unchanged.

Parameters
  • func (callable) –

    A callable that takes in a single ndarray and returns a single ndarray. Ensure the transformed dimensionality must be the same for all subjects.

    If your function requires more than one argument, use functools.partial to freeze all the arguments except the features for the subject.

  • func_description (str, optional) – Human readable description of the given function.

Returns

xfm_ds – with features obtained from subject-wise transform

Return type

Dataset

Raises
  • TypeError – If given func is not a callable

  • ValueError – If transformation of any of the subjects features raises an exception.

Examples

A simple example below shows the application of the .apply_xfm() to create a new pyradigm dataset:

from pyradigm import ClassificationDataset as ClfDataset

thickness = ClfDataset(in_path='ADNI_thickness.csv')
# get_pcg_median is a function that computes the median value in PCG
pcg_thickness = thickness.apply_xfm(func=get_pcg_median,
                            description = 'applying ROI mask for PCG')
pcg_median = pcg_thickness.apply_xfm(func=np.median,
                            description='median per subject')

A complex example using a function that takes more than one argument:

from functools import partial
import hiwenet

thickness = ClfDataset(in_path='ADNI_thickness.csv')
roi_membership = read_roi_membership()
calc_hist_dist = partial(hiwenet, groups = roi_membership)

thickness_hiwenet = thickness.transform(func=calc_hist_dist,
                            description = 'histogram weighted networks')
median_thk_hiwenet = thickness_hiwenet.transform(func=np.median,
                            description='median per subject')