API for Regression dataset¶
A tutorial-like presentation is available at Features and usage, which teaches the overall usage pattern and how to use the following API for the RegressionDataset
.
-
class
pyradigm.regress.
RegressionDataset
(dataset_path=None, in_dataset=None, data=None, targets=None, description='', feature_names=None, dtype=<class 'numpy.float64'>, allow_nan_inf=False)[source]¶ Bases:
pyradigm.base.BaseDataset
RegressionDataset where target values are numeric (such as float or int)
- Attributes
attr
Returns attributes dictionary, keyed-in by [attr_name][samplet_id]
attr_dtype
Direct access to attr dtype
data
data in its original dict form.
dataset_attr
Returns dataset attributes
description
Text description (header) that can be set by user.
dtype
Returns the data type of the features in the Dataset
feature_names
Returns the feature names as an numpy array of strings.
num_features
number of features in each samplet.
num_samplets
number of samplets in the entire dataset.
num_targets
Total number of unique classes in the dataset.
samplet_ids
Sample identifiers (strings) forming the basis of Dataset
shape
Returns the pythonic shape of the dataset: num_samplets x num_features.
target_set
Set of unique classes in the dataset.
target_sizes
Returns the sizes of different objects in a Counter object.
targets
Returns the array of targets for all the samplets.
Methods
add_attr
(self, attr_name, samplet_id, attr_value)Method to add samplet-wise attributes to the dataset.
add_dataset_attr
(self, attr_name, attr_value)Adds dataset-wide attributes (common to all samplets).
add_samplet
(self, samplet_id, features, target)Adds a new samplet to the dataset with its features, label and class ID.
attr_summary
(self)Simple summary of attributes currently stored in the dataset
data_and_targets
(self)Dataset features and targets in a matrix form for learning.
del_attr
(self, attr_name[, samplet_ids])Method to retrieve specified attribute for a list of samplet IDs
del_samplet
(self, sample_id)Method to remove a samplet from the dataset.
extend
(self, other)Method to extend the dataset vertically (add samplets from anotehr dataset).
from_arff
(arff_path[, encode_nonnumeric])Loads a given dataset saved in Weka’s ARFF format.
get
(self, item[, not_found_value])Method like dict.get() which can return specified value if key not found
get_attr
(self, attr_name[, samplet_ids])Method to retrieve specified attribute for a list of samplet IDs
get_data_matrix_in_order
(self, subset_ids)Returns a numpy array of features, rows in the same order as subset_ids
get_feature_subset
(self, subset_idx)Returns the subset of features indexed numerically.
get_subset
(self, subset_ids)Returns a smaller dataset identified by their keys/samplet IDs.
get_target
(self, target_id)Returns a smaller dataset of samplets with the requested target.
glance
(self[, nitems])Quick and partial glance of the data matrix.
random_subset
(self[, perc])Returns a random sub-dataset (of specified size by percentage) within each class.
random_subset_ids
(self[, perc])Returns a random subset of sample ids, of size percentage*num_samplets
random_subset_ids_by_count
(self[, count])Returns a random subset of sample ids, of size percentage*num_samplets
samplet_ids_with_target
(self, target_id)Returns a list of samplet ids with a given target_id
save
(self, file_path[, …])Method to save the dataset to disk.
summarize
(self)Summary of classes: names, numeric labels and sizes
train_test_split_ids
(self[, train_perc, count])Returns two disjoint sets of samplet ids for use in cross-validation.
transform
(self, func[, func_description])Applies a given a function to the features of each subject
-
add_attr
(self, attr_name, samplet_id, attr_value)¶ Method to add samplet-wise attributes to the dataset.
Note: attribute values get overwritten by default, if they already exist for a given samplet
- Parameters
attr_name (str) – Name of the attribute to be added.
samplet_id (str or list of str) – Identifier(s) for the samplet
attr_value (generic or list of generic) – Value(s) of the attribute. Any data type allowed, although it is strongly recommended to keep the data type the same, or compatible, across all the samplets in this dataset.
-
add_dataset_attr
(self, attr_name, attr_value)¶ Adds dataset-wide attributes (common to all samplets).
This is a great way to add meta data (such as version, processing details, anything else common to all samplets). This is better than encoding the info into the description field, as this allow programmatic retrieval.
- Parameters
attr_name (str) – Identifier for the attribute
attr_value (object) – Value of the attribute (any datatype)
-
add_samplet
(self, samplet_id, features, target, overwrite=False, feature_names=None, attr_names=None, attr_values=None)¶ Adds a new samplet to the dataset with its features, label and class ID.
This is the preferred way to construct the dataset.
- Parameters
samplet_id (str) – An identifier uniquely identifies this samplet.
features (list, ndarray) – The features for this samplet
target (int, str) – The label for this samplet
overwrite (bool) – If True, allows the overwite of features for an existing subject ID. Default : False.
feature_names (list) – The names for each feature. Assumed to be in the same order as features
attr_names (str or list of str) – Name of the attribute to be added for this samplet
attr_values (generic or list of generic) – Value of the attribute. Any data type allowed as long as they are compatible across all the samplets in this dataset.
- Raises
ValueError – If samplet_id is already in the Dataset (and overwrite=False), or If dimensionality of the current samplet does not match the current, or If feature_names do not match existing names
TypeError – If samplet to be added is of different data type compared to existing samplets.
-
property
attr
¶ Returns attributes dictionary, keyed-in by [attr_name][samplet_id]
-
property
attr_dtype
¶ Direct access to attr dtype
-
attr_summary
(self)¶ Simple summary of attributes currently stored in the dataset
-
property
data
¶ data in its original dict form.
-
data_and_targets
(self)¶ Dataset features and targets in a matrix form for learning.
Also returns samplet_ids in the same order.
- Returns
data_matrix (ndarray) – 2D array of shape [num_samplets, num_features] with features corresponding row-wise to samplet_ids
targets (ndarray) – Array of numeric targets for each samplet corresponding row-wise to samplet_ids
samplet_ids (list) – List of samplet ids
-
property
dataset_attr
¶ Returns dataset attributes
-
del_attr
(self, attr_name, samplet_ids='all')¶ Method to retrieve specified attribute for a list of samplet IDs
- Parameters
attr_name (str) – Name of the attribute
samplet_ids (str or list) – One or more samplet IDs whose attribute is being queried. Default: ‘all’, all the existing samplet IDs will be used.
- Returns
- Return type
None
- Raises
Warning – If attr_name was never set for this dataset
-
del_samplet
(self, sample_id)¶ Method to remove a samplet from the dataset.
- Parameters
sample_id (str) – samplet id to be removed.
- Raises
UserWarning – If samplet id to delete was not found in the dataset.
-
property
description
¶ Text description (header) that can be set by user.
-
property
dtype
¶ Returns the data type of the features in the Dataset
-
extend
(self, other)¶ Method to extend the dataset vertically (add samplets from anotehr dataset).
- Parameters
other (Dataset) – second dataset to be combined with the current (different samplets, but same dimensionality)
- Raises
TypeError – if input is not an Dataset.
-
property
feature_names
¶ Returns the feature names as an numpy array of strings.
-
classmethod
from_arff
(arff_path, encode_nonnumeric=False)¶ Loads a given dataset saved in Weka’s ARFF format.
- Parameters
arff_path (str) – Path to a dataset saved in Weka’s ARFF file format.
encode_nonnumeric (bool) – Flag to specify whether to encode non-numeric features (categorical, nominal or string) features to numeric values. Currently used only when importing ARFF files. It is usually better to encode your data at the source, and then import them. Use with caution!
-
get
(self, item, not_found_value=None)¶ Method like dict.get() which can return specified value if key not found
-
get_attr
(self, attr_name, samplet_ids='all')¶ Method to retrieve specified attribute for a list of samplet IDs
- Parameters
attr_name (str) – Name of the attribute
samplet_ids (str or list) – One or more samplet IDs whose attribute is being queried. Default: ‘all’, all the existing samplet IDs will be used.
- Returns
attr_values – Attribute values for the list of samplet IDs
- Return type
ndarray
- Raises
KeyError – If attr_name was never set for dataset, or any of the samplets requested
-
get_data_matrix_in_order
(self, subset_ids)¶ Returns a numpy array of features, rows in the same order as subset_ids
- Parameters
subset_ids (list) – List od samplet IDs to extracted from the dataset.
- Returns
matrix – Matrix of features, for each id in subset_ids, in order.
- Return type
ndarray
-
get_feature_subset
(self, subset_idx)¶ Returns the subset of features indexed numerically.
- Parameters
subset_idx (list, ndarray) – List of indices to features to be returned
- Returns
Dataset – with subset of features requested.
- Return type
Dataset
- Raises
UnboundLocalError – If input indices are out of bounds for the dataset.
-
get_subset
(self, subset_ids)¶ Returns a smaller dataset identified by their keys/samplet IDs.
- Parameters
subset_ids (list) – List od samplet IDs to extracted from the dataset.
- Returns
sub-dataset – sub-dataset containing only requested samplet IDs.
- Return type
Dataset
-
get_target
(self, target_id)[source]¶ Returns a smaller dataset of samplets with the requested target.
- Parameters
target_id (str or list) – identifier(s) of the class(es) to be returned.
- Returns
With subset of samples belonging to the given class(es).
- Return type
- Raises
ValueError – If one or more of the requested classes do not exist in this dataset. If the specified id is empty or None
-
glance
(self, nitems=5)¶ Quick and partial glance of the data matrix.
- Parameters
nitems (int) – Number of items to glance from the dataset. Default : 5
- Returns
- Return type
dict
-
property
num_features
¶ number of features in each samplet.
-
property
num_samplets
¶ number of samplets in the entire dataset.
-
property
num_targets
¶ Total number of unique classes in the dataset.
-
random_subset
(self, perc=0.5)[source]¶ Returns a random sub-dataset (of specified size by percentage) within each class.
- Parameters
perc (float) – Fraction of samples to be taken from the dataset
- Returns
subdataset – random sub-dataset of specified size.
- Return type
-
random_subset_ids
(self, perc=0.5)[source]¶ Returns a random subset of sample ids, of size percentage*num_samplets
- Parameters
perc (float) – Fraction of the dataset
- Returns
subset – List of samplet ids
- Return type
list
- Raises
ValueError – If no subjects from one or more classes were selected.
UserWarning – If an empty or full dataset is requested.
-
random_subset_ids_by_count
(self, count=5)[source]¶ Returns a random subset of sample ids, of size percentage*num_samplets
- Parameters
count (float) – Fraction of dataset
- Returns
subset – List of samplet ids
- Return type
list
- Raises
ValueError – If no subjects from one or more classes were selected.
UserWarning – If an empty or full dataset is requested.
-
property
samplet_ids
¶ Sample identifiers (strings) forming the basis of Dataset
-
samplet_ids_with_target
(self, target_id)[source]¶ Returns a list of samplet ids with a given target_id
- Parameters
target_id (float) – Value of the target to query
- Returns
subset_ids – List of samplet ids belonging to a given target_id.
- Return type
list
-
save
(self, file_path, allow_constant_features=False, allow_constant_features_across_samplets=False)¶ Method to save the dataset to disk.
- Parameters
file_path (str) – File path to save the current dataset to
allow_constant_features (bool) – Flag indicating whether to allow all the values for features for a samplet to be identical (e.g. all zeros). This flag (when False) intends to catch unusual, and likely incorrect, situations when all features for a given samplet are all zeros or are all some other constant value. In normal, natural, and real-world scenarios, different features will have different values. So when they are 0s or some other constant value, it is indicative of a bug somewhere. When constant values is intended, pass True for this flag.
allow_constant_features_across_samplets (bool) – While the previous flag allow_constant_features looks at one samplet at a time (across features; along rows in feature matrix X: N x p), this flag checks for constant values across all samplets for a given feature (along the columns). When similar values are expected across all samplets, pass True to this flag.
- Raises
IOError – If saving to disk is not successful.
-
property
shape
¶ num_samplets x num_features.
- Type
Returns the pythonic shape of the dataset
-
summarize
(self)[source]¶ Summary of classes: names, numeric labels and sizes
- Returns
target_set (list) – List of names of all the classes
target_sizes (list) – Size of each class (number of samples)
-
property
target_set
¶ Set of unique classes in the dataset.
-
property
target_sizes
¶ Returns the sizes of different objects in a Counter object.
-
property
targets
¶ Returns the array of targets for all the samplets.
-
train_test_split_ids
(self, train_perc=None, count=None)[source]¶ Returns two disjoint sets of samplet ids for use in cross-validation.
Offers two ways to specify the sizes: fraction or count. Only one access method can be used at a time.
- Parameters
train_perc (float) – fraction of samplets to build the training subset, between 0 and 1
count (int) – exact count of samplets to build the training subset, between 1 and N-1
- Returns
train_set (list) – List of ids in the training set.
test_set (list) – List of ids in the test set.
- Raises
ValueError – If the fraction is outside open interval (0, 1), or If counts are outside larger than the smallest class, or If unrecognized format is provided for input args, or If the selection results in empty subsets for either train or test sets.
-
transform
(self, func, func_description=None)¶ - Applies a given a function to the features of each subject
and returns a new dataset with other info unchanged.
- Parameters
func (callable) –
A callable that takes in a single ndarray and returns a single ndarray. Ensure the transformed dimensionality must be the same for all subjects.
If your function requires more than one argument, use functools.partial to freeze all the arguments except the features for the subject.
func_description (str, optional) – Human readable description of the given function.
- Returns
xfm_ds – with features obtained from subject-wise transform
- Return type
Dataset
- Raises
TypeError – If given func is not a callable
ValueError – If transformation of any of the subjects features raises an exception.
Examples
A simple example below shows the application of the .apply_xfm() to create a new pyradigm dataset:
from pyradigm import ClassificationDataset as ClfDataset thickness = ClfDataset(in_path='ADNI_thickness.csv') # get_pcg_median is a function that computes the median value in PCG pcg_thickness = thickness.apply_xfm(func=get_pcg_median, description = 'applying ROI mask for PCG') pcg_median = pcg_thickness.apply_xfm(func=np.median, description='median per subject')
A complex example using a function that takes more than one argument:
from functools import partial import hiwenet thickness = ClfDataset(in_path='ADNI_thickness.csv') roi_membership = read_roi_membership() calc_hist_dist = partial(hiwenet, groups = roi_membership) thickness_hiwenet = thickness.transform(func=calc_hist_dist, description = 'histogram weighted networks') median_thk_hiwenet = thickness_hiwenet.transform(func=np.median, description='median per subject')