Transformers¶

DeepChem dc.trans.Transformer objects are another core building block of DeepChem programs. Often times, machine learning systems are very delicate. They need their inputs and outputs to fit within a pre-specified range or follow a clean mathematical distribution. Real data of course is wild and hard to control. What do you do if you have a crazy dataset and need to bring its statistics to heel? Fear not for you have Transformer objects.

General Transformers ¶

NormalizationTransformer ¶

class NormalizationTransformer(transform_X: bool = False, transform_y: bool = False, transform_w: bool = False, dataset: Dataset | None = None, transform_gradients: bool = False, move_mean: bool = True)[source]¶

Normalizes dataset to have zero mean and unit standard deviation

This transformer transforms datasets to have zero mean and unit standard deviation.

Examples

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.rand(n_samples, n_tasks)
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.NormalizationTransformer(transform_y=True, dataset=dataset)
>>> dataset = transformer.transform(dataset)

Note

This class can only transform X or y and not w. So only one of transform_X or transform_y can be set.

Raises:: ValueError – if transform_X and transform_y are both set.

__init__(transform_X: bool = False, transform_y: bool = False, transform_w: bool = False, dataset: Dataset | None = None, transform_gradients: bool = False, move_mean: bool = True)[source]¶

Initialize normalization transformation.

Parameters:

transform_X (bool, optional (default False)) – Whether to transform X
transform_y (bool, optional (default False)) – Whether to transform y
transform_w (bool, optional (default False)) – Whether to transform w
dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed

transform_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transform the data in a set of (X, y, w) arrays.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of ids.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

untransform(z: ndarray) → ndarray[source]¶

Undo transformation on provided data.

Parameters:: z (np.ndarray) – Array to transform back
Returns:: z_out – Array with normalization undone.
Return type:: np.ndarray

untransform_grad(grad, tasks)[source]¶: DEPRECATED. DO NOT USE.

transform(dataset: Dataset, parallel: bool = False, out_dir: str | None = None, **kwargs) → Dataset[source]¶

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:

dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.

Returns:

A newly transformed Dataset object

Return type:

Dataset

transform_on_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

MinMaxTransformer ¶

class MinMaxTransformer(transform_X: bool = False, transform_y: bool = False, dataset: Dataset | None = None)[source]¶

Ensure each value rests between 0 and 1 by using the min and max.

MinMaxTransformer transforms the dataset by shifting each axis of X or y (depending on whether transform_X or transform_y is True), except the first one by the minimum value along the axis and dividing the result by the range (maximum value - minimum value) along the axis. This ensures each axis is between 0 and 1. In case of multi-task learning, it ensures each task is given equal importance.

Given original array A, the transformed array can be written as:

>>> import numpy as np
>>> A = np.random.rand(10, 10)
>>> A_min = np.min(A, axis=0)
>>> A_max = np.max(A, axis=0)
>>> A_t = np.nan_to_num((A - A_min)/(A_max - A_min))

Examples

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.rand(n_samples, n_tasks)
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.MinMaxTransformer(transform_y=True, dataset=dataset)
>>> dataset = transformer.transform(dataset)

Note

This class can only transform X or y and not w. So only one of transform_X or transform_y can be set.

Raises:: ValueError – if transform_X and transform_y are both set.

__init__(transform_X: bool = False, transform_y: bool = False, dataset: Dataset | None = None)[source]¶

Initialization of MinMax transformer.

Parameters:

transform_X (bool, optional (default False)) – Whether to transform X
transform_y (bool, optional (default False)) – Whether to transform y
dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed

transform_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transform the data in a set of (X, y, w, ids) arrays.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of ids.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

untransform(z: ndarray) → ndarray[source]¶

Undo transformation on provided data.

Parameters:: z (np.ndarray) – Transformed X or y array
Returns:: Array with min-max scaling undone.
Return type:: np.ndarray

transform(dataset: Dataset, parallel: bool = False, out_dir: str | None = None, **kwargs) → Dataset[source]¶

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:

dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.

Returns:

A newly transformed Dataset object

Return type:

Dataset

transform_on_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

ClippingTransformer ¶

class ClippingTransformer(transform_X: bool = False, transform_y: bool = False, dataset: Dataset | None = None, x_max: float = 5.0, y_max: float = 500.0)[source]¶

Clip large values in datasets.

Examples

Let’s clip values from a synthetic dataset

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.zeros((n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.ClippingTransformer(transform_X=True)
>>> dataset = transformer.transform(dataset)

__init__(transform_X: bool = False, transform_y: bool = False, dataset: Dataset | None = None, x_max: float = 5.0, y_max: float = 500.0)[source]¶

Initialize clipping transformation.

Parameters:

transform_X (bool, optional (default False)) – Whether to transform X
transform_y (bool, optional (default False)) – Whether to transform y
dataset (dc.data.Dataset object, optional) – Dataset to be transformed
x_max (float, optional) – Maximum absolute value for X
y_max (float, optional) – Maximum absolute value for y

Note

This transformer can transform X and y jointly, but does not transform w.

Raises:: ValueError – if transform_w is set.

transform_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transform the data in a set of (X, y, w) arrays.

Parameters:

X (np.ndarray) – Array of Features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights
ids (np.ndarray) – Array of ids.

Returns:

X (np.ndarray) – Transformed features
y (np.ndarray) – Transformed tasks
w (np.ndarray) – Transformed weights
idstrans (np.ndarray) – Transformed array of ids

untransform(z: ndarray) → ndarray[source]¶: Not implemented.

transform(dataset: Dataset, parallel: bool = False, out_dir: str | None = None, **kwargs) → Dataset[source]¶

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:

dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.

Returns:

A newly transformed Dataset object

Return type:

Dataset

transform_on_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

LogTransformer ¶

class LogTransformer(transform_X: bool = False, transform_y: bool = False, features: List[int] | None = None, tasks: List[str] | None = None, dataset: Dataset | None = None)[source]¶

Computes a logarithmic transformation

This transformer computes the transformation given by

>>> import numpy as np
>>> A = np.random.rand(10, 10)
>>> A = np.log(A + 1)

Assuming that tasks/features are not specified. If specified, then transformations are only performed on specified tasks/features.

Examples

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.zeros((n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.LogTransformer(transform_X=True)
>>> dataset = transformer.transform(dataset)

Note

This class can only transform X or y and not w. So only one of transform_X or transform_y can be set.

Raises:: ValueError – if transform_w is set or transform_X and transform_y are both set.

__init__(transform_X: bool = False, transform_y: bool = False, features: List[int] | None = None, tasks: List[str] | None = None, dataset: Dataset | None = None)[source]¶

Initialize log transformer.

Parameters:

transform_X (bool, optional (default False)) – Whether to transform X
transform_y (bool, optional (default False)) – Whether to transform y
features (list[Int]) – List of features indices to transform
tasks (list[str]) – List of task names to transform.
dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed

transform_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transform the data in a set of (X, y, w) arrays.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of weights.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

untransform(z: ndarray) → ndarray[source]¶

Undo transformation on provided data.

Parameters:: z (np.ndarray,) – Transformed X or y array
Returns:: Array with a logarithmic transformation undone.
Return type:: np.ndarray

transform(dataset: Dataset, parallel: bool = False, out_dir: str | None = None, **kwargs) → Dataset[source]¶

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:

dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.

Returns:

A newly transformed Dataset object

Return type:

Dataset

transform_on_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

CDFTransformer ¶

class CDFTransformer(transform_X: bool = False, transform_y: bool = False, dataset: Dataset | None = None, bins: int = 2)[source]¶

Histograms the data and assigns values based on sorted list.

Acts like a Cumulative Distribution Function (CDF). If given a dataset of samples from a continuous distribution computes the CDF of this dataset and replaces values with their corresponding CDF values.

Examples

Let’s look at an example where we transform only features.

>>> N = 10
>>> n_feat = 5
>>> n_bins = 100

Note that we’re using 100 bins for our CDF histogram

>>> import numpy as np
>>> X = np.random.normal(size=(N, n_feat))
>>> y = np.random.randint(2, size=(N,))
>>> dataset = dc.data.NumpyDataset(X, y)
>>> cdftrans = dc.trans.CDFTransformer(transform_X=True, dataset=dataset, bins=n_bins)
>>> dataset = cdftrans.transform(dataset)

Note that you can apply this transformation to y as well

>>> X = np.random.normal(size=(N, n_feat))
>>> y = np.random.normal(size=(N,))
>>> dataset = dc.data.NumpyDataset(X, y)
>>> cdftrans = dc.trans.CDFTransformer(transform_y=True, dataset=dataset, bins=n_bins)
>>> dataset = cdftrans.transform(dataset)

__init__(transform_X: bool = False, transform_y: bool = False, dataset: Dataset | None = None, bins: int = 2)[source]¶

Initialize this transformer.

Parameters:

transform_X (bool, optional (default False)) – Whether to transform X
transform_y (bool, optional (default False)) – Whether to transform y
dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
bins (int, optional (default 2)) – Number of bins to use when computing histogram.

transform_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Performs CDF transform on data.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

untransform(z: ndarray) → ndarray[source]¶

Undo transformation on provided data.

Note that this transformation is only undone for y.

Parameters:: z (np.ndarray,) – Transformed y array
Returns:: Array with the transformation undone.
Return type:: np.ndarray

transform(dataset: Dataset, parallel: bool = False, out_dir: str | None = None, **kwargs) → Dataset[source]¶

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:

dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.

Returns:

A newly transformed Dataset object

Return type:

Dataset

transform_on_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

PowerTransformer ¶

class PowerTransformer(transform_X: bool = False, transform_y: bool = False, dataset: Dataset | None = None, powers: List[int] = [1])[source]¶

Takes power n transforms of the data based on an input vector.

Computes the specified powers of the dataset. This can be useful if you’re looking to add higher order features of the form x_i^2, x_i^3 etc. to your dataset.

Examples

Let’s look at an example where we transform only X.

>>> N = 10
>>> n_feat = 5
>>> powers = [1, 2, 0.5]

So in this example, we’re taking the identity, squares, and square roots. Now let’s construct our matrices

>>> import numpy as np
>>> X = np.random.rand(N, n_feat)
>>> y = np.random.normal(size=(N,))
>>> dataset = dc.data.NumpyDataset(X, y)
>>> trans = dc.trans.PowerTransformer(transform_X=True, dataset=dataset, powers=powers)
>>> dataset = trans.transform(dataset)

Let’s now look at an example where we transform y. Note that the y transform expands out the feature dimensions of y the same way it does for X so this transform is only well defined for singletask datasets.

>>> import numpy as np
>>> X = np.random.rand(N, n_feat)
>>> y = np.random.rand(N)
>>> dataset = dc.data.NumpyDataset(X, y)
>>> trans = dc.trans.PowerTransformer(transform_y=True, dataset=dataset, powers=powers)
>>> dataset = trans.transform(dataset)

__init__(transform_X: bool = False, transform_y: bool = False, dataset: Dataset | None = None, powers: List[int] = [1])[source]¶

Initialize this transformer

Parameters:

transform_X (bool, optional (default False)) – Whether to transform X
transform_y (bool, optional (default False)) – Whether to transform y
dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed. Note that this argument is ignored since
specified. (PowerTransformer doesn't require it to be) – powers: list[int], optional (default [1]) The list of powers of features/labels to compute.

transform_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Performs power transform on data.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

untransform(z: ndarray) → ndarray[source]¶

Undo transformation on provided data.

Parameters:: z (np.ndarray,) – Transformed y array
Returns:: Array with the power transformation undone.
Return type:: np.ndarray

transform(dataset: Dataset, parallel: bool = False, out_dir: str | None = None, **kwargs) → Dataset[source]¶

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:

dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.

Returns:

A newly transformed Dataset object

Return type:

Dataset

transform_on_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

BalancingTransformer ¶

class BalancingTransformer(dataset: Dataset)[source]¶

Balance positive and negative (or multiclass) example weights.

This class balances the sample weights so that the sum of all example weights from all classes is the same. This can be useful when you’re working on an imbalanced dataset where there are far fewer examples of some classes than others.

Examples

Here’s an example for a binary dataset.

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> n_classes = 2
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.randint(n_classes, size=(n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.BalancingTransformer(dataset=dataset)
>>> dataset = transformer.transform(dataset)

And here’s a multiclass dataset example.

>>> n_samples = 50
>>> n_features = 3
>>> n_tasks = 1
>>> n_classes = 5
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.randint(n_classes, size=(n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.BalancingTransformer(dataset=dataset)
>>> dataset = transformer.transform(dataset)

DuplicateBalancingTransformer ¶

class DuplicateBalancingTransformer(dataset: Dataset)[source]¶

Balance binary or multiclass datasets by duplicating rarer class samples.

This class balances a dataset by duplicating samples of the rarer class so that the sum of all example weights from all classes is the same. (Up to integer rounding of course). This can be useful when you’re working on an imabalanced dataset where there are far fewer examples of some classes than others.

This class differs from BalancingTransformer in that it actually duplicates rarer class samples rather than just increasing their sample weights. This may be more friendly for models that are numerically fragile and can’t handle imbalanced example weights.

Examples

Here’s an example for a binary dataset.

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> n_classes = 2
>>> import deepchem as dc
>>> import numpy as np
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.randint(n_classes, size=(n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.DuplicateBalancingTransformer(dataset=dataset)
>>> dataset = transformer.transform(dataset)

And here’s a multiclass dataset example.

>>> n_samples = 50
>>> n_features = 3
>>> n_tasks = 1
>>> n_classes = 5
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.randint(n_classes, size=(n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.DuplicateBalancingTransformer(dataset=dataset)
>>> dataset = transformer.transform(dataset)

ImageTransformer ¶

class ImageTransformer(size: Tuple[int, int], transform_X: bool = True, transform_y: bool = False)[source]¶

Transforms images to a specified width and/or height.

Images of shape (n_samples, width, height) and (n_samples, width, height, channels) are supported.

Images of shape (n_samples, width, height, channels) can be resized to (n_samples, new_width, new_height, channels).

Note

This class require Pillow to be installed.

__init__(size: Tuple[int, int], transform_X: bool = True, transform_y: bool = False)[source]¶

Initializes ImageTransformer.

Parameters:

size (Tuple[int, int]) – The image size, a tuple of (width, height).
transform_X (bool, optional (default True)) – Whether to transform X
transform_y (bool, optional (default False)) – Whether to transform y

Examples

Let’s transform a small dataset of images and their masks.

>>> import deepchem as dc
>>> import numpy as np
>>> X = np.random.rand(10, 256, 256, 3)
>>> y = np.random.rand(10, 256, 256, 3)

Let’s now make a ImageDataset >>> dataset = dc.data.ImageDataset(X, y)

And let’s apply our transformer with a size of (128, 128, 3). >>> img_transform = dc.trans.ImageTransformer(size=(128, 128), transform_X=True, transform_y=True) >>> resized_dataset = dataset.transform(img_transform)

We can see that our dataset has been resized. >>> resized_X = resized_dataset.X >>> resized_X.shape (10, 128, 128, 3)

We can also see that the masks have been resized. If you want to transform only X, you can set transform_y to False, and vice versa.

transform_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transform the data in a set of (X, y, w, ids) arrays.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

transform(dataset: Dataset, parallel: bool = False, out_dir: str | None = None, **kwargs) → Dataset[source]¶

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:

dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.

Returns:

A newly transformed Dataset object

Return type:

Dataset

transform_on_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

untransform(transformed: ndarray) → ndarray[source]¶

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters:: transformed (np.ndarray) – Array which was previously transformed by this class.

FeaturizationTransformer ¶

class FeaturizationTransformer(dataset: Dataset | None = None, featurizer: Featurizer | None = None)[source]¶

A transformer which runs a featurizer over the X values of a dataset.

Datasets used by this transformer must be compatible with the internal featurizer. The idea of this transformer is that it allows for the application of a featurizer to an existing dataset.

Examples

>>> smiles = ["C", "CC"]
>>> X = np.array(smiles)
>>> y = np.array([1, 0])
>>> dataset = dc.data.NumpyDataset(X, y)
>>> trans = dc.trans.FeaturizationTransformer(dataset, dc.feat.CircularFingerprint())
>>> dataset = trans.transform(dataset)

__init__(dataset: Dataset | None = None, featurizer: Featurizer | None = None)[source]¶

Initialization of FeaturizationTransformer

Parameters:

dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
featurizer (dc.feat.Featurizer object, optional (default None)) – Featurizer applied to perform transformations.

transform_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transforms arrays of rdkit mols using internal featurizer.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

transform(dataset: Dataset, parallel: bool = False, out_dir: str | None = None, **kwargs) → Dataset[source]¶

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:

dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.

Returns:

A newly transformed Dataset object

Return type:

Dataset

transform_on_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

untransform(transformed: ndarray) → ndarray[source]¶

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters:: transformed (np.ndarray) – Array which was previously transformed by this class.

Specified Usecase Transformers ¶

CoulombFitTransformer ¶

class CoulombFitTransformer(dataset: Dataset)[source]¶

Performs randomization and binarization operations on batches of Coulomb Matrix features during fit.

Examples

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features, n_features)
>>> y = np.zeros((n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> fit_transformers = [dc.trans.CoulombFitTransformer(dataset)]
>>> model = dc.models.MultitaskFitTransformRegressor(n_tasks,
...    [n_features, n_features], batch_size=n_samples, fit_transformers=fit_transformers, n_evals=1)
>>> print(model.n_features)
12

__init__(dataset: Dataset)[source]¶

Initializes CoulombFitTransformer.

Parameters:: dataset (dc.data.Dataset) – Dataset object to be transformed.

realize(X: ndarray) → ndarray[source]¶

Randomize features.

Parameters:: X (np.ndarray) – Features
Returns:: X – Randomized features
Return type:: np.ndarray

normalize(X: ndarray) → ndarray[source]¶

Normalize features.

Parameters:: X (np.ndarray) – Features
Returns:: X – Normalized features
Return type:: np.ndarray

expand(X: ndarray) → ndarray[source]¶

Binarize features.

Parameters:: X (np.ndarray) – Features
Returns:: X – Binarized features
Return type:: np.ndarray

X_transform(X: ndarray) → ndarray[source]¶

Perform Coulomb Fit transform on features.

Parameters:: X (np.ndarray) – Features
Returns:: X – Transformed features
Return type:: np.ndarray

transform_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Performs randomization and binarization operations on data.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

untransform(z: ndarray) → ndarray[source]¶: Not implemented.

transform(dataset: Dataset, parallel: bool = False, out_dir: str | None = None, **kwargs) → Dataset[source]¶

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:

dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.

Returns:

A newly transformed Dataset object

Return type:

Dataset

transform_on_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

IRVTransformer ¶

class IRVTransformer(K: int, n_tasks: int, dataset: Dataset)[source]¶

Performs transform from ECFP to IRV features(K nearest neighbors).

This transformer is required by MultitaskIRVClassifier as a preprocessing step before training.

Examples

Let’s start by defining the parameters of the dataset we’re about to transform.

>>> n_feat = 128
>>> N = 20
>>> n_tasks = 2

Let’s now make our dataset object

>>> import numpy as np
>>> import deepchem as dc
>>> X = np.random.randint(2, size=(N, n_feat))
>>> y = np.zeros((N, n_tasks))
>>> w = np.ones((N, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w)

And let’s apply our transformer with 10 nearest neighbors.

>>> K = 10
>>> trans = dc.trans.IRVTransformer(K, n_tasks, dataset)
>>> dataset = trans.transform(dataset)

Note

This class requires TensorFlow to be installed.

__init__(K: int, n_tasks: int, dataset: Dataset)[source]¶

Initializes IRVTransformer.

Parameters:

K (int) – number of nearest neighbours being count
n_tasks (int) – number of tasks
dataset (dc.data.Dataset object) – train_dataset

realize(similarity: ndarray, y: ndarray, w: ndarray) → List[source]¶

find samples with top ten similarity values in the reference dataset

Parameters:

similarity (np.ndarray) – similarity value between target dataset and reference dataset should have size of (n_samples_in_target, n_samples_in_reference)
y (np.array) – labels for a single task
w (np.array) – weights for a single task

Returns:

features – n_samples * np.array of size (2*K,) each array includes K similarity values and corresponding labels

Return type:

list

X_transform(X_target: ndarray) → ndarray[source]¶

Calculate similarity between target dataset(X_target) and: reference dataset(X): #(1 in intersection)/#(1 in union)

similarity = (X_target intersect X)/(X_target union X)

Parameters:: X_target (np.ndarray) – fingerprints of target dataset should have same length with X in the second axis
Returns:: X_target – features of size(batch_size, 2*K*n_tasks)
Return type:: np.ndarray

static matrix_mul(X1, X2, shard_size=5000)[source]¶: Calculate matrix multiplication for big matrix, X1 and X2 are sliced into pieces with shard_size rows(columns) then multiplied together and concatenated to the proper size

transform(dataset: Dataset, parallel: bool = False, out_dir: str | None = None, **kwargs) → DiskDataset | NumpyDataset[source]¶

Transforms a given dataset

Parameters:

dataset (Dataset) – Dataset to transform
parallel (bool, optional, (default False)) – Whether to parallelize this transformation. Currently ignored.
out_dir (str, optional (default None)) – Directory to write resulting dataset.

Returns:

DiskDataset or NumpyDataset
Dataset object that is transformed.

untransform(z: ndarray) → ndarray[source]¶: Not implemented.

transform_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transform the data in a set of (X, y, w, ids) arrays.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

transform_on_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

DAGTransformer ¶

class DAGTransformer(max_atoms: int = 50)[source]¶

Performs transform from ConvMol adjacency lists to DAG calculation orders

This transformer is used by DAGModel before training to transform its inputs to the correct shape. This expansion turns a molecule with n atoms into n DAGs, each with root at a different atom in the molecule.

Examples

Let’s transform a small dataset of molecules.

>>> N = 10
>>> n_feat = 5
>>> import numpy as np
>>> feat = dc.feat.ConvMolFeaturizer()
>>> X = feat(["C", "CC"])
>>> y = np.random.rand(N)
>>> dataset = dc.data.NumpyDataset(X, y)
>>> trans = dc.trans.DAGTransformer(max_atoms=5)
>>> dataset = trans.transform(dataset)

__init__(max_atoms: int = 50)[source]¶

Initializes DAGTransformer.

Parameters:: max_atoms (int, optional (Default 50)) – Maximum number of atoms to allow

transform_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transform the data in a set of (X, y, w, ids) arrays.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

untransform(z: ndarray) → ndarray[source]¶: Not implemented.

UG_to_DAG(sample: ConvMol) → List[source]¶

This function generates the DAGs for a molecule

Parameters:: sample (ConvMol) – Molecule to transform
Returns:: List of parent adjacency matrices
Return type:: List

transform(dataset: Dataset, parallel: bool = False, out_dir: str | None = None, **kwargs) → Dataset[source]¶

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:

dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.

Returns:

A newly transformed Dataset object

Return type:

Dataset

transform_on_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

RxnSplitTransformer ¶

class RxnSplitTransformer(sep_reagent: bool = True, dataset: Dataset | None = None)[source]¶

Splits the reaction SMILES input into the source and target strings required for machine translation tasks.

The input is expected to be in the form reactant>reagent>product. The source string would be reactants>reagents and the target string would be the products.

The transformer can also separate the reagents from the reactants for a mixed training mode. During mixed training, the source string is transformed from reactants>reagent to reactants.reagent> . This can be toggled (default True) by setting the value of sep_reagent while calling the transformer.

Examples

>>> # When mixed training is toggled.
>>> import numpy as np
>>> from deepchem.trans.transformers import RxnSplitTransformer
>>> reactions = np.array(["CC(C)C[Mg+].CON(C)C(=O)c1ccc(O)nc1>C1CCOC1.[Cl-]>CC(C)CC(=O)c1ccc(O)nc1","CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(N)cc3)cc21.O=CO>>CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(NC=O)cc3)cc21"], dtype=object)
>>> trans = RxnSplitTransformer(sep_reagent=True)
>>> split_reactions = trans.transform_array(X=reactions, y=np.array([]), w=np.array([]), ids=np.array([]))
>>> split_reactions
(array([['CC(C)C[Mg+].CON(C)C(=O)c1ccc(O)nc1>C1CCOC1.[Cl-]',
        'CC(C)CC(=O)c1ccc(O)nc1'],
       ['CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(N)cc3)cc21.O=CO>',
        'CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(NC=O)cc3)cc21']], dtype='<U51'), array([], dtype=float64), array([], dtype=float64), array([], dtype=float64))

When mixed training is disabled, you get the following outputs:

>>> trans_disable = RxnSplitTransformer(sep_reagent=False)
>>> split_reactions = trans_disable.transform_array(X=reactions, y=np.array([]), w=np.array([]), ids=np.array([]))
>>> split_reactions
(array([['CC(C)C[Mg+].CON(C)C(=O)c1ccc(O)nc1.C1CCOC1.[Cl-]>',
        'CC(C)CC(=O)c1ccc(O)nc1'],
       ['CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(N)cc3)cc21.O=CO>',
        'CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(NC=O)cc3)cc21']], dtype='<U51'), array([], dtype=float64), array([], dtype=float64), array([], dtype=float64))

Note

This class only transforms the feature field of a reaction dataset like USPTO.

__init__(sep_reagent: bool = True, dataset: Dataset | None = None)[source]¶

Initializes the Reaction split Transformer.

Parameters:

sep_reagent (bool, optional (default True)) – To separate the reagent and reactants for training.
dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed.

transform_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transform the data in a set of (X, y, w, ids) arrays.

Parameters:

X (np.ndarray) – Array of features(the reactions)
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of weights.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

transform(dataset: Dataset, parallel: bool = False, out_dir: str | None = None, **kwargs) → Dataset[source]¶

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:

dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.

Returns:

A newly transformed Dataset object

Return type:

Dataset

transform_on_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

untransform(z)[source]¶: Not Implemented.

Base Transformer (for develop)¶

The dc.trans.Transformer class is the abstract parent class for all transformers. This class should never be directly initialized, but contains a number of useful method implementations.

class Transformer(transform_X: bool = False, transform_y: bool = False, transform_w: bool = False, transform_ids: bool = False, dataset: Dataset | None = None)[source]¶

Abstract base class for different data transformation techniques.

A transformer is an object that applies a transformation to a given dataset. Think of a transformation as a mathematical operation which makes the source dataset more amenable to learning. For example, one transformer could normalize the features for a dataset (ensuring they have zero mean and unit standard deviation). Another transformer could for example threshold values in a dataset so that values outside a given range are truncated. Yet another transformer could act as a data augmentation routine, generating multiple different images from each source datapoint (a transformation need not necessarily be one to one).

Transformers are designed to be chained, since data pipelines often chain multiple different transformations to a dataset. Transformers are also designed to be scalable and can be applied to large dc.data.Dataset objects. Not that Transformers are not usually thread-safe so you will have to be careful in processing very large datasets.

This class is an abstract superclass that isn’t meant to be directly instantiated. Instead, you will want to instantiate one of the subclasses of this class inorder to perform concrete transformations.

__init__(transform_X: bool = False, transform_y: bool = False, transform_w: bool = False, transform_ids: bool = False, dataset: Dataset | None = None)[source]¶

Initializes transformation based on dataset statistics.

Parameters:

transform_X (bool, optional (default False)) – Whether to transform X
transform_y (bool, optional (default False)) – Whether to transform y
transform_w (bool, optional (default False)) – Whether to transform w
transform_ids (bool, optional (default False)) – Whether to transform ids
dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed

transform(dataset: Dataset, parallel: bool = False, out_dir: str | None = None, **kwargs) → Dataset[source]¶

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:

dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.

Returns:

A newly transformed Dataset object

Return type:

Dataset

transform_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transform the data in a set of (X, y, w, ids) arrays.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

transform_on_array(X: ndarray, y: ndarray, w: ndarray, ids: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray][source]¶

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:

X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.

Returns:

Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids

untransform(transformed: ndarray) → ndarray[source]¶

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters:: transformed (np.ndarray) – Array which was previously transformed by this class.