Transformers

DeepChem dc.trans.Transformer objects are another core building block of DeepChem programs. Often times, machine learning systems are very delicate. They need their inputs and outputs to fit within a pre-specified range or follow a clean mathematical distribution. Real data of course is wild and hard to control. What do you do if you have a crazy dataset and need to bring its statistics to heel? Fear not for you have Transformer objects.

Transformer

The dc.trans.Transformer class is the abstract parent class for all transformers. This class should never be directly initialized, but contains a number of useful method implementations.

class Transformer(transform_X: bool = False, transform_y: bool = False, transform_w: bool = False, transform_ids: bool = False, dataset: Optional[deepchem.data.datasets.Dataset] = None)[source]

Abstract base class for different data transformation techniques.

A transformer is an object that applies a transformation to a given dataset. Think of a transformation as a mathematical operation which makes the source dataset more amenable to learning. For example, one transformer could normalize the features for a dataset (ensuring they have zero mean and unit standard deviation). Another transformer could for example threshold values in a dataset so that values outside a given range are truncated. Yet another transformer could act as a data augmentation routine, generating multiple different images from each source datapoint (a transformation need not necessarily be one to one).

Transformers are designed to be chained, since data pipelines often chain multiple different transformations to a dataset. Transformers are also designed to be scalable and can be applied to large dc.data.Dataset objects. Not that Transformers are not usually thread-safe so you will have to be careful in processing very large datasets.

This class is an abstract superclass that isn’t meant to be directly instantiated. Instead, you will want to instantiate one of the subclasses of this class inorder to perform concrete transformations.

__init__(transform_X: bool = False, transform_y: bool = False, transform_w: bool = False, transform_ids: bool = False, dataset: Optional[deepchem.data.datasets.Dataset] = None)[source]

Initializes transformation based on dataset statistics.

Parameters
  • transform_X (bool, optional (default False)) – Whether to transform X

  • transform_y (bool, optional (default False)) – Whether to transform y

  • transform_w (bool, optional (default False)) – Whether to transform w

  • transform_ids (bool, optional (default False)) – Whether to transform ids

  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed

transform(dataset: deepchem.data.datasets.Dataset, parallel: bool = False, out_dir: Optional[str] = None, **kwargs)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters
  • dataset (dc.data.Dataset) – Dataset object to be transformed.

  • parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.

  • out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.

Returns

Return type

a newly constructed Dataset object

transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray][source]

Transform the data in a set of (X, y, w, ids) arrays.

Parameters
  • X (np.ndarray) – Array of features

  • y (np.ndarray) – Array of labels

  • w (np.ndarray) – Array of weights.

  • ids (np.ndarray) – Array of identifiers.

Returns

  • Xtrans (np.ndarray) – Transformed array of features

  • ytrans (np.ndarray) – Transformed array of labels

  • wtrans (np.ndarray) – Transformed array of weights

  • idstrans (np.ndarray) – Transformed array of ids

transform_on_array(X, y, w, ids)[source]

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters
  • X (np.ndarray) – Array of features

  • y (np.ndarray) – Array of labels

  • w (np.ndarray) – Array of weights.

  • ids (np.ndarray) – Array of identifiers.

Returns

  • Xtrans (np.ndarray) – Transformed array of features

  • ytrans (np.ndarray) – Transformed array of labels

  • wtrans (np.ndarray) – Transformed array of weights

  • idstrans (np.ndarray) – Transformed array of ids

untransform(z)[source]

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters

z (np.ndarray) – Array which was previously transformed by this class.

Returns

Return type

ztrans

MinMaxTransformer

class MinMaxTransformer(transform_X=False, transform_y=False, dataset=None)[source]

Ensure each value rests between 0 and 1 by using the min and max.

MinMaxTransformer transforms the dataset by shifting each axis of X or y (depending on whether transform_X or transform_y is True), except the first one by the minimum value along the axis and dividing the result by the range (maximum value - minimum value) along the axis. This ensures each axis is between 0 and 1. In case of multi-task learning, it ensures each task is given equal importance.

Given original array A, the transformed array can be written as:

>>> import numpy as np
>>> A = np.random.rand(10, 10)
>>> A_min = np.min(A, axis=0)
>>> A_max = np.max(A, axis=0)
>>> A_t = np.nan_to_num((A - A_min)/(A_max - A_min))

Examples

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.rand(n_samples, n_tasks)
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.MinMaxTransformer(transform_y=True, dataset=dataset)
>>> dataset = transformer.transform(dataset)

Note

This class can only transform X or y and not w. So only one of transform_X or transform_y can be set.

Raises

ValueError

__init__(transform_X=False, transform_y=False, dataset=None)[source]

Initialization of MinMax transformer.

Parameters
  • transform_X (bool, optional (default False)) – Whether to transform X

  • transform_y (bool, optional (default False)) – Whether to transform y

  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed

transform(dataset, parallel=False)[source]

Transforms the dataset.

Parameters
  • dataset (dc.data.Dataset) – Dataset object to be transformed.

  • parallel (bool, optional (default False)) – At present this argument is ignored.

  • out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.

Returns

Return type

a newly constructed Dataset object

transform_array(X, y, w, ids)[source]

Transform the data in a set of (X, y, w, ids) arrays.

Parameters
  • X (np.ndarray) – Array of features

  • y (np.ndarray) – Array of labels

  • w (np.ndarray) – Array of weights.

  • ids (np.ndarray) – Array of ids.

Returns

  • Xtrans (np.ndarray) – Transformed array of features

  • ytrans (np.ndarray) – Transformed array of labels

  • wtrans (np.ndarray) – Transformed array of weights

  • idstrans (np.ndarray) – Transformed array of ids

untransform(z)[source]

Undo transformation on provided data.

Parameters

z (np.ndarray,) – Transformed X or y array

NormalizationTransformer

class NormalizationTransformer(transform_X=False, transform_y=False, transform_w=False, dataset=None, transform_gradients=False, move_mean=True)[source]

Normalizes dataset to have zero mean and unit standard deviation

This transformer transforms datasets to have zero mean and unit standard deviation.

Examples

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.rand(n_samples, n_tasks)
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.NormalizationTransformer(transform_y=True, dataset=dataset)
>>> dataset = transformer.transform(dataset)

Note

This class can only transform X or y and not w. So only one of transform_X or transform_y can be set.

Raises

ValueError

__init__(transform_X=False, transform_y=False, transform_w=False, dataset=None, transform_gradients=False, move_mean=True)[source]

Initialize normalization transformation.

Parameters
  • transform_X (bool, optional (default False)) – Whether to transform X

  • transform_y (bool, optional (default False)) – Whether to transform y

  • transform_w (bool, optional (default False)) – Whether to transform w

  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed

transform(dataset, parallel=False)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters
  • dataset (dc.data.Dataset) – Dataset object to be transformed.

  • parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.

  • out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.

Returns

Return type

a newly constructed Dataset object

transform_array(X, y, w, ids)[source]

Transform the data in a set of (X, y, w) arrays.

Parameters
  • X (np.ndarray) – Array of features

  • y (np.ndarray) – Array of labels

  • w (np.ndarray) – Array of weights.

  • ids (np.ndarray) – Array of ids.

Returns

  • Xtrans (np.ndarray) – Transformed array of features

  • ytrans (np.ndarray) – Transformed array of labels

  • wtrans (np.ndarray) – Transformed array of weights

  • idstrans (np.ndarray) – Transformed array of ids

untransform(z)[source]

Undo transformation on provided data.

Parameters

z (np.ndarray) – Array to transform back

Returns

z_out – Array with normalization undone.

Return type

np.ndarray

untransform_grad(grad, tasks)[source]

DEPRECATED. DO NOT USE.

ClippingTransformer

class ClippingTransformer(transform_X=False, transform_y=False, dataset=None, x_max=5.0, y_max=500.0)[source]

Clip large values in datasets.

Examples

Let’s clip values from a synthetic dataset

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.zeros((n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.ClippingTransformer(transform_X=True)
>>> dataset = transformer.transform(dataset)
__init__(transform_X=False, transform_y=False, dataset=None, x_max=5.0, y_max=500.0)[source]

Initialize clipping transformation.

Parameters
  • transform_X (bool, optional (default False)) – Whether to transform X

  • transform_y (bool, optional (default False)) – Whether to transform y

  • dataset (dc.data.Dataset object, optional) – Dataset to be transformed

  • x_max (float, optional) – Maximum absolute value for X

  • y_max (float, optional) – Maximum absolute value for y

Note

This transformer can transform X and y jointly, but does not transform w.

Raises

ValueError

transform_array(X, y, w, ids)[source]

Transform the data in a set of (X, y, w) arrays.

Parameters
  • X (np.ndarray) – Array of Features

  • y (np.ndarray) – Array of labels

  • w (np.ndarray) – Array of weights

  • ids (np.ndarray) – Array of ids.

Returns

  • X (np.ndarray) – Transformed features

  • y (np.ndarray) – Transformed tasks

  • w (np.ndarray) – Transformed weights

  • idstrans (np.ndarray) – Transformed array of ids

untransform(z)[source]

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters

z (np.ndarray) – Array which was previously transformed by this class.

Returns

Return type

ztrans

LogTransformer

class LogTransformer(transform_X=False, transform_y=False, features=None, tasks=None, dataset=None)[source]

Computes a logarithmic transformation

This transformer computes the transformation given by

>>> import numpy as np
>>> A = np.random.rand(10, 10)
>>> A = np.log(A + 1)

Assuming that tasks/features are not specified. If specified, then transformations are only performed on specified tasks/features.

Examples

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.zeros((n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.LogTransformer(transform_X=True)
>>> dataset = transformer.transform(dataset)

Note

This class can only transform X or y and not w. So only one of transform_X or transform_y can be set.

Raises
  • ValueError

  • both set.

__init__(transform_X=False, transform_y=False, features=None, tasks=None, dataset=None)[source]

Initialize log transformer.

Parameters
  • transform_X (bool, optional (default False)) – Whether to transform X

  • transform_y (bool, optional (default False)) – Whether to transform y

  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed

  • features (list[Int]) – List of features indices to transform

  • tasks (list[str]) – List of task names to transform.

transform_array(X, y, w, ids)[source]

Transform the data in a set of (X, y, w) arrays.

Parameters
  • X (np.ndarray) – Array of features

  • y (np.ndarray) – Array of labels

  • w (np.ndarray) – Array of weights.

  • ids (np.ndarray) – Array of weights.

Returns

  • Xtrans (np.ndarray) – Transformed array of features

  • ytrans (np.ndarray) – Transformed array of labels

  • wtrans (np.ndarray) – Transformed array of weights

  • idstrans (np.ndarray) – Transformed array of ids

untransform(z)[source]

Undo transformation on provided data.

Parameters

z (np.ndarray,) – Transformed X or y array

BalancingTransformer

class BalancingTransformer(dataset: deepchem.data.datasets.Dataset)[source]

Balance positive and negative (or multiclass) example weights.

This class balances the sample weights so that the sum of all example weights from all classes is the same. This can be useful when you’re working on an imbalanced dataset where there are far fewer examples of some classes than others.

Examples

Here’s an example for a binary dataset.

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> n_classes = 2
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.randint(n_classes, size=(n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.BalancingTransformer(dataset=dataset)
>>> dataset = transformer.transform(dataset)

And here’s a multiclass dataset example.

>>> n_samples = 50
>>> n_features = 3
>>> n_tasks = 1
>>> n_classes = 5
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.randint(n_classes, size=(n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.BalancingTransformer(dataset=dataset)
>>> dataset = transformer.transform(dataset)

See also

deepchem.trans.DuplicateBalancingTransformer

Balance by duplicating samples.

Note

This transformer is only meaningful for classification datasets where y takes on a limited set of values. This class can only transform w and does not transform X or y.

Raises
  • ValueError

  • ValueError

__init__(dataset: deepchem.data.datasets.Dataset)[source]

Initializes transformation based on dataset statistics.

Parameters
  • transform_X (bool, optional (default False)) – Whether to transform X

  • transform_y (bool, optional (default False)) – Whether to transform y

  • transform_w (bool, optional (default False)) – Whether to transform w

  • transform_ids (bool, optional (default False)) – Whether to transform ids

  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed

transform_array(X, y, w, ids)[source]

Transform the data in a set of (X, y, w) arrays.

Parameters
  • X (np.ndarray) – Array of features

  • y (np.ndarray) – Array of labels

  • w (np.ndarray) – Array of weights.

  • ids (np.ndarray) – Array of weights.

Returns

  • Xtrans (np.ndarray) – Transformed array of features

  • ytrans (np.ndarray) – Transformed array of labels

  • wtrans (np.ndarray) – Transformed array of weights

  • idstrans (np.ndarray) – Transformed array of ids

DuplicateBalancingTransformer

class DuplicateBalancingTransformer(dataset: deepchem.data.datasets.Dataset)[source]

Balance binary or multiclass datasets by duplicating rarer class samples.

This class balances a dataset by duplicating samples of the rarer class so that the sum of all example weights from all classes is the same. (Up to integer rounding of course). This can be useful when you’re working on an imabalanced dataset where there are far fewer examples of some classes than others.

This class differs from BalancingTransformer in that it actually duplicates rarer class samples rather than just increasing their sample weights. This may be more friendly for models that are numerically fragile and can’t handle imbalanced example weights.

Examples

Here’s an example for a binary dataset.

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> n_classes = 2
>>> import deepchem as dc
>>> import numpy as np
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.randint(n_classes, size=(n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.DuplicateBalancingTransformer(dataset=dataset)
>>> dataset = transformer.transform(dataset)

And here’s a multiclass dataset example.

>>> n_samples = 50
>>> n_features = 3
>>> n_tasks = 1
>>> n_classes = 5
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.randint(n_classes, size=(n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.DuplicateBalancingTransformer(dataset=dataset)
>>> dataset = transformer.transform(dataset)

See also

deepchem.trans.BalancingTransformer

Balance by changing sample weights.

Note

This transformer is only well-defined for singletask datasets. (Since examples are actually duplicated, there’s no meaningful way to duplicate across multiple tasks in a way that preserves the balance.)

This transformer is only meaningful for classification datasets where y takes on a limited set of values. This class transforms all of X, y, w, ids.

Raises

ValueError

__init__(dataset: deepchem.data.datasets.Dataset)[source]

Initializes transformation based on dataset statistics.

Parameters
  • transform_X (bool, optional (default False)) – Whether to transform X

  • transform_y (bool, optional (default False)) – Whether to transform y

  • transform_w (bool, optional (default False)) – Whether to transform w

  • transform_ids (bool, optional (default False)) – Whether to transform ids

  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed

transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray][source]

Transform the data in a set of (X, y, w, id) arrays.

Parameters
  • X (np.ndarray) – Array of features

  • y (np.ndarray) – Array of labels

  • w (np.ndarray) – Array of weights.

  • ids (np.ndarray) – Array of identifiers

Returns

  • Xtrans (np.ndarray) – Transformed array of features

  • ytrans (np.ndarray) – Transformed array of labels

  • wtrans (np.ndarray) – Transformed array of weights

  • idtrans (np.ndarray) – Transformed array of identifiers

CDFTransformer

class CDFTransformer(transform_X: bool = False, transform_y: bool = False, dataset: Optional[deepchem.data.datasets.Dataset] = None, bins: int = 2)[source]

Histograms the data and assigns values based on sorted list.

Acts like a Cumulative Distribution Function (CDF). If given a dataset of samples from a continuous distribution computes the CDF of this dataset and replaces values with their corresponding CDF values.

Examples

Let’s look at an example where we transform only features.

>>> N = 10
>>> n_feat = 5
>>> n_bins = 100

Note that we’re using 100 bins for our CDF histogram

>>> import numpy as np
>>> X = np.random.normal(size=(N, n_feat))
>>> y = np.random.randint(2, size=(N,))
>>> dataset = dc.data.NumpyDataset(X, y)
>>> cdftrans = dc.trans.CDFTransformer(transform_X=True, dataset=dataset, bins=n_bins)
>>> dataset = cdftrans.transform(dataset)

Note that you can apply this transformation to y as well

>>> X = np.random.normal(size=(N, n_feat))
>>> y = np.random.normal(size=(N,))
>>> dataset = dc.data.NumpyDataset(X, y)
>>> cdftrans = dc.trans.CDFTransformer(transform_y=True, dataset=dataset, bins=n_bins)
>>> dataset = cdftrans.transform(dataset)
__init__(transform_X: bool = False, transform_y: bool = False, dataset: Optional[deepchem.data.datasets.Dataset] = None, bins: int = 2)[source]

Initialize this transformer.

Parameters
  • transform_X (bool, optional (default False)) – Whether to transform X

  • transform_y (bool, optional (default False)) – Whether to transform y

  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed

  • bins (int, optional (default 2)) – Number of bins to use when computing histogram.

transform_array(X, y, w, ids)[source]

Performs CDF transform on data.

Parameters
  • X (np.ndarray) – Array of features

  • y (np.ndarray) – Array of labels

  • w (np.ndarray) – Array of weights.

  • ids (np.ndarray) – Array of identifiers

Returns

  • Xtrans (np.ndarray) – Transformed array of features

  • ytrans (np.ndarray) – Transformed array of labels

  • wtrans (np.ndarray) – Transformed array of weights

  • idstrans (np.ndarray) – Transformed array of ids

untransform(z)[source]

Undo transformation on provided data.

Note that this transformation is only undone for y.

Parameters

z (np.ndarray,) – Transformed y array

PowerTransformer

class PowerTransformer(transform_X=False, transform_y=False, dataset=None, powers=[1])[source]

Takes power n transforms of the data based on an input vector.

Computes the specified powers of the dataset. This can be useful if you’re looking to add higher order features of the form x_i^2, x_i^3 etc. to your dataset.

Examples

Let’s look at an example where we transform only X.

>>> N = 10
>>> n_feat = 5
>>> powers = [1, 2, 0.5]

So in this example, we’re taking the identity, squares, and square roots. Now let’s construct our matrices

>>> import numpy as np
>>> X = np.random.rand(N, n_feat)
>>> y = np.random.normal(size=(N,))
>>> dataset = dc.data.NumpyDataset(X, y)
>>> trans = dc.trans.PowerTransformer(transform_X=True, dataset=dataset, powers=powers)
>>> dataset = trans.transform(dataset)

Let’s now look at an example where we transform y. Note that the y transform expands out the feature dimensions of y the same way it does for X so this transform is only well defined for singletask datasets.

>>> import numpy as np
>>> X = np.random.rand(N, n_feat)
>>> y = np.random.rand(N)
>>> dataset = dc.data.NumpyDataset(X, y)
>>> trans = dc.trans.PowerTransformer(transform_y=True, dataset=dataset, powers=powers)
>>> dataset = trans.transform(dataset)
__init__(transform_X=False, transform_y=False, dataset=None, powers=[1])[source]

Initialize this transformer

Parameters
  • transform_X (bool, optional (default False)) – Whether to transform X

  • transform_y (bool, optional (default False)) – Whether to transform y

  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed. Note that this argument is ignored since PowerTransformer doesn’t require it to be specified.

  • powers (list[int], optional (default [1])) – The list of powers of features/labels to compute.

transform_array(X, y, w, ids)[source]

Performs power transform on data.

Parameters
  • X (np.ndarray) – Array of features

  • y (np.ndarray) – Array of labels

  • w (np.ndarray) – Array of weights.

  • ids (np.ndarray) – Array of identifiers.

Returns

  • Xtrans (np.ndarray) – Transformed array of features

  • ytrans (np.ndarray) – Transformed array of labels

  • wtrans (np.ndarray) – Transformed array of weights

  • idstrans (np.ndarray) – Transformed array of ids

untransform(z)[source]

Undo transformation on provided data.

Parameters

z (np.ndarray,) – Transformed y array

CoulombFitTransformer

class CoulombFitTransformer(dataset)[source]

Performs randomization and binarization operations on batches of Coulomb Matrix features during fit.

Example

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features, n_features)
>>> y = np.zeros((n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> fit_transformers = [dc.trans.CoulombFitTransformer(dataset)]
>>> model = dc.models.MultitaskFitTransformRegressor(n_tasks,
...    [n_features, n_features], batch_size=n_samples, fit_transformers=fit_transformers, n_evals=1)
>>> print(model.n_features)
12
__init__(dataset)[source]

Initializes CoulombFitTransformer.

Parameters

dataset (dc.data.Dataset object) –

realize(X)[source]

Randomize features.

Parameters

X (np.ndarray) – Features

Returns

X – Randomized features

Return type

np.ndarray

normalize(X)[source]

Normalize features.

Parameters

X (np.ndarray) – Features

Returns

X – Normalized features

Return type

np.ndarray

expand(X)[source]

Binarize features.

Parameters

X (np.ndarray) – Features

Returns

X – Binarized features

Return type

np.ndarray

X_transform(X)[source]

Perform Coulomb Fit transform on features.

Parameters

X (np.ndarray) – Features

Returns

X – Transformed features

Return type

np.ndarray

transform_array(X, y, w, ids)[source]

Transform the data in a set of (X, y, w, ids) arrays.

Parameters
  • X (np.ndarray) – Array of features

  • y (np.ndarray) – Array of labels

  • w (np.ndarray) – Array of weights.

  • ids (np.ndarray) – Array of identifiers.

Returns

  • Xtrans (np.ndarray) – Transformed array of features

  • ytrans (np.ndarray) – Transformed array of labels

  • wtrans (np.ndarray) – Transformed array of weights

  • idstrans (np.ndarray) – Transformed array of ids

untransform(z)[source]

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters

z (np.ndarray) – Array which was previously transformed by this class.

Returns

Return type

ztrans

IRVTransformer

class IRVTransformer(K, n_tasks, dataset)[source]

Performs transform from ECFP to IRV features(K nearest neighbors).

This transformer is required by MultitaskIRVClassifier as a preprocessing step before training.

Examples

Let’s start by defining the parameters of the dataset we’re about to transform.

>>> n_feat = 128
>>> N = 20
>>> n_tasks = 2

Let’s now make our dataset object

>>> import numpy as np
>>> import deepchem as dc
>>> X = np.random.randint(2, size=(N, n_feat))
>>> y = np.zeros((N, n_tasks))
>>> w = np.ones((N, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w)

And let’s apply our transformer with 10 nearest neighbors.

>>> K = 10
>>> trans = dc.trans.IRVTransformer(K, n_tasks, dataset)
>>> dataset = trans.transform(dataset)

Note

This class requires TensorFlow to be installed.

__init__(K, n_tasks, dataset)[source]

Initializes IRVTransformer.

Parameters
  • K (int) – number of nearest neighbours being count

  • n_tasks (int) – number of tasks

  • dataset (dc.data.Dataset object) – train_dataset

realize(similarity, y, w)[source]

find samples with top ten similarity values in the reference dataset

Parameters
  • similarity (np.ndarray) – similarity value between target dataset and reference dataset should have size of (n_samples_in_target, n_samples_in_reference)

  • y (np.array) – labels for a single task

  • w (np.array) – weights for a single task

Returns

features – n_samples * np.array of size (2*K,) each array includes K similarity values and corresponding labels

Return type

list

X_transform(X_target)[source]

Calculate similarity between target dataset(X_target) and reference dataset(X): #(1 in intersection)/#(1 in union)

similarity = (X_target intersect X)/(X_target union X)

Parameters

X_target (np.ndarray) – fingerprints of target dataset should have same length with X in the second axis

Returns

X_target – features of size(batch_size, 2*K*n_tasks)

Return type

np.ndarray

static matrix_mul(X1, X2, shard_size=5000)[source]

Calculate matrix multiplication for big matrix, X1 and X2 are sliced into pieces with shard_size rows(columns) then multiplied together and concatenated to the proper size

transform(dataset, parallel=False, out_dir=None, **kwargs)[source]

Transforms a given dataset

Parameters
  • dataset (Dataset) – Dataset to transform

  • parallel (bool, optional, (default False)) – Whether to parallelize this transformation. Currently ignored.

  • out_dir (str, optional (default None)) – Directory to write resulting dataset.

Returns

Return type

Dataset object that is transformed.

untransform(z)[source]

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters

z (np.ndarray) – Array which was previously transformed by this class.

Returns

Return type

ztrans

DAGTransformer

class DAGTransformer(max_atoms=50)[source]

Performs transform from ConvMol adjacency lists to DAG calculation orders

This transformer is used by DAGModel before training to transform its inputs to the correct shape. This expansion turns a molecule with n atoms into n DAGs, each with root at a different atom in the molecule.

Examples

Let’s transform a small dataset of molecules.

>>> N = 10
>>> n_feat = 5
>>> import numpy as np
>>> feat = dc.feat.ConvMolFeaturizer()
>>> X = feat(["C", "CC"])
>>> y = np.random.rand(N)
>>> dataset = dc.data.NumpyDataset(X, y)
>>> trans = dc.trans.DAGTransformer(max_atoms=5)
>>> dataset = trans.transform(dataset)
__init__(max_atoms=50)[source]

Initializes DAGTransformer.

Parameters

max_atoms (int, optional (Default 50)) – Maximum number of atoms to allow

transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray][source]

Transform the data in a set of (X, y, w, ids) arrays.

Parameters
  • X (np.ndarray) – Array of features

  • y (np.ndarray) – Array of labels

  • w (np.ndarray) – Array of weights.

  • ids (np.ndarray) – Array of identifiers.

Returns

  • Xtrans (np.ndarray) – Transformed array of features

  • ytrans (np.ndarray) – Transformed array of labels

  • wtrans (np.ndarray) – Transformed array of weights

  • idstrans (np.ndarray) – Transformed array of ids

untransform(z)[source]

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters

z (np.ndarray) – Array which was previously transformed by this class.

Returns

Return type

ztrans

UG_to_DAG(sample: deepchem.feat.mol_graphs.ConvMol) → List[source]

This function generates the DAGs for a molecule

Parameters

sample (ConvMol) – Molecule to transform

Returns

Return type

List of parent adjacency matrices

ImageTransformer

class ImageTransformer(size)[source]

Convert an image into width, height, channel

__init__(size)[source]

Initializes transformation based on dataset statistics.

transform_array(X, y, w)[source]

Transform the data in a set of (X, y, w) arrays.

ANITransformer

class ANITransformer(max_atoms=23, radial_cutoff=4.6, angular_cutoff=3.1, radial_length=32, angular_length=8, atom_cases=[1, 6, 7, 8, 16], atomic_number_differentiated=True, coordinates_in_bohr=True)[source]

Performs transform from 3D coordinates to ANI symmetry functions

Note

This class requires TensorFlow to be installed.

__init__(max_atoms=23, radial_cutoff=4.6, angular_cutoff=3.1, radial_length=32, angular_length=8, atom_cases=[1, 6, 7, 8, 16], atomic_number_differentiated=True, coordinates_in_bohr=True)[source]

Only X can be transformed

transform_array(X, y, w)[source]

Transform the data in a set of (X, y, w, ids) arrays.

Parameters
  • X (np.ndarray) – Array of features

  • y (np.ndarray) – Array of labels

  • w (np.ndarray) – Array of weights.

  • ids (np.ndarray) – Array of identifiers.

Returns

  • Xtrans (np.ndarray) – Transformed array of features

  • ytrans (np.ndarray) – Transformed array of labels

  • wtrans (np.ndarray) – Transformed array of weights

  • idstrans (np.ndarray) – Transformed array of ids

untransform(z)[source]

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters

z (np.ndarray) – Array which was previously transformed by this class.

Returns

Return type

ztrans

build()[source]

tensorflow computation graph for transform

distance_matrix(coordinates, flags)[source]

Generate distance matrix

distance_cutoff(d, cutoff, flags)[source]

Generate distance matrix with trainable cutoff

radial_symmetry(d_cutoff, d, atom_numbers)[source]

Radial Symmetry Function

angular_symmetry(d_cutoff, d, atom_numbers, coordinates)[source]

Angular Symmetry Function

FeaturizationTransformer

class FeaturizationTransformer(dataset=None, featurizer=None)[source]

A transformer which runs a featurizer over the X values of a dataset.

Datasets used by this transformer must be compatible with the internal featurizer. The idea of this transformer is that it allows for the application of a featurizer to an existing dataset.

Examples

>>> smiles = ["C", "CC"]
>>> X = np.array(smiles)
>>> y = np.array([1, 0])
>>> dataset = dc.data.NumpyDataset(X, y)
>>> trans = dc.trans.FeaturizationTransformer(dataset, dc.feat.CircularFingerprint())
>>> dataset = trans.transform(dataset)
__init__(dataset=None, featurizer=None)[source]

Initialization of FeaturizationTransformer

Parameters
  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed

  • featurizer (dc.feat.Featurizer object) – Featurizer applied to perform transformations.

transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray][source]

Transforms arrays of rdkit mols using internal featurizer.

Parameters
  • X (np.ndarray) – Array of features

  • y (np.ndarray) – Array of labels

  • w (np.ndarray) – Array of weights.

  • ids (np.ndarray) – Array of identifiers.

Returns

  • Xtrans (np.ndarray) – Transformed array of features

  • ytrans (np.ndarray) – Transformed array of labels

  • wtrans (np.ndarray) – Transformed array of weights

  • idstrans (np.ndarray) – Transformed array of ids

DataTransforms

class DataTransforms(Image)[source]

Applies different data transforms to images.

This utility class facilitates various image transformations thatmay be of use for handling image datasets.

Note

This class requires PIL to be installed.

__init__(Image)[source]

Initialize self. See help(type(self)) for accurate signature.

scale(h, w)[source]

Scales the image

Parameters
  • h (int) – Height of the images

  • w (int) – Width of the images

flip(direction='lr')[source]

Flips the image

Parameters

direction (str) – “lr” denotes left-right flip and “ud” denotes up-down flip.

rotate(angle=0)[source]

Rotates the image

Parameters

angle (float (default = 0 i.e no rotation)) – Denotes angle by which the image should be rotated (in Degrees)

Returns

Return type

The rotated input array

gaussian_blur(sigma=0.2)[source]

Adds gaussian noise to the image

Parameters

sigma (float) – Std dev. of the gaussian distribution

center_crop(x_crop, y_crop)[source]

Crops the image from the center

Parameters
  • x_crop (int) – the total number of pixels to remove in the horizontal direction, evenly split between the left and right sides

  • y_crop (int) – the total number of pixels to remove in the vertical direction, evenly split between the top and bottom sides

Returns

Return type

The center cropped input array

crop(left, top, right, bottom)[source]

Crops the image and returns the specified rectangular region from an image

Parameters
  • left (int) – the number of pixels to exclude from the left of the image

  • top (int) – the number of pixels to exclude from the top of the image

  • right (int) – the number of pixels to exclude from the right of the image

  • bottom (int) – the number of pixels to exclude from the bottom of the image

Returns

Return type

The cropped input array

convert2gray()[source]

Converts the image to grayscale. The coefficients correspond to the Y’ component of the Y’UV color system.

Returns

Return type

The grayscale image.

shift(width, height, mode='constant', order=3)[source]

Shifts the image

Parameters
  • width (float) – Amount of width shift (positive values shift image right )

  • height (float) – Amount of height shift(positive values shift image lower)

  • mode (str) – Points outside the boundaries of the input are filled according to the given mode: (‘constant’, ‘nearest’, ‘reflect’ or ‘wrap’). Default is ‘constant’

  • order (int) – The order of the spline interpolation, default is 3. The order has to be in the range 0-5.

gaussian_noise(mean=0, std=25.5)[source]

Adds gaussian noise to the image

Parameters
  • mean (float) – Mean of gaussian.

  • std (float) – Standard deviation of gaussian.

salt_pepper_noise(prob=0.05, salt=255, pepper=0)[source]

Adds salt and pepper noise to the image

Parameters
  • prob (float) – probability of the noise.

  • salt (float) – value of salt noise.

  • pepper (float) – value of pepper noise.

median_filter(size)[source]

Calculates a multidimensional median filter

Parameters

size (int) – The kernel size in pixels.

Returns

Return type

The median filtered image.