Transformers

DeepChem dc.trans.Transformer objects are another core building block of DeepChem programs. Often times, machine learning systems are very delicate. They need their inputs and outputs to fit within a pre-specified range or follow a clean mathematical distribution. Real data of course is wild and hard to control. What do you do if you have a crazy dataset and need to bring its statistics to heel? Fear not for you have Transformer objects.

Transformer

The dc.trans.Transformer class is the abstract parent class for all transformers. This class should never be directly initialized, but contains a number of useful method implementations.

class deepchem.trans.Transformer(transform_X=False, transform_y=False, transform_w=False, dataset=None)

Abstract base class for different data transformation techniques.

Transformer objects are used to transform Dataset objects in ways that are useful to machine learning. Transformations might process the data to make learning easier (say by normalizing), or may implement techniques such as data augmentation.

Note that you can never instantiate a Transformer class directly. You will want to use one of the concrete subclasses.

__init__(transform_X=False, transform_y=False, transform_w=False, dataset=None)[source]

Initializes transformation based on dataset statistics.

Parameters:
  • transform_X (bool, optional (default False)) – Whether to transform X
  • transform_y (bool, optional (default False)) – Whether to transform y
  • transform_w (bool, optional (default False)) – Whether to transform w
  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
transform(dataset, parallel=False, out_dir=None, **kwargs)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object to be transformed.
  • parallel (bool, optional (default False)) – At present this argument is ignored.
  • out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
Returns:

Return type:

a newly constructed Dataset object

transform_array(X, y, w)[source]

Transform the data in a set of (X, y, w) arrays.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

transform_on_array(X, y, w)[source]

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

untransform(z)[source]

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters:z (np.ndarray) – Array which was previously transformed by this class.
Returns:
Return type:ztrans

MinMaxTransformer

class deepchem.trans.MinMaxTransformer(transform_X=False, transform_y=False, transform_w=False, dataset=None)[source]

Ensure each value rests between 0 and 1 by using the min and max.

MinMaxTransformer transforms the dataset by shifting each axis of X or y (depending on whether transform_X or transform_y is True), except the first one by the minimum value along the axis and dividing the result by the range (maximum value - minimum value) along the axis. This ensures each axis is between 0 and 1. In case of multi-task learning, it ensures each task is given equal importance.

Given original array A, the transformed array can be written as:

>>> import numpy as np
>>> A = np.random.rand(10, 10)
>>> A_min = np.min(A, axis=0)
>>> A_max = np.max(A, axis=0)
>>> A_t = np.nan_to_num((A - A_min)/(A_max - A_min))

Example

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.rand(n_samples, n_tasks)
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.MinMaxTransformer(transform_y=True, dataset=dataset)
>>> dataset = transformer.transform(dataset)

Note

This class can only transform X or y and not w. So only one of transform_X or transform_y can be set.

Raises:
  • ValueError if transform_w is set or transform_X and transform_y are
  • both set.
__init__(transform_X=False, transform_y=False, transform_w=False, dataset=None)[source]

Initialization of MinMax transformer.

Parameters:
  • transform_X (bool, optional (default False)) – Whether to transform X
  • transform_y (bool, optional (default False)) – Whether to transform y
  • transform_w (bool, optional (default False)) – Whether to transform w
  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
transform(dataset, parallel=False)[source]

Transforms the dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object to be transformed.
  • parallel (bool, optional (default False)) – At present this argument is ignored.
  • out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
Returns:

Return type:

a newly constructed Dataset object

transform_array(X, y, w)[source]

Transform the data in a set of (X, y, w) arrays.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

transform_on_array(X, y, w)[source]

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

untransform(z)[source]

Undo transformation on provided data.

Parameters:z (np.ndarray,) – Transformed X or y array

NormalizationTransformer

class deepchem.trans.NormalizationTransformer(transform_X=False, transform_y=False, transform_w=False, dataset=None, transform_gradients=False, move_mean=True)[source]

Normalizes dataset to have zero mean and unit standard deviation

This transformer transforms datasets to have zero mean and unit standard deviation.

Example

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.rand(n_samples, n_tasks)
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.NormalizationTransformer(transform_y=True, dataset=dataset)
>>> dataset = transformer.transform(dataset)

Note

This class can only transform X or y and not w. So only one of transform_X or transform_y can be set.

Raises:
  • ValueError if transform_w is set or transform_X and transform_y are
  • both set.
__init__(transform_X=False, transform_y=False, transform_w=False, dataset=None, transform_gradients=False, move_mean=True)[source]

Initialize normalization transformation.

Parameters:
  • transform_X (bool, optional (default False)) – Whether to transform X
  • transform_y (bool, optional (default False)) – Whether to transform y
  • transform_w (bool, optional (default False)) – Whether to transform w
  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
transform(dataset, parallel=False)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object to be transformed.
  • parallel (bool, optional (default False)) – At present this argument is ignored.
  • out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
Returns:

Return type:

a newly constructed Dataset object

transform_array(X, y, w)[source]

Transform the data in a set of (X, y, w) arrays.

transform_on_array(X, y, w)[source]

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

untransform(z)[source]

Undo transformation on provided data.

untransform_grad(grad, tasks)[source]

Undo transformation on gradient.

ClippingTransformer

class deepchem.trans.ClippingTransformer(transform_X=False, transform_y=False, transform_w=False, dataset=None, x_max=5.0, y_max=500.0)[source]

Clip large values in datasets.

Example

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.zeros((n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.ClippingTransformer(transform_X=True)
>>> dataset = transformer.transform(dataset)
__init__(transform_X=False, transform_y=False, transform_w=False, dataset=None, x_max=5.0, y_max=500.0)[source]

Initialize clipping transformation.

Parameters:
  • transform_X (bool, optional (default False)) – Whether to transform X
  • transform_y (bool, optional (default False)) – Whether to transform y
  • transform_w (bool, optional (default False)) – Whether to transform w
  • dataset (dc.data.Dataset object, optional) – Dataset to be transformed
  • x_max (float, optional) – Maximum absolute value for X
  • y_max (float, optional) – Maximum absolute value for y

Note

This transformer can transform X and y jointly, but does not transform w.

Raises:ValueError if transform_w is set.
transform(dataset, parallel=False, out_dir=None, **kwargs)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object to be transformed.
  • parallel (bool, optional (default False)) – At present this argument is ignored.
  • out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
Returns:

Return type:

a newly constructed Dataset object

transform_array(X, y, w)[source]

Transform the data in a set of (X, y, w) arrays.

Parameters:
  • X (np.ndarray) – Features
  • y (np.ndarray) – Tasks
  • w (np.ndarray) – Weights
Returns:

  • X (np.ndarray) – Transformed features
  • y (np.ndarray) – Transformed tasks
  • w (np.ndarray) – Transformed weights

transform_on_array(X, y, w)[source]

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

untransform(z)[source]

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters:z (np.ndarray) – Array which was previously transformed by this class.
Returns:
Return type:ztrans

LogTransformer

class deepchem.trans.LogTransformer(transform_X=False, transform_y=False, transform_w=False, features=None, tasks=None, dataset=None)[source]

Computes a logarithmic transformation

This transformer computes the transformation given by

>>> import numpy as np
>>> A = np.random.rand(10, 10)
>>> A = np.log(A + 1)

Assuming that tasks/features are not specified. If specified, then transformations are only performed on specified tasks/features.

Example

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.zeros((n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.LogTransformer(transform_X=True)
>>> dataset = transformer.transform(dataset)

Note

This class can only transform X or y and not w. So only one of transform_X or transform_y can be set.

Raises:
  • ValueError if transform_w is set or transform_X and transform_y are
  • both set.
__init__(transform_X=False, transform_y=False, transform_w=False, features=None, tasks=None, dataset=None)[source]

Initialize log transformer.

Parameters:
  • transform_X (bool, optional (default False)) – Whether to transform X
  • transform_y (bool, optional (default False)) – Whether to transform y
  • transform_w (bool, optional (default False)) – Whether to transform w
  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
  • features (list[Int]) – List of features indices to transform
  • tasks (list[str]) – List of task names to transform.
transform(dataset, parallel=False, out_dir=None, **kwargs)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object to be transformed.
  • parallel (bool, optional (default False)) – At present this argument is ignored.
  • out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
Returns:

Return type:

a newly constructed Dataset object

transform_array(X, y, w)[source]

Transform the data in a set of (X, y, w) arrays.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

transform_on_array(X, y, w)[source]

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

untransform(z)[source]

Undo transformation on provided data.

Parameters:z (np.ndarray,) – Transformed X or y array

BalancingTransformer

class deepchem.trans.BalancingTransformer(transform_X=False, transform_y=False, transform_w=False, dataset=None)[source]

Balance positive and negative examples for weights.

This class balances the sample weights so that the sum of all example weights from all classes is the same. This can be useful when you’re working on an imbalanced dataset where there are far fewer examples of some classes than others.

Example

Here’s an example for a binary dataset.

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> n_classes = 2
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.randint(n_classes, size=(n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.BalancingTransformer(transform_w=True, dataset=dataset)
>>> dataset = transformer.transform(dataset)

And here’s a multiclass dataset example.

>>> n_samples = 50
>>> n_features = 3
>>> n_tasks = 1
>>> n_classes = 5
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.randint(n_classes, size=(n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> transformer = dc.trans.BalancingTransformer(transform_w=True, dataset=dataset)
>>> dataset = transformer.transform(dataset)

Note

This transformer is only meaningful for classification datasets where y takes on a limited set of values. This class can only transform w and does not transform X or y.

Raises:
  • ValueError if transform_X or transform_y are set. Also raises
  • ValueError if y or w aren’t of shape (N,) or (N, n_tasks).
__init__(transform_X=False, transform_y=False, transform_w=False, dataset=None)[source]

Initializes transformation based on dataset statistics.

Parameters:
  • transform_X (bool, optional (default False)) – Whether to transform X
  • transform_y (bool, optional (default False)) – Whether to transform y
  • transform_w (bool, optional (default False)) – Whether to transform w
  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
transform(dataset, parallel=False, out_dir=None, **kwargs)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object to be transformed.
  • parallel (bool, optional (default False)) – At present this argument is ignored.
  • out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
Returns:

Return type:

a newly constructed Dataset object

transform_array(X, y, w)[source]

Transform the data in a set of (X, y, w) arrays.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

transform_on_array(X, y, w)[source]

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

untransform(z)[source]

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters:z (np.ndarray) – Array which was previously transformed by this class.
Returns:
Return type:ztrans

CDFTransformer

class deepchem.trans.CDFTransformer(transform_X=False, transform_y=False, transform_w=False, dataset=None, bins=2)[source]

Histograms the data and assigns values based on sorted list.

Acts like a Cumulative Distribution Function (CDF). If given a dataset of samples from a continuous distribution computes the CDF of this dataset.

TODO: Add an example of this. The current documentation is confusing.

__init__(transform_X=False, transform_y=False, transform_w=False, dataset=None, bins=2)[source]

Initialize this transformer.

Parameters:
  • transform_X (bool, optional (default False)) – Whether to transform X
  • transform_y (bool, optional (default False)) – Whether to transform y
  • transform_w (bool, optional (default False)) – Whether to transform w
  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
  • bins (int, optional (default 2)) –
transform(dataset, parallel=False, out_dir=None, **kwargs)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object to be transformed.
  • parallel (bool, optional (default False)) – At present this argument is ignored.
  • out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
Returns:

Return type:

a newly constructed Dataset object

transform_array(X, y, w)[source]

Performs CDF transform on data.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

transform_on_array(X, y, w)[source]

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

untransform(z)[source]

Undo transformation on provided data.

Note that this transformation is only undone for y.

Parameters:z (np.ndarray,) – Transformed y array

PowerTransformer

class deepchem.trans.PowerTransformer(transform_X=False, transform_y=False, transform_w=False, dataset=None, powers=[1])[source]

Takes power n transforms of the data based on an input vector.

Computes the specified powers of the dataset. This can be useful if you’re looking to add higher order features of the form x_i^2, x_i^3 etc. to your dataset.

__init__(transform_X=False, transform_y=False, transform_w=False, dataset=None, powers=[1])[source]

Initialize this transformer

Parameters:
  • transform_X (bool, optional (default False)) – Whether to transform X
  • transform_y (bool, optional (default False)) – Whether to transform y
  • transform_w (bool, optional (default False)) – Whether to transform w
  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed. Note that this argument is ignored since PowerTransformer doesn’t require it to be specified.
  • powers (list[int], optional (default [1])) – The list of powers of features/labels to compute.
transform(dataset, parallel=False, out_dir=None, **kwargs)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object to be transformed.
  • parallel (bool, optional (default False)) – At present this argument is ignored.
  • out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
Returns:

Return type:

a newly constructed Dataset object

transform_array(X, y, w)[source]

Performs power transform on data.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

transform_on_array(X, y, w)[source]

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

untransform(z)[source]

Undo transformation on provided data.

Parameters:z (np.ndarray,) – Transformed y array

CoulombFitTransformer

class deepchem.trans.CoulombFitTransformer(dataset)[source]

Performs randomization and binarization operations on batches of Coulomb Matrix features during fit.

Example

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features, n_features)
>>> y = np.zeros((n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> fit_transformers = [dc.trans.CoulombFitTransformer(dataset)]
>>> model = dc.models.MultitaskFitTransformRegressor(n_tasks,
...    [n_features, n_features], batch_size=n_samples, fit_transformers=fit_transformers, n_evals=1)
>>> print(model.n_features)
12
X_transform(X)[source]

Perform Coulomb Fit transform on features.

Parameters:X (np.ndarray) – Features
Returns:X – Transformed features
Return type:np.ndarray
__init__(dataset)[source]

Initializes CoulombFitTransformer.

Parameters:dataset (dc.data.Dataset object) –
expand(X)[source]

Binarize features.

Parameters:X (np.ndarray) – Features
Returns:X – Binarized features
Return type:np.ndarray
normalize(X)[source]

Normalize features.

Parameters:X (np.ndarray) – Features
Returns:X – Normalized features
Return type:np.ndarray
realize(X)[source]

Randomize features.

Parameters:X (np.ndarray) – Features
Returns:X – Randomized features
Return type:np.ndarray
transform(dataset, parallel=False, out_dir=None, **kwargs)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object to be transformed.
  • parallel (bool, optional (default False)) – At present this argument is ignored.
  • out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
Returns:

Return type:

a newly constructed Dataset object

transform_array(X, y, w)[source]

Transform the data in a set of (X, y, w) arrays.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

transform_on_array(X, y, w)[source]

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

untransform(z)[source]

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters:z (np.ndarray) – Array which was previously transformed by this class.
Returns:
Return type:ztrans

IRVTransformer

class deepchem.trans.IRVTransformer(K, n_tasks, dataset, transform_y=False, transform_x=False)[source]

Performs transform from ECFP to IRV features(K nearest neibours).

X_transform(X_target)[source]

Calculate similarity between target dataset(X_target) and reference dataset(X): #(1 in intersection)/#(1 in union)

similarity = (X_target intersect X)/(X_target union X)

Parameters:X_target (np.ndarray) – fingerprints of target dataset should have same length with X in the second axis
Returns:X_target – features of size(batch_size, 2*K*n_tasks)
Return type:np.ndarray
__init__(K, n_tasks, dataset, transform_y=False, transform_x=False)[source]

Initializes IRVTransformer.

Parameters:
  • dataset (dc.data.Dataset object) – train_dataset
  • K (int) – number of nearest neighbours being count
  • n_tasks (int) – number of tasks
static matrix_mul(X1, X2, shard_size=5000)[source]

Calculate matrix multiplication for big matrix, X1 and X2 are sliced into pieces with shard_size rows(columns) then multiplied together and concatenated to the proper size

realize(similarity, y, w)[source]

find samples with top ten similarity values in the reference dataset

Parameters:
  • similarity (np.ndarray) – similarity value between target dataset and reference dataset should have size of (n_samples_in_target, n_samples_in_reference)
  • y (np.array) – labels for a single task
  • w (np.array) – weights for a single task
Returns:

features – n_samples * np.array of size (2*K,) each array includes K similarity values and corresponding labels

Return type:

list

DAGTransformer

class deepchem.trans.DAGTransformer(max_atoms=50, transform_X=True, transform_y=False, transform_w=False)[source]

Performs transform from ConvMol adjacency lists to DAG calculation orders

UG_to_DAG(sample)[source]

This function generates the DAGs for a molecule

__init__(max_atoms=50, transform_X=True, transform_y=False, transform_w=False)[source]

Initializes DAGTransformer. Only X can be transformed

transform(dataset, parallel=False, out_dir=None, **kwargs)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object to be transformed.
  • parallel (bool, optional (default False)) – At present this argument is ignored.
  • out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
Returns:

Return type:

a newly constructed Dataset object

transform_array(X, y, w)[source]

Add calculation orders to ConvMol objects

transform_on_array(X, y, w)[source]

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

untransform(z)[source]

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters:z (np.ndarray) – Array which was previously transformed by this class.
Returns:
Return type:ztrans

ImageTransformer

class deepchem.trans.ImageTransformer(size, transform_X=True, transform_y=False, transform_w=False)[source]

Convert an image into width, height, channel

__init__(size, transform_X=True, transform_y=False, transform_w=False)[source]

Initializes transformation based on dataset statistics.

transform(dataset, parallel=False, out_dir=None, **kwargs)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object to be transformed.
  • parallel (bool, optional (default False)) – At present this argument is ignored.
  • out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
Returns:

Return type:

a newly constructed Dataset object

transform_array(X, y, w)[source]

Transform the data in a set of (X, y, w) arrays.

transform_on_array(X, y, w)[source]

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

untransform(z)[source]

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters:z (np.ndarray) – Array which was previously transformed by this class.
Returns:
Return type:ztrans

ANITransformer

class deepchem.trans.ANITransformer(max_atoms=23, radial_cutoff=4.6, angular_cutoff=3.1, radial_length=32, angular_length=8, atom_cases=[1, 6, 7, 8, 16], atomic_number_differentiated=True, coordinates_in_bohr=True, transform_X=True, transform_y=False, transform_w=False)[source]

Performs transform from 3D coordinates to ANI symmetry functions

__init__(max_atoms=23, radial_cutoff=4.6, angular_cutoff=3.1, radial_length=32, angular_length=8, atom_cases=[1, 6, 7, 8, 16], atomic_number_differentiated=True, coordinates_in_bohr=True, transform_X=True, transform_y=False, transform_w=False)[source]

Only X can be transformed

angular_symmetry(d_cutoff, d, atom_numbers, coordinates)[source]

Angular Symmetry Function

build()[source]

tensorflow computation graph for transform

distance_cutoff(d, cutoff, flags)[source]

Generate distance matrix with trainable cutoff

distance_matrix(coordinates, flags)[source]

Generate distance matrix

radial_symmetry(d_cutoff, d, atom_numbers)[source]

Radial Symmetry Function

transform(dataset, parallel=False, out_dir=None, **kwargs)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object to be transformed.
  • parallel (bool, optional (default False)) – At present this argument is ignored.
  • out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
Returns:

Return type:

a newly constructed Dataset object

transform_array(X, y, w)[source]

Transform the data in a set of (X, y, w) arrays.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

transform_on_array(X, y, w)[source]

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

untransform(z)[source]

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters:z (np.ndarray) – Array which was previously transformed by this class.
Returns:
Return type:ztrans

FeaturizationTransformer

class deepchem.trans.FeaturizationTransformer(transform_X=False, transform_y=False, transform_w=False, dataset=None, featurizer=None)[source]

A transformer which runs a featurizer over the X values of a dataset.

Datasets used by this transformer be compatible with the internal featurizer.

__init__(transform_X=False, transform_y=False, transform_w=False, dataset=None, featurizer=None)[source]

Initialization of FeaturizationTransformer

Parameters:
  • transform_X (bool, optional (default False)) – Whether to transform X
  • transform_y (bool, optional (default False)) – Whether to transform y
  • transform_w (bool, optional (default False)) – Whether to transform w
  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
  • featurizer (dc.feat.Featurizer object) – Featurizer applied to perform transformations.
transform(dataset, parallel=False, out_dir=None, **kwargs)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object to be transformed.
  • parallel (bool, optional (default False)) – At present this argument is ignored.
  • out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
Returns:

Return type:

a newly constructed Dataset object

transform_array(X, y, w)[source]

Transforms arrays of rdkit mols using internal featurizer.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

transform_on_array(X, y, w)[source]

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

untransform(z)[source]

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters:z (np.ndarray) – Array which was previously transformed by this class.
Returns:
Return type:ztrans

DataTransforms

class deepchem.trans.DataTransforms(Image)[source]

Applies different data transforms to images.

__init__(Image)[source]

Initializes transformation based on dataset statistics.

Parameters:
  • transform_X (bool, optional (default False)) – Whether to transform X
  • transform_y (bool, optional (default False)) – Whether to transform y
  • transform_w (bool, optional (default False)) – Whether to transform w
  • dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
center_crop(x_crop, y_crop)[source]

Crops the image from the center

Parameters:
  • x_crop (int) – the total number of pixels to remove in the horizontal direction, evenly split between the left and right sides
  • y_crop (int) – the total number of pixels to remove in the vertical direction, evenly split between the top and bottom sides
Returns:

Return type:

The center cropped input array

convert2gray()[source]

Converts the image to grayscale. The coefficients correspond to the Y’ component of the Y’UV color system.

Returns:
Return type:The grayscale image.
crop(left, top, right, bottom)[source]

Crops the image and returns the specified rectangular region from an image

Parameters:
  • left (int) – the number of pixels to exclude from the left of the image
  • top (int) – the number of pixels to exclude from the top of the image
  • right (int) – the number of pixels to exclude from the right of the image
  • bottom (int) – the number of pixels to exclude from the bottom of the image
Returns:

Return type:

The cropped input array

flip(direction='lr')[source]

Flips the image

Parameters:direction (str) – “lr” denotes left-right flip and “ud” denotes up-down flip.
gaussian_blur(sigma=0.2)[source]

Adds gaussian noise to the image

Parameters:sigma (float) – Std dev. of the gaussian distribution
gaussian_noise(mean=0, std=25.5)[source]

Adds gaussian noise to the image

Parameters:
  • mean (float) – Mean of gaussian.
  • std (float) – Standard deviation of gaussian.
median_filter(size)[source]

Calculates a multidimensional median filter

Parameters:size (int) – The kernel size in pixels.
Returns:
Return type:The median filtered image.
rotate(angle=0)[source]

Rotates the image

Parameters:angle (float (default = 0 i.e no rotation)) – Denotes angle by which the image should be rotated (in Degrees)
Returns:
Return type:The rotated input array
salt_pepper_noise(prob=0.05, salt=255, pepper=0)[source]

Adds salt and pepper noise to the image

Parameters:
  • prob (float) – probability of the noise.
  • salt (float) – value of salt noise.
  • pepper (float) – value of pepper noise.
scale(h, w)[source]

Scales the image

Parameters:
  • h (int) – Height of the images
  • w (int) – Width of the images
shift(width, height, mode='constant', order=3)[source]

Shifts the image

Parameters:
  • width (float) – Amount of width shift (positive values shift image right )
  • height (float) – Amount of height shift(positive values shift image lower)
  • mode (str) – Points outside the boundaries of the input are filled according to the given mode: (‘constant’, ‘nearest’, ‘reflect’ or ‘wrap’). Default is ‘constant’
  • order (int) – The order of the spline interpolation, default is 3. The order has to be in the range 0-5.
transform(dataset, parallel=False, out_dir=None, **kwargs)[source]

Transforms all internally stored data in dataset.

This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object to be transformed.
  • parallel (bool, optional (default False)) – At present this argument is ignored.
  • out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
Returns:

Return type:

a newly constructed Dataset object

transform_array(X, y, w)[source]

Transform the data in a set of (X, y, w) arrays.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

transform_on_array(X, y, w)[source]

Transforms numpy arrays X, y, and w

DEPRECATED. Use transform_array instead.

Parameters:
  • X (np.ndarray) – Array of features
  • y (np.ndarray) – Array of labels
  • w (np.ndarray) – Array of weights.
Returns:

  • Xtrans (np.ndarray) – Transformed array of features
  • ytrans (np.ndarray) – Transformed array of labels
  • wtrans (np.ndarray) – Transformed array of weights

untransform(z)[source]

Reverses stored transformation on provided data.

Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.

Parameters:z (np.ndarray) – Array which was previously transformed by this class.
Returns:
Return type:ztrans