Transformers¶
DeepChem dc.trans.Transformer
objects are another core
building block of DeepChem programs. Often times, machine learning
systems are very delicate. They need their inputs and outputs to fit
within a pre-specified range or follow a clean mathematical
distribution. Real data of course is wild and hard to control. What do
you do if you have a crazy dataset and need to bring its statistics to
heel? Fear not for you have Transformer
objects.
Contents
General Transformers¶
NormalizationTransformer¶
- class NormalizationTransformer(transform_X: bool = False, transform_y: bool = False, transform_w: bool = False, dataset: Optional[deepchem.data.datasets.Dataset] = None, transform_gradients: bool = False, move_mean: bool = True)[source]¶
Normalizes dataset to have zero mean and unit standard deviation
This transformer transforms datasets to have zero mean and unit standard deviation.
Examples
>>> n_samples = 10 >>> n_features = 3 >>> n_tasks = 1 >>> ids = np.arange(n_samples) >>> X = np.random.rand(n_samples, n_features) >>> y = np.random.rand(n_samples, n_tasks) >>> w = np.ones((n_samples, n_tasks)) >>> dataset = dc.data.NumpyDataset(X, y, w, ids) >>> transformer = dc.trans.NormalizationTransformer(transform_y=True, dataset=dataset) >>> dataset = transformer.transform(dataset)
Note
This class can only transform X or y and not w. So only one of transform_X or transform_y can be set.
- Raises
ValueError – if transform_X and transform_y are both set.
- __init__(transform_X: bool = False, transform_y: bool = False, transform_w: bool = False, dataset: Optional[deepchem.data.datasets.Dataset] = None, transform_gradients: bool = False, move_mean: bool = True)[source]¶
Initialize normalization transformation.
- Parameters
transform_X (bool, optional (default False)) – Whether to transform X
transform_y (bool, optional (default False)) – Whether to transform y
transform_w (bool, optional (default False)) – Whether to transform w
dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
- transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transform the data in a set of (X, y, w) arrays.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of ids.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- untransform(z: numpy.ndarray) numpy.ndarray [source]¶
Undo transformation on provided data.
- Parameters
z (np.ndarray) – Array to transform back
- Returns
z_out – Array with normalization undone.
- Return type
np.ndarray
- transform(dataset: deepchem.data.datasets.Dataset, parallel: bool = False, out_dir: Optional[str] = None, **kwargs) deepchem.data.datasets.Dataset [source]¶
Transforms all internally stored data in dataset.
This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.
- Parameters
dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
- Returns
A newly transformed Dataset object
- Return type
- transform_on_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transforms numpy arrays X, y, and w
DEPRECATED. Use transform_array instead.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
MinMaxTransformer¶
- class MinMaxTransformer(transform_X: bool = False, transform_y: bool = False, dataset: Optional[deepchem.data.datasets.Dataset] = None)[source]¶
Ensure each value rests between 0 and 1 by using the min and max.
MinMaxTransformer transforms the dataset by shifting each axis of X or y (depending on whether transform_X or transform_y is True), except the first one by the minimum value along the axis and dividing the result by the range (maximum value - minimum value) along the axis. This ensures each axis is between 0 and 1. In case of multi-task learning, it ensures each task is given equal importance.
Given original array A, the transformed array can be written as:
>>> import numpy as np >>> A = np.random.rand(10, 10) >>> A_min = np.min(A, axis=0) >>> A_max = np.max(A, axis=0) >>> A_t = np.nan_to_num((A - A_min)/(A_max - A_min))
Examples
>>> n_samples = 10 >>> n_features = 3 >>> n_tasks = 1 >>> ids = np.arange(n_samples) >>> X = np.random.rand(n_samples, n_features) >>> y = np.random.rand(n_samples, n_tasks) >>> w = np.ones((n_samples, n_tasks)) >>> dataset = dc.data.NumpyDataset(X, y, w, ids) >>> transformer = dc.trans.MinMaxTransformer(transform_y=True, dataset=dataset) >>> dataset = transformer.transform(dataset)
Note
This class can only transform X or y and not w. So only one of transform_X or transform_y can be set.
- Raises
ValueError – if transform_X and transform_y are both set.
- __init__(transform_X: bool = False, transform_y: bool = False, dataset: Optional[deepchem.data.datasets.Dataset] = None)[source]¶
Initialization of MinMax transformer.
- Parameters
transform_X (bool, optional (default False)) – Whether to transform X
transform_y (bool, optional (default False)) – Whether to transform y
dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
- transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transform the data in a set of (X, y, w, ids) arrays.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of ids.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- untransform(z: numpy.ndarray) numpy.ndarray [source]¶
Undo transformation on provided data.
- Parameters
z (np.ndarray) – Transformed X or y array
- Returns
Array with min-max scaling undone.
- Return type
np.ndarray
- transform(dataset: deepchem.data.datasets.Dataset, parallel: bool = False, out_dir: Optional[str] = None, **kwargs) deepchem.data.datasets.Dataset [source]¶
Transforms all internally stored data in dataset.
This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.
- Parameters
dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
- Returns
A newly transformed Dataset object
- Return type
- transform_on_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transforms numpy arrays X, y, and w
DEPRECATED. Use transform_array instead.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
ClippingTransformer¶
- class ClippingTransformer(transform_X: bool = False, transform_y: bool = False, dataset: Optional[deepchem.data.datasets.Dataset] = None, x_max: float = 5.0, y_max: float = 500.0)[source]¶
Clip large values in datasets.
Examples
Let’s clip values from a synthetic dataset
>>> n_samples = 10 >>> n_features = 3 >>> n_tasks = 1 >>> ids = np.arange(n_samples) >>> X = np.random.rand(n_samples, n_features) >>> y = np.zeros((n_samples, n_tasks)) >>> w = np.ones((n_samples, n_tasks)) >>> dataset = dc.data.NumpyDataset(X, y, w, ids) >>> transformer = dc.trans.ClippingTransformer(transform_X=True) >>> dataset = transformer.transform(dataset)
- __init__(transform_X: bool = False, transform_y: bool = False, dataset: Optional[deepchem.data.datasets.Dataset] = None, x_max: float = 5.0, y_max: float = 500.0)[source]¶
Initialize clipping transformation.
- Parameters
transform_X (bool, optional (default False)) – Whether to transform X
transform_y (bool, optional (default False)) – Whether to transform y
dataset (dc.data.Dataset object, optional) – Dataset to be transformed
x_max (float, optional) – Maximum absolute value for X
y_max (float, optional) – Maximum absolute value for y
Note
This transformer can transform X and y jointly, but does not transform w.
- Raises
ValueError – if transform_w is set.
- transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transform the data in a set of (X, y, w) arrays.
- Parameters
X (np.ndarray) – Array of Features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights
ids (np.ndarray) – Array of ids.
- Returns
X (np.ndarray) – Transformed features
y (np.ndarray) – Transformed tasks
w (np.ndarray) – Transformed weights
idstrans (np.ndarray) – Transformed array of ids
- transform(dataset: deepchem.data.datasets.Dataset, parallel: bool = False, out_dir: Optional[str] = None, **kwargs) deepchem.data.datasets.Dataset [source]¶
Transforms all internally stored data in dataset.
This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.
- Parameters
dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
- Returns
A newly transformed Dataset object
- Return type
- transform_on_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transforms numpy arrays X, y, and w
DEPRECATED. Use transform_array instead.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
LogTransformer¶
- class LogTransformer(transform_X: bool = False, transform_y: bool = False, features: Optional[List[int]] = None, tasks: Optional[List[str]] = None, dataset: Optional[deepchem.data.datasets.Dataset] = None)[source]¶
Computes a logarithmic transformation
This transformer computes the transformation given by
>>> import numpy as np >>> A = np.random.rand(10, 10) >>> A = np.log(A + 1)
Assuming that tasks/features are not specified. If specified, then transformations are only performed on specified tasks/features.
Examples
>>> n_samples = 10 >>> n_features = 3 >>> n_tasks = 1 >>> ids = np.arange(n_samples) >>> X = np.random.rand(n_samples, n_features) >>> y = np.zeros((n_samples, n_tasks)) >>> w = np.ones((n_samples, n_tasks)) >>> dataset = dc.data.NumpyDataset(X, y, w, ids) >>> transformer = dc.trans.LogTransformer(transform_X=True) >>> dataset = transformer.transform(dataset)
Note
This class can only transform X or y and not w. So only one of transform_X or transform_y can be set.
- Raises
ValueError – if transform_w is set or transform_X and transform_y are both set.
- __init__(transform_X: bool = False, transform_y: bool = False, features: Optional[List[int]] = None, tasks: Optional[List[str]] = None, dataset: Optional[deepchem.data.datasets.Dataset] = None)[source]¶
Initialize log transformer.
- Parameters
transform_X (bool, optional (default False)) – Whether to transform X
transform_y (bool, optional (default False)) – Whether to transform y
features (list[Int]) – List of features indices to transform
tasks (list[str]) – List of task names to transform.
dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
- transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transform the data in a set of (X, y, w) arrays.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of weights.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- untransform(z: numpy.ndarray) numpy.ndarray [source]¶
Undo transformation on provided data.
- Parameters
z (np.ndarray,) – Transformed X or y array
- Returns
Array with a logarithmic transformation undone.
- Return type
np.ndarray
- transform(dataset: deepchem.data.datasets.Dataset, parallel: bool = False, out_dir: Optional[str] = None, **kwargs) deepchem.data.datasets.Dataset [source]¶
Transforms all internally stored data in dataset.
This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.
- Parameters
dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
- Returns
A newly transformed Dataset object
- Return type
- transform_on_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transforms numpy arrays X, y, and w
DEPRECATED. Use transform_array instead.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
CDFTransformer¶
- class CDFTransformer(transform_X: bool = False, transform_y: bool = False, dataset: Optional[deepchem.data.datasets.Dataset] = None, bins: int = 2)[source]¶
Histograms the data and assigns values based on sorted list.
Acts like a Cumulative Distribution Function (CDF). If given a dataset of samples from a continuous distribution computes the CDF of this dataset and replaces values with their corresponding CDF values.
Examples
Let’s look at an example where we transform only features.
>>> N = 10 >>> n_feat = 5 >>> n_bins = 100
Note that we’re using 100 bins for our CDF histogram
>>> import numpy as np >>> X = np.random.normal(size=(N, n_feat)) >>> y = np.random.randint(2, size=(N,)) >>> dataset = dc.data.NumpyDataset(X, y) >>> cdftrans = dc.trans.CDFTransformer(transform_X=True, dataset=dataset, bins=n_bins) >>> dataset = cdftrans.transform(dataset)
Note that you can apply this transformation to y as well
>>> X = np.random.normal(size=(N, n_feat)) >>> y = np.random.normal(size=(N,)) >>> dataset = dc.data.NumpyDataset(X, y) >>> cdftrans = dc.trans.CDFTransformer(transform_y=True, dataset=dataset, bins=n_bins) >>> dataset = cdftrans.transform(dataset)
- __init__(transform_X: bool = False, transform_y: bool = False, dataset: Optional[deepchem.data.datasets.Dataset] = None, bins: int = 2)[source]¶
Initialize this transformer.
- Parameters
transform_X (bool, optional (default False)) – Whether to transform X
transform_y (bool, optional (default False)) – Whether to transform y
dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
bins (int, optional (default 2)) – Number of bins to use when computing histogram.
- transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Performs CDF transform on data.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- untransform(z: numpy.ndarray) numpy.ndarray [source]¶
Undo transformation on provided data.
Note that this transformation is only undone for y.
- Parameters
z (np.ndarray,) – Transformed y array
- Returns
Array with the transformation undone.
- Return type
np.ndarray
- transform(dataset: deepchem.data.datasets.Dataset, parallel: bool = False, out_dir: Optional[str] = None, **kwargs) deepchem.data.datasets.Dataset [source]¶
Transforms all internally stored data in dataset.
This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.
- Parameters
dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
- Returns
A newly transformed Dataset object
- Return type
- transform_on_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transforms numpy arrays X, y, and w
DEPRECATED. Use transform_array instead.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
PowerTransformer¶
- class PowerTransformer(transform_X: bool = False, transform_y: bool = False, dataset: Optional[deepchem.data.datasets.Dataset] = None, powers: List[int] = [1])[source]¶
Takes power n transforms of the data based on an input vector.
Computes the specified powers of the dataset. This can be useful if you’re looking to add higher order features of the form x_i^2, x_i^3 etc. to your dataset.
Examples
Let’s look at an example where we transform only X.
>>> N = 10 >>> n_feat = 5 >>> powers = [1, 2, 0.5]
So in this example, we’re taking the identity, squares, and square roots. Now let’s construct our matrices
>>> import numpy as np >>> X = np.random.rand(N, n_feat) >>> y = np.random.normal(size=(N,)) >>> dataset = dc.data.NumpyDataset(X, y) >>> trans = dc.trans.PowerTransformer(transform_X=True, dataset=dataset, powers=powers) >>> dataset = trans.transform(dataset)
Let’s now look at an example where we transform y. Note that the y transform expands out the feature dimensions of y the same way it does for X so this transform is only well defined for singletask datasets.
>>> import numpy as np >>> X = np.random.rand(N, n_feat) >>> y = np.random.rand(N) >>> dataset = dc.data.NumpyDataset(X, y) >>> trans = dc.trans.PowerTransformer(transform_y=True, dataset=dataset, powers=powers) >>> dataset = trans.transform(dataset)
- __init__(transform_X: bool = False, transform_y: bool = False, dataset: Optional[deepchem.data.datasets.Dataset] = None, powers: List[int] = [1])[source]¶
Initialize this transformer
- Parameters
transform_X (bool, optional (default False)) – Whether to transform X
transform_y (bool, optional (default False)) – Whether to transform y
dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed. Note that this argument is ignored since
specified. (PowerTransformer doesn't require it to be) – powers: list[int], optional (default [1]) The list of powers of features/labels to compute.
- transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Performs power transform on data.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- untransform(z: numpy.ndarray) numpy.ndarray [source]¶
Undo transformation on provided data.
- Parameters
z (np.ndarray,) – Transformed y array
- Returns
Array with the power transformation undone.
- Return type
np.ndarray
- transform(dataset: deepchem.data.datasets.Dataset, parallel: bool = False, out_dir: Optional[str] = None, **kwargs) deepchem.data.datasets.Dataset [source]¶
Transforms all internally stored data in dataset.
This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.
- Parameters
dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
- Returns
A newly transformed Dataset object
- Return type
- transform_on_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transforms numpy arrays X, y, and w
DEPRECATED. Use transform_array instead.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
BalancingTransformer¶
- class BalancingTransformer(dataset: deepchem.data.datasets.Dataset)[source]¶
Balance positive and negative (or multiclass) example weights.
This class balances the sample weights so that the sum of all example weights from all classes is the same. This can be useful when you’re working on an imbalanced dataset where there are far fewer examples of some classes than others.
Examples
Here’s an example for a binary dataset.
>>> n_samples = 10 >>> n_features = 3 >>> n_tasks = 1 >>> n_classes = 2 >>> ids = np.arange(n_samples) >>> X = np.random.rand(n_samples, n_features) >>> y = np.random.randint(n_classes, size=(n_samples, n_tasks)) >>> w = np.ones((n_samples, n_tasks)) >>> dataset = dc.data.NumpyDataset(X, y, w, ids) >>> transformer = dc.trans.BalancingTransformer(dataset=dataset) >>> dataset = transformer.transform(dataset)
And here’s a multiclass dataset example.
>>> n_samples = 50 >>> n_features = 3 >>> n_tasks = 1 >>> n_classes = 5 >>> ids = np.arange(n_samples) >>> X = np.random.rand(n_samples, n_features) >>> y = np.random.randint(n_classes, size=(n_samples, n_tasks)) >>> w = np.ones((n_samples, n_tasks)) >>> dataset = dc.data.NumpyDataset(X, y, w, ids) >>> transformer = dc.trans.BalancingTransformer(dataset=dataset) >>> dataset = transformer.transform(dataset)
See also
deepchem.trans.DuplicateBalancingTransformer
Balance by duplicating samples.
Note
This transformer is only meaningful for classification datasets where y takes on a limited set of values. This class can only transform w and does not transform X or y.
- Raises
ValueError – if transform_X or transform_y are set. Also raises or if y or w aren’t of shape (N,) or (N, n_tasks).
- __init__(dataset: deepchem.data.datasets.Dataset)[source]¶
Initializes transformation based on dataset statistics.
- Parameters
transform_X (bool, optional (default False)) – Whether to transform X
transform_y (bool, optional (default False)) – Whether to transform y
transform_w (bool, optional (default False)) – Whether to transform w
transform_ids (bool, optional (default False)) – Whether to transform ids
dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
- transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transform the data in a set of (X, y, w) arrays.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of weights.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- transform(dataset: deepchem.data.datasets.Dataset, parallel: bool = False, out_dir: Optional[str] = None, **kwargs) deepchem.data.datasets.Dataset [source]¶
Transforms all internally stored data in dataset.
This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.
- Parameters
dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
- Returns
A newly transformed Dataset object
- Return type
- transform_on_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transforms numpy arrays X, y, and w
DEPRECATED. Use transform_array instead.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- untransform(transformed: numpy.ndarray) numpy.ndarray [source]¶
Reverses stored transformation on provided data.
Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.
- Parameters
transformed (np.ndarray) – Array which was previously transformed by this class.
DuplicateBalancingTransformer¶
- class DuplicateBalancingTransformer(dataset: deepchem.data.datasets.Dataset)[source]¶
Balance binary or multiclass datasets by duplicating rarer class samples.
This class balances a dataset by duplicating samples of the rarer class so that the sum of all example weights from all classes is the same. (Up to integer rounding of course). This can be useful when you’re working on an imabalanced dataset where there are far fewer examples of some classes than others.
This class differs from BalancingTransformer in that it actually duplicates rarer class samples rather than just increasing their sample weights. This may be more friendly for models that are numerically fragile and can’t handle imbalanced example weights.
Examples
Here’s an example for a binary dataset.
>>> n_samples = 10 >>> n_features = 3 >>> n_tasks = 1 >>> n_classes = 2 >>> import deepchem as dc >>> import numpy as np >>> ids = np.arange(n_samples) >>> X = np.random.rand(n_samples, n_features) >>> y = np.random.randint(n_classes, size=(n_samples, n_tasks)) >>> w = np.ones((n_samples, n_tasks)) >>> dataset = dc.data.NumpyDataset(X, y, w, ids) >>> transformer = dc.trans.DuplicateBalancingTransformer(dataset=dataset) >>> dataset = transformer.transform(dataset)
And here’s a multiclass dataset example.
>>> n_samples = 50 >>> n_features = 3 >>> n_tasks = 1 >>> n_classes = 5 >>> ids = np.arange(n_samples) >>> X = np.random.rand(n_samples, n_features) >>> y = np.random.randint(n_classes, size=(n_samples, n_tasks)) >>> w = np.ones((n_samples, n_tasks)) >>> dataset = dc.data.NumpyDataset(X, y, w, ids) >>> transformer = dc.trans.DuplicateBalancingTransformer(dataset=dataset) >>> dataset = transformer.transform(dataset)
See also
deepchem.trans.BalancingTransformer
Balance by changing sample weights.
Note
This transformer is only well-defined for singletask datasets. (Since examples are actually duplicated, there’s no meaningful way to duplicate across multiple tasks in a way that preserves the balance.)
This transformer is only meaningful for classification datasets where y takes on a limited set of values. This class transforms all of X, y, w, ids.
- Raises
ValueError –
- __init__(dataset: deepchem.data.datasets.Dataset)[source]¶
Initializes transformation based on dataset statistics.
- Parameters
transform_X (bool, optional (default False)) – Whether to transform X
transform_y (bool, optional (default False)) – Whether to transform y
transform_w (bool, optional (default False)) – Whether to transform w
transform_ids (bool, optional (default False)) – Whether to transform ids
dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
- transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transform the data in a set of (X, y, w, id) arrays.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idtrans (np.ndarray) – Transformed array of identifiers
- transform(dataset: deepchem.data.datasets.Dataset, parallel: bool = False, out_dir: Optional[str] = None, **kwargs) deepchem.data.datasets.Dataset [source]¶
Transforms all internally stored data in dataset.
This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.
- Parameters
dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
- Returns
A newly transformed Dataset object
- Return type
- transform_on_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transforms numpy arrays X, y, and w
DEPRECATED. Use transform_array instead.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- untransform(transformed: numpy.ndarray) numpy.ndarray [source]¶
Reverses stored transformation on provided data.
Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.
- Parameters
transformed (np.ndarray) – Array which was previously transformed by this class.
ImageTransformer¶
- class ImageTransformer(size: Tuple[int, int])[source]¶
Convert an image into width, height, channel
Note
This class require Pillow to be installed.
- __init__(size: Tuple[int, int])[source]¶
Initializes ImageTransformer.
- Parameters
size (Tuple[int, int]) – The image size, a tuple of (width, height).
- transform_array(X, y, w)[source]¶
Transform the data in a set of (X, y, w, ids) arrays.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- transform(dataset: deepchem.data.datasets.Dataset, parallel: bool = False, out_dir: Optional[str] = None, **kwargs) deepchem.data.datasets.Dataset [source]¶
Transforms all internally stored data in dataset.
This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.
- Parameters
dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
- Returns
A newly transformed Dataset object
- Return type
- transform_on_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transforms numpy arrays X, y, and w
DEPRECATED. Use transform_array instead.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- untransform(transformed: numpy.ndarray) numpy.ndarray [source]¶
Reverses stored transformation on provided data.
Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.
- Parameters
transformed (np.ndarray) – Array which was previously transformed by this class.
FeaturizationTransformer¶
- class FeaturizationTransformer(dataset: Optional[deepchem.data.datasets.Dataset] = None, featurizer: Optional[deepchem.feat.base_classes.Featurizer] = None)[source]¶
A transformer which runs a featurizer over the X values of a dataset.
Datasets used by this transformer must be compatible with the internal featurizer. The idea of this transformer is that it allows for the application of a featurizer to an existing dataset.
Examples
>>> smiles = ["C", "CC"] >>> X = np.array(smiles) >>> y = np.array([1, 0]) >>> dataset = dc.data.NumpyDataset(X, y) >>> trans = dc.trans.FeaturizationTransformer(dataset, dc.feat.CircularFingerprint()) >>> dataset = trans.transform(dataset)
- __init__(dataset: Optional[deepchem.data.datasets.Dataset] = None, featurizer: Optional[deepchem.feat.base_classes.Featurizer] = None)[source]¶
Initialization of FeaturizationTransformer
- Parameters
dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
featurizer (dc.feat.Featurizer object, optional (default None)) – Featurizer applied to perform transformations.
- transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transforms arrays of rdkit mols using internal featurizer.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- transform(dataset: deepchem.data.datasets.Dataset, parallel: bool = False, out_dir: Optional[str] = None, **kwargs) deepchem.data.datasets.Dataset [source]¶
Transforms all internally stored data in dataset.
This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.
- Parameters
dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
- Returns
A newly transformed Dataset object
- Return type
- transform_on_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transforms numpy arrays X, y, and w
DEPRECATED. Use transform_array instead.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- untransform(transformed: numpy.ndarray) numpy.ndarray [source]¶
Reverses stored transformation on provided data.
Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.
- Parameters
transformed (np.ndarray) – Array which was previously transformed by this class.
Specified Usecase Transformers¶
CoulombFitTransformer¶
- class CoulombFitTransformer(dataset: deepchem.data.datasets.Dataset)[source]¶
Performs randomization and binarization operations on batches of Coulomb Matrix features during fit.
Examples
>>> n_samples = 10 >>> n_features = 3 >>> n_tasks = 1 >>> ids = np.arange(n_samples) >>> X = np.random.rand(n_samples, n_features, n_features) >>> y = np.zeros((n_samples, n_tasks)) >>> w = np.ones((n_samples, n_tasks)) >>> dataset = dc.data.NumpyDataset(X, y, w, ids) >>> fit_transformers = [dc.trans.CoulombFitTransformer(dataset)] >>> model = dc.models.MultitaskFitTransformRegressor(n_tasks, ... [n_features, n_features], batch_size=n_samples, fit_transformers=fit_transformers, n_evals=1) >>> print(model.n_features) 12
- __init__(dataset: deepchem.data.datasets.Dataset)[source]¶
Initializes CoulombFitTransformer.
- Parameters
dataset (dc.data.Dataset) – Dataset object to be transformed.
- realize(X: numpy.ndarray) numpy.ndarray [source]¶
Randomize features.
- Parameters
X (np.ndarray) – Features
- Returns
X – Randomized features
- Return type
np.ndarray
- normalize(X: numpy.ndarray) numpy.ndarray [source]¶
Normalize features.
- Parameters
X (np.ndarray) – Features
- Returns
X – Normalized features
- Return type
np.ndarray
- expand(X: numpy.ndarray) numpy.ndarray [source]¶
Binarize features.
- Parameters
X (np.ndarray) – Features
- Returns
X – Binarized features
- Return type
np.ndarray
- X_transform(X: numpy.ndarray) numpy.ndarray [source]¶
Perform Coulomb Fit transform on features.
- Parameters
X (np.ndarray) – Features
- Returns
X – Transformed features
- Return type
np.ndarray
- transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Performs randomization and binarization operations on data.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- transform(dataset: deepchem.data.datasets.Dataset, parallel: bool = False, out_dir: Optional[str] = None, **kwargs) deepchem.data.datasets.Dataset [source]¶
Transforms all internally stored data in dataset.
This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.
- Parameters
dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
- Returns
A newly transformed Dataset object
- Return type
- transform_on_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transforms numpy arrays X, y, and w
DEPRECATED. Use transform_array instead.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
IRVTransformer¶
- class IRVTransformer(K: int, n_tasks: int, dataset: deepchem.data.datasets.Dataset)[source]¶
Performs transform from ECFP to IRV features(K nearest neighbors).
This transformer is required by MultitaskIRVClassifier as a preprocessing step before training.
Examples
Let’s start by defining the parameters of the dataset we’re about to transform.
>>> n_feat = 128 >>> N = 20 >>> n_tasks = 2
Let’s now make our dataset object
>>> import numpy as np >>> import deepchem as dc >>> X = np.random.randint(2, size=(N, n_feat)) >>> y = np.zeros((N, n_tasks)) >>> w = np.ones((N, n_tasks)) >>> dataset = dc.data.NumpyDataset(X, y, w)
And let’s apply our transformer with 10 nearest neighbors.
>>> K = 10 >>> trans = dc.trans.IRVTransformer(K, n_tasks, dataset) >>> dataset = trans.transform(dataset)
Note
This class requires TensorFlow to be installed.
- __init__(K: int, n_tasks: int, dataset: deepchem.data.datasets.Dataset)[source]¶
Initializes IRVTransformer.
- Parameters
K (int) – number of nearest neighbours being count
n_tasks (int) – number of tasks
dataset (dc.data.Dataset object) – train_dataset
- realize(similarity: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray) List [source]¶
find samples with top ten similarity values in the reference dataset
- Parameters
similarity (np.ndarray) – similarity value between target dataset and reference dataset should have size of (n_samples_in_target, n_samples_in_reference)
y (np.array) – labels for a single task
w (np.array) – weights for a single task
- Returns
features – n_samples * np.array of size (2*K,) each array includes K similarity values and corresponding labels
- Return type
list
- X_transform(X_target: numpy.ndarray) numpy.ndarray [source]¶
- Calculate similarity between target dataset(X_target) and
reference dataset(X): #(1 in intersection)/#(1 in union)
similarity = (X_target intersect X)/(X_target union X)
- Parameters
X_target (np.ndarray) – fingerprints of target dataset should have same length with X in the second axis
- Returns
X_target – features of size(batch_size, 2*K*n_tasks)
- Return type
np.ndarray
- static matrix_mul(X1, X2, shard_size=5000)[source]¶
Calculate matrix multiplication for big matrix, X1 and X2 are sliced into pieces with shard_size rows(columns) then multiplied together and concatenated to the proper size
- transform(dataset: deepchem.data.datasets.Dataset, parallel: bool = False, out_dir: Optional[str] = None, **kwargs) Union[deepchem.data.datasets.DiskDataset, deepchem.data.datasets.NumpyDataset] [source]¶
Transforms a given dataset
- Parameters
dataset (Dataset) – Dataset to transform
parallel (bool, optional, (default False)) – Whether to parallelize this transformation. Currently ignored.
out_dir (str, optional (default None)) – Directory to write resulting dataset.
- Returns
DiskDataset or NumpyDataset
Dataset object that is transformed.
- transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transform the data in a set of (X, y, w, ids) arrays.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- transform_on_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transforms numpy arrays X, y, and w
DEPRECATED. Use transform_array instead.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
DAGTransformer¶
- class DAGTransformer(max_atoms: int = 50)[source]¶
Performs transform from ConvMol adjacency lists to DAG calculation orders
This transformer is used by DAGModel before training to transform its inputs to the correct shape. This expansion turns a molecule with n atoms into n DAGs, each with root at a different atom in the molecule.
Examples
Let’s transform a small dataset of molecules.
>>> N = 10 >>> n_feat = 5 >>> import numpy as np >>> feat = dc.feat.ConvMolFeaturizer() >>> X = feat(["C", "CC"]) >>> y = np.random.rand(N) >>> dataset = dc.data.NumpyDataset(X, y) >>> trans = dc.trans.DAGTransformer(max_atoms=5) >>> dataset = trans.transform(dataset)
- __init__(max_atoms: int = 50)[source]¶
Initializes DAGTransformer.
- Parameters
max_atoms (int, optional (Default 50)) – Maximum number of atoms to allow
- transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transform the data in a set of (X, y, w, ids) arrays.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- UG_to_DAG(sample: deepchem.feat.mol_graphs.ConvMol) List [source]¶
This function generates the DAGs for a molecule
- Parameters
sample (ConvMol) – Molecule to transform
- Returns
List of parent adjacency matrices
- Return type
List
- transform(dataset: deepchem.data.datasets.Dataset, parallel: bool = False, out_dir: Optional[str] = None, **kwargs) deepchem.data.datasets.Dataset [source]¶
Transforms all internally stored data in dataset.
This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.
- Parameters
dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
- Returns
A newly transformed Dataset object
- Return type
- transform_on_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transforms numpy arrays X, y, and w
DEPRECATED. Use transform_array instead.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
RxnSplitTransformer¶
- class RxnSplitTransformer(sep_reagent: bool = True, dataset: Optional[deepchem.data.datasets.Dataset] = None)[source]¶
Splits the reaction SMILES input into the source and target strings required for machine translation tasks.
The input is expected to be in the form reactant>reagent>product. The source string would be reactants>reagents and the target string would be the products.
The transformer can also separate the reagents from the reactants for a mixed training mode. During mixed training, the source string is transformed from reactants>reagent to reactants.reagent> . This can be toggled (default True) by setting the value of sep_reagent while calling the transformer.
Examples
>>> # When mixed training is toggled. >>> import numpy as np >>> from deepchem.trans.transformers import RxnSplitTransformer >>> reactions = np.array(["CC(C)C[Mg+].CON(C)C(=O)c1ccc(O)nc1>C1CCOC1.[Cl-]>CC(C)CC(=O)c1ccc(O)nc1","CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(N)cc3)cc21.O=CO>>CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(NC=O)cc3)cc21"], dtype=object) >>> trans = RxnSplitTransformer(sep_reagent=True) >>> split_reactions = trans.transform_array(X=reactions, y=np.array([]), w=np.array([]), ids=np.array([])) >>> split_reactions (array([['CC(C)C[Mg+].CON(C)C(=O)c1ccc(O)nc1>C1CCOC1.[Cl-]', 'CC(C)CC(=O)c1ccc(O)nc1'], ['CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(N)cc3)cc21.O=CO>', 'CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(NC=O)cc3)cc21']], dtype='<U51'), array([], dtype=float64), array([], dtype=float64), array([], dtype=float64))
When mixed training is disabled, you get the following outputs:
>>> trans_disable = RxnSplitTransformer(sep_reagent=False) >>> split_reactions = trans_disable.transform_array(X=reactions, y=np.array([]), w=np.array([]), ids=np.array([])) >>> split_reactions (array([['CC(C)C[Mg+].CON(C)C(=O)c1ccc(O)nc1.C1CCOC1.[Cl-]>', 'CC(C)CC(=O)c1ccc(O)nc1'], ['CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(N)cc3)cc21.O=CO>', 'CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(NC=O)cc3)cc21']], dtype='<U51'), array([], dtype=float64), array([], dtype=float64), array([], dtype=float64))
Note
This class only transforms the feature field of a reaction dataset like USPTO.
- __init__(sep_reagent: bool = True, dataset: Optional[deepchem.data.datasets.Dataset] = None)[source]¶
Initializes the Reaction split Transformer.
- Parameters
sep_reagent (bool, optional (default True)) – To separate the reagent and reactants for training.
dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed.
- transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transform the data in a set of (X, y, w, ids) arrays.
- Parameters
X (np.ndarray) – Array of features(the reactions)
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of weights.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- transform(dataset: deepchem.data.datasets.Dataset, parallel: bool = False, out_dir: Optional[str] = None, **kwargs) deepchem.data.datasets.Dataset [source]¶
Transforms all internally stored data in dataset.
This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.
- Parameters
dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
- Returns
A newly transformed Dataset object
- Return type
- transform_on_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transforms numpy arrays X, y, and w
DEPRECATED. Use transform_array instead.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
Base Transformer (for develop)¶
The dc.trans.Transformer
class is the abstract parent class
for all transformers. This class should never be directly initialized,
but contains a number of useful method implementations.
- class Transformer(transform_X: bool = False, transform_y: bool = False, transform_w: bool = False, transform_ids: bool = False, dataset: Optional[deepchem.data.datasets.Dataset] = None)[source]¶
Abstract base class for different data transformation techniques.
A transformer is an object that applies a transformation to a given dataset. Think of a transformation as a mathematical operation which makes the source dataset more amenable to learning. For example, one transformer could normalize the features for a dataset (ensuring they have zero mean and unit standard deviation). Another transformer could for example threshold values in a dataset so that values outside a given range are truncated. Yet another transformer could act as a data augmentation routine, generating multiple different images from each source datapoint (a transformation need not necessarily be one to one).
Transformers are designed to be chained, since data pipelines often chain multiple different transformations to a dataset. Transformers are also designed to be scalable and can be applied to large dc.data.Dataset objects. Not that Transformers are not usually thread-safe so you will have to be careful in processing very large datasets.
This class is an abstract superclass that isn’t meant to be directly instantiated. Instead, you will want to instantiate one of the subclasses of this class inorder to perform concrete transformations.
- __init__(transform_X: bool = False, transform_y: bool = False, transform_w: bool = False, transform_ids: bool = False, dataset: Optional[deepchem.data.datasets.Dataset] = None)[source]¶
Initializes transformation based on dataset statistics.
- Parameters
transform_X (bool, optional (default False)) – Whether to transform X
transform_y (bool, optional (default False)) – Whether to transform y
transform_w (bool, optional (default False)) – Whether to transform w
transform_ids (bool, optional (default False)) – Whether to transform ids
dataset (dc.data.Dataset object, optional (default None)) – Dataset to be transformed
- transform(dataset: deepchem.data.datasets.Dataset, parallel: bool = False, out_dir: Optional[str] = None, **kwargs) deepchem.data.datasets.Dataset [source]¶
Transforms all internally stored data in dataset.
This method transforms all internal data in the provided dataset by using the Dataset.transform method. Note that this method adds X-transform, y-transform columns to metadata. Specified keyword arguments are passed on to Dataset.transform.
- Parameters
dataset (dc.data.Dataset) – Dataset object to be transformed.
parallel (bool, optional (default False)) – if True, use multiple processes to transform the dataset in parallel. For large datasets, this might be faster.
out_dir (str, optional) – If out_dir is specified in kwargs and dataset is a DiskDataset, the output dataset will be written to the specified directory.
- Returns
A newly transformed Dataset object
- Return type
- transform_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transform the data in a set of (X, y, w, ids) arrays.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- transform_on_array(X: numpy.ndarray, y: numpy.ndarray, w: numpy.ndarray, ids: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Transforms numpy arrays X, y, and w
DEPRECATED. Use transform_array instead.
- Parameters
X (np.ndarray) – Array of features
y (np.ndarray) – Array of labels
w (np.ndarray) – Array of weights.
ids (np.ndarray) – Array of identifiers.
- Returns
Xtrans (np.ndarray) – Transformed array of features
ytrans (np.ndarray) – Transformed array of labels
wtrans (np.ndarray) – Transformed array of weights
idstrans (np.ndarray) – Transformed array of ids
- untransform(transformed: numpy.ndarray) numpy.ndarray [source]¶
Reverses stored transformation on provided data.
Depending on whether transform_X or transform_y or transform_w was set, this will perform different un-transformations. Note that this method may not always be defined since some transformations aren’t 1-1.
- Parameters
transformed (np.ndarray) – Array which was previously transformed by this class.