DeepChem Tutorial

If you’re new to DeepChem, you probably want to know the basics. What is DeepChem? Why should you care about using it? The short answer is that DeepChem is a scientific machine learning library. (The “Chem” indicates the historical fact that DeepChem initially focused on chemical applications, but we aim to support all types of scientific applications more broadly).

Why would you want to use DeepChem instead of another machine learning library? Simply put, DeepChem maintains an extensive collection of utilities to enable scientific deep learning including classes for loading scientific datasets, processing them, transforming them, splitting them up, and learning from them. Behind the scenes DeepChem uses a variety of other machine learning frameworks such as sklearn, tensorflow, and xgboost. We are also experimenting with adding additional models implemented in pytorch and jax. Our focus is to facilitate scientific experimentation using whatever tools are available at hand.

In the rest of this tutorials, we’ll provide a rapid fire overview of DeepChem’s API. DeepChem is a big library so we won’t cover everything, but we should give you enough to get started.

Quickstart

If you’re new, you can install DeepChem on a new machine with the following commands

pip install tensorflow==2.2.0
pip install --pre deepchem

DeepChem is under very active development at present, so we recommend using our nightly build until we release a next major release. Note that to use DeepChem for chemistry applications, you will have to also install RDKit using conda.

conda install -y -c conda-forge rdkit

Datasets

The dc.data module contains utilities to handle Dataset objects. These Dataset objects are the heart of DeepChem. A Dataset is an abstraction of a dataset in machine learning. That is, a collection of features, labels, weights, alongside associated identifiers. Rather than explaining further, we’ll just show you.

>>> import deepchem as dc
>>> import numpy as np
>>> N_samples = 50
>>> n_features = 10
>>> X = np.random.rand(N_samples, n_features)
>>> y = np.random.rand(N_samples)
>>> dataset = dc.data.NumpyDataset(X, y)
>>> dataset.X.shape
(50, 10)
>>> dataset.y.shape
(50,)

Here we’ve used the NumpyDataset class which stores datasets in memory. This works fine for smaller datasets and is very convenient for experimentation, but is less convenient for larger datasets. For that we have the DiskDataset class.

>>> dataset = dc.data.DiskDataset.from_numpy(X, y)
>>> dataset.X.shape
(50, 10)
>>> dataset.y.shape
(50,)

In this example we haven’t specified a data directory, so this DiskDataset is written to a temporary folder. Note that dataset.X and dataset.y load data from disk underneath the hood! So this can get very expensive for larger datasets.

More Tutorials

DeepChem maintains an extensive collection of addition tutorials that are meant to be run on Google colab, an online platform that allows you to execute Jupyter notebooks. Once you’ve finished this introductory tutorial, we recommend working through these more involved tutorials.