Session 19: RDKit#
In this session, we will be looking at RDKit, an open-source cheminformatics toolkit. We will learn how to approach using RDKit in Python to visualize molecules and compute basic chemical properties
Part 1: Basics#
Before we begin, make sure you have installed RDKit with pip install rdkit
in your terminal.
Task 1#
In this task you will create a simple molecule (ethanol) from its SMILES and InChI representations and render its structure as an image.
Create a new file
rdkit_intro.py
in your session folder.Copy and paste the following code:
from rdkit import Chem from rdkit.Chem import Draw # Create an ethanol molecule object from a SMILES string smiles = "CCO" molecule = Chem.MolFromSmiles(smiles) # Generate an image of the molecule and display it img = Draw.MolToImage(molecule, size=(300, 300)) img.show() # Alternatively, use InChI and save the image to a file instead inchi = "InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3" molecule = Chem.MolFromInchi(inchi) img = Draw.MolToImage(molecule, size=(300, 300)) img.save("ethanol.png")
Review the code and see if you can figure out what it is doing.
Run the code and compare to your thoughts in the previous step.
Analysis#
There’s actually very “little” going on here in terms of code, and it demonstrates the power of using RDKit while just scratching the surface of what it can do.
Importing RDKit: Since it is broken down into several modules, we actually need to import a couple of things that we will be using.
rdkit.Chem
provides the majority of the basic functionality we need. Of particular importance is therdchem.Mol
class, which combines all of the atoms, bonds, and chemical properties of the molecules we want to investigate. It also provides a whole host of useful functions to manipulate the molecule and calculate various quantities; you can read about these in the RDKit documentation.rdkit.Chem.Draw
needs to be imported separately to allow us to create an image object to display and/or save.
SMILES and InChI: In the script, we use two different string representations of ethanol, which are commonly used to share chemical formulas in computing.
It is beyond the scope of this module to go into detail of these, but below is a very brief explanation of the SMILES version of ethanol.
The string
CCO
represents ethanol in SMILES notation, where eachC
represents a carbon atom andO
represents an oxygen atom.The hydrogens are implicitly understood by RDKit (and other software), which knows carbon typically forms 4 bonds and oxygen forms 2 bonds.
Creating Molecule Objects: The next step is to take our string representation of our molecule and convert it into an RDKit molecule object.
Chem.MolFromSmiles()
andChem.MolFromInchi()
are the two function that do this, taking the SMILES or InChI strings, respectively.While you don’t need to understand what RDKit is doing under the hood, it is essentially creating an object with all of the atoms, bonds, and implicit features needed.
This molecule object is very powerful and will allow us to do all sorts of things, some of which we are about to look at.
Rendering (drawing) the Molecule: The final bit(s) of code essentially take the molecule object we have created and use the
Chem.Draw
module to render it into an image.In the first instance we are just using the
show()
function to display the image as we have done in a previous session.The second time we are actually saving the image as a file instead, using much the same method as earlier.
You should expect the two images to be the same (as the SMILES and InChI given in this example represent the same structure!), and indeed they should be.
Programming Challenge
Create a little program to accept a SMILES string as an input from the user, as well as a filename. Then, save the image of the molecule according to the given filename and print a confirmation message with the full image filepath.
Task 2#
Create a new file called
rdkit_properties.py
.Past in the following code:
from rdkit import Chem from rdkit.Chem import Descriptors # Create a benzene molecule object from a SMILES string smiles = "c1ccccc1" molecule = Chem.MolFromSmiles(smiles) # Calculate molecular weight and LogP mol_weight = Descriptors.MolWt(molecule) logp = Descriptors.MolLogP(molecule) print(f"Molecular Weight of benzene: {mol_weight:.2f} g/mol") print(f"LogP of benzene: {logp:.2f}")
Run the script and check the output in your terminal.
Analysis#
Think back to your own molecular weight calculator - hopefully you can see the power in using existing software packages to allow you to focus on actual problems you want to solve.
SMILES Representation of Benzene The SMILES string “c1ccccc1” represents benzene:
Lowercase
c
indicates aromatic carbon atoms (as opposed to uppercaseC
for aliphatic carbon)The numbers
1
serve as “ring closures” - they indicate where the ring connects back to itselfThis creates a six-membered aromatic carbon ring, which is benzene
RDKit Descriptors Module: The
rdkit.Chem.Descriptors
module provides access to a wide range of molecular property calculationsThere are over 200 different descriptors allowing users to skip writing complex algorithms themselves.
Molecular Calculations: Here we are calculating two different example properties:
Descriptors.MolWt()
allows us to calculate the molecular weight of the molecule. It does basically the same thing you have already written yourself, using an internal periodic table for reference.Descriptors.MolLogP()
estimates the octanol-water partition coefficient. This is a measure of a molecule’s lipophilicity and is a significant property used in drug discovery.Doing these for a single molecule seems a bit trivial, but we can leverage Python to do this for us at scale.
Programming Challenge
Recreate your molecular weight calculator program, this time using RDKit to simplify your code.
Part 2: More advanced usage#
In this part, we will explore some more complex uses of RDKit. You will learn how to generate and optimize 3D conformers as well as compute molecular similarities using fingerprints. It is only beginning to scratch the surface of what RDKit can help you create!
Task 3#
Before starting this task, make sure to install py3Dmol by entering pip install py3dmol
in your terminal.
Open the
rdkit_3d.ipynb
Jupyter Notebook.Run the cell to see a 3D visualization of a molecule appear.
Change some settings to see how the visualization changes.
Analysis#
Here we are leveraging the py3Dmol package to visualize the molecule we have imported using RDKit. Since it generates a JavaScript viewer, it will only work in a Jupyter notebook; a great example of why you might want to use the latter as opposed to an IDE in certain situations.
Importing our molecule: After importing our packages, we import the SMILES for caffeine using the
MolFromSmiles()
method we have seen before.The main difference here is that we will be optimizing our structure and so need to make sure to add hydrogens using the
AddHs()
function.RDKit will automatically work out how many hydrogens to add, and where.
Try commenting this step out and see what changes about the output!
Optimize the structure: The next step is to take our molecule object and actually put each of its constituent atoms in 3D space, determining their positions.
The first step here is to use
EmbedMolecule()
to assign coordinates to the atom.The next step uses
UFFOptimizeMolecule()
to apply the Universal Force Field (UFF) to determine a more accurate structure for the molecule.The UFF is a general-purpose computational model that we can apply to a molecule to give a “good” first estimate of the optimized positions of the atoms in a molecule. It is lightweight enough to run on our local computer.
Force fields are generally used to get a good starting position for more accurate calculations (e.g., DFT) which take a much longer time and require large compute clusters to run.
Finally, we create a variable containing the x-, y-, and z-positions of each atom in XYZ format which py3Dmol can read, using the
MolToXYZBlock()
function.
Display the molecule: The final step is to actually display our molecule, which we do by creating a
py3Dmol.view
object.We then need to add caffeine to our view using
addModel()
and set the style usingsetStyle()
Note that py3Dmol uses 3Dmol.js for its actual visualizer, so you may need to use its documentation for certain aspects.
Finally,
zoomTo()
ensures we are focussed on our molecule andshow()
creates the visual display.
Task 4#
Create a new file named
rdkit_similarity.py
in your session folder.Paste in the following code:
from rdkit import Chem from rdkit.Chem import AllChem, DataStructs # Define two molecules using SMILES strings smiles_ethanol = "CCO" # ethanol smiles_propanol = "CCCO" # propanol ethanol = Chem.MolFromSmiles(smiles_ethanol) propanol = Chem.MolFromSmiles(smiles_propanol) # Generate fingerprints for both molecules fpgen = AllChem.GetRDKitFPGenerator() fp_ethanol = fpgen.GetFingerprint(ethanol) fp_propanol = fpgen.GetFingerprint(propanol) # Compute the Tanimoto similarity between the two fingerprints similarity_tanimoto = DataStructs.TanimotoSimilarity(fp_ethanol, fp_propanol) similarity_dice = DataStructs.DiceSimilarity(fp_ethanol, fp_propanol) print(f"Tanimoto similarity between ethanol and propanol: {similarity_tanimoto:.2f}") print(f"Dice similarity between ethanol and propanol: {similarity_dice:.2f}")
Run the script.
Find the definitions for Tanimoto and Dice similarities of molecules, and see if you can calculate the same values.
Analysis#
In this example, we’re exploring how RDKit can be used to calculate molecular similarity - a key concept in computational drug discovery and cheminformatics.
Creating Molecular Objects: Similar to previous tasks, we create the necessary objects using
Chem.MolFromSmiles()
Generating Molecular Fingerprints: Next we want to create some “fingerprints” of our two molecules.
First, we need a fingerprint generator object, which is created with
AllChem.GetRDKitFPGenerator()
.This function essentially creates a vectorized representation (fingerprint) of each of the molecules, with each structural pattern being represented as part of the fingerprint.
Calculating Similarity Metrics: Once we have fingerprints, we can apply mathematical metrics to compare them.
DataStructs.TanimotoSimilarity()
calculates similarity using the Tanimoto coefficient.DataStructs.DiceSimilarity()
uses the Dice coefficient instead.Both return values between 0 (completely different) and 1 (identical). Other similarity coefficients exist, but these are all beyond the scope of this module.
Programming Challenge
Create a program that analyzes the file compounds.txt
in your session folder to:
Determine which 3 pairs of compounds are most similar based on their Tanimoto similarity, in descending order.
Generate and visualize 2D structures of the most similar pair in a single image, including appropriate labels.
Summary#
By the end of this session, you will have:
Created and visualized molecular structures using RDKit, imported via SMILES/InChI representations
Calculated some properties such as molecular weight and LogP, replacing manual calculations
Generated and viewed 3D molecular structures with py3Dmol
Used more advanced data structures within RDKit to determine the molecular similarity between two molecules