Session 19: RDKit#

In this session, we will be looking at RDKit, an open-source cheminformatics toolkit. We will learn how to approach using RDKit in Python to visualize molecules and compute basic chemical properties


Part 1: Basics#

Before we begin, make sure you have installed RDKit with pip install rdkit in your terminal.

Task 1#

In this task you will create a simple molecule (ethanol) from its SMILES and InChI representations and render its structure as an image.

  1. Create a new file rdkit_intro.py in your session folder.

  2. Copy and paste the following code:

    from rdkit import Chem
    from rdkit.Chem import Draw
    
    # Create an ethanol molecule object from a SMILES string
    smiles = "CCO"
    molecule = Chem.MolFromSmiles(smiles)
    
    # Generate an image of the molecule and display it
    img = Draw.MolToImage(molecule, size=(300, 300))
    img.show()
    
    # Alternatively, use InChI and save the image to a file instead
    inchi = "InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3"
    molecule = Chem.MolFromInchi(inchi)
    img = Draw.MolToImage(molecule, size=(300, 300))
    img.save("ethanol.png")
    
  3. Review the code and see if you can figure out what it is doing.

  4. Run the code and compare to your thoughts in the previous step.

Analysis#

There’s actually very “little” going on here in terms of code, and it demonstrates the power of using RDKit while just scratching the surface of what it can do.

  • Importing RDKit: Since it is broken down into several modules, we actually need to import a couple of things that we will be using.

    • rdkit.Chem provides the majority of the basic functionality we need. Of particular importance is the rdchem.Mol class, which combines all of the atoms, bonds, and chemical properties of the molecules we want to investigate. It also provides a whole host of useful functions to manipulate the molecule and calculate various quantities; you can read about these in the RDKit documentation.

    • rdkit.Chem.Draw needs to be imported separately to allow us to create an image object to display and/or save.

  • SMILES and InChI: In the script, we use two different string representations of ethanol, which are commonly used to share chemical formulas in computing.

    • It is beyond the scope of this module to go into detail of these, but below is a very brief explanation of the SMILES version of ethanol.

      • The string CCO represents ethanol in SMILES notation, where each C represents a carbon atom and O represents an oxygen atom.

      • The hydrogens are implicitly understood by RDKit (and other software), which knows carbon typically forms 4 bonds and oxygen forms 2 bonds.

  • Creating Molecule Objects: The next step is to take our string representation of our molecule and convert it into an RDKit molecule object.

    • Chem.MolFromSmiles() and Chem.MolFromInchi() are the two function that do this, taking the SMILES or InChI strings, respectively.

    • While you don’t need to understand what RDKit is doing under the hood, it is essentially creating an object with all of the atoms, bonds, and implicit features needed.

    • This molecule object is very powerful and will allow us to do all sorts of things, some of which we are about to look at.

  • Rendering (drawing) the Molecule: The final bit(s) of code essentially take the molecule object we have created and use the Chem.Draw module to render it into an image.

    • In the first instance we are just using the show() function to display the image as we have done in a previous session.

    • The second time we are actually saving the image as a file instead, using much the same method as earlier.

    • You should expect the two images to be the same (as the SMILES and InChI given in this example represent the same structure!), and indeed they should be.

Programming Challenge

Create a little program to accept a SMILES string as an input from the user, as well as a filename. Then, save the image of the molecule according to the given filename and print a confirmation message with the full image filepath.

Task 2#

  1. Create a new file called rdkit_properties.py.

  2. Past in the following code:

    from rdkit import Chem
    from rdkit.Chem import Descriptors
    
    # Create a benzene molecule object from a SMILES string
    smiles = "c1ccccc1"
    molecule = Chem.MolFromSmiles(smiles)
    
    # Calculate molecular weight and LogP
    mol_weight = Descriptors.MolWt(molecule)
    logp = Descriptors.MolLogP(molecule)
    
    print(f"Molecular Weight of benzene: {mol_weight:.2f} g/mol")
    print(f"LogP of benzene: {logp:.2f}")
    
  3. Run the script and check the output in your terminal.

Analysis#

Think back to your own molecular weight calculator - hopefully you can see the power in using existing software packages to allow you to focus on actual problems you want to solve.

  • SMILES Representation of Benzene The SMILES string “c1ccccc1” represents benzene:

    • Lowercase c indicates aromatic carbon atoms (as opposed to uppercase C for aliphatic carbon)

    • The numbers 1 serve as “ring closures” - they indicate where the ring connects back to itself

    • This creates a six-membered aromatic carbon ring, which is benzene

  • RDKit Descriptors Module: The rdkit.Chem.Descriptors module provides access to a wide range of molecular property calculations

    • There are over 200 different descriptors allowing users to skip writing complex algorithms themselves.

  • Molecular Calculations: Here we are calculating two different example properties:

    • Descriptors.MolWt() allows us to calculate the molecular weight of the molecule. It does basically the same thing you have already written yourself, using an internal periodic table for reference.

    • Descriptors.MolLogP() estimates the octanol-water partition coefficient. This is a measure of a molecule’s lipophilicity and is a significant property used in drug discovery.

    • Doing these for a single molecule seems a bit trivial, but we can leverage Python to do this for us at scale.

Programming Challenge

Recreate your molecular weight calculator program, this time using RDKit to simplify your code.

Part 2: More advanced usage#

In this part, we will explore some more complex uses of RDKit. You will learn how to generate and optimize 3D conformers as well as compute molecular similarities using fingerprints. It is only beginning to scratch the surface of what RDKit can help you create!

Task 3#

Before starting this task, make sure to install py3Dmol by entering pip install py3dmol in your terminal.

  1. Open the rdkit_3d.ipynb Jupyter Notebook.

  2. Run the cell to see a 3D visualization of a molecule appear.

  3. Change some settings to see how the visualization changes.

Analysis#

Here we are leveraging the py3Dmol package to visualize the molecule we have imported using RDKit. Since it generates a JavaScript viewer, it will only work in a Jupyter notebook; a great example of why you might want to use the latter as opposed to an IDE in certain situations.

  • Importing our molecule: After importing our packages, we import the SMILES for caffeine using the MolFromSmiles() method we have seen before.

    • The main difference here is that we will be optimizing our structure and so need to make sure to add hydrogens using the AddHs() function.

    • RDKit will automatically work out how many hydrogens to add, and where.

      • Try commenting this step out and see what changes about the output!

  • Optimize the structure: The next step is to take our molecule object and actually put each of its constituent atoms in 3D space, determining their positions.

    • The first step here is to use EmbedMolecule() to assign coordinates to the atom.

    • The next step uses UFFOptimizeMolecule() to apply the Universal Force Field (UFF) to determine a more accurate structure for the molecule.

      • The UFF is a general-purpose computational model that we can apply to a molecule to give a “good” first estimate of the optimized positions of the atoms in a molecule. It is lightweight enough to run on our local computer.

      • Force fields are generally used to get a good starting position for more accurate calculations (e.g., DFT) which take a much longer time and require large compute clusters to run.

    • Finally, we create a variable containing the x-, y-, and z-positions of each atom in XYZ format which py3Dmol can read, using the MolToXYZBlock() function.

  • Display the molecule: The final step is to actually display our molecule, which we do by creating a py3Dmol.view object.

    • We then need to add caffeine to our view using addModel() and set the style using setStyle()

      • Note that py3Dmol uses 3Dmol.js for its actual visualizer, so you may need to use its documentation for certain aspects.

    • Finally, zoomTo() ensures we are focussed on our molecule and show() creates the visual display.

Task 4#

  1. Create a new file named rdkit_similarity.py in your session folder.

  2. Paste in the following code:

    from rdkit import Chem
    from rdkit.Chem import AllChem, DataStructs
    
    # Define two molecules using SMILES strings
    smiles_ethanol = "CCO"  # ethanol
    smiles_propanol = "CCCO"  # propanol
    
    ethanol = Chem.MolFromSmiles(smiles_ethanol)
    propanol = Chem.MolFromSmiles(smiles_propanol)
    
    # Generate fingerprints for both molecules
    fpgen = AllChem.GetRDKitFPGenerator()
    fp_ethanol = fpgen.GetFingerprint(ethanol)
    fp_propanol = fpgen.GetFingerprint(propanol)
    
    # Compute the Tanimoto similarity between the two fingerprints
    similarity_tanimoto = DataStructs.TanimotoSimilarity(fp_ethanol, fp_propanol)
    similarity_dice = DataStructs.DiceSimilarity(fp_ethanol, fp_propanol)
    print(f"Tanimoto similarity between ethanol and propanol: {similarity_tanimoto:.2f}")
    print(f"Dice similarity between ethanol and propanol: {similarity_dice:.2f}")
    
  3. Run the script.

  4. Find the definitions for Tanimoto and Dice similarities of molecules, and see if you can calculate the same values.

Analysis#

In this example, we’re exploring how RDKit can be used to calculate molecular similarity - a key concept in computational drug discovery and cheminformatics.

  • Creating Molecular Objects: Similar to previous tasks, we create the necessary objects using Chem.MolFromSmiles()

  • Generating Molecular Fingerprints: Next we want to create some “fingerprints” of our two molecules.

    • First, we need a fingerprint generator object, which is created with AllChem.GetRDKitFPGenerator().

    • This function essentially creates a vectorized representation (fingerprint) of each of the molecules, with each structural pattern being represented as part of the fingerprint.

  • Calculating Similarity Metrics: Once we have fingerprints, we can apply mathematical metrics to compare them.

    • DataStructs.TanimotoSimilarity() calculates similarity using the Tanimoto coefficient.

    • DataStructs.DiceSimilarity() uses the Dice coefficient instead.

    • Both return values between 0 (completely different) and 1 (identical). Other similarity coefficients exist, but these are all beyond the scope of this module.

Programming Challenge

Create a program that analyzes the file compounds.txt in your session folder to:

  1. Determine which 3 pairs of compounds are most similar based on their Tanimoto similarity, in descending order.

  2. Generate and visualize 2D structures of the most similar pair in a single image, including appropriate labels.

Summary#

By the end of this session, you will have:

  • Created and visualized molecular structures using RDKit, imported via SMILES/InChI representations

  • Calculated some properties such as molecular weight and LogP, replacing manual calculations

  • Generated and viewed 3D molecular structures with py3Dmol

  • Used more advanced data structures within RDKit to determine the molecular similarity between two molecules