Appendix 2: Molecular Weight Calculator - Walkthrough

Appendix 2: Molecular Weight Calculator - Walkthrough#

This notebook documents the process of developing some code for the Molecular Weight Calculator mid-session mini project.

It just one way in which this could be done, there are an almost infinite number of alternative approaches that could be just as good,(or better), but the aim is to illustrate the approach of dividing the problem into smaller probelems, and iteratively building the code to solve each of them.

First steps - analyse the problem, break it down into logical sub-problems.#

1. We need some logic to parse the input string into manageable chunks.#

A two-stage process of chopping up the input molecular formula could be a way to go, e.g.:

Create a list (formula_list) from the string (formula_string), e.g.:

“C10H12O” –> [“C10”, “H12”, “O”]
Create a dictionary (formula_dict) from formula_list, e.g.:

[“C10”, “H12”, “O”] –> {“C”: 10, “H”: 12, “O”: 1}

2. We need a database of elements and their masses#

A dictionary with element symbols and element masses could be very useful. Actually this has pretty much been covered in previous sessions - e.g. Session 7 included loading data from a file called all_elements.txt - just what we need. We will write code to import this into a dictionary, so, for example:

element_mass['C'] = 12.0

Subproblem 1: from formula_string to formula_list#

Eventually the code will be in a standalone Python script run from the command line, but developing in a Jupyter environment makes it easier to play with ideas.

As discussed above, the formula_list object will be chopped-up bits of formula_string. This sounds like a string slicing task. We need code that will identify where each chunk should start and end.

The start-points for each chunk are easy - they are the indices if the upper-case characters in formula_string.

The end-ponts are quite easy too - remember in Python formula_string[3:7] means “the characters in formula_string starting from 3 and finishing just before 7”. The “just before” bit is the key - the end-point of each chunk is the same as the start-point of the next.

But what about the last chunk? The end-point for that will be the length of formula_string.

OK, time to start writing some code:

formula_string = "C10H12O" # an example to play with...
start_points = []
n_characters = len(formula_string) # number of characters in the string
for i in range(n_characters): # loop over all the characters
    if formula_string[i].isupper():
        start_points.append(i) # add the index to start_points
        
print(start_points) # check it's done what we want...

[0, 3, 6]

OK - that looks like a fair start - the numbers are the numbers we expect. So now lets get those corresponding end-points for each chunk:

end_points = start_points[1:] # end point of each chunk is start point of the next
end_points.append(n_characters) # special case for the last chunk

print(end_points)

[3, 6, 7]

Again this looks good - the numbers are what we are expecting. Now we can solve the first sub-problem and generate formula_list:

formula_list = [] # start with an empty list
n_chunks = len(start_points) # the number of chunks
for i in range(n_chunks):
    formula_list.append(formula_string[start_points[i]:end_points[i]])
    
print(formula_list) # print it out to check

['C10', 'H12', 'O']

It works!

OK, let’s take the snippets of code above and re-write them in the cell below as a single chunk of code. At the same time we will test them on a different molecular formula to check it still works:

formula_string = "C6H10SO2F"
start_points = []
n_characters = len(formula_string) # number of characters in the string
for i in range(n_characters): # loop over all the characters
    if formula_string[i].isupper():
        start_points.append(i) # add the index to start_points
        
end_points = start_points[1:] # end point of each chunk is start point of the next
end_points.append(n_characters) # special case for the last chunk

formula_list = []
n_chunks = len(start_points) # the number of chunks
for i in range(n_chunks):
    formula_list.append(formula_string[start_points[i]:end_points[i]])
    
print(formula_list) # print it out to check

['C6', 'H10', 'S', 'O2', 'F']

Looking good.

What happens if we try molecular formulas that include 2-letter element symbols?

formula_string = "CaSO4"
start_points = []
n_characters = len(formula_string) # number of characters in the string
for i in range(n_characters): # loop over all the characters
    if formula_string[i].isupper():
        start_points.append(i) # add the index to start_points

end_points = start_points[1:] # end point of each chunk is start point of the next
end_points.append(n_characters) # special case for the last chunk

formula_list = []
n_chunks = len(start_points) # the number of chunks
for i in range(n_chunks):
    formula_list.append(formula_string[start_points[i]:end_points[i]])
    
print(formula_list) # print it out to check

['Ca', 'S', 'O4']

It still works!

As it turns out, the logic approach here works fine for 2-letter element symbols.

Subproblem 2: From formula_list to formula_dict#

OK, onto the next sub-problem, splitting up the chunks into element symbols and element counts.

Thinking about it, each chunk will be one of four types:

A single upper-case letter - e.g. “F”
A single upper-case letter then a number - e.g. “C10”
A two-letter string - e.g. “Ca”
A two-letter string then a number - e.g. “Br2”

Some logic to check if we have a one- or two-letter symbol in each chunk could go something like this:

Find the length (number of characters) in the chunk.
If it’s 1 then we have a one-letter element symbol.
If it’s greater than 1, then look at the second character.
If it’s a lower-case letter, we have a two-letter element symbol.
Else it’s number, so we have a one-letter element symbol.

Time to turn this into some code:

chunk = "C10" # something to play with...
n_characters = len(chunk) # number of characters in the chunk
if n_characters == 1:
    one_letter_symbol = True
else:
    if chunk[1].islower():
        one_letter_symbol = False
    else:
        one_letter_symbol = True

print(one_letter_symbol)

True

Edit the cell above, replacing “C10” with other test chunks, convince yourself it seems to work reliably.

OK, so with this boolean variable one_letter_symbol, it’s straightforward to work out how the chunk should be divided up into symbol plus element count:

If one_letter_symbol is True and the chunk length is 1, the element symbol is the whole of the chunk and the count is 1.
If one_letter_symbol is True and the chunk length is >1, the element symbol is the first character in the chunk and the remainder is the count.
If one_letter_symbol is False and the chunk length is 2, the element symbol is the whole of the chunk and the count is 1.
If one_letter_symbol is False and the chunk length is >2, the element symbol is the first two characters of the chunk and the remainder is the count.

Let’s put that into some code:

chunk = "Cl2"
n_characters = len(chunk)
one_letter_symbol = False # for now, avoid repeating the code block above
formula_dict = {} # start with an empty dictionary
if one_letter_symbol:
    if n_characters == 1:
        formula_dict[chunk] = 1
    else:
        formula_dict[chunk[0]] = int(chunk[1:]) # convert string to integer
else:
    if n_characters == 2:
        formula_dict[chunk] = 1
    else:
        formula_dict[chunk[:2]] = int(chunk[2:])
        
print(formula_dict)

{'Cl': 2}

Again, experiment with changing the value of chunk in the cell above (and remembering to set one_character_symbol to True or False as required), to check it works no matter what the format of the chunk.

OK - now we can put the code snippets from the cells above together into a complete code block to convert a formula_list into a formula_dict:

formula_list = ['C6', 'H10', 'S', 'O2', 'F']
formula_dict = {} # start with an empty dictionary
for chunk in formula_list:
    # Code to find out if it's a 1- or 2-letter symbol:
    n_characters = len(chunk) # number of characters in the chunk
    if n_characters == 1:
        one_letter_symbol = True
    else:
        if chunk[1].islower():
            one_letter_symbol = False
        else:
            one_letter_symbol = True
            
    # Now we know if there is a 1- or 2-letter symbol, process the chunk:
    if one_letter_symbol:
        if n_characters == 1:
            formula_dict[chunk] = 1
        else:
            formula_dict[chunk[0]] = int(chunk[1:]) # convert string to integer
    else:
        if n_characters == 2:
            formula_dict[chunk] = 1
        else:
            formula_dict[chunk[:2]] = int(chunk[2:])

print(formula_dict) # printout to check

{'C': 6, 'H': 10, 'S': 1, 'O': 2, 'F': 1}

Great!

Now we need that database of element symbols and element masses. As discussed above, code and data from Session 7 will get us what we need:

Subproblem 3: Create the element_mass dictionary#

Take a copy of the file all_elements.txt from Session 7 and put it in the same folder as this Jupyter notebook. Then load it:

elements_file_name = "../../../phar2062-workshops/Session 7 - More on Strings/all_elements.txt"
element_mass = {} # create an empty dictionary
with open(elements_file_name) as elements_file: # Use a context block so the file is automatically closed at the end
    for line in elements_file:
        fields = line.split()
        symbol = fields[3] # the symbol is ther last field
        mass = float(fields[1]) # mass is the second field - convert to a float
        element_mass[symbol] = mass

print(element_mass["Cl"]) # Quick check - do we get the expected mass?

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[9], line 3
      1 elements_file_name = "../../../phar2062-workshops/Session 7 - More on Strings/all_elements.txt"
      2 element_mass = {} # create an empty dictionary
----> 3 with open(elements_file_name) as elements_file: # Use a context block so the file is automatically closed at the end
      4     for line in elements_file:
      5         fields = line.split()

File /opt/hostedtoolcache/Python/3.12.9/x64/lib/python3.12/site-packages/IPython/core/interactiveshell.py:325, in _modified_open(file, *args, **kwargs)
    318 if file in {0, 1, 2}:
    319     raise ValueError(
    320         f"IPython won't let you open fd={file} by default "
    321         "as it is likely to crash IPython. If you know what you are doing, "
    322         "you can use builtins' open."
    323     )
--> 325 return io_open(file, *args, **kwargs)

FileNotFoundError: [Errno 2] No such file or directory: '../../../phar2062-workshops/Session 7 - More on Strings/all_elements.txt'

OK, now we can use element_mass and formula_dict to calculate the molecular weight, by iterating over the keys (which are element symbols) in formula_dict:

molecular_mass = 0.0 # start with a mass of zero
for element_symbol in formula_dict: # using formula_dict from cell above
    molecular_mass = molecular_mass + element_mass[element_symbol] * formula_dict[element_symbol]

print(molecular_mass)

165.2024031636

This seems to be working OK - but to test it all out properly, let’s put all the code snippets from the cells above together into what we hope may be the near-final Python program.

Putting it all together#

# Part 1: from formula_string to formula_list:

formula_string = "C6H10SO2F"
start_points = []
n_characters = len(formula_string) # number of characters in the string

# 1a: get start and end indices for each chunk
for i in range(n_characters): # loop over all the characters
    if formula_string[i].isupper():
        start_points.append(i) # append the index to start_points
        
end_points = start_points[1:] # end point of each chunk is start point of the next
end_points.append(n_characters) # special case for the last chunk

# 1b: divide formula_string up:
formula_list = []
n_chunks = len(start_points) # the number of chunks
for i in range(n_chunks):
    formula_list.append(formula_string[start_points[i]:end_points[i]])


# Part 2: from formula_list to formula_dict:

formula_dict = {} # start with an empty dictionary
for chunk in formula_list:
    # 2a: code to find out if it's a 1- or 2-letter symbol:
    n_characters = len(chunk) # number of characters in the chunk
    if n_characters == 1:
        one_letter_symbol = True
    else:
        if chunk[1].islower():
            one_letter_symbol = False
        else:
            one_letter_symbol = True
            
    # 2b: now we know if there is a 1- or 2-letter symbol, process the chunk:
    if one_letter_symbol:
        if n_characters == 1:
            formula_dict[chunk] = 1
        else:
            formula_dict[chunk[0]] = int(chunk[1:]) # convert string to integer
    else:
        if n_characters == 2:
            formula_dict[chunk] = 1
        else:
            formula_dict[chunk[:2]] = int(chunk[2:])


# Part 3: Load the database of elements and masses:

elements_file_name = "../../../phar2062-workshops/Session 7 - More on Strings/all_elements.txt"
element_mass = {} # create an empty dictionary
with open(elements_file_name) as elements_file: # Use a context block so the file is automatically closed at the end
    for line in elements_file:
        fields = line.split()
        symbol = fields[3] # the symbol is the last field
        mass = float(fields[1]) # mass is the second field - convert to a float
        element_mass[symbol] = mass

# Part 4: Calculate the molecular mass:

molecular_mass = 0.0 # start with a mass of zero
for element_symbol in formula_dict:
    molecular_mass = molecular_mass + element_mass[element_symbol] * formula_dict[element_symbol]

print(f'The molecular mass of {formula_string} is {molecular_mass} amu.')

    

The molecular mass of C6H10SO2F is 165.2024031636 amu.

Convert from Python notebook to stand-alone script.#

Although it’s possible to export Python notebooks as *.py scripts, it’s seldom the best way. A generally better approach is to get all the relevant code into a single cell (as we have done above), and then cut-and-paste this into a new *.py file via a text editor.

In the current case, the only adaptation that will be required is to replace:

formula_string = "C6H10SO2F"

with:

formula_string = input('Enter your molecular formula:')

So it becomes interactive.

Give it a go!