Session 7: More on Strings#
Introduction#
Why more on strings? Isn’t data science mainly about numbers?
Well not really: while the science data this course is mainly concerned with is indeed numbers, the science of data is applicable to all fields, and many of them are highly textual - think of all those Large Language Models!
But more relevantly to us, while the data we want to be able to manipulate and analyse with the help of Python will indeed most commonly be numeric, very commonly we have to read it in to our program in text form, and the reports we want our programs to produce will also probably feature a lot of text.
So the focus here is on these two applications:
input - analysing textual data sources as part of turning it into numbers, and
output - creating nice-looking text messages.
Task 1#
In a terminal window, navigate to your
PHAR2062/Session 7 - More on Strings
folder.Create a text file in this folder with the name
elements.txt
, and the contents exactly:1 H 1.0078 Hydrogen 2 He 4.0026 Helium 3 Li 6.9410 Lithium 15 P 30.9740 Phosphorus
Now write a Python program called
parse_data.py
that reads in this data file, and just prints it out again:elements_file = open('elements.txt') for line in elements_file.readlines(): print(line) elements_file.close()
(if there is anything here you feel unsure about look over the previous workshop about Lists).
Run your Python program and make sure it produces the expected output.
Analysis#
There should be no surprises here - you have done exactly this sort of thing before.
But to turn this into something more useful, we want to be able to separate out the different columns of data in this file - the atomic, numbers, element symbols, atomic masses, and element names. That’s what you will now work towards.
Task 2#
Edit
parse_data.py
so it reads:elements_file = open('elements.txt') for line in elements_file.readlines(): words = line.split() print(words) elements_file.close()
And then re-run it
Analysis#
In the second line of the file, a string variable line
is set sequentially to each line if text on the file elements.txt
.
In line 3 we see a method of the string object - .split()
- being used. As you can probably work out from the output, what this does is return a list of strings, each being one of the non-blank sections in the original.
The .split()
method of a string is very useful when reading in and analysing (“parsing”) data. Strings have a variety of other methods too, for example:
.toupper()
returns an all upper-case copy of the string..tolower()
returns an all lower-case copy of the string..isupper()
returnsTrue
if all the letters in the string are upper-case letters, otherwise it returnsFalse
.islower()
does the opposite.
Now we know how to separate out the different data items in each line of the file, we can begin to create some more useful data objects for future use.
Task 3#
Edit
parse_data.py
so it reads:atomic_numbers = [] element_symbols = [] atomic_masses = [] element_names = [] elements_file = open('elements.txt') for line in elements_file.readlines(): words = line.split() atomic_numbers.append(words[0]) element_symbols.append(words[1]) atomic_masses.append(words[2]) element_names.append(words[3]) elements_file.close() print(atomic_numbers) print(element_symbols) print(atomic_masses) print(element_names)
And then re-run it
Analysis#
We begin by creating four empty lists, then as we iterate through the lines in the file, the appropriate items in each line are appended to the corresponding lists. At the end the output should confirm that this has indeed been achieved!
Programming Challenge 1
You will notice that the four lists we have created are all lists of strings. Can you edit the program so that atomic_numbers
becomes a list of integers, and atomic_masses
becomes a list of floats? Look over material from previous workshops if you need a hint!
Task 4#
Edit your new, improved
parse_data.py
so the last lines look like this:elements_file.close() for i in [0, 1, 2, 3]: print(atomic_numbers[i], element_symbols[i], atomic_masses[i], element_names[i])
And then re-run it.
Analysis#
If everything goes according to plan, you should see the following output:
1 H 1.0078 Hydrogen
2 He 4.0026 Helium
3 Li 6.941 Lithium
15 P 30.974 Phosphorus
If you compare this with the input file you can see it’s very nearly the same - but not quite. In the output, the columns of data are not so nicely formatted: in lines 1 and 4 there is only one space after the element symbol, and on lines 3 and 4 the atomic masses of Li and P have only been printed out to 3 decimal places. The first column of atomic numbers don’t line up nicely either.
To fix this, we need to start about string formatting in Python.
First a warning: how to get nicely formatted strings has been tackled by the developers of Python in a number of different ways over the history of the language. If you look at other examples of Python code, you will quickly see things that look rather different from what you will learn here. To keep things simple, we are only going to talk for now about the most modern of Python’s approaches to string formatting, which is called f-strings.
Task 5#
Edit the last few lines of
parse_data.py
so they look like this:elements_file.close() for i in [0, 1, 2, 3]: formatted_string = f'{atomic_numbers[i]} {element_symbols[i]} {atomic_masses[i]} {element_names[i]}' print(formatted_string)
And then re-run it.
Analysis#
Firstly you may be rather disappointed to see that the output looks just like it did last time. But let’s just analyse the penultimate line of the code.
Here we see the string variable formatted_string
being assigned using an f-string. An f-string starts with f
before the opening quote (which doesn’t have to be a single quote - a double quote ("
) is fine too, as long as the end of the string is marked by a matching double quote, of course).
Inside an f-string, any variable that is found between a pair of braces ({}
) is converted into its string representation. Anything else in the f-string stays as it is (in this case, the single space characters between each item).
The power of f-strings comes from the fact that within each pair of braces you can specify not only what variable is going to be represented there, but also how it is formatted.
Task 6#
Edit the penultimate line of
parse_data.py
so it look like this:formatted_string = f'{atomic_numbers[i]:2d} {element_symbols[i]:2s} {atomic_masses[i]:7.4f} {element_names[i]}'
And then re-run it.
Analysis#
All the columns of data are now nicely lined up - how come?
The entry in the f-string for the atomic_numbers now reads: {atomic_numbers[i]:2d}
. The :
separates the name of the variable from a code that controls how the item is printed. The code 2d
means “print out as an integer number that is right-justified in a field that is 2 characters wide”.
Likewise the entry in the f-string for the element symbol now reads: {element_symbols[i]:2s}
. In this case the code 2s
means “print out as a left-justified string in a field that is 2 characters wide”.
Finally the entry for the atomic mass now reads: {atomic_masses[i]:7.4f}
. In this case the code 7.4f
means “print out as a floating point number, in a field that is 7 characters wide and includes 4 figures after the decimal point”.
The codes (or “mini-language”) used in the formatting of f-strings is very flexible, but also inevitably can get quite complicated. We won’t go into any more for now, but there are pointers to further reading/study below.
Programming Challenge 2
Edit the f-string in parse_data.py
so it reads:
formatted_string = f'{atomic_numbers[i]:2d}: "{element_symbols[i]:2s}" mass = {atomic_masses[i]:7.4f} ({element_names[i]})'
(look carefuly - the changes are quite small!) and then re-run it to see how characters that are outside the braces contribute to what ends up being printed out.
Now: can you edit that line in the file so that the output looks like this (only first line shown):
Hydrogen (atomic number 1) has the symbol "H" and an atomic mass of 1.01 (to two decimal places)
Programming Challenge 3
In the folder for this session you will see a file all_elements.txt
. Can you adapt the code you have written so it can read this file?
(You may find this useful later on in the course!)
Bonus material: Putting it in Context#
All the examples above illustrate the “proper” way to deal with reading data from files, which is that you start by ‘opening” the file, then you access the data in it, and finally you don’t forget to “close” it again.
Forgetting to “close” a file is easy to do, but Python provides a way to help you.
Task 7#
Create a stripped-down copy of your
parse_data.py
file that looks like this:
with open('elements.txt') as elements_file:
for line in elements_file.readlines():
print(line)
print('all done!')
Save it with the name
parse_data_revised.py
, and then run it.
Analysis#
You should see it works fine, even though there is no elements_file.close()
line.
This is because the with X as Y
idiom of the code in the first line starts
something called a context_block (note how the lines below are indented).
The line creates a variable (Y
) that is the return value of the function X
.
But Y
is a temporary variable, that only exists within the context block
(any lines of code below that are indented). At the end of the context block (where the indentation stops), Python safely ‘cleans up’ Y
, which in the case of
Y
being an open file object, means implicitly close()
ing it.
Programming Challenge 4
The context block idiom is very commonly used for reading (and writing) files
and you will see it a lot - so it’s a good idea to get used to using it too.
Edit your original parse_data.py
code to use it.
Later on in the course you may see context blocks being used for other purposes than reading and writing files - thay are useful wherever you need to create some variables for a certain amount of time, but ‘safely’ get rid of them later when they are npt needed any more.
Summary#
In this session you have been introduced to Python tools that should help you a lot when working out how to deal with files of data. You have seen how the .split()
method of strings can help to separate different data values within a line of text, and how to convert them into the correct data type, and then create suitable data structures to organise them (here, lists). Then you have seen how to create nicely-formatted output files of data too, using the power of f-strings.
You can learn more about f-strings here