Session 18: Visualizing Data

Session 18: Visualizing Data#

Visualizing data is something we commonly do as scientists, and it makes perfect sense for us to leverage Python where we can to automate the process of generating graphs and figures. In this session, we will explore how to use the popular plotting library Matplotlib to achieve this goal.

Note: you can choose to use Jupyter Notebooks for all of these tasks if you prefer! It is a common usecase to use such notebooks for data visualization.

Part 1: Matplotlib Basics#

Before we begin, we need to make sure we have Matplotlib installed, so ensure you successfully run pip install matplotlib in your terminal.

Task 1#

Create a new file in the session folder called matplotlib_intro.py

Paste in the following code:

import os
import numpy as np
import matplotlib.pyplot as plt

# Get the directory of this script
script_path = os.path.abspath(__file__)
script_dir = os.path.dirname(script_path)

# Construct the full path to test_data.txt
data_path = os.path.join(script_dir, "test_data.txt")

# Load data with numpy
data = np.loadtxt(data_path)

# Assume the first column is x, second column is y
x = data[:, 0]
y = data[:, 1]

# Create a basic plot
plt.plot(x, y, label="Sample Data")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Basic Plot from test_data.txt")
plt.legend()
plt.show()

Run the script, and a window should appear displaying a plot of the data.

Analysis#

Import Matplotlib: For plotting, we want to use the matplotlib.pyplot package. As we did with NumPy, we use an alias (plt) to make our life easier:
```
import matplotlib.pyplot as plt
```
Import data: NumPy is used to read in the same dataset we have used in a previous session, much in the same way as seen before.
- The code assumes two columns in test_data.txt, which we assign to two variables:
  - x is the first column (index 0).
  - y is the second column (index 1).
Creating a plot: To create a “plot” object we use the plt.plot() function, which takes in data and some parameters, and then allows us to draw the plot.
- Using plt.xlabel() and plt.ylabel() we can set the appropriate x- and y- axis labels, and using plt.title() we set a title.
- Check out the image below (taken from the official documentation) to see the “anatomy” of a Matplotlib figure.
- You should have a look through the documentation for plt.plot() to see what else you can do with it.
Displaying the plot: Finally, using plt.show() will display the figure by opening a window on your computer.
- Note: we have not saved this as an image anywhere!

Task 2#

Modify your matplotlib_twodata.py file by pasting the following block of code to replace the block of code at the end.

import os

import matplotlib.pyplot as plt
import numpy as np

# Get the directory of this script
script_path = os.path.abspath(__file__)
script_dir = os.path.dirname(script_path)

# Construct the full path to test_data.txt
data_path = os.path.join(script_dir, "test_data.txt")

# Load data with numpy
data = np.loadtxt(data_path)

# Assume the first column is x, second column is y
x = data[:, 0]
y = data[:, 1]

# Create a new set of data by arbitrarily scaling the original y-axis data
y2 = y * 1.5

plt.plot(x, y, label="Sample Data")
plt.plot(x, y2, label="Scaled Data", linestyle="--")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Comparing Two Data Series")
plt.legend()
plt.show()

Run the script and see if you spot anything different.
Take a closer look at the code and identify what has changed (and why?).

Analysis#

Let’s focus on the two new lines of code here to see what we have changed this time around. The labels and titles have also been changed, but the fundamental code is the same.

y2 = y * 1.5

Very simply, we are taking the NumPy array y and multiplying each element by 1.5 and saving it to a new array y2.
- Remember vectorization? It allows us to do this very easily!

plt.plot(x, y2, label="Scaled Data", linestyle="--")

Here we are calling the plt.plot() function a second time, this time with the same x-axis but with our new y2 array as the y-axis data
We are also using the linestyle keyword argument to apply a different style ("--" will give us dashed lines) to our second set of data in this plot.
- Take a look at the relevant documentation page for other available linestyles. Feel free to have a play with different styles for either (or both) of the data sets.

Programming Challenge

Write a script that will read the spec_data.csv file found in the session folder and create a plot with appropriate labels and no legend.

You will need to take into account that the file has a “header” row which gives each column a label, and that the data is in a comma-separated values format. The np.loadtxt() documentation might be useful to you in addressing these.

Raw data from: https://data.hpc.imperial.ac.uk/resolve/?doi=8763

Part 2: Curve Fitting#

Task 3#

Create a file called matplotlib_curvefit.py and paste in the following:

import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit

# Generate a made-up dataset that follows y = 2*x + 5
x = np.linspace(0, 10, 50)
a, b = 2.0, 5.0
# Generate noise (random numbers) with mean 0 and standard deviation 0.5
noise = np.random.normal(0, 0.5, size=len(x))
y = a * x + b + noise


def func(x, a, b):
    return a * x + b


# Fit the linear model
optimal_parameters, covariance = curve_fit(func, x, y)
a_fit, b_fit = optimal_parameters
y_fit = func(x, a_fit, b_fit)

plt.plot(x, y, marker="o", linestyle="", label="Synthetic Data")
plt.plot(x, y_fit, linestyle="--", color="r", label=f"Fitted: a={a_fit:.2f}, b={b_fit:.2f}")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Curve Fitting Linear Data")
plt.legend()
plt.show()

Run the script a couple of times and make a note of the values of the fitted parameters (in the legend).

Analysis#

There are 2 main things happening in this script; generating a random linear data set, and fitting a linear regression and plotting it.

Let’s took a look at each section, line-by-line where needed.

Generating random linear data#

x = np.linspace(0, 10, 50)

We use the np.linspace() function of NumPy to generate an evenly-spaced linear array of 50 numbers from 0 to 10.
This corresponds to our x-coordinate data, and so we call this array x.

a, b = 2.0, 5.0

As we will be using the \(y=ax+b\) format of a straight line, we need to set some arbitrary values for our gradient (a) and y-intercept (b).
Note: we are assigning both variables on a single line by using commas on either side of the =!

noise = np.random.normal(0, 0.5, size=len(x))

Since we don’t want a perfect straight line, we will generate some noise by creating an array equal in size to our x array filled with random numbers with a mean of 0 and a standard deviation of 0.5.
To achieve a mean of 0, some values will be positive, and some will be negative.
Feel free to use print() to see the contents of this array, or even plot them with plt.plot()!

y = a * x + b + noise

Finally, we combine everything to determine our y values.
Rather than being exactly a straight line, our y-coordinates will be ever-so-slightly shifted by the noise we have added.

Fitting a linear regression#

With our random “data”, we use SciPy to fit a function of our choosing to the model. We could use NumPy’s polynomial fitting, but it is much more powerful to be able to get parameters from our own functions as data will not always follow a simple trend.

def func(x, a, b):
    return a * x + b

Here we are defining a function which we will feed into SciPy’s curve_fit() function.
We are expecting a linear relationship, so our function will be in the form \(f(x)=ax+b\), where we are hoping to find the values of \(a\) and \(b\).
- In a more realistic scenario, we would not know the values of these already.
If we were fitting an exponential or logarithmic function, we would modify func() accordingly.
- For example, for an exponential fit we may use the following:
```
def func(x, a, b, c):
    return a * np.exp(b*x) + c
```

optimal_parameters, covariance = curve_fit(func, x, y)

We now use SciPy’s curve_fit() function, which takes in a model function (here, func()) as well as our input data and outputs a couple of things for us.
- First, it ouputs some optimal_parameters that contains the optimal parameters for our model function to fit the selected data through a least squares analysis.
- The second output gives the approximate covariance of our optimal parameters, which we can use for a more detailed error analysis (not in this task).

a_fit, b_fit = optimal_parameters

In this step we are simply splitting up the optimal_parameters array into individual parameter variables.

y_fit = func(x, a_fit, b_fit)

Here we are generating a series of points for our straight line based on the fitted parameters a_fit and b_fit.
This is made very easy by just using the function for the equation of a straight line we have already provided.
- Note: we are continuing to make sure of NumPy’s vectorization here by applying the function to each element in the array x in a single line!

plt.plot(x, y_fit, linestyle="--", color="r", label=f"Fitted: a={a_fit:.2f}, b={b_fit:.2f}")

We want to add the fitted line to our plot, so we just use plt.plot() again with the fitted y-values.
By using appropriate keyword arguments, we can format this to our liking. More details on how to do this are in the Matplotlib documentation, which you are encouraged to look through.

Our final figure should look a bit like this, but with different values, of course

Programming Challenge

See if you can calculate the standard deviation error in the parameters \(a\) and \(b\), and display these in the figure legend.

Task 4#

We have made figures, but how can we actually save them to use them in a report? By saving it as a file, of course.

Create a new file called matplotlib_savefig.py.

In your new file, paste the following code:

import os

import matplotlib.pyplot as plt
import numpy as np

# Generate a simple dataset
x = np.linspace(0, 10, 50)
y = np.sin(x)

# Create the plot
plt.plot(x, y)

# Define the output file path (saving in the same directory as the script)
script_dir = os.path.dirname(os.path.abspath(__file__))
output_path = os.path.join(script_dir, "sine_wave_plot.png")

# Save the figure
plt.savefig(output_path)

Run the script.
Check your session folder to locate the saved file and open it.

Analysis#

Creating data and a plot: We create a simple data set of 50 data points along \(f(x)=\sin x\) using np.linspace() for our x-values and np.sin() for our y-values.
Determine an output filename: Just as we had to give the full filepaths before when reading data, it is best practice to create full paths when writing data as well.
- You can change the file format by specifying a different file extension (e.g., .jpg), but a vector-based format such as .png is generally preferred for diagrams.
Save the figure: Using the plt.savefig() function, we save our plot to the chosen output path.
- You will get no visual confirmation that this has happened, other than the file being created.
- Running the script again will overwrite this file, so be mindful if you make changes to it using other software!

Summary#

By the end of this session, you will have:

Learned to create and and customize plots with Matplotlib, including multiple datasets and modifying labels and linestyles.
Demonstrated using SciPy’s curve_fit for a simple linear fit, including fitting parameters and displaying them in the plot.
Saved a figure to a file for later use.

Session 18: Visualizing Data

Contents

Session 18: Visualizing Data#

Part 1: Matplotlib Basics#

Task 1#

Analysis#

Task 2#

Analysis#

Part 2: Curve Fitting#

Task 3#

Analysis#

Generating random linear data#

Fitting a linear regression#

Task 4#

Analysis#

Summary#