Working with Numpy Arrays#

Overview

Questions:

What are the differences between NumPy arrays and lists?
How can I use NumPy to do calculations?

Objectives:

Be able to name the differences between Python lists and numpy arrays.
Understand the idea of broadcasting.

Numpy is a widely used Python library for scientific computing. It has a number of useful features, including the a data structure called an array. Compared to the built-in data types lists which we used previously, numpy arrays have many features which can help you in your data analysis. Properly using the features of numpy arrays will make your code much faster and more efficient.

Broadcasting#

However, in order for this type of operation to work, the arrays do not have to be exactly the same size they just have to have compatible shapes! Another special thing about numpy is something called broadcasting. Broadcasting occurs when you attempt mathematical operations on arrays that have different shapes. If possible, the smaller array is “broadcast” across the larger array.

Let’s think about what would happen if we wanted to move every atom in our coordinate set by our translation vector.

If you were working with Python lists, or you didn’t know about the features of numpy arrays, you might try to do this with a for loop.

PYTHON

num_atoms = len(coordinates)
new_coordinates = []

for n in range(num_atoms):
    
    atom_coord = coordinates[n]
    updated_coord = []
    for i in range(3):
        translated_coordinate = atom_coord[i] + translation_vector[i]
        updated_coord.append(translated_coordinate)
    
    new_coordinates.append(updated_coord)
    
print(new_coordinates)

OUTPUT

[[0.1, -0.1, 0], [0.1, -0.1, 1.122462048309373], [0.1, -0.1, 2.244924096618746]]

If we think about the indices we were adding the x indices together, the y indices together, and the z indices together for the two arrays. You might write something that looks like this to express the addition:

\[\begin{split} c_{translated} = \left[\begin{array}{rrr} c_{11} + t_1 & c_{12} + t_2 & c_{13} + t_3 \\ c_{21} + t_1 & c_{22} + t_2 & c_{23} + t_3 \\ c_{31} + t_1 & c_{32} + t_2 & c_{33} + t_3 \\ \end{array}\right] \end{split}\]

NumPy is able to compare the shape of two arrays to see if they are compatible. When we examine the shape, we see that they both have 3 present. The shapes are compatible.Broadcasting in numpy allows us to achieve that with one command, rather than in multiple for loop.

PYTHON

new_coordinates_np = coordinates_np + translation_vector_np

print(new_coordinates_np)

Note that both variables do not have to be arrays for this to work, only one. However, the two array-like variables do have to have a matching dimension. You can see the shape of an array using the function np.shape.

PYTHON

np.shape(coordinates)

OUTPUT

(3, 3)

PYTHON

np.shape(translation_vector_np)

OUTPUT

(3,)

When you typed, coordinates + translation_vector_np, numpy looked at the shapes of both arrays to figure out if they were compatible.

If we are using a NumPy array for operations, this check will always be performed. You can also multiply arrays by scalars:

PYTHON

10 * coordinates_np

Logical comparisons#

We can also do logical comparisons on whole arrays. For example, to find out if values in the array are greater than 0, we can write

PYTHON

coordinates = np.array(coordinates)
print(coordinates > 0)

This will print either True or False for each array element depending on whether the value of that element is greater than 0 or not.

OUTPUT

array([[False,  True, False],
       [False, False,  True],
       [False,  True,  True],
       ...,
       [False,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

To get every value in the array that is greater than 0, we can use this as a list of indices we want, or a slice.

PYTHON

greater_than_0_values = coordinates[coordinates>0]
print(greater_than_0_values)

OUTPUT

[1.38509308 0.83913362 1.65106295 ... 3.49745584 0.37549254 4.39339869]

Array Axes#

Imagine we wanted to calculate the geometric center of our atoms To do this, we would need to get the average x coordinate, the average y coordinate, and the average z coordinate.

Previously, we would have done this with a for loop and looped over the columns. However, the numpy.mean function will let us do that without a for loop.

To calculate the mean, or average of a set of data,

PYTHON

coordinates_np.mean()

OUTPUT

0.3741540161031243

When we use the mean function on an array without any other arguments, it will give us the average of all of the values. However, sometimes, we prefer to have the average of all the rows or columns instead. This ties into the idea of a NumPy array axis, a very important concept when working with NumPy.

Our NumPy array is an “n-dimensional” array. The “n” indicates that we can have any number of dimensions, or axes. We can see the number of axes or dimensions for our array using .ndim

PYTHON

coordinates_np.ndim

OUTPUT

Our NumPy array has two dimensions, or two axes. This should make sense because we can use two numbers (row index and column index) to get the value of any cell in our NumPy array.

These axes are called “Axis 0” and “Axis 1”. Axis 0 is in the first position when indexing into an array, Axis 1 is the second. The figure below shows an illustration of Axis 0 and Axis 1. Axis 0 runs down the array, while Axis 1 goes across the columns.

For most NumPy array operations, we can indicate which axis to apply the operation to. NumPy functions usually allow us to specify this using the axis command. When we specify axis we give an array axis over which to give an average. In our two dimensional array, we choose between axis=0 and axis=1. A key thing to realize is that when we do axis=n, it applies the operation down those axes.

PYTHON

print(coordinates_np.mean(axis=0))
print(coordinates_np.mean(axis=1))

Notice that when we calculate the mean of axis=0, we get an output array with 4 values, while when we calculate the mean of axis=1, we get 3 values.

OUTPUT

[0.28061551 0.         0.84184654]
[0.         0.37415402 0.74830803 0.37415402]

Check the shape of our coordinates again:

PYTHON

coordinates_np.shape

OUTPUT

(4, 3)

If we want the column average (or the average of all the rows for a column), we would want the average value of our columns. This corresponds to applying the operation down Axis 0. When we examine the shape, we see that the have 4 values for each column (and this is indicated in the shape index = 1) We expect that we will have three values - the average x, the average y, and the average z.

PYTHON

center = np.mean(data, axis=0)
print(center)

There are a lot of ways to think about array axes, and it will be beneficial to do a google search to find another explanation.

Additional Reading#

If you are still working to understand NumPy, try reading the Beginner’s Tutorial.

Key Points

NumPy arrays which are the same size use element-wise operations when added or subtracted.
NumPy uses something called broadcasting for arrays which are not the same size to allow arrays to be added or multiplied.
NumPy has extensive documentation online - you should check this out if you need to do a computation.

Working with Numpy Arrays#

NumPy Arrays vs. Python Lists#

Array Shape#

Accessing Information in a Multidimensional Array#

Broadcasting#

Logical comparisons#

Array Axes#

Additional Reading#