Scientific and Engineering Libraries, Part 2¶

Weeks 12 and 13 are about using external scientific and engineering libraries that are not inside core Python, but are popular enough that we should learn them too.

Pandas¶

Pandas is a Python library for easy and efficient numerical computation of table-like data. You can install it via either pip or conda with, for example the following command:

pip install pandas

We generally import pandas as "pd".

Assume that you have the following data:

In [2]:

columns = ['Name', 'Grade', 'Age']
data = [
    ['Jack', 40.2, 20],
    ['Amanda', 30.0, 25],
    ['Mary', 60.2, 19],
    ['John', 85.0, 30],
    ['Susan', 70.0, 28],
    ['Bill', 58.0, 28],
    ['Jill', 90.0, 27],
    ['Tom', 90.0, 24],
    ['Jerry', 72.0, 26],
    ['George', 79.0, 22],
    ['Elaine', 82.0, 23]
]

We can read this using pandas by converting it to a pandas.DataFrame:

Here's how it looks:

In [4]:

data

Out[4]:

	Name	Grade	Age
0	Jack	40.2	20
1	Amanda	30.0	25
2	Mary	60.2	19
3	John	85.0	30
4	Susan	70.0	28
5	Bill	58.0	28
6	Jill	90.0	27
7	Tom	90.0	24
8	Jerry	72.0	26
9	George	79.0	22
10	Elaine	82.0	23

Pandas has a multitude of different reader and writer functions to read/write data from a different file format.

We can quickly save our data as a CSV file with .to_csv():

And read it back using .read_csv():

Our data has three columns, and ten students. We can access a column like this:

In [7]:

data['Grade']

Out[7]:

0     40.2
1     30.0
2     60.2
3     85.0
4     70.0
5     58.0
6     90.0
7     90.0
8     72.0
9     79.0
10    82.0
Name: Grade, dtype: float64

If we want, we can also create "horizontal columns" (i.e, an index), using the names of the students:

In [9]:

data

Out[9]:

	Name	Grade	Age
Name
Jack	Jack	40.2	20
Amanda	Amanda	30.0	25
Mary	Mary	60.2	19
John	John	85.0	30
Susan	Susan	70.0	28
Bill	Bill	58.0	28
Jill	Jill	90.0	27
Tom	Tom	90.0	24
Jerry	Jerry	72.0	26
George	George	79.0	22
Elaine	Elaine	82.0	23

This way, we can access a student's information by providing their name as an index to the .loc[...] attribute:

In [10]:

data.loc['Amanda']

Out[10]:

Name     Amanda
Grade      30.0
Age          25
Name: Amanda, dtype: object

If we didn't have an index, we could access a "row" using the .iloc[...] attribute with a numeric index:

In [11]:

data.iloc[1]

Out[11]:

Name     Amanda
Grade      30.0
Age          25
Name: Amanda, dtype: object

In [12]:

data

Out[12]:

	Name	Grade	Age
Name
Jack	Jack	40.2	20
Amanda	Amanda	30.0	25
Mary	Mary	60.2	19
John	John	85.0	30
Susan	Susan	70.0	28
Bill	Bill	58.0	28
Jill	Jill	90.0	27
Tom	Tom	90.0	24
Jerry	Jerry	72.0	26
George	George	79.0	22
Elaine	Elaine	82.0	23

We can access a specific grade, using the following syntax:

In [13]:

data.loc['Jack']['Grade']

Out[13]:

40.2

In [15]:

data.loc['Jack', 'Grade']

Out[15]:

40.2

Chained indexing¶

One trick to watch out for: when we want to change a value inside of a DataFrame, you should use chained indexing.

If we try to change Jack's grade using this syntax, we get a warning:

In [16]:

data['Grade'].loc['Jack'] = 73

/Users/cagri/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py:1637: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)

Pandas will still update the value for you, but technically you're working on a copy of the data (as a result of the first indexing).

The best practice is to do this instead, which will not produce a warning:

In [18]:

data

Out[18]:

	Name	Grade	Age
Name
Jack	Jack	83.0	20
Amanda	Amanda	30.0	25
Mary	Mary	60.2	19
John	John	85.0	30
Susan	Susan	70.0	28
Bill	Bill	58.0	28
Jill	Jill	90.0	27
Tom	Tom	90.0	24
Jerry	Jerry	72.0	26
George	George	79.0	22
Elaine	Elaine	82.0	23

Pandas has some nice functions that are useful for getting quick statistics about your data:

For numeric data, .describe() will produce some general stats about the given columns:

In [19]:

data.describe()

Out[19]:

	Grade	Age
count	11.000000	11.000000
mean	72.654545	24.727273
std	17.848045	3.495452
min	30.000000	19.000000
25%	65.100000	22.500000
50%	79.000000	25.000000
75%	84.000000	27.500000
max	90.000000	30.000000

For example, we see that the mean age is 24.72.

.sort_values(col) will reorder the data such that the rows are sorted from lowest-to-highest for that given category:

In [20]:

data.sort_values('Grade')

Out[20]:

	Name	Grade	Age
Name
Amanda	Amanda	30.0	25
Bill	Bill	58.0	28
Mary	Mary	60.2	19
Susan	Susan	70.0	28
Jerry	Jerry	72.0	26
George	George	79.0	22
Elaine	Elaine	82.0	23
Jack	Jack	83.0	20
John	John	85.0	30
Jill	Jill	90.0	27
Tom	Tom	90.0	24

Using ascending=False will sort the values from largest to smallest:

In [21]:

data.sort_values('Age', ascending=False)

Out[21]:

	Name	Grade	Age
Name
John	John	85.0	30
Susan	Susan	70.0	28
Bill	Bill	58.0	28
Jill	Jill	90.0	27
Jerry	Jerry	72.0	26
Amanda	Amanda	30.0	25
Tom	Tom	90.0	24
Elaine	Elaine	82.0	23
George	George	79.0	22
Jack	Jack	83.0	20
Mary	Mary	60.2	19

.value_counts(col) can help with counting the number of different values for a given field:

In [22]:

data.value_counts('Grade')

Out[22]:

Grade
90.0    2
30.0    1
58.0    1
60.2    1
70.0    1
72.0    1
79.0    1
82.0    1
83.0    1
85.0    1
dtype: int64

.nlargest(num, col) will give you the rows with the largest value for the given column:

	Name	Grade	Age
Jill	Jill	90.0	27
Tom	Tom	90.0	24
John	John	85.0	30

Finally, .plot() will produce graphs of line plots of your data:

In [24]:

data.plot()

Out[24]:

<AxesSubplot:xlabel='Name'>

In [25]:

data['Grade'].plot()

Out[25]:

<AxesSubplot:xlabel='Name'>

As a final remark, if you ever wonder the internal datatype used to represent data in Pandas data, they are NumPy arrays:

In [26]:

data['Grade'].values

Out[26]:

array([83. , 30. , 60.2, 85. , 70. , 58. , 90. , 90. , 72. , 79. , 82. ])

In [27]:

data.values

Out[27]:

array([['Jack', 83.0, 20],
       ['Amanda', 30.0, 25],
       ['Mary', 60.2, 19],
       ['John', 85.0, 30],
       ['Susan', 70.0, 28],
       ['Bill', 58.0, 28],
       ['Jill', 90.0, 27],
       ['Tom', 90.0, 24],
       ['Jerry', 72.0, 26],
       ['George', 79.0, 22],
       ['Elaine', 82.0, 23]], dtype=object)

Matplotlib¶

Matplotlib is a widely-used library to produce plots and graphs of various forms in Python.

You can install it via

pip install matplotlib

Let's go over the anatomy of a matplotlib graph, from the example in the textbook:

The figure above has several key features:

It has a main title (title), x-axis title (xtitle) and a y-axis title (ytitle).
It has two lines which have different colors and two labels that are shown in the little box (legend).
The numeric values in the table are have their values shown using the numbers on the axes (the ticks).

Every part of the anatomical graph components are customizable in matplotlib.

Let's draw the simplest graph, using the pyplot API.

plt.plot(x_values, y_values) is for drawing a lineplot that passes through each pair of points given by x_values and y_values, which are either lists or NumPy arrays:

In [29]:

xs = list(range(10))
ys = [x**2 for x in xs]
plt.plot(xs, ys)

# Customizing the graph
plt.xlabel('x')
plt.ylabel('y')
plt.title('Graph of x ** 2')

Out[29]:

Text(0.5, 1.0, 'Graph of x ** 2')

If we need to plot multiple lines into the same plot, we can just call plt.plot() multiple times.

The following example makes use of the OOP approach of using matplotlib, which is calling plt.subplots() to get fig and ax objects, and using the ax object to draw the graph.

Notice that most function calls using ax are the same, but some have .set_ as a prefix to the function call.

In [30]:

fig, ax = plt.subplots()

ax.plot([x**2 for x in range(10)], label='$f(x) = x^2$')

# Display this with a gray color
ax.plot([x**3 for x in range(10)], label='$f(x) = x^3$', c='tab:gray')


ax.set_xlabel('x')

# Show 10 ticks
ax.set_xticks(range(10))

# Display the legend
ax.legend()

# Add a grid
ax.grid()

# Display the y values in logarithmic scale
ax.set_yscale('log')

ax.set_title('Graph of polynomials')

Out[30]:

Text(0.5, 1.0, 'Graph of polynomials')

We frequently need to draw multiple subplots inside a single plot, which is possible in the OOP style by giving row and column numbers to the plt.subplots() call to get multiple subplots, and then drawing each plot in its own plot:

In [31]:

# Plot x ** 2
fig, ax = plt.subplots(1, 2)  # 1 row, 2 columns of plots
ax[0].plot([x**2 for x in range(10)])
ax[0].set_xlabel('x')
ax[0].set_title('y = x ** 2')

# Plot x ** 3 on the second canvas
ax[1].plot([x**3 for x in range(10)])
ax[1].set_xlabel('x')
ax[1].set_title('y = x ** 3')

Out[31]:

Text(0.5, 1.0, 'y = x ** 3')

This is also possible using the pyplot style. To make plt switch to one specific subplot, the plt.subplot(...) call should end with the currently-switched plot number.

In [32]:

# Draw x ** 2
plt.subplot(1, 2, 1) # 1 row, 2 columns of plots, switch to first plot
plt.plot([x**2 for x in range(10)])
plt.xlabel('x')
plt.title('y = x ** 2')

plt.subplot(1, 2, 2) # Same thing, but switch to second plot
plt.plot([x**3 for x in range(10)])
plt.xlabel('x')
plt.title('y = x ** 3')

Out[32]:

Text(0.5, 1.0, 'y = x ** 3')