Scientific and Engineering Libraries, Part 2

Weeks 12 and 13 are about using external scientific and engineering libraries that are not inside core Python, but are popular enough that we should learn them too.

Pandas

Pandas is a Python library for easy and efficient numerical computation of table-like data. You can install it via either pip or conda with, for example the following command:

pip install pandas

We generally import pandas as "pd".

Assume that you have the following data:

We can read this using pandas by converting it to a pandas.DataFrame:

Here's how it looks:

Pandas has a multitude of different reader and writer functions to read/write data from a different file format.

We can quickly save our data as a CSV file with .to_csv():

And read it back using .read_csv():

Our data has three columns, and ten students. We can access a column like this:

If we want, we can also create "horizontal columns" (i.e, an index), using the names of the students:

This way, we can access a student's information by providing their name as an index to the .loc[...] attribute:

If we didn't have an index, we could access a "row" using the .iloc[...] attribute with a numeric index:

We can access a specific grade, using the following syntax:

Chained indexing

One trick to watch out for: when we want to change a value inside of a DataFrame, you should use chained indexing.

If we try to change Jack's grade using this syntax, we get a warning:

Pandas will still update the value for you, but technically you're working on a copy of the data (as a result of the first indexing).

The best practice is to do this instead, which will not produce a warning:

Pandas has some nice functions that are useful for getting quick statistics about your data:

For numeric data, .describe() will produce some general stats about the given columns:

For example, we see that the mean age is 24.72.

.sort_values(col) will reorder the data such that the rows are sorted from lowest-to-highest for that given category:

Using ascending=False will sort the values from largest to smallest:

.value_counts(col) can help with counting the number of different values for a given field:

.nlargest(num, col) will give you the rows with the largest value for the given column:

Finally, .plot() will produce graphs of line plots of your data:

As a final remark, if you ever wonder the internal datatype used to represent data in Pandas data, they are NumPy arrays:

Matplotlib

Matplotlib is a widely-used library to produce plots and graphs of various forms in Python.

You can install it via

pip install matplotlib

Let's go over the anatomy of a matplotlib graph, from the example in the textbook: hi

The figure above has several key features:

Every part of the anatomical graph components are customizable in matplotlib.

Let's draw the simplest graph, using the pyplot API.

plt.plot(x_values, y_values) is for drawing a lineplot that passes through each pair of points given by x_values and y_values, which are either lists or NumPy arrays:

If we need to plot multiple lines into the same plot, we can just call plt.plot() multiple times.

The following example makes use of the OOP approach of using matplotlib, which is calling plt.subplots() to get fig and ax objects, and using the ax object to draw the graph.

Notice that most function calls using ax are the same, but some have .set_ as a prefix to the function call.

We frequently need to draw multiple subplots inside a single plot, which is possible in the OOP style by giving row and column numbers to the plt.subplots() call to get multiple subplots, and then drawing each plot in its own plot:

This is also possible using the pyplot style. To make plt switch to one specific subplot, the plt.subplot(...) call should end with the currently-switched plot number.