In this notebook lecture, we will cover:
- Loading, filtering, and manipulating tabular data with the pandas package
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
0. Tabular Data Handling with Numpy¶
Let’s say we have data in a file called “GaiaM92.csv” and we want to open it in Python. One option is to use numpy:
data = np.genfromtxt("./GaiaM92.csv", delimiter=",", skip_header=1)
data
array([[ 2.59278972e+02, 4.31089818e+01, nan, ...,
nan, 1.95672800e+01, nan],
[ 2.59289978e+02, 4.31050111e+01, nan, ...,
nan, 1.84345000e+01, 6.32349000e-01],
[ 2.59289054e+02, 4.31133782e+01, nan, ...,
nan, 1.92420200e+01, nan],
...,
[ 2.59257393e+02, 4.31632966e+01, -3.17254579e+00, ...,
1.17721990e+00, 1.90242880e+01, 6.21408460e-01],
[ 2.59272586e+02, 4.31676046e+01, -3.12183086e-01, ...,
4.06385234e+00, 1.78788530e+01, 6.74488070e-01],
[ 2.59270560e+02, 4.31673887e+01, -2.00777306e-01, ...,
1.26197385e+00, 1.91982980e+01, 7.78442400e-01]])
If we wanted to access rows, we could do the following:
print(data[0])
[259.27897188 43.10898178 nan nan nan
19.56728 nan]
Or if we wanted to access specific columns:
print(data[:, 0])
[259.27897188 259.28997796 259.28905426 ... 259.25739277 259.2725861
259.27056011]
This works, but isn’t very inuitive or user-friendly. Fortunately, there is a much nicer way to do the same operations using the package pandas!
1. What is Pandas, and why should I use it?¶
Officially, pandas is a (https://
In practice, it’s a very user-friendly package for working with tabular data with Python.
When to use pandas¶
Pandas offers a convenient, easy-to-visualize interface to tabular data, including the reading/writing steps. This makes it ideal for:
(1) working with small or medium-sized datasets
(2) manipulating data within Jupyter notebooks, and
(3) exporting Python data to files.
This notebook will explore each of these cases in some form or another.
When not to use pandas¶
(1) The first thing to note is that pandas is never necessary: this is a pretty important point. There are other packages than can do everything that pandas can do, and most of its functionality can be reproduced with numpy alone. This being said, pandas is often the most convenient/user-friendly option.
(2) This user-friendliness comes with some overhead, i.e., it makes code a bit slower. For this reason, pandas is not a good option if you are trying to have maximally-fast code and not a good option for huge datasets.
(3) If you need to be very careful about datatypes, pandas is probably not the best tool to use. I almost never run into such situations, but if exporting data with datatypes is important, then csv files (which pandas excels at) are not an option, and you should consider fits files instead.
2. “DataFrames”: pandas’ key tool¶
The key tool that pandas introduces is the DataFrame. Fundamentally, it’s just a 2D datastructure with rows and columns (and sometimes an index) -- i.e., it’s basically a table.
2a. Creating a DataFrame from a file¶
Pandas can be a very flexible tool for loading in datafiles in many different formats. The one you might most often run into is using data contained in .csv files. Here’s how you could load that with pandas:
filepath = "./GaiaM92.csv"
df = pd.read_csv(
filepath
) ## df is short for dataframe; this is fairly typical notation for pandas
That’s it!. That’s all it takes to generate a Pandas dataframe from a file. This assumes that the file itself has a row that is just the column names, but that’s pretty much it on the input formatting side. Let’s crack it open and see what it contains:
df
The table above looks very nice! It’s clearly formatted, shows you the column names without lots of extraneous info (e.g. datatypes), etc. It even tells you the dimensions of the table!
As you can also see, there are some entries in this table that are showing “NaN” (“not a number”). And that’s OK! Pandas is great at filling in missing entries from tables, which is one reason to use the package in the first place.
Now, the output of the above is obviously truncated. If you want to preview the first 5 rows, you can do:
df.head(10)
Have a long list of columns? You can print them easily:
print(df.columns)
Index(['ra', 'dec', 'parallax', 'pmra', 'pmdec', 'phot_g_mean_mag', 'bp_rp'], dtype='object')
2b. Isolating Columns!¶
Pandas makes it trivial to access certain columns in your data. This is in contrast to raw numpy arrays. To do so with pandas, we can simply use the following notation:
df["ra"]
0 259.278972
1 259.289978
2 259.289054
3 259.290390
4 259.306256
...
8049 259.256567
8050 259.265129
8051 259.257393
8052 259.272586
8053 259.270560
Name: ra, Length: 8054, dtype: float64
The above is what we call a Series in pandas. It functions basically the same as a numpy array, but carries with it an index. If you just want the values, you can add “.values” to the end:
df["ra"].values
array([259.27897188, 259.28997796, 259.28905426, ..., 259.25739277,
259.2725861 , 259.27056011])
Importantly, the above is a numpy array already! This makes it somewhat clearer what is going on “under the hood”: pandas dataframes are very similar to organized collections of numpy arrays.
If you want a single row, you will need the index of that row. For the case of index 0, this looks like:
df.iloc[0].values
array([259.27897188, 43.10898178, nan, nan,
nan, 19.56728 , nan])
2c. Creating new columns in an existing dataframe¶
df
Want to use a column (or more than one!) to create a new column! It’s super simple with pandas. Let’s see how that works:
df["NewColumn"] = np.sqrt(df["pmra"] ** 2 + df["pmdec"] ** 2)
Now if we re-show the dataframe, our new column will be there
df.head()
What if you have an entirely new column that is not dependent on other columns (as in the case above?). Let’s see the case of adding a new column of random numbers.
df["NewColumn2"] = np.random.uniform(size=len(df))
df.head()
2d. Other ways to load in tables:¶
As noted before, there’s a number of other ways you can load tabular data from files. For example,
df2 = pd.read_csv(
"https://support.staffbase.com/hc/en-us/article_attachments/360009197011/username-password-recovery-code.csv"
)
df2
Oh no! It looks mangled. The problem here is that the delimeter is wrong. In other words, the character specifying the separation between entries is not correct. Before it was a comma (hence the “c” in csv file), but here it’s a semicolon.
Pandas can handle that. Just use the "delimeter = " argument as below:
df2 = pd.read_csv(
"https://support.staffbase.com/hc/en-us/article_attachments/360009197011/username-password-recovery-code.csv",
delimiter=";",
)
df2
and voila! It’s fixed. The same can even be done if the delimiter is whitespace, in which case you want to use the delim_whitespace = True argument to read_csv().In short: pandas can handle data that is not simply a generic .csv file.
2e. Saving Dataframes¶
To save a dataframe (e.g., after you have made edits), one need only call .to_csv():
df.to_csv("Modified_data.csv")
which will save your file.
3. Creating Dataframes "From Scratch’': a great way to save results¶
Oftentimes, we don’t have begin data in a file, but we want to save the data into one. For example, one situation that often comes up in research is that you have run trials / a Monte Carlo simulation and you want a way to export that information (e.g., for exact reproducibility of your plots).
Let’s generate some mock data to demonstrate this:
array_1 = np.random.normal(size=1000)
array_2 = 10 * array_1
and here’s how to make that into a dataframe assuming each array above is its own column:
combined = np.column_stack([array_1, array_2])
new_dataframe = pd.DataFrame(combined)
new_dataframe
This obviously doesn’t have column names, but we can add them easily:
new_dataframe.columns = ["col1", "col2"]
new_dataframe.head()
Using Dictionaries
The same can also be achieved using dictionaries and the from_dict() method. For example,
data_dictionary = {"col1": array_1, "col2": array_2}
new_dataframe = pd.DataFrame.from_dict(data_dictionary)
new_dataframe.head()
4. Filtering Data¶
What if we want to access subsets of the data? It’s just like numpy arrays! You just pass the columns instead. For example,
filter1 = df["pmra"] > 1
filter1
0 False
1 False
2 False
3 False
4 False
...
8049 False
8050 False
8051 False
8052 False
8053 False
Name: pmra, Length: 8054, dtype: bool
This generates a Series full of boolean (True/False) values. This might not be very useful on its own, but if we wanted to fully isolate the rows, we could use these booleans to create a new dataframe that meets our condtion. For example,
filtered_dataframe = df[filter1]
The same works with multiple conditions (and can be done in one line - just watch parantheses placement):
filtered_dataframe = df[(df["pmra"] > 4) & (df["pmra"] > 4)]
5. Plotting Data¶
Pandas dataframe columns can be passed to matplotlib just like any other numpy array. Let’s see how that works:
plt.scatter(new_dataframe["col1"], new_dataframe["col2"])
An entirely equivalent method is to do as follows:
new_dataframe.plot.scatter("col1", "col2")
This one comes with free labels!