5.2 Pandas#
Many kinds of real-world data are stored in a tabular format. This means two-dimensional tables structured into rows and columns, with each observation typically taking up a row and each column representing a single variable.
The Pandas library is a popular Python library for dealing with tabular data. In comparison to numpy, pandas specifically limits us to two-dimensional tables, but we gain the flexibility of e.g. having variables of different types.
import pandas as pd
DataFrames#
A DataFrame
is the core data structure in the Pandas library. It is ideal for working with tabular data, making it easy to manipulate, filter, and analyse data sets. We can create and print a simple example as follows:
my_df = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3']})
print(my_df)
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3
We can also read data from other sources such as Excel or CSV (comma-separated values) files. CSV files are a good way of storing tabular data as they are plain text, meaning you can open and read them with any kind of text editor or programming language. Usually, CSV files have a header row containing the labels (column names).
As an example, let us load some data that was collected as part of a study of life span changes in brain tissue properties [YWM14]. We can simply load this file from the internet by using the pd.read_csv()
function. The only argument required for this is the data we want to load. In this case, we provide an URL as the data is stored on the internet, however, you would usually provide a path pointing to a file on your computer. All other arguments are optional, but they can be useful in providing additional instructions for what to do with the data. Here, we use usecols=[1,2,3,4,5,6,7]
to specify only some specific columns, na_values="NaN"
to specify that missing values should be entered as NaN (“not a number”), and index_col=0
to use the first column as an our index:
yeatman_data = pd.read_csv("https://yeatmanlab.github.io/AFQBrowser-demo/data/subjects.csv",
usecols=[1,2,3,4,5,6,7],
na_values="NaN",
index_col=0)
print(yeatman_data.head())
Age Gender Handedness IQ IQ_Matrix IQ_Vocab
subjectID
subject_000 20 Male NaN 139.0 65.0 77.0
subject_001 31 Male NaN 129.0 58.0 74.0
subject_002 18 Female NaN 130.0 63.0 70.0
subject_003 28 Male Right NaN NaN NaN
subject_004 29 Male NaN NaN NaN NaN
The variable yeatman_data
now is a Pandas DataFrame which contains our data and we can use the .head()
method to look at the first few rows of the data. The leftmost column subjectID
is the index column, while the other columns contain the data for different variables such as age, gender, or IQ. You can also see that the DataFrame is, in contrary to the previously introduced numpy arrays, heterogeneously typed. This means it can contain variables of different types (e.g. strings or floats). You can also already see some missing values. For example, subject003
and subject_004
are missing values for IQ related columns.
Summarizing#
Pandas further contains useful methods to e.g. summarize the data. We can use .info()
to get a closer look into our data:
print(yeatman_data.info())
<class 'pandas.core.frame.DataFrame'>
Index: 77 entries, subject_000 to subject_076
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 77 non-null int64
1 Gender 76 non-null object
2 Handedness 66 non-null object
3 IQ 63 non-null float64
4 IQ_Matrix 63 non-null float64
5 IQ_Vocab 63 non-null float64
dtypes: float64(3), int64(1), object(2)
memory usage: 4.2+ KB
None
Most of this information should already make sense. An observation that can be made is that Gender and Handedness are stored as objects. This is because they are a mixture of not only strings but also NaNs, which are considered numerical values.
We can also get a first statistical summary for the numerical columns by using the .describe()
method. NaN values are ignored for these calculations, but the count
column will tell you how many values were used for the calculations in each column.
print(yeatman_data.describe())
Age IQ IQ_Matrix IQ_Vocab
count 77.000000 63.000000 63.000000 63.000000
mean 18.961039 122.142857 60.539683 64.015873
std 12.246849 12.717599 7.448372 8.125015
min 6.000000 92.000000 41.000000 36.000000
25% 9.000000 114.000000 57.000000 60.000000
50% 14.000000 122.000000 61.000000 64.000000
75% 28.000000 130.000000 64.500000 70.000000
max 50.000000 151.000000 76.000000 80.000000
Indexing#
In previous sessions we already leearned about indexing and slicing as a way of accessing individual elements of e.g. lists. Pandas DataFrames also support a variety of different indexing and slicing operations. For example, we can use label-based indexing to select rows through the .loc
attribute and by indexing in square brackets:
print(yeatman_data.loc["subject_000"])
Age 20
Gender Male
Handedness NaN
IQ 139.0
IQ_Matrix 65.0
IQ_Vocab 77.0
Name: subject_000, dtype: object
In the case that we do not know the exact label of the subject but just its index (e.g. we want to access the first subject), we can use the iloc
attribute for that purpose:
print(yeatman_data.iloc[0])
Age 20
Gender Male
Handedness NaN
IQ 139.0
IQ_Matrix 65.0
IQ_Vocab 77.0
Name: subject_000, dtype: object
This returns the same information, as instead of looking for subject_000
we are asking for the first row at position 0. You now might ask yourself how we can index a two-dimensional table with just one index. The answer is that it is just a shorthand form of the full expression:
print(yeatman_data.iloc[0, :])
Age 20
Gender Male
Handedness NaN
IQ 139.0
IQ_Matrix 65.0
IQ_Vocab 77.0
Name: subject_000, dtype: object
Remember from previous sections that :
stands for “all values”. This means we can also apply slicing to extract a subset of columns:
print(yeatman_data.iloc[0, 2:5])
Handedness NaN
IQ 139.0
IQ_Matrix 65.0
Name: subject_000, dtype: object
Similarly, we can also access a single column:
print(yeatman_data.iloc[:, 0])
subjectID
subject_000 20
subject_001 31
subject_002 18
subject_003 28
subject_004 29
..
subject_072 40
subject_073 50
subject_074 40
subject_075 17
subject_076 17
Name: Age, Length: 77, dtype: int64
However, while .loc
and .iloc
are powerful attibutes, we can also simply address columns directly by their name:
print(yeatman_data["Age"])
subjectID
subject_000 20
subject_001 31
subject_002 18
subject_003 28
subject_004 29
..
subject_072 40
subject_073 50
subject_074 40
subject_075 17
subject_076 17
Name: Age, Length: 77, dtype: int64
If we assign this column to a new variable, it will result in a Pandas Series
, which is a one dimensional series of values. Series are pretty similar to DataFrames (essentially DataFrames are just a collection of Series):
age = yeatman_data["Age"]
print(age['subject_072'])
print(age.iloc[72])
40
40
Series are useful as we can, for example, create a new subset of a DataFrame containing only the variables Age
and IQ
. This can be done by indexing with a list of columns and assigning the resulting subset to a new variable:
yeatman_subset = yeatman_data[["Age", "IQ"]]
print(yeatman_subset.head())
Age IQ
subjectID
subject_000 20 139.0
subject_001 31 129.0
subject_002 18 130.0
subject_003 28 NaN
subject_004 29 NaN
Computations#
Like NumPy arrays, Pandas DataFrames also have many methods that allow for computations. However, as we only deal with tabular data, the dimensions are always the same, with the columns being the variables and the rows being the observations. One can simply calculate the means of all the variables in the DataFrame:
yeatman_means = yeatman_data.mean(numeric_only=True)
print(yeatman_means)
print(yeatman_means["Age"])
Age 18.961039
IQ 122.142857
IQ_Matrix 60.539683
IQ_Vocab 64.015873
dtype: float64
18.961038961038962
Since not all variables are numeric, we include a numeric_only=True)
as an argument of the mean function. We can also directly calculate the mean for individual series:
yeatman_data["IQ"].mean()
122.14285714285714
We can further perform arithmetics on DataFrames. For example, we could calculate a standardized z-score for the age of each subject.
age_mean = yeatman_data["Age"].mean()
age_std = yeatman_data["Age"].std()
print((yeatman_data["Age"] - age_mean ) / age_std)
subjectID
subject_000 0.084835
subject_001 0.983025
subject_002 -0.078472
subject_003 0.738064
subject_004 0.819718
...
subject_072 1.717908
subject_073 2.534445
subject_074 1.717908
subject_075 -0.160126
subject_076 -0.160126
Name: Age, Length: 77, dtype: float64
A useful thing is to then save the result as a new variable in our DataFrame. For example, we can create a new column called Age_zscore
and assign our results to it:
yeatman_data["Age_zscore"] = (yeatman_data["Age"] - age_mean ) / age_std
print(yeatman_data.head())
Age Gender Handedness IQ IQ_Matrix IQ_Vocab Age_zscore
subjectID
subject_000 20 Male NaN 139.0 65.0 77.0 0.084835
subject_001 31 Male NaN 129.0 58.0 74.0 0.983025
subject_002 18 Female NaN 130.0 63.0 70.0 -0.078472
subject_003 28 Male Right NaN NaN NaN 0.738064
subject_004 29 Male NaN NaN NaN NaN 0.819718
Filtering#
Similar to logical indexing in NumPy, we can also filter our data set based on some properties. For example, let’s assume we only want be able to filter subjects below the age of 18 in our analysis. We can then simply create a new boolean variable in the DataFrame which codes for this condition:
yeatman_data["Age_below_18"] = yeatman_data["Age"] < 18
print(yeatman_data.head())
Age Gender Handedness IQ IQ_Matrix IQ_Vocab Age_zscore \
subjectID
subject_000 20 Male NaN 139.0 65.0 77.0 0.084835
subject_001 31 Male NaN 129.0 58.0 74.0 0.983025
subject_002 18 Female NaN 130.0 63.0 70.0 -0.078472
subject_003 28 Male Right NaN NaN NaN 0.738064
subject_004 29 Male NaN NaN NaN NaN 0.819718
Age_below_18
subjectID
subject_000 False
subject_001 False
subject_002 False
subject_003 False
subject_004 False
As you can see, we have now extended our original DataFrame by another column which tells us if the correspoding subjects are younger than 18.
MultiIndex#
Sometimes we want to select groups made up of combinations of variables. For example, we might want to analyze the data based on both gender and age. One way of doing this is to change the index of the DataFrame to be made up of more than one column. This is called a MultiIndex DataFrame. We can do so by applying the set_index()
method of a DataFrame to create a new kind of index:
multi_index = yeatman_data.set_index(["Gender", "Age_below_18"])
print(multi_index.head())
Age Handedness IQ IQ_Matrix IQ_Vocab Age_zscore
Gender Age_below_18
Male False 20 NaN 139.0 65.0 77.0 0.084835
False 31 NaN 129.0 58.0 74.0 0.983025
Female False 18 NaN 130.0 63.0 70.0 -0.078472
Male False 28 Right NaN NaN NaN 0.738064
False 29 NaN NaN NaN NaN 0.819718
You can now see that we have two indices. This means we can apply the .loc
method to select rows based on both indices:
male_below_18 = multi_index.loc["Male", True]
print(male_below_18.describe())
Age IQ IQ_Matrix IQ_Vocab Age_zscore
count 24.000000 21.000000 21.000000 21.000000 24.000000
mean 10.166667 121.952381 60.428571 63.857143 -0.718093
std 2.823299 14.589298 8.930685 7.856753 0.230533
min 6.000000 92.000000 41.000000 49.000000 -1.058316
25% 8.750000 112.000000 56.000000 60.000000 -0.833769
50% 9.000000 121.000000 60.000000 64.000000 -0.813355
75% 12.000000 132.000000 67.000000 70.000000 -0.568394
max 17.000000 146.000000 76.000000 79.000000 -0.160126
/tmp/ipykernel_3407/2795514678.py:1: PerformanceWarning: indexing past lexsort depth may impact performance.
male_below_18 = multi_index.loc["Male", True]
This might already seem useful, but it can become quite cumbersome if you want to repeat this for many kind of combinations. And because grouping data into different subgroups is such a common pattern in data analysis, Pandas offers a built-in way of doing so, which we will explore in the following subsection.
Split-Apply-Combine#
A usual problem we are faced with in data analysis is the following: We (1) want to take a data set and split it into subsets, (2) we then independently apply some operation to each subset and (3) combine the results of all independent operations into a new data set. This pattern ins called split-apply-combine.
For example, let’s start with splitting the data by the Gender
column:
gender_groups = yeatman_data.groupby("Gender")
The newly gender_grous
variable is a DataFrameGroupBy
object, which is pretty similar to a normal DataFrame, with the additional feature of having distinct groups whithin. This means we can perform many operations just as if we would be working with a normal DataFrame, with the only difference being the operation being applied independently to each subset.
For example, we can calculate the mean for each group:
print(gender_groups.mean(numeric_only=True))
Age IQ IQ_Matrix IQ_Vocab Age_zscore Age_below_18
Gender
Female 18.351351 120.612903 59.419355 63.516129 -0.049783 0.567568
Male 18.743590 123.625000 61.625000 64.500000 -0.017756 0.615385
The output of this operation is a DataFrame that contains the summary with th original DataFrame’s Gender
variable as the index. This means we can apply standard indexing operations on it as well to get e.g. the mean age of female subjects:
print(gender_groups.mean(numeric_only=True).loc["Female", "Age"])
18.35135135135135
We can further group by multiple indices:
gender_age_groups = yeatman_data.groupby(["Gender", "Age_below_18"])
print(gender_age_groups.mean(numeric_only=True))
Age IQ IQ_Matrix IQ_Vocab Age_zscore
Gender Age_below_18
Female False 28.625000 123.636364 61.000000 65.454545 0.789098
True 10.523810 118.950000 58.550000 62.450000 -0.688931
Male False 32.466667 126.818182 63.909091 65.727273 1.102784
True 10.166667 121.952381 60.428571 63.857143 -0.718093
Joining Tables#
Another useful feature of Pandas is its ability to join data. For example, lets assume we have three DataFrames with the same columns but different indices. This could for example happen if you would measure the same variables for multiple subjects over three different measurement days. So the index would be the individual subject, and the three DataFrames would be the data you aquired on e.g. Monday, Tuesday, and Wednesday:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
Here it might be intuitive to just concatenate them into one big DataFrame:
combined_df = pd.concat([df1, df2, df3])
print(combined_df)
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
In this case, we see that the concatenation is quite straightforward and succesful. But what about if the DataFrames are not of identical structure? Let’s assume we have df4
which has index values \(2\) and \(3\) as well as columns B
and D
in common with df1
, but it also has the additional indices \(6\) and \(7\) ad well as a new column F
:
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
'D': ['D2', 'D4', 'D6', 'D7'],
'F': ['F2', 'F3', 'F6', 'F7']},
index=[2, 3, 6, 7])
combined_df = pd.concat([df1, df4])
print(combined_df)
A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
2 NaN B2 NaN D2 F2
3 NaN B3 NaN D4 F3
6 NaN B6 NaN D6 F6
7 NaN B7 NaN D7 F7
The results look a bit more interesting. Here, Pandas tried to preserve as much information as possible, meaning the new DataFrame contains all columns/variables present in the two original DataFrames. Wherever possible, Pandas will merge the columns into one. This is true for column D
, as it exists in both original DataFrames. For all other columns, Pandas preserves the input values and adds NaN
s for the missing values.
There are also other, more complicated, scenarios which we will not talk about here. For example, you might want to concatenate along the second axis instead of the first one. Don’t be afraid of trying things out if you are ever in need of something more detailed. Getting used to working with data sets takes time, but no matter your specific goal, it will more likely than not be possible with just a few lines of code.
Errors#
Before closing this section, I would like to emphazize on a few patterns of errors that are unique to Pandas and which you most likely will encounter at some point in your own projects.
One common pattern of errors comes from a confusion between Series and DataFame objects. And while we previously learned that they are indeed pretty similar, they still have some differences. For example, Series objects have a useful .value_counts()
method that creates a table with the number of observations in the Series for every unique value. DataFrames however do not implement this method and will cause a Python AttributeError
instead.
Another common error comes from the fact that many operations create a new DataFrame as an output insted of changing the current one in place. For example you might expect that:
yeatman_data.dropna()
print(yeatman_data.head())
Age Gender Handedness IQ IQ_Matrix IQ_Vocab Age_zscore \
subjectID
subject_000 20 Male NaN 139.0 65.0 77.0 0.084835
subject_001 31 Male NaN 129.0 58.0 74.0 0.983025
subject_002 18 Female NaN 130.0 63.0 70.0 -0.078472
subject_003 28 Male Right NaN NaN NaN 0.738064
subject_004 29 Male NaN NaN NaN NaN 0.819718
Age_below_18
subjectID
subject_000 False
subject_001 False
subject_002 False
subject_003 False
subject_004 False
will remove the NaN
values from the DataFrame. However, it will not do so on the yeatman_data
DataFrame itself but you need to assign it to a new variable if you want to keep this result:
yeatman_without_nan = yeatman_data.dropna()
print(yeatman_without_nan.head())
Age Gender Handedness IQ IQ_Matrix IQ_Vocab Age_zscore \
subjectID
subject_007 34 Male Right 115.0 66.0 51.0 1.227986
subject_009 21 Male Right 124.0 63.0 64.0 0.166489
subject_010 29 Female Right 130.0 60.0 73.0 0.819718
subject_011 31 Male Right 128.0 64.0 67.0 0.983025
subject_012 28 Female Left 129.0 68.0 64.0 0.738064
Age_below_18
subjectID
subject_007 False
subject_009 False
subject_010 False
subject_011 False
subject_012 False
or alternatively, explicitly specify that the existing yeatman
DataFrame should be modified:
yeatman_data.dropna(inplace=True)
These kind of errors are especially dangerous, as you could unknowingly continue working with an unchanged DataFrame, leading to erroneous results later on in your script. It therefore makes sense to at least periodically check the intermediate results in your calculations to spot errors early.
Finally, indexing errors are also common. Don’t be discouraged by such errors, as indexing can indeed be confusing in the beginning, especially with different ways of doing so such as indexing in NumPy, indexing by rows and columns, and indexing with .loc
or .iloc
.