5.2 Pandas

5.2 Pandas#

Many kinds of real-world data are stored in a tabular format. This means two-dimensional tables structured into rows and columns, with each observation typically taking up a row and each column representing a single variable.

The Pandas library is a popular Python library for dealing with tabular data. In comparison to numpy, pandas specifically limits us to two-dimensional tables, but we gain the flexibility of e.g. having variables of different types.

import pandas as pd

DataFrames#

A DataFrame is the core data structure in the Pandas library. It is ideal for working with tabular data, making it easy to manipulate, filter, and analyse data sets. We can create and print a simple example as follows:

my_df = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3'],
                      'C': ['C0', 'C1', 'C2', 'C3']})
print(my_df)

    A   B   C
A0  B0  C0
A1  B1  C1
A2  B2  C2
A3  B3  C3

We can also read data from other sources such as Excel or CSV (comma-separated values) files. CSV files are a good way of storing tabular data as they are plain text, meaning you can open and read them with any kind of text editor or programming language. Usually, CSV files have a header row containing the labels (column names).

As an example, let us load some data that was collected as part of a study of life span changes in brain tissue properties [YWM14]. We can simply load this file from the internet by using the pd.read_csv() function. The only argument required for this is the data we want to load. In this case, we provide an URL as the data is stored on the internet, however, you would usually provide a path pointing to a file on your computer. All other arguments are optional, but they can be useful in providing additional instructions for what to do with the data. Here, we use usecols=[1,2,3,4,5,6,7] to specify only some specific columns, na_values="NaN" to specify that missing values should be entered as NaN (“not a number”), and index_col=0 to use the first column as an our index:

yeatman_data = pd.read_csv("https://yeatmanlab.github.io/AFQBrowser-demo/data/subjects.csv",
                      usecols=[1,2,3,4,5,6,7],
                      na_values="NaN",
                      index_col=0)
print(yeatman_data.head())

             Age  Gender Handedness     IQ  IQ_Matrix  IQ_Vocab
subjectID                                                      
subject_000   20    Male        NaN  139.0       65.0      77.0
subject_001   31    Male        NaN  129.0       58.0      74.0
subject_002   18  Female        NaN  130.0       63.0      70.0
subject_003   28    Male      Right    NaN        NaN       NaN
subject_004   29    Male        NaN    NaN        NaN       NaN

The variable yeatman_data now is a Pandas DataFrame which contains our data and we can use the .head() method to look at the first few rows of the data. The leftmost column subjectID is the index column, while the other columns contain the data for different variables such as age, gender, or IQ. You can also see that the DataFrame is, in contrary to the previously introduced numpy arrays, heterogeneously typed. This means it can contain variables of different types (e.g. strings or floats). You can also already see some missing values. For example, subject003 and subject_004 are missing values for IQ related columns.

Summarizing#

Pandas further contains useful methods to e.g. summarize the data. We can use .info() to get a closer look into our data:

print(yeatman_data.info())

<class 'pandas.core.frame.DataFrame'>
Index: 77 entries, subject_000 to subject_076
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Age         77 non-null     int64  
 1   Gender      76 non-null     object 
 2   Handedness  66 non-null     object 
 3   IQ          63 non-null     float64
 4   IQ_Matrix   63 non-null     float64
 5   IQ_Vocab    63 non-null     float64
dtypes: float64(3), int64(1), object(2)
memory usage: 4.2+ KB
None

Most of this information should already make sense. An observation that can be made is that Gender and Handedness are stored as objects. This is because they are a mixture of not only strings but also NaNs, which are considered numerical values.

We can also get a first statistical summary for the numerical columns by using the .describe() method. NaN values are ignored for these calculations, but the count column will tell you how many values were used for the calculations in each column.

print(yeatman_data.describe())

             Age          IQ  IQ_Matrix   IQ_Vocab
count  77.000000   63.000000  63.000000  63.000000
mean   18.961039  122.142857  60.539683  64.015873
std    12.246849   12.717599   7.448372   8.125015
min     6.000000   92.000000  41.000000  36.000000
25%     9.000000  114.000000  57.000000  60.000000
50%    14.000000  122.000000  61.000000  64.000000
75%    28.000000  130.000000  64.500000  70.000000
max    50.000000  151.000000  76.000000  80.000000

Indexing#

In previous sessions we already leearned about indexing and slicing as a way of accessing individual elements of e.g. lists. Pandas DataFrames also support a variety of different indexing and slicing operations. For example, we can use label-based indexing to select rows through the .loc attribute and by indexing in square brackets:

print(yeatman_data.loc["subject_000"])

Age              20
Gender         Male
Handedness      NaN
IQ            139.0
IQ_Matrix      65.0
IQ_Vocab       77.0
Name: subject_000, dtype: object

In the case that we do not know the exact label of the subject but just its index (e.g. we want to access the first subject), we can use the iloc attribute for that purpose:

print(yeatman_data.iloc[0])

Age              20
Gender         Male
Handedness      NaN
IQ            139.0
IQ_Matrix      65.0
IQ_Vocab       77.0
Name: subject_000, dtype: object

This returns the same information, as instead of looking for subject_000 we are asking for the first row at position 0. You now might ask yourself how we can index a two-dimensional table with just one index. The answer is that it is just a shorthand form of the full expression:

print(yeatman_data.iloc[0, :])

Age              20
Gender         Male
Handedness      NaN
IQ            139.0
IQ_Matrix      65.0
IQ_Vocab       77.0
Name: subject_000, dtype: object

Remember from previous sections that : stands for “all values”. This means we can also apply slicing to extract a subset of columns:

print(yeatman_data.iloc[0, 2:5])

Handedness      NaN
IQ            139.0
IQ_Matrix      65.0
Name: subject_000, dtype: object

Similarly, we can also access a single column:

print(yeatman_data.iloc[:, 0])

subjectID
subject_000    20
subject_001    31
subject_002    18
subject_003    28
subject_004    29
               ..
subject_072    40
subject_073    50
subject_074    40
subject_075    17
subject_076    17
Name: Age, Length: 77, dtype: int64

However, while .loc and .iloc are powerful attibutes, we can also simply address columns directly by their name:

print(yeatman_data["Age"])

subjectID
subject_000    20
subject_001    31
subject_002    18
subject_003    28
subject_004    29
               ..
subject_072    40
subject_073    50
subject_074    40
subject_075    17
subject_076    17
Name: Age, Length: 77, dtype: int64

If we assign this column to a new variable, it will result in a Pandas Series, which is a one dimensional series of values. Series are pretty similar to DataFrames (essentially DataFrames are just a collection of Series):

age = yeatman_data["Age"]
print(age['subject_072'])
print(age.iloc[72])

40
40

Series are useful as we can, for example, create a new subset of a DataFrame containing only the variables Age and IQ. This can be done by indexing with a list of columns and assigning the resulting subset to a new variable:

yeatman_subset = yeatman_data[["Age", "IQ"]]
print(yeatman_subset.head())

             Age     IQ
subjectID              
subject_000   20  139.0
subject_001   31  129.0
subject_002   18  130.0
subject_003   28    NaN
subject_004   29    NaN

Computations#

Like NumPy arrays, Pandas DataFrames also have many methods that allow for computations. However, as we only deal with tabular data, the dimensions are always the same, with the columns being the variables and the rows being the observations. One can simply calculate the means of all the variables in the DataFrame:

yeatman_means = yeatman_data.mean(numeric_only=True)
print(yeatman_means)
print(yeatman_means["Age"])

Age           18.961039
IQ           122.142857
IQ_Matrix     60.539683
IQ_Vocab      64.015873
dtype: float64
18.961038961038962

Since not all variables are numeric, we include a numeric_only=True) as an argument of the mean function. We can also directly calculate the mean for individual series:

yeatman_data["IQ"].mean()

122.14285714285714

We can further perform arithmetics on DataFrames. For example, we could calculate a standardized z-score for the age of each subject.

age_mean = yeatman_data["Age"].mean()
age_std = yeatman_data["Age"].std()
print((yeatman_data["Age"] - age_mean ) / age_std)

subjectID
subject_000    0.084835
subject_001    0.983025
subject_002   -0.078472
subject_003    0.738064
subject_004    0.819718
                 ...   
subject_072    1.717908
subject_073    2.534445
subject_074    1.717908
subject_075   -0.160126
subject_076   -0.160126
Name: Age, Length: 77, dtype: float64

A useful thing is to then save the result as a new variable in our DataFrame. For example, we can create a new column called Age_zscore and assign our results to it:

yeatman_data["Age_zscore"] = (yeatman_data["Age"] - age_mean ) / age_std
print(yeatman_data.head())

             Age  Gender Handedness     IQ  IQ_Matrix  IQ_Vocab  Age_zscore
subjectID                                                                  
subject_000   20    Male        NaN  139.0       65.0      77.0    0.084835
subject_001   31    Male        NaN  129.0       58.0      74.0    0.983025
subject_002   18  Female        NaN  130.0       63.0      70.0   -0.078472
subject_003   28    Male      Right    NaN        NaN       NaN    0.738064
subject_004   29    Male        NaN    NaN        NaN       NaN    0.819718

Filtering#

Similar to logical indexing in NumPy, we can also filter our data set based on some properties. For example, let’s assume we only want be able to filter subjects below the age of 18 in our analysis. We can then simply create a new boolean variable in the DataFrame which codes for this condition:

yeatman_data["Age_below_18"] = yeatman_data["Age"] < 18
print(yeatman_data.head())

             Age  Gender Handedness     IQ  IQ_Matrix  IQ_Vocab  Age_zscore  \
subjectID                                                                     
subject_000   20    Male        NaN  139.0       65.0      77.0    0.084835   
subject_001   31    Male        NaN  129.0       58.0      74.0    0.983025   
subject_002   18  Female        NaN  130.0       63.0      70.0   -0.078472   
subject_003   28    Male      Right    NaN        NaN       NaN    0.738064   
subject_004   29    Male        NaN    NaN        NaN       NaN    0.819718   

             Age_below_18  
subjectID                  
subject_000         False  
subject_001         False  
subject_002         False  
subject_003         False  
subject_004         False  

As you can see, we have now extended our original DataFrame by another column which tells us if the correspoding subjects are younger than 18.

MultiIndex#

Sometimes we want to select groups made up of combinations of variables. For example, we might want to analyze the data based on both gender and age. One way of doing this is to change the index of the DataFrame to be made up of more than one column. This is called a MultiIndex DataFrame. We can do so by applying the set_index() method of a DataFrame to create a new kind of index:

multi_index = yeatman_data.set_index(["Gender", "Age_below_18"])
print(multi_index.head())

                     Age Handedness     IQ  IQ_Matrix  IQ_Vocab  Age_zscore
Gender Age_below_18                                                        
Male   False          20        NaN  139.0       65.0      77.0    0.084835
       False          31        NaN  129.0       58.0      74.0    0.983025
Female False          18        NaN  130.0       63.0      70.0   -0.078472
Male   False          28      Right    NaN        NaN       NaN    0.738064
       False          29        NaN    NaN        NaN       NaN    0.819718

You can now see that we have two indices. This means we can apply the .loc method to select rows based on both indices:

male_below_18 = multi_index.loc["Male", True]
print(male_below_18.describe())

             Age          IQ  IQ_Matrix   IQ_Vocab  Age_zscore
count  24.000000   21.000000  21.000000  21.000000   24.000000
mean   10.166667  121.952381  60.428571  63.857143   -0.718093
std     2.823299   14.589298   8.930685   7.856753    0.230533
min     6.000000   92.000000  41.000000  49.000000   -1.058316
25%     8.750000  112.000000  56.000000  60.000000   -0.833769
50%     9.000000  121.000000  60.000000  64.000000   -0.813355
75%    12.000000  132.000000  67.000000  70.000000   -0.568394
max    17.000000  146.000000  76.000000  79.000000   -0.160126

/tmp/ipykernel_3407/2795514678.py:1: PerformanceWarning: indexing past lexsort depth may impact performance.
  male_below_18 = multi_index.loc["Male", True]

This might already seem useful, but it can become quite cumbersome if you want to repeat this for many kind of combinations. And because grouping data into different subgroups is such a common pattern in data analysis, Pandas offers a built-in way of doing so, which we will explore in the following subsection.

Split-Apply-Combine#

A usual problem we are faced with in data analysis is the following: We (1) want to take a data set and split it into subsets, (2) we then independently apply some operation to each subset and (3) combine the results of all independent operations into a new data set. This pattern ins called split-apply-combine.

For example, let’s start with splitting the data by the Gender column:

gender_groups = yeatman_data.groupby("Gender")

The newly gender_grous variable is a DataFrameGroupBy object, which is pretty similar to a normal DataFrame, with the additional feature of having distinct groups whithin. This means we can perform many operations just as if we would be working with a normal DataFrame, with the only difference being the operation being applied independently to each subset.

For example, we can calculate the mean for each group:

print(gender_groups.mean(numeric_only=True))

              Age          IQ  IQ_Matrix   IQ_Vocab  Age_zscore  Age_below_18
Gender                                                                       
Female  18.351351  120.612903  59.419355  63.516129   -0.049783      0.567568
Male    18.743590  123.625000  61.625000  64.500000   -0.017756      0.615385

The output of this operation is a DataFrame that contains the summary with th original DataFrame’s Gender variable as the index. This means we can apply standard indexing operations on it as well to get e.g. the mean age of female subjects:

print(gender_groups.mean(numeric_only=True).loc["Female", "Age"])

18.35135135135135

We can further group by multiple indices:

gender_age_groups = yeatman_data.groupby(["Gender", "Age_below_18"])
print(gender_age_groups.mean(numeric_only=True))

                           Age          IQ  IQ_Matrix   IQ_Vocab  Age_zscore
Gender Age_below_18                                                         
Female False         28.625000  123.636364  61.000000  65.454545    0.789098
       True          10.523810  118.950000  58.550000  62.450000   -0.688931
Male   False         32.466667  126.818182  63.909091  65.727273    1.102784
       True          10.166667  121.952381  60.428571  63.857143   -0.718093

Joining Tables#

Another useful feature of Pandas is its ability to join data. For example, lets assume we have three DataFrames with the same columns but different indices. This could for example happen if you would measure the same variables for multiple subjects over three different measurement days. So the index would be the individual subject, and the three DataFrames would be the data you aquired on e.g. Monday, Tuesday, and Wednesday:

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                    index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']},
                    index=[8, 9, 10, 11])

Here it might be intuitive to just concatenate them into one big DataFrame:

combined_df = pd.concat([df1, df2, df3])
print(combined_df)

      A    B    C    D
  A0   B0   C0   D0
  A1   B1   C1   D1
  A2   B2   C2   D2
  A3   B3   C3   D3
  A4   B4   C4   D4
  A5   B5   C5   D5
  A6   B6   C6   D6
  A7   B7   C7   D7
  A8   B8   C8   D8
  A9   B9   C9   D9
A10  B10  C10  D10
A11  B11  C11  D11

In this case, we see that the concatenation is quite straightforward and succesful. But what about if the DataFrames are not of identical structure? Let’s assume we have df4 which has index values \(2\) and \(3\) as well as columns Band Din common with df1, but it also has the additional indices \(6\) and \(7\) ad well as a new column F:

df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                    'D': ['D2', 'D4', 'D6', 'D7'],
                    'F': ['F2', 'F3', 'F6', 'F7']},
                    index=[2, 3, 6, 7])

combined_df = pd.concat([df1, df4])
print(combined_df)

     A   B    C   D    F
 A0  B0   C0  D0  NaN
 A1  B1   C1  D1  NaN
 A2  B2   C2  D2  NaN
 A3  B3   C3  D3  NaN
NaN  B2  NaN  D2   F2
NaN  B3  NaN  D4   F3
NaN  B6  NaN  D6   F6
NaN  B7  NaN  D7   F7

The results look a bit more interesting. Here, Pandas tried to preserve as much information as possible, meaning the new DataFrame contains all columns/variables present in the two original DataFrames. Wherever possible, Pandas will merge the columns into one. This is true for column D, as it exists in both original DataFrames. For all other columns, Pandas preserves the input values and adds NaNs for the missing values.

There are also other, more complicated, scenarios which we will not talk about here. For example, you might want to concatenate along the second axis instead of the first one. Don’t be afraid of trying things out if you are ever in need of something more detailed. Getting used to working with data sets takes time, but no matter your specific goal, it will more likely than not be possible with just a few lines of code.

Errors#

Before closing this section, I would like to emphazize on a few patterns of errors that are unique to Pandas and which you most likely will encounter at some point in your own projects.

One common pattern of errors comes from a confusion between Series and DataFame objects. And while we previously learned that they are indeed pretty similar, they still have some differences. For example, Series objects have a useful .value_counts() method that creates a table with the number of observations in the Series for every unique value. DataFrames however do not implement this method and will cause a Python AttributeErrorinstead.

Another common error comes from the fact that many operations create a new DataFrame as an output insted of changing the current one in place. For example you might expect that:

yeatman_data.dropna()
print(yeatman_data.head())

             Age  Gender Handedness     IQ  IQ_Matrix  IQ_Vocab  Age_zscore  \
subjectID                                                                     
subject_000   20    Male        NaN  139.0       65.0      77.0    0.084835   
subject_001   31    Male        NaN  129.0       58.0      74.0    0.983025   
subject_002   18  Female        NaN  130.0       63.0      70.0   -0.078472   
subject_003   28    Male      Right    NaN        NaN       NaN    0.738064   
subject_004   29    Male        NaN    NaN        NaN       NaN    0.819718   

             Age_below_18  
subjectID                  
subject_000         False  
subject_001         False  
subject_002         False  
subject_003         False  
subject_004         False  

will remove the NaN values from the DataFrame. However, it will not do so on the yeatman_data DataFrame itself but you need to assign it to a new variable if you want to keep this result:

yeatman_without_nan = yeatman_data.dropna()
print(yeatman_without_nan.head())

             Age  Gender Handedness     IQ  IQ_Matrix  IQ_Vocab  Age_zscore  \
subjectID                                                                     
subject_007   34    Male      Right  115.0       66.0      51.0    1.227986   
subject_009   21    Male      Right  124.0       63.0      64.0    0.166489   
subject_010   29  Female      Right  130.0       60.0      73.0    0.819718   
subject_011   31    Male      Right  128.0       64.0      67.0    0.983025   
subject_012   28  Female       Left  129.0       68.0      64.0    0.738064   

             Age_below_18  
subjectID                  
subject_007         False  
subject_009         False  
subject_010         False  
subject_011         False  
subject_012         False  

or alternatively, explicitly specify that the existing yeatman DataFrame should be modified:

yeatman_data.dropna(inplace=True)

These kind of errors are especially dangerous, as you could unknowingly continue working with an unchanged DataFrame, leading to erroneous results later on in your script. It therefore makes sense to at least periodically check the intermediate results in your calculations to spot errors early.

Finally, indexing errors are also common. Don’t be discouraged by such errors, as indexing can indeed be confusing in the beginning, especially with different ways of doing so such as indexing in NumPy, indexing by rows and columns, and indexing with .loc or .iloc.