When doing data analysis **there is no better help than pandas library**. Actually pandas is one of the reasons python became extremely popular in the data science field. It leverages one of the pillars of the field, numpy (a library for working with matrices), adding not only indexes and columns, but a wide functionality too. You can almost do magic with your data with pandas!

One of the most **common uses with pandas is grouping data**. You can make a group with the function *groupby()* and then apply some common action to that group, like *mean()*, *count()*, *median()*, etc.

The function *groupby()* sounds like SQL’s GROUP BY, and while **it’s similar to its SQL cosin, it comes with extended powers**. Let’s see some basic use before showing the tricks!

Let’s suppose **we have a dataset with the results of an exam**. We have 6 students that spent almost 2 hours (120 minutes) solving the problems from the exam, that took part in 2 different rooms (labeled 1 and 2).

1 2 3 4 5 6 7 8 |
import pandas as pd df = pd.DataFrame({ 'name': ['John', 'Alex', 'Mark', 'Anne', 'George', 'Claire'], 'room': [1, 2, 1, 2, 1, 2], 'time': [100, 89, 109, 120, 94, 112], 'result': ['A', 'B', 'A', 'C', 'B', 'C'] }) df |

1 2 3 4 5 6 7 |
name result room time 0 John A 1 100 1 Alex B 2 89 2 Mark A 1 109 3 Anne C 2 120 4 George B 1 94 5 Claire C 2 112 |

**INTRODUCTION: basic grouping**

The **most basic way to do grouping** is by a column (or ‘feature’, in data analysis’ slang). In this case we age going to group by *room*, then choose only the *time* feature, and get the mean of time spent in each room.

1 |
df.groupby('room').time.mean() |

1 2 3 4 |
room 1 101 2 107 Name: time, dtype: int64 |

Notice that **the functions that are used with groups can also be used without grouping**, as it’s showed in the next case. Here we are also showing here **how to use square brackets** to choose only some columns, in this case result and time, so later further operations are done only on them.

1 |
df[['result', 'time']].mean() |

1 2 |
time 104.0 dtype: float64 |

In the following case we first chose 2 columns, result and time, and then we group by result, looking for the maximum values.

1 |
df[['result', 'time']].groupby('result').max() |

1 2 3 4 5 |
time result A 109 B 94 C 120 |

Given these examples, we can get an idea of basic use… but let’s see now **my 3 favorite tricks when grouping**.

**TRICK 1: list grouping**

You can do **grouping with more than one column**, and the result will be a multi-index dataframe, nice!

1 |
df[['result', 'room', 'time']].groupby(['room', 'result']).max() |

1 2 3 4 5 6 |
time room result 1 A 109 1 B 94 2 B 89 2 C 120 |

Ok, ok, I hear you ‘this can be done with SQL too’. That’s right, but later you can use the multi-index for further exploration.

**TRICK 2: grouping by function**

You can **pass a function as parameter to pandas’ groupby()** to create groups. The function will get an index as parameter, so you can use pandas’ *loc* to locate the data. For instance, let’s suppose we want to group by the number of ‘e’ letter that each student name have.

1 2 |
count_e = lambda index: df.loc[index]['name'].count('e') df.groupby(count_e).mean() |

1 2 3 4 |
room time 0 1.0 104.5 1 2.0 107.0 2 1.0 94.0 |

So people with zero ‘e’s in his name spent 104.5 minutes as mean, while people with 2 ‘e’s finished the exam in just 94 minutes.

Isn’t it amazing? Of course this is a stupid example, but you can do things like, for instance, group ages by decades (like I did in a notebook on kaggle).

**TRICK 3: Group and rank**

pandas has** a function called rank() that gets the order/rank of a column**. For example it can sort

*time*column and show a

**1**for the quickest student, then

**2**for the second one, etc.

1 2 |
df['global_order'] = df['time'].rank() df |

1 2 3 4 5 6 7 |
name result room time global_order 0 John A 1 100 3.0 1 Alex B 2 89 1.0 2 Mark A 1 109 4.0 3 Anne C 2 120 6.0 4 George B 1 94 2.0 5 Claire C 2 112 5.0 |

But **how could we get the rank per room?** We want to know which student was the quickest for room 1, and which one for room 2…

1 2 |
df['room_order'] = df[['room', 'time']].groupby('room').rank() df |

1 2 3 4 5 6 7 |
name result room time global_order room_order 0 John A 1 100 3.0 2.0 1 Alex B 2 89 1.0 1.0 2 Mark A 1 109 4.0 3.0 3 Anne C 2 120 6.0 3.0 4 George B 1 94 2.0 1.0 5 Claire C 2 112 5.0 2.0 |

So Alex was the number 1 globally and also the quickest in room 1. But George was the number 1 in room 2. Isn’t that magic?

Funny enough, rank() return the order as floats, but you can change it’s type with .astype(int) later.

**TRICK 4: Group and process with agg() magic**

**With the agg() function you can describe several grouped processing in a compressed form**. Let’s see the example to understand it better:

1 2 3 4 5 |
aggregations = { 'time': ['min', 'max', lambda x: max(x) - min(x)], 'global_order': ['mean'] } df.groupby('room').agg(aggregations) |

1 2 3 4 5 |
time global_order min max <lambda> mean room 1 94 109 15 3.0 2 89 120 31 4.0 |

Using a dictionary we have defined first which columns we will work on. Then we define the operations we want to perform on this column; you can even write a lambda function there!

**I hope you liked these tricks!**