My 3 favorite grouping tricks with pandas

ArchitectureWhen doing data analysis there is no better help than pandas library. Actually pandas is one of the reasons python became extremely popular in the data science field. It leverages one of the pillars of the field, numpy (a library for working with matrices), adding not only indexes and columns, but a wide functionality too. You can almost do magic with your data with pandas!

One of the most common uses with pandas is grouping data. You can make a group with the function groupby() and then apply some common action to that group, like mean(), count(), median(), etc.

The function groupby() sounds like SQL’s GROUP BY, and while it’s similar to its SQL cosin, it comes with extended powers. Let’s see some basic use before showing the tricks!

Let’s suppose we have a dataset with the results of an exam. We have 6 students that spent almost 2 hours (120 minutes) solving the problems from the exam, that took part in 2 different rooms (labeled 1 and 2).

Now let’s start to do basic grouping.
pandas_group_2The most basic way to do grouping is by a column (or ‘feature’, in data analysis’ slang). In the case [4] we group by room, then choose only the time feature, and get the mean of time spent in each room.

Notice that the functions that are used with groups can also be used without grouping, as it’s showed in case [5]. We are also showing here how to use square brackets to choose only some columns, in this case result and time, so later further operations are done only on them.

In the case [6] we first chose 2 columns, result and time, and then we group by result, looking for the maximum values.

Given these examples, we can get an idea of basic use… but let’s see now my 3 favorite tricks when grouping.

TRICK 1: list grouping

You can do grouping with more than one column, and the result will be a multi-index dataframe, nice!
Ok, ok, I hear you ‘this can be done with SQL too’. That’s right, but later you can use the multi-index for further exploration.

TRICK 2: grouping by function

You can pass a function as parameter to pandas’ groupby() to create groups. The function will get an index as parameter, so you can use pandas’ loc to locate the data. For instance, let’s suppose we want to group by the number of ‘e’ letter that each student name have.
So people with zero ‘e’s in his name spent 104.5 minutes as mean, while people with 2 ‘e’s finished the exam in just 94 minutes.

Isn’t it amazing? Of course this is a stupid example, but you can do things like, for instance, group ages by decades (like I did in a notebook on kaggle).

TRICK 3: Group and rank
pandas has a function called rank() that gets the order/rank of a column. For example it can sort time column and show a 1 for the quickest student, then 2 for the second one, etc.

But how could we get the rank per room? We want to know which student was the quickest for room 1, and which one for room 2…
So Alex was the number 1 globally and also the quickest in room 1. But George was the number 1 in room 2. Isn’t that magic?

Funny enough, rank() return the order as floats, but you can change it’s type with .astype(int) later.


I hope you liked these tricks! And I hope I’ll find a better way to share code with context coloring but copy-paste-able… ideas?

2017 focus: ML

At the end of 2016 I was still amazed with the result of AlphaGo vs. Lee Sedol match in March (for the 1st time a machine beats a top professional Go player), and at the same time I was looking for a subject to focus on in 2017, so I chose Machine Learning. During my university years I tried out some related tools (genetic algorithms, basic neural networks, etc), but for 10 years I’d not looked at it again.

The first stop was the famous Machine Learning course by Andrew Ng in Coursera, as everybody points you there. Despite it explains a lot of complex stuff in an intuitive way, soon you get tired of so much maths and using Octave/Matlab, when you should be using Python.

After one year learning about Machine Learning, I think I have quite a list of recommendations on how to start exploring the field. Disclaimer: this could be related with my preferred way of learning, that is, with text instead of videos. This could be a good way to start if you have no previous experience:

  • Do not watch that coursera’s ML course, but just read the notes somebody took on it instead.
  • Learn about Python, but specially about the libraries Numpy, Pandas and scikit-learn. Also how to run a jupyter notebook. And the best way to install them all is via Anaconda distribution.
  • Buy a copy (paper or ebook) of the book “Python Machine Learning” by Sebastian Raschka.
  • Join Kaggle and have a look at the Titanic tutorials, and it’s new Learn section. They also have a video-course in Udacity in case you like watching videos.
  • Don’t be in a rush to learn deep-learning (aka neural networks), because you’ll first have to learn about classic ML models, but also a lot of related processes: data cleaning, feature engineering and data visualization.

My first real-world input was in May, when I attended PyData conference in Barcelona, which was a turning point: I found lots of ideas to apply, but over all I felt the industry’s pulse.

workshopDuring summer I challenge myself to apply it at work and to do a conference talk. The subject was customer segmentation using non-supervised algorithms, using a dataset I prepared myself from our company’s data. Finally the talk became a 2-hour workshop.

It was the first time I did a presentation about Machine Learning in English. Despite the audience was satisfied with the workshop and some people had interesting conversation after, I felt that I should’ve work harder while preparing it.

As 2017 finished and 2018 started I’ll continue focusing on ML, but with a more practical approach. In my day work we have developed a recommendation system that will evolve with several ML models working together, and after work I’ll try to play more with Kaggle, taking part in some competitions.

In 2018, I’ll try deep learning too: both with Andrew Ng’s course with Tensorflow, a creative apps course and some video-tutorials on PyTorch. I’ll try to improve my engineering approach to ML, as things like version control, testing and deployment are very rare to see in a world with more university people than industry ones. Finally I plan to complete a nice course on data visualization with D3.js.

I hope all these links help somebody too!

Teaching students about real industry work

Some months ago I had the chance to teach University students about how we develop in the real world, as part of a “companies’ seminars” event.

There is an ongoing discussion in our industry: Do you need a major in Computer Science to become a successful developer?. People say that the subjects explained in the University become outdated quickly, basically due to the lightning speed of technology. People say that nowadays joining a course on javascript is enough to learn to program. Other people say that you must spend 4~5 years in University.

I’m on the side of the need for formal University education. Students need foundations to perfectly understand how things really work. But it’s true that they also need to know how the industry really work. Virtualization, code versioning, code quality (“clean”), tradeoffs, etc, are subjects that are not taught in University, unluckily.

During the seminar I taught students about general subjects like the tradeoffs we have to choose in our company, but also about last trending technologies like docker. Anyhow the most loved subject by them was my introduction to clean code, that opened their eyes. Let’s hope this will inspire them.

Here are the links to the slides I used:
Professional development
Clean Code
OOP and SOLID principles
Introduction to docker
Seminar conclusion

The best advice I gave them: Find a job in a company where you can learn.

PyData conference in Barcelona

pydatabcn2017I was lucky to attend PyData conference in Barcelona this year, hosted in ESADE.

Although I’m basically a PHP developer, I’ve been playing with data science tools lately with python’s stack. I have no real experience in data science, apart from a couple of prediction coding using linear regression, but I was curious.

With a novice spirit, I set some clear objectives: find out if data science is like teenager sex, or companies are really using it; get a feeling of the community; and try to learn as much as I could.

First of all, the community is vibrant, actually far more than PHP’s one in Barcelona. The organization was smooth too, and all the people I talked with was really nice. Everybody had things to learn, so came with an open mind.

It was funny to see that I was on the “data owners” side, while most people were in the “looking for datasets” side. This led to several conversations asking me how we use the data in our company.

Regarding the talks, there were quite a lot about tools. Python science stack have a wide range of evolving tools, and this somehow reminds me of PHP circa 2008, when basic tools (PHPUnit, for example) were becoming popular. It’s good to polish your tools and master them, so I welcomed those talks.

There were also some talks on theory, which surprised me, as I haven’t never seen university professors in software conferences. Mathematical and computer science concepts were explained, for instance on optimization. This contrasts with the common industry solution: if some code is slow, just use more machine instances, which is far cheaper that spend time trying to optimize things (at least 99% of the time). I don’t mean I didn’t like those talks (actually one was really mind blowing), but I would love to see more professors in some other conferences, getting a real feel of some industry practices.

I was looking for talks showing “real fire”, real examples in companies. We heard about hotels trying to predict cancellations (in order to do overbooking); we saw IBM’s Watson analyzing the personality of customers; predict which employees will leave a big company; ideas to react knowing bad weather will arrive; best weekday to publish job offers and set interviews; and some other extremely interesting stuff… but I do want more!

My overall feeling is that I learned a lot. Python is not really used as a language but more as an interface for some amazing libraries. It looks like I have no option but to start exploring the data in ulabox!

I’d like to thank ulabox (my employer) that paid the ticket, and all the people in the organization that did a great job!

I published some of my (unedited) notes too.

Remote working effectively

Some months ago my coworkers asked me to share my experience remote working. We work in a normal office, but I had worked from home during 5 years, from Barcelona, Seoul and Mexico DF. So I prepared a simple presentation about the main issues to consider if you want to try working from home.

After the presentation a interesting discussion followed. Some of my mates worked as freelancers in the past, and arrived to similar ways of managing the working time. It’s the key point when working from home: control yourself how and how many time you are productive.

Virtual disk design kata

In my current job (ulabox) we do every Thursday a internal training session, usually prepared by one of our department members. Some months ago I prepared a code kata on design patterns, with 5 steps with instructions. The idea was to push the team to debate about different approaches to a common problem, and show them some classical design patterns, as a way to polish our weapons. The result was good, but the discussion only really happened at the end, when I showed them those patterns.

Ninja weaponsSome weeks later I heard about a code conference in Barcelona, organized by the Barcelona Software Craftsmanship group, so I took the chance to polish my kata and ask them to do in the event. It was rejected to the main event.

Later I heard about Monday’s katas: this group organizes every Monday a code kata with up to 20 developers. I offered my kata and our office to do it, and on December 12th we did it! All participants agree: the kata is smooth and induces to think about the subjects it later shows.

I published my kata on github. Have fun!

Do you test your tests?

Weird shapesThe first time I read about serious testing was in The Pragmatic Programmer. The book explains the usual (boring) benefits of testing, but a twisted detail rolled my eyes up: also test your tests. Testing is a net that helps you to change the code without breaking the logic, and as a real life net, you should verify it works as expected. Tests should be in a tight relation with the code.

When is a test good? Trying to find the differences between a good test and a bad one is not obvious, however. Looking for lacks or anti-patterns in our tests is a good option to improve them.

Thanks to PHPUnit and Xdebug, the PHP community started to care about testing years ago. Since then, the easiest way to show the quality of a test suite is the code coverage, that is, the percentage of the code the test stresses. That worked until programmers started to focus on a 100% coverage, creating artificial tests that doesn’t stress the logic correctly, but instead get a fake 100% line coverage. If a line is executed once, even if the subject was a different test unit that uses that class, the line “is tested”.

Are you really testing each class? Following the logic? Even if you use proper unit tests, you may be missing things.

Let’s start with a stupid example, a function that does an “AND”, and a tests that gets a 100% coverage:

class MyOperator
    public function doAnd($param1, $param2)
        if ($param1) {
            if ($param2) {
                return true;
        return false;

class MyOperatorTest extends \PHPUnit_Framework_TestCase
    public function testDoAnd()
        $operator = new MyOperator();
        $this->assertEquals(true,  $operator->doAnd(true, true));
        $this->assertEquals(false, $operator->doAnd(true, false));

The test is only stressing 2 cases! Actually an “AND” has 4 possible cases, so the 2 missing cases (false-true, false-false) were totally ignored, despite you get a 100% line coverage.

This was a basic example to show the difference between line coverage and path coverage (in this case, 4 possible paths). The good news is that Derick Rethans is working on it. I wonder how many programmers will get surprised while seeing their code’s path coverage is low.

Another way to test your tests is to change the source code and see if the test fails (it should!) or not. This is called Mutation Testing, and helps to detect when a test is not working perfectly. For instance:

    public function biggerThan5($number)
        if ($number > 5) {
            return true;
        return false;
/* ... */
    public function testBiggerThan5()
        $operator = new MyOperator();
        $this->assertEquals(true, $operator->biggerThan5(8));
        $this->assertEquals(false, $operator->biggerThan5(3));

This test looks complete, but there is no test for the bound case, biggerThat5(5). This test is not really accurate.

In the PHP ecosystem there are 2 only available Mutation Testing tools: Humbug and Mutatesting. The second one, despite the author is also the creator of the excellent PHP-metrics, seems abandoned.

So the only real option is Humbug, developed by the author of Mockery. Unluckily it only works with PHPUnit for the moment. It basically finds places where the code can be easily changed, like a true for a false, or a number N for N+1, and runs the tests to see if that mutation is killed (that is, the test fails). For instance, in the previous example it changes 5 to 6, and the tests still work, so the mutation was not killed.

I just hope these tools become more popular, in order to improve the quality of our industry. And let’s hope soon Humbug will work in PHPspec too, as many companies are moving from PHPUnit to Behat-PHPspec.

The code of this post can be find at its github repo.

The most required PHP packages

composerComposer is the most used package dependency manager in PHP ecosystem since a few years. It manages complex dependecies with an easy syntax, and you can easily search among all available packages in, where you can publish your own too.

It was interesting to find a list of the most required packages. Some of them are part of a framework, while others are lonely gems. As a person that values professionalism, it should be a must to have some experience, or at least know, the most popular packages. Here I’m having a look at some of them.

phpunitsThe absolute #1 is PHPUnit. If you are programming in PHP and have never used it, go and get a position as consultant, please. The PHP unit testing framework has been a popular choice since years ago (I’ve even posted about it in 2008). A must you should know.

The list contains other testing related packages too. #4 Mockery is an object mocking package widely used. In my personal case, I use Prophecy for mocking, which comes with #21 phpspec, a test framework that can complement or substitute PHPUnit. In the list also appears #44 behat, a tool for making scenario-oriented BDD that is rising popularity.

symfonyRegarding complete frameworks, the list includes lots of Symfony framework packages. #3 symfony/framework-bundle, #5 symfony/symfony, #8 symfony/console, #10 symfony/yaml, etc. Some of them are core to symfony framework, while others are so independent that are widely used, even without using this framework. A clear example of that is yaml package, which (obviously) processes yaml files.

Of course not only Symfony appears in the list, but also other popular frameworks like: Laravel (#2 illuminate/support, #47 laravel/framework), Zend (#13 zendframework/zendframework), Yii (#17 yiisoft/yii2) and Silex (#22 silex/silex).

doctrineApart from complete frameworks, the list comes with some other must-know packages. The DB-related Doctrine is the first to appear (#6 doctrine/orm, #31 doctrine/common, #38 doctrine/dbal). There are some code-quality related packages, like #7 PHP Coveralls, #11 PHP_CodeSniffer and #33 PHP Mess Detector. Finally some usual suspects, like #15 Guzzle (HTTP client), #19 Monolog (the standard way of logging, already included everywhere).

Finally, some packages in the list were a surprise to me, as I had no idea about them. For instance, #9 composer/installers (an multi-framework installer), #16 silverstripe/framework (a CMS) or #29 nette (another framework).

leagueLogoDefinitively it’s a great collection of packages, and if you combine them with the ones from The League of Extraordinary Packages, you will get a great stuff to learn from: 1st, know about them; 2nd, try to use them; and 3rd, read their source code.

General programming principles

This is just a list about programming principles that I’m making for myself. These should be instinctive to any developer.

How I looked for a new tech job

Sagrada Familia at nightBack in Barcelona after 3 years working remotely, I decided to look for a new in-office job. But following an uncommon way to search for a job.

First I had a look on jobs’ websites, but only to get an idea of what technologies are popular in Barcelona. Symfony was the most remarkable one. But I didn’t apply to any of the job offers I saw there.

I don’t want to work in a company just because they opened a job offer. I want to work in a company with great developers to learn from, and a product that passionates me. Actually I discovered that sometimes good companies need more developers but have no time to publish openings.

So I looked for a list of local companies. Regarding Barcelona, I found (¹). And started to make a list with the interesting ones.

I also started to join programming events like conferences and talks, meeting people there. The idea was to find companies which technical level is a bit better than my skills(²). If you have the chance to join one of them, you’ll improve greatly.

Pretty Hot PeopleSend them your CV with a well prepared cover letter (email). Some of them will contact you back. And the real fun begins: TECH interviews! Usually the process starts with a tech test, where you have to program something in a short time (between 1 to 4 hours). By the way, if the company does not ask you to do a test, run away (read the reasons in Joel’s test); once I joined a company that didn’t ask me so, and 3 weeks later I quited because their code was not a good one to learn from.

True fact: you will do your first interview horribly.

However, you will learn a lot while having interviews and tech tests. Specially if you ask for FEEDBACK! From my experience, only half of them will send you some feedback (following Sergey Brin’s style, “make the candidate learn something”). Feedback is pure gold. It’s the best way to learn from other developers working in the industry.

In my case, while having interviews I learned some new code design ideas. Moreover I ended up reading about DDD and BDD. For instance, in the first interview (3 months ago) they asked me about the meaning of BDD, and I had no idea; but in the last tech test, totally based on behat, I was able to code comfortably.

I can only say thank you to the few companies that gave me valuable feedback, even if they didn’t hire me. Now I’m better thanks to them. Somehow they helped me to sharp my skills!

Summing up, the “always learn” mantra should be applied to the process of looking for a new job too.


(¹) For other cities in Europe, you may want to have a look at
(²) Actually this is a borrowed idea from my Korean classes in Seoul. The teacher always speaks using some more words and expressions that the students should know, so the students keep fighting all the time trying to get the level. However, students can get exhausted with that drowning feeling.