Category: Data

4 things I would’ve liked to know about Cloud Functions

We had an important project to do: build and keep updated some transformed tables in BigQuery with data that comes from a transactional system. We needed 3 pieces: the code that build the tables, a place to deploy this code and a scheduler to call it. Given the code piece was in Python, we had to evaluate different platforms to deploy.

cloud-functionThe platform chosen was Google Cloud Functions. Actually there is a nice diagram to let you choose among Google services that helped out (but for some reason it was not easy to found). We could’ve tried it deploying on our own hardware server, but a lateral idea with this project was testing the cloud.

Cloud Functions are perhaps the easiest way to deploy your code: given it’s Python, Go or Node.js; for other languages you could try the new Cloud Run, which is basically a docker container that is executed on an external event. A Cloud Function can be triggered using a HTTP call, a Google PubSub message, or some platform internal events (like when a file is uploaded). In our case, we use Google Scheduler (a basic cron) that sends a PubSub message.

Up to here everything sounds perfect, but we found 4 minor problems due to the lack of knowledge of the platform:

As you deploy and use the Function on the cloud, the results will come with delay. This is not a real problem, but if you’re get used to work with “tail” or other console scripts to explore the logs, you have to relax and wait a couple of minutes before sentencing your code is not working.

Cloud Function’s documentation says that this product is intended to short-running scripts, so it comes with a 1 minute timeout. We missed this detail at first, and the actual use misled us to think there was no timeout: if you manually launch the Function, apparently there is no timeout and runs for several minutes. But later, having a look at the logs, we found the timeout error (that actually is logged as INFO instead of ERROR!). It seems that when our script uses more than 1 minute, it continues until somebody else asks for resources: so if we are in a relaxed pool, we can be lucky and run it more time.

However you can configure, on deploy time, the timeout limit up to 9 minutes. Unluckily our code runs for 11 minutes, so we had to split it.

The library we use does some disk-writing internally, and we didn’t realize it until we saw the problem in the logs. The good thing is that you can just write in /tmp, so we only had to reconfigure the library to write there. The weird thing is that anything that your code writes into /tmp is also written in the function log, so the logs can become difficult to follow.

This was the most trickiest one! We are using a Python library that, by default, creates 4 threads of execution. For some reason this doesn’t work well on Cloud Functions, and sometimes the connection with BigQuery is closed before all threads have finished. So we had to use an undocumented feature of the library to work only with a single thread.

Summing up, Google Cloud Functions is a lightweight way to execute your Python scripts, with a really way to deploy and use. But sometimes things go wrong under the hood, so you should READ THE LOGS to find out if all is ok. Checking that the final results matches what you expect will help too (for instance, doing automated test-queries that try to find non matching numbers).

Disclaimer: we chose Google Cloud Functions due to the particular background of the company (team, knowledge, etc). Depending on the task to do, you might want to have a look at other more specific products, that could help you better to make ETLs or processing data (for instance, Dataflow/Beam on Google platform).

My 2018 in review

My main objective in 2018 was to go deep in Machine Learning (as a way to continue 2017’s focus). When the year started, I decided to organize my free time in small 1-month projects. The original idea was to start with Deep Learning too, but I ended up exploring other fields, like data engineering.

In February I tried different approaches to develop a ML model to solve the famous Titanic Kaggle competition, where you have to predict the survival of different passengers given some data. It was really fun, because I explored different ideas, but I ended with a quite over-engineered notebook. Later I realized how important is to find which are the noisy features that you have to ignore.

In March I decided to improve my python skills so I set a challenge based in intuition: try to group products that are bought together, using real data from work (Ulabox, an online supermarket). I enjoyed creating sparse matrices with scipy and doing matrix operations with numpy, which was a good refresh of maths. The result was a nice dentogram that showed that some vegetables are bought together, as well as some types of yogurt.

In May I created a simple notebook to solve the Titanic competition but with one idea: help my work mates to join a competition and get excited with ML. So I made the most simple code that worked, but at the same time trying to show an eye-catching result. I tried plotting a simple decision tree with a great result: both coworkers joined the session, and other Kaggle users voted up my notebook.

In June I bought a new computer with a GTX1080, getting ready to jump to Deep Learning. I tried some tutorials (Tensorflow and Pytorch), but I didn’t like starting from level 0, that is, creating my own neurons from scratch. Actually I learned about neural networks years ago, at university. Later, almost at the end of 2018 I finally found a book with the level I was looking for: Advanced Deep Learning with Keras.

PyConDERegarding conferences: in July I attended PyData Berlin thanks to my employee (who paid me the tickets). Later in September I also attended DataEngConf in Barcelona, that really matched the needs of my company: make a data engineering plan. In October I took a train to Paris and then another to Karlsruhe to attend PyConDE; this conference was really well organized, with a wide concepts’ talks and in an incredible venue: a digital art museum with thought-provoking expositions about the future we are building.

The most interesting books I’ve read this year came as suggestions from conferences’ sessions: one is Lean Analytics and other is Data Engineering Teams. During 2018 I read some non work related books too, most of them sci-fi novels (like The Expanse book 3 and 4).

During summer I continue improving my knowledge of Python, using libraries to create images and videos. Also joined a MOOC about Google Cloud Platform (as a need from work).

I sent 3 papers for different Call for Papers during the year, and was lucky to get selected to do a workshop in November in Barcelona, during the unforgettable PyDay. I prepared a practical introduction to NLP, using classic and modern methods to classify texts. I chose Spanish jokes as the corpus to work on, and the result was amazing: both the audience and myself enjoyed a lot the workshop.

Finally in December I took a rest regarding tech stuff… and got married 🙂

My 4 favorite grouping tricks with pandas

ArchitectureWhen doing data analysis there is no better help than pandas library. Actually pandas is one of the reasons python became extremely popular in the data science field. It leverages one of the pillars of the field, numpy (a library for working with matrices), adding not only indexes and columns, but a wide functionality too. You can almost do magic with your data with pandas!

One of the most common uses with pandas is grouping data. You can make a group with the function groupby() and then apply some common action to that group, like mean(), count(), median(), etc.

The function groupby() sounds like SQL’s GROUP BY, and while it’s similar to its SQL cosin, it comes with extended powers. Let’s see some basic use before showing the tricks!

Let’s suppose we have a dataset with the results of an exam. We have 6 students that spent almost 2 hours (120 minutes) solving the problems from the exam, that took part in 2 different rooms (labeled 1 and 2).

INTRODUCTION: basic grouping

The most basic way to do grouping is by a column (or ‘feature’, in data analysis’ slang). In this case we age going to group by room, then choose only the time feature, and get the mean of time spent in each room.

Notice that the functions that are used with groups can also be used without grouping, as it’s showed in the next case. Here we are also showing here how to use square brackets to choose only some columns, in this case result and time, so later further operations are done only on them.

In the following case we first chose 2 columns, result and time, and then we group by result, looking for the maximum values.

Given these examples, we can get an idea of basic use… but let’s see now my 3 favorite tricks when grouping.

TRICK 1: list grouping

You can do grouping with more than one column, and the result will be a multi-index dataframe, nice!

Ok, ok, I hear you ‘this can be done with SQL too’. That’s right, but later you can use the multi-index for further exploration.

TRICK 2: grouping by function

You can pass a function as parameter to pandas’ groupby() to create groups. The function will get an index as parameter, so you can use pandas’ loc to locate the data. For instance, let’s suppose we want to group by the number of ‘e’ letter that each student name have.

So people with zero ‘e’s in his name spent 104.5 minutes as mean, while people with 2 ‘e’s finished the exam in just 94 minutes.

Isn’t it amazing? Of course this is a stupid example, but you can do things like, for instance, group ages by decades (like I did in a notebook on kaggle).

TRICK 3: Group and rank

pandas has a function called rank() that gets the order/rank of a column. For example it can sort time column and show a 1 for the quickest student, then 2 for the second one, etc.

But how could we get the rank per room? We want to know which student was the quickest for room 1, and which one for room 2…

So Alex was the number 1 globally and also the quickest in room 1. But George was the number 1 in room 2. Isn’t that magic?

Funny enough, rank() return the order as floats, but you can change it’s type with .astype(int) later.

TRICK 4: Group and process with agg() magic

With the agg() function you can describe several grouped processing in a compressed form. Let’s see the example to understand it better:

Using a dictionary we have defined first which columns we will work on. Then we define the operations we want to perform on this column; you can even write a lambda function there!

I hope you liked these tricks!

2017 focus: ML

At the end of 2016 I was still amazed with the result of AlphaGo vs. Lee Sedol match in March (for the 1st time a machine beats a top professional Go player), and at the same time I was looking for a subject to focus on in 2017, so I chose Machine Learning. During my university years I tried out some related tools (genetic algorithms, basic neural networks, etc), but for 10 years I’d not looked at it again.

The first stop was the famous Machine Learning course by Andrew Ng in Coursera, as everybody points you there. Despite it explains a lot of complex stuff in an intuitive way, soon you get tired of so much maths and using Octave/Matlab, when you should be using Python.

After one year learning about Machine Learning, I think I have quite a list of recommendations on how to start exploring the field. Disclaimer: this could be related with my preferred way of learning, that is, with text instead of videos. This could be a good way to start if you have no previous experience:

  • Do not watch that coursera’s ML course, but just read the notes somebody took on it instead.
  • Learn about Python, but specially about the libraries Numpy, Pandas and scikit-learn. Also how to run a jupyter notebook. And the best way to install them all is via Anaconda distribution.
  • Buy a copy (paper or ebook) of the book “Python Machine Learning” by Sebastian Raschka.
  • Join Kaggle and have a look at the Titanic tutorials, and it’s new Learn section. They also have a video-course in Udacity in case you like watching videos.
  • Don’t be in a rush to learn deep-learning (aka neural networks), because you’ll first have to learn about classic ML models, but also a lot of related processes: data cleaning, feature engineering and data visualization.

My first real-world input was in May, when I attended PyData conference in Barcelona, which was a turning point: I found lots of ideas to apply, but over all I felt the industry’s pulse.

workshopDuring summer I challenge myself to apply it at work and to do a conference talk. The subject was customer segmentation using non-supervised algorithms, using a dataset I prepared myself from our company’s data. Finally the talk became a 2-hour workshop.

It was the first time I did a presentation about Machine Learning in English. Despite the audience was satisfied with the workshop and some people had interesting conversation after, I felt that I should’ve work harder while preparing it.

As 2017 finished and 2018 started I’ll continue focusing on ML, but with a more practical approach. In my day work we have developed a recommendation system that will evolve with several ML models working together, and after work I’ll try to play more with Kaggle, taking part in some competitions.

In 2018, I’ll try deep learning too: both with Andrew Ng’s course with Tensorflow, a creative apps course and some video-tutorials on PyTorch. I’ll try to improve my engineering approach to ML, as things like version control, testing and deployment are very rare to see in a world with more university people than industry ones. Finally I plan to complete a nice course on data visualization with D3.js.

I hope all these links help somebody too!

PyData conference in Barcelona

pydatabcn2017I was lucky to attend PyData conference in Barcelona this year, hosted in ESADE.

Although I’m basically a PHP developer, I’ve been playing with data science tools lately with python’s stack. I have no real experience in data science, apart from a couple of prediction coding using linear regression, but I was curious.

With a novice spirit, I set some clear objectives: find out if data science is like teenager sex, or companies are really using it; get a feeling of the community; and try to learn as much as I could.

First of all, the community is vibrant, actually far more than PHP’s one in Barcelona. The organization was smooth too, and all the people I talked with was really nice. Everybody had things to learn, so came with an open mind.

It was funny to see that I was on the “data owners” side, while most people were in the “looking for datasets” side. This led to several conversations asking me how we use the data in our company.

Regarding the talks, there were quite a lot about tools. Python science stack have a wide range of evolving tools, and this somehow reminds me of PHP circa 2008, when basic tools (PHPUnit, for example) were becoming popular. It’s good to polish your tools and master them, so I welcomed those talks.

There were also some talks on theory, which surprised me, as I haven’t never seen university professors in software conferences. Mathematical and computer science concepts were explained, for instance on optimization. This contrasts with the common industry solution: if some code is slow, just use more machine instances, which is far cheaper that spend time trying to optimize things (at least 99% of the time). I don’t mean I didn’t like those talks (actually one was really mind blowing), but I would love to see more professors in some other conferences, getting a real feel of some industry practices.

I was looking for talks showing “real fire”, real examples in companies. We heard about hotels trying to predict cancellations (in order to do overbooking); we saw IBM’s Watson analyzing the personality of customers; predict which employees will leave a big company; ideas to react knowing bad weather will arrive; best weekday to publish job offers and set interviews; and some other extremely interesting stuff… but I do want more!

My overall feeling is that I learned a lot. Python is not really used as a language but more as an interface for some amazing libraries. It looks like I have no option but to start exploring the data in ulabox!

I’d like to thank ulabox (my employer) that paid the ticket, and all the people in the organization that did a great job!

I published some of my (unedited) notes too.