We had an important project to do: build and keep updated some transformed tables in BigQuery with data that comes from a transactional system. We needed 3 pieces: the code that build the tables, a place to deploy this code and a scheduler to call it. Given the code piece was in Python, we had to evaluate different platforms to deploy.
The platform chosen was Google Cloud Functions. Actually there is a nice diagram to let you choose among Google services that helped out (but for some reason it was not easy to found). We could’ve tried it deploying on our own hardware server, but a lateral idea with this project was testing the cloud.
Cloud Functions are perhaps the easiest way to deploy your code: given it’s Python, Go or Node.js; for other languages you could try the new Cloud Run, which is basically a docker container that is executed on an external event. A Cloud Function can be triggered using a HTTP call, a Google PubSub message, or some platform internal events (like when a file is uploaded). In our case, we use Google Scheduler (a basic cron) that sends a PubSub message.
Up to here everything sounds perfect, but we found 4 minor problems due to the lack of knowledge of the platform:
DEPLOY AND LOGS WITH DELAY
As you deploy and use the Function on the cloud, the results will come with delay. This is not a real problem, but if you’re get used to work with “tail” or other console scripts to explore the logs, you have to relax and wait a couple of minutes before sentencing your code is not working.
TIMEOUT
Cloud Function’s documentation says that this product is intended to short-running scripts, so it comes with a 1 minute timeout. We missed this detail at first, and the actual use misled us to think there was no timeout: if you manually launch the Function, apparently there is no timeout and runs for several minutes. But later, having a look at the logs, we found the timeout error (that actually is logged as INFO instead of ERROR!). It seems that when our script uses more than 1 minute, it continues until somebody else asks for resources: so if we are in a relaxed pool, we can be lucky and run it more time.
However you can configure, on deploy time, the timeout limit up to 9 minutes. Unluckily our code runs for 11 minutes, so we had to split it.
NO WRITE PERMISSIONS
The library we use does some disk-writing internally, and we didn’t realize it until we saw the problem in the logs. The good thing is that you can just write in /tmp, so we only had to reconfigure the library to write there. The weird thing is that anything that your code writes into /tmp is also written in the function log, so the logs can become difficult to follow.
SINGLE THREAD
This was the most trickiest one! We are using a Python library that, by default, creates 4 threads of execution. For some reason this doesn’t work well on Cloud Functions, and sometimes the connection with BigQuery is closed before all threads have finished. So we had to use an undocumented feature of the library to work only with a single thread.
Summing up, Google Cloud Functions is a lightweight way to execute your Python scripts, with a really way to deploy and use. But sometimes things go wrong under the hood, so you should READ THE LOGS to find out if all is ok. Checking that the final results matches what you expect will help too (for instance, doing automated test-queries that try to find non matching numbers).
Disclaimer: we chose Google Cloud Functions due to the particular background of the company (team, knowledge, etc). Depending on the task to do, you might want to have a look at other more specific products, that could help you better to make ETLs or processing data (for instance, Dataflow/Beam on Google platform).