Are Jupyter notebooks ready for production?

In recent years, there has been a rapid increase in the use of Jupyter notebooks, s.a. Octoverse: Growth of Jupyter notebooks, 2016-2019. This is a Mathematica- inspired application that combines text, visualisation, and code in one document. Jupyter notebooks are widely used by our customers for prototyping, research analysis and machine learning. However, we have also seen that the growing popularity has also helped Jupyter notebooks be used in other areas of data analysis, and additional tools have been used to run extensive calculations with them.

However, Jupyter notebooks tend to be inappropriate for creating scalable, maintainable, and long-lasting production code. Although notebooks can be meaningfully versioned with a few tricks, automated tests can also run, but in complex projects, mixing code, comments and tests becomes an obstacle: Jupyter notebooks can not be sufficiently modularized. Although notebooks can be imported as modules, these options are extremely limited: the notebooks must first be fully loaded into memory and a new module must be created before each cell can run in it.

As a result, it came to the first notebook war, which was essentially a conflict between data scientists and software engineers.

How To Bridge The Gap?

Notebooks are rapidly gaining popularity among data scientists and becoming the de facto standard for rapid prototyping and exploratory analysis. Above all, however, Netflix has created an extensive ecosystem of additional tools and services, such as Genie and Metacat. These tools simplify complexity and support a broader audience of analysts, scientists and especially computer scientists. In general, each of these roles depends on different tools and languages. Superficially, the workflows seem different, if not complementary. However, at a more abstract level, these workflows have several overlapping tasks:

data exploration occurs early in a project

This may include displaying sample data, statistical profiling, and data visualization

Data preparation

iterative task

may include cleanup, standardising, transforming, denormalising, and aggregating data

Data validation

recurring task

may include displaying sample data, performing statistical profiling and aggregated analysis queries, and visualising data

Product creation

occurs late in a project

This may include providing code for production, training models, and scheduling workflows

A JupyterHub can already do a good job here to make these tasks as simple and manageable as possible. It is scalable and significantly reduces the number of tools.

To understand why Jupyter notebooks are so compelling for us, we highlight their core functionalities:

  • A messaging protocol for checking and executing language-independent code
  • An editable file format for writing and capturing code, code output, and markdown notes
  • A web-based user interface for interactive writing and code execution and data visualisation

Use Cases

Of our many applications, notebooks are today most commonly used for data access, parameterization, and workflow planning.

Data access

First we introduced notebooks to support data science workflows. As acceptance grew, we saw an opportunity to leverage the versatility and architecture of Jupyter notebooks for general data access. Mid-2018, we started to expand our notebooks from a niche product to a universal data platform.

From the user’s point of view, notebooks provide a convenient interface for iteratively executing code, searching and visualizing data – all on a single development platform. Because of this combination of versatility, performance, and ease of use, we have seen rapid adoption across many user groups of the platform.

Parameterization

Along with increasing acceptance, we have introduced additional features for other use cases. From this work notebooks became simply paramatable. This provided our users with a simple mechanism to define notebooks as reusable templates.

Workflow planning

As a further area of notebook ​​applications, we have discovered the planning of workflows. They have the following advantages, among others:

  • On the one hand, notebooks allow interactive work and rapid prototyping and on the other hand they can be put into production almost without any problems. For this the notebooks are modularized and marked as trustworthy.
  • Another advantage of notebooks are the different kernels, so that users can choose the right execution environment.
  • In addition, errors in notebooks are easier to understand because they are assigned to specific cells and the outputs can be stored.

Logging

In order to be able to use notebooks not only for rapid prototyping but also for long-term productivity, certain process events must be logged so that, for example, errors can be diagnosed more easily and the entire process can be monitored. IPython Notebboks can use the logging module of the standard Python library or loguru, see also Jupyter-Tutorial: Logging.

Testing

There have been a number of approaches to automate the testing of notebooks, such as nbval, but with ipytest writing notebook tests became much easier, see also Jupyter Tutorial: ipytest.

Summary

Over the last few years, we have been promoting close collaboration between Software Engineers and data scientists to achieve scalable, maintainable and production-ready code. Together, we have found solutions that can provide production-ready models for machine learning projects as well.