Data science & Data Engineering

Data science is so versatile that almost every company can benefit from its proper use – be it in customer management, forecasting or logistics.

Customer management

Successful customer management is comprehensive and has often been neglected by companies due to the effort involved. Customer satisfaction is a fundamental component of a company’s success and should therefore be given the appropriate recognition. Working with one’s own customers efficiently, transparent and trustworthy cooperation is essential. Data science can automate processes and make them significantly easier without that customers would have to disclose their business secrets. Contrary to the Web 2.0, where customers are supposed tovoluntarily disclose their data, cusy recommends leaving the data for the analysis with the customer and and evaluate it there. cusy supports companies infinding out how consolidate and merge customer data from a wide range of sources, combine product types and optimally evaluate the collected knowledge with text mining and sentiment analysis.

Forecasts

However, data science not only offers advanced solutions in customer management but also in order forecasting: fluctuating incoming orders and the heterogeneous delivery behaviour of individual suppliers make planning enormously difficult. Machine-learning models can recognise patterns in supplierbehaviour in order to make predictions about future incoming goods quantities. The better forecasts now allow not only better planning of staff and storage, but also a more intelligent control of suppliers, so that peaks in capacity utilisation can be anticipated in orders.

Logistics

When time is money, the reduction of workload is very tempting. cusy can supportyou in this with route optimisation, both in the dynamic route optimisation as well as last-mile scheduling.

Get in touch

I will be happy to answer your questions and create a tailor-made offer for your data science project.

Portrait Veit Schiele
Veit Schiele
E-mail: info@cusy.io

I will also be happy to call you!

Request now

Find data and its origin with DataHub

One of the big difficulties for data scientists is finding the data they need, understanding it and assessing its trustworthiness. Without the necessary metadata on the available data sources and without adequate search functions, finding needed data remains a major challenge.

Traditionally, this function has been provided by bloated data cataloguing solutions. In recent years, a number of open source projects have emerged that improve the developer experience (DX) of both providing and consuming data, e.g. Netflix’s Metacat, LinkedIn’s WhereHows, LF AI & Data Foundation’s Amundsen and WeWork’s Marquez. At the same time, the behaviour of data providers also changed, moving away from bloated data cataloguing solutions towards tools that can derive partial metadata information from different sources.

LinkedIn has also developed WhereHows into DataHub. This platform enables data to be found via an extensible metadata system. Instead of crawling and polling metadata, DataHub uses a push model where the individual components of the data ecosystem publish metadata to the central platform via a REST API or a Kafka stream. This push-based integration shifts responsibility from the central entity to the individual teams, who are thus responsible for their metadata. As more and more companies seek to become data-driven, a system that helps with data discovery and understanding data quality and provenance is critical.

The Generalised Metadata Architecture (GMA) of DataHub allows different storage technologies that can be requested with

  • document-based CRUD (Create, Read, Update, Delete)
  • complex queries even of nested tables
  • graph traversal
  • full text search incl. auto-completion

Plugins are available for files, BigQuery, dbt, Hive, Kafka, LDAP, MongoDB, MySQL, PostgreSQL, SQLAlchemy, and Snowflake, among others. You can also transfer metadata to the DataHub via console, REST API and files.

With the GMA, each team can also provide its own metadata services, so-called GMS, to make their data accessible via graphs and search indexes.

In order to also be able to open up the software that was used to create the data, we currently use Git2PROV and then import the W3C PROV data with the file sink.

Finally, DataHub uses Apache Gobblin to open up the data lifecycle.

Choosing the right NoSQL database

Relational databases dominated the software industry for a long time and are very mature with mechanisms such as redundancy, transaction control and standard interfaces. However, they were initially only able to react moderately to higher demands on scalability and performance. Thus, from the beginning of 2010, the term NoSQL was increasingly used to describe new types of databases that better met these requirements.

NoSQL databases should solve the following problems:

  • Bridging the internal data structure of the application and the relational data structure of the database.
  • Moving away from the integration of a wide variety of data structures into a uniform data model.
  • The growing amount of data increasingly required clusters for data storage

Aggregated data models

Relational database modelling is very different from the types of data structures that application developers use. The use of data structures modelled by developers to solve different problem domains has led to a move away from relational modelling towards aggregate models. Most of this is inspired by Domain Driven Design. An aggregate is a collection of data that we interact with as a unit. These aggregates form the boundaries for ACID operations, where Key Values, Documents and Column Family can be seen as forms of an aggregator-oriented database.

Aggregates make it easier for the database to manage data storage on a cluster, as the data unit can now be on any computer. Aggregator-oriented databases work best when most data interactions are performed with the same aggregate, e.g. when a profile needs to be retrieved with all its details. It is better to store the profile as an aggregation object and use these aggregates to retrieve profile details.

Distribution models

Aggregator-oriented databases facilitate the distribution of data because the distribution mechanism only has to move the aggregate and doesn’t have to worry about related data, since all related data is contained in the aggregate itself. There are two main types of data distribution:

Sharding
Sharding distributes different data across multiple servers so that each server acts as a single source for a subset of data.
Replication

Replication copies data across multiple servers so that the same data can be found in multiple locations. Replication takes two forms:

Master-slave replication makes one node the authoritative copy, processing writes, while slaves are synchronised with the master and may process reads.

Peer-to-peer replication allows writes to any node. Nodes coordinate to synchronise their copies of the data.

Master-slave replication reduces the likelihood of update conflicts, but peer-to-peer replication avoids writing all operations to a single server, thus avoiding a single point of failure. A system can use one or both techniques.

CAP Theorem

In distributed systems, the following three aspects are important:

  • Consistency
  • Availability
  • Partition tolerance

Eric Brewer has established the CAP theorem, which states that in any distributed system we can only choose two of the three options. Many NoSQL databases try to provide options where a setup can be chosen to set up the database according to your requirements. For example, if you consider Riak as a distributed key-value database, there are essentially the three variables

r
Number of nodes to respond to a read request before it is considered successful
w
number of nodes to respond to a write request before it is considered successful
n
Number of nodes on which the data is replicated, also called replication factor

In a Riak cluster with 5 nodes, we can adjust the values for r, w and n so that the system is very consistent by setting r = 5 and w = 5. However, by doing this we have made the cluster vulnerable to network partitions, as no write is possible if only one node is not responding. We can make the same cluster highly available for writes or reads by setting r = 1 and w = 1. However, now consistency may be affected as some nodes may not have the latest copy of the data. The CAP theorem states that when you get a network partition, you have to balance the availability of data against the consistency of data. Durability can also be weighed against latency, especially if you want to survive failures with replicated data.

Often with relational databases you needed little understanding of these requirements; now they become important again. So you may have been used to using transactions in relational databases. In NoSQL databases, however, these are no longer available to you and you have to think about how they should be implemented. Does the writing have to be transaction-safe? Or is it acceptable for data to be lost from time to time? Finally, sometimes an external transaction manager like ZooKeeper can be helpful.

Different types of NoSQL databases

NoSQL databases can be roughly divided into four types:

Key-value databases

Key-value databases are the simplest NoSQL data stores from an API perspective. The client can either retrieve the value for the key, enter a value for a key or delete a key from the data store. The value is a blob that the datastore just stores without caring or knowing what is inside. It is solely the responsibility of the application to understand what has been stored. Because key-value databases always use primary key access, they generally have high performance and can be easily scaled.

Some of the most popular key-value databases are

Riak KV
Home | GitHub | Docs
Redis
Home | GitHub | Docs
Memcached
Home | GitHub | Docs
Berkeley DB
Home | GitHub | Docs
Upscaledb
Home | GitHub | C API Docs

You need to choose them carefully as there are big differences between them. For example, while Riak stores data persistently, Memcached usually does not.

Document databases

These databases store and retrieve documents, which may be XML, JSON, BSON, etc. These documents are hierarchical tree data structures that can consist of maps, collections and scalar values. Document databases provide rich query languages and constructs such as databases, indexes, etc. that allow for an easier transition from relational databases.

Some of the most popular document databases are

MongoDB
Home | GitHub | Docs
CouchDB
Home | GitHub | Docs
RavenDB
Home | GitHub | Docs
Elasticsearch
Home | GitHub | Docs
eXist
Home | GitHub | Docs

Column Family Stores

These databases store data in column families as rows assigned to a row key. They are excellent for groups of related data that are frequently accessed together. For example, this could be all of a person’s profile information, but not their activities.

While each Column Family can be compared to the row in an RDBMS table where the key identifies the row and the row consists of multiple columns, in Column Family Stores the different rows do not have to have the same columns.

Some of the most popular Column Family Stores are

Cassandra
Home | GitHub | Docs
HBase
Home | GitHub | Docs
Hypertable
Home | GitHub | Docs

Cassandra can be described as fast and easily scalable because writes are distributed across the cluster. The cluster does not have a master node, so reads and writes can be performed by any node in the cluster.

Graph database

In graph databases you can store entities with certain properties and relationships between these entities. Entities are also called nodes. Think of a node as an instance of an object in an application; relationships can then be called edges, which can also have properties and are directed.

Graph models
Labeled Property Graph
In a labelled property graph, both nodes and edges can have properties.
Resource Description Framework (RDF)
In RDF, graphs are represented using triples. A triple consists of three elements in the form node-edge-node subject --predicate-> object, which are defined as resources in the form of a globally unique URI or as an anonymous resource. In order to be able to manage different graphs within a database, these are stored as quads, whereby a quad extends each triple by a reference to the associated graph. Building on RDF, a vocabulary has been developed with RDF Schema to formalise weak ontologies and furthermore to describe fully decidable ontologies with the Web Ontology Language.
Algorithms

Important algorithms for querying nodes and edges are:

Breadth-first search, depth-first search
Breadth-first search (BFS) is a method for traversing the nodes of a graph. In contrast to depth-first search (DFS), all nodes that can be reached directly from the initial node are traversed first. Only then are subsequent nodes traversed.
Shortest path
Path between two different nodes of a graph, which has minimum length with respect to an edge weight function.
Eigenvector
In linear algebra, a vector different from the zero vector, whose direction is not changed by the mapping. An eigenvector is therefore only scaled and the scaling factor is called the eigenvalue of the mapping.
Query languages
Blueprints
a Java API for property graphs that can be used together with various graph databases.
Cypher
a query language developed by Neo4j.
GraphQL
an SQL-like query language
Gremlin
an open source graph programming language that can be used with various graph databases (Neo4j, OrientDB).
SPARQL
query language specified by the W3C for RDF data models.
Distinction from relational databases

When we want to store graphs in relational databases, this is usually only done for specific conditions, e.g. for relationships between people. Adding more types of relationships then usually involves many schema changes.

In graph databases, traversing the links or relationships is very fast because the relationship between nodes doesn’t have to be calculated at query time.

Some of the most popular graph databases are

Neo4j
Home | GitHub | Docs
InfiniteGraph
Home

Selecting the NoSQL database

What all NoSQL databases have in common is that they don’t enforce a particular schema. Unlike strong-schema relational databases, schema changes do not need to be stored along with the source code that accesses those changes. Schema-less databases can tolerate changes in the implied schema, so they do not require downtime to migrate; they are therefore especially popular for systems that need to be available 24/7.

But how do we choose the right NoSQL database from so many? In the following we can only give you some general criteria:

Key-value databases
are generally useful for storing sessions, user profiles and settings. However, if relationships between the stored data are to be queried or multiple keys are to be edited simultaneously, we would avoid key-value databases.
Document databases
are generally useful for content management systems and e-commerce applications. However, we would avoid using document databases if complex transactions are required or multiple operations or queries are to be made for different aggregate structures.
Column Family Stores
are generally useful for content management systems, and high volume writes such as log aggregation. We would avoid using Column Family Stores databases that are in early development and whose query patterns may still change.
Graph databases
are well suited for problem areas where we need to connect data such as social networks, geospatial data, routing information as well as recommender system.

Conclusion

The rise of NoSQL databases did not lead to the demise of relational databases. They can coexist well. Often, different data storage technologies are used to store the data to match your structure and required query.

Beuth University: Prototype for a medication app

For the Beuth University, we develop a prototype for a medication app.

The app is intended to improve the safety of the medication and in particular in the monitoring of ingestion rhythm and the knowledge of side effects and influences.

Not only the patients themselves should be able to use this app, but also relatives and caregivers.

In fact, there are already many apps that promise to meet the requirements. However, with more detailed research, they have significant shortcomings.

Professional quality

The professional quality of other apps is rarely discernible and, if the few reviews are taken as a basis, is usually very low. This is all the more problematic when apps promise to point out interactions and double prescriptions for medications with similar effects. For customers who rely on the fact that their app will warn them of dangers, for example with their self-medication requests, are likely to be at serious risk.

User groups

The apps also very rarely provide information about their user groups, neither about

  • Suitability for specific diseases/conditions
  • Suitability for gender, special age groups (or areas) etc.
  • Suitability for certain health professions and settings: clinical, outpatient, at home, …
  • Suitability for physiological and physical impairments, also not the support for TalkBack for Android and VoiceOver for iPhone.
  • Support for country-specific drugs and pack sizes

Privacy

The handling of user data is usually poor. The data protection declarations usually leave customers unclear as to what happens to their information. This is all the more problematic since over 80% of the apps transfer data to infrastructure providers such as Google, Facebook etc. Not even the encrypted transmission of user data was always guaranteed, especially not when data was transmitted by email. The few independent test procedures are unlikely to contribute to clarification, since they mostly rely on self-assessment.

Are Jupyter notebooks ready for production?

In recent years, there has been a rapid increase in the use of Jupyter notebooks, s.a. Octoverse: Growth of Jupyter notebooks, 2016-2019. This is a Mathematica- inspired application that combines text, visualisation, and code in one document. Jupyter notebooks are widely used by our customers for prototyping, research analysis and machine learning. However, we have also seen that the growing popularity has also helped Jupyter notebooks be used in other areas of data analysis, and additional tools have been used to run extensive calculations with them.

However, Jupyter notebooks tend to be inappropriate for creating scalable, maintainable, and long-lasting production code. Although notebooks can be meaningfully versioned with a few tricks, automated tests can also run, but in complex projects, mixing code, comments and tests becomes an obstacle: Jupyter notebooks can not be sufficiently modularized. Although notebooks can be imported as modules, these options are extremely limited: the notebooks must first be fully loaded into memory and a new module must be created before each cell can run in it.

As a result, it came to the first notebook war, which was essentially a conflict between data scientists and software engineers.

How To Bridge The Gap?

Notebooks are rapidly gaining popularity among data scientists and becoming the de facto standard for rapid prototyping and exploratory analysis. Above all, however, Netflix has created an extensive ecosystem of additional tools and services, such as Genie and Metacat. These tools simplify complexity and support a broader audience of analysts, scientists and especially computer scientists. In general, each of these roles depends on different tools and languages. Superficially, the workflows seem different, if not complementary. However, at a more abstract level, these workflows have several overlapping tasks:

data exploration occurs early in a project

This may include displaying sample data, statistical profiling, and data visualization

Data preparation

iterative task

may include cleanup, standardising, transforming, denormalising, and aggregating data

Data validation

recurring task

may include displaying sample data, performing statistical profiling and aggregated analysis queries, and visualising data

Product creation

occurs late in a project

This may include providing code for production, training models, and scheduling workflows

A JupyterHub can already do a good job here to make these tasks as simple and manageable as possible. It is scalable and significantly reduces the number of tools.

To understand why Jupyter notebooks are so compelling for us, we highlight their core functionalities:

  • A messaging protocol for checking and executing language-independent code
  • An editable file format for writing and capturing code, code output, and markdown notes
  • A web-based user interface for interactive writing and code execution and data visualisation

Use Cases

Of our many applications, notebooks are today most commonly used for data access, parameterization, and workflow planning.

Data access

First we introduced notebooks to support data science workflows. As acceptance grew, we saw an opportunity to leverage the versatility and architecture of Jupyter notebooks for general data access. Mid-2018, we started to expand our notebooks from a niche product to a universal data platform.

From the user’s point of view, notebooks provide a convenient interface for iteratively executing code, searching and visualizing data – all on a single development platform. Because of this combination of versatility, performance, and ease of use, we have seen rapid adoption across many user groups of the platform.

Parameterization

Along with increasing acceptance, we have introduced additional features for other use cases. From this work notebooks became simply paramatable. This provided our users with a simple mechanism to define notebooks as reusable templates.

Workflow planning

As a further area of notebook ​​applications, we have discovered the planning of workflows. They have the following advantages, among others:

  • On the one hand, notebooks allow interactive work and rapid prototyping and on the other hand they can be put into production almost without any problems. For this the notebooks are modularized and marked as trustworthy.
  • Another advantage of notebooks are the different kernels, so that users can choose the right execution environment.
  • In addition, errors in notebooks are easier to understand because they are assigned to specific cells and the outputs can be stored.

Logging

In order to be able to use notebooks not only for rapid prototyping but also for long-term productivity, certain process events must be logged so that, for example, errors can be diagnosed more easily and the entire process can be monitored. IPython Notebboks can use the logging module of the standard Python library or loguru, see also Jupyter-Tutorial: Logging.

Testing

There have been a number of approaches to automate the testing of notebooks, such as nbval, but with ipytest writing notebook tests became much easier, see also Jupyter Tutorial: ipytest.

Summary

Over the last few years, we have been promoting close collaboration between Software Engineers and data scientists to achieve scalable, maintainable and production-ready code. Together, we have found solutions that can provide production-ready models for machine learning projects as well.

elena international: Web-based planning tool for microgrids

For elena international we realise a web-based planning tool for microgrids where we use Jupyter notebooks and Voilà vuetify to develop presentation logic and user interactions fast, simple and robust.

elena international is a startup company that provides customised solutions for various stages of power system planning.

First we realise a walking skeleton of a web-based planning tool for microgrids.

The chosen software architecture allows elena the further development of its Julia libraries. In the Web tool the adoption of new features takes place in IPython notebooks with PyJulia. These notebooks can also be used to define the voila-vuetify widgets for interacting with users. Finally, with Voilà, these notebooks are converted into an interactive dashboard:

Voilà Dashboard

To harden the notebooks for production, on the one hand we write tests for each method, which run regularly with GitLab CI/CD, and on the other hand we activate logging for fault diagnosis and monitoring.

So we are not only answering the question Are Jupyter notebooks ready for production?, we also extend the possibilities of notebooks: they serve an editor system, in which not only texts can be written and media added, but also the presentation logic of scientific calculations and interactive widgets can be defined.