Find data and its origin with DataHub

by Veit Schiele last modified 2021-08-20T22:40:37+02:00
One of the big difficulties for data scientists is finding the data they need, understanding it and assessing its trustworthiness. Without the necessary metadata on the available data sources and without adequate search functions, finding needed data remains a major challenge.

Traditionally, this function has been provided by bloated data cataloguing solutions. In recent years, a number of open source projects have emerged that improve the developer experience (DX) of both providing and consuming data, e.g. Netflix’s Metacat, LinkedIn’s WhereHows, LF AI & Data Foundation’s Amundsen and WeWork’s Marquez. At the same time, the behaviour of data providers also changed, moving away from bloated data cataloguing solutions towards tools that can derive partial metadata information from different sources.

LinkedIn has also developed WhereHows into DataHub. This platform enables data to be found via an extensible metadata system. Instead of crawling and polling metadata, DataHub uses a push model where the individual components of the data ecosystem publish metadata to the central platform via a REST API or a Kafka stream. This push-based integration shifts responsibility from the central entity to the individual teams, who are thus responsible for their metadata. As more and more companies seek to become data-driven, a system that helps with data discovery and understanding data quality and provenance is critical.

The Generalised Metadata Architecture (GMA) of DataHub allows different storage technologies that can be requested with

  • document-based CRUD (Create, Read, Update, Delete)
  • complex queries even of nested tables
  • graph traversal
  • full text search incl. auto-completion

Plugins are available for files, BigQuery, dbt, Hive, Kafka, LDAP, MongoDB, MySQL, PostgreSQL, SQLAlchemy, and Snowflake, among others. You can also transfer metadata to the DataHub via console, REST API and files.

With the GMA, each team can also provide its own metadata services, so-called GMS, to make their data accessible via graphs and search indexes.

In order to also be able to open up the software that was used to create the data, we currently use Git2PROV and then import the W3C PROV data with the file sink.

Finally, DataHub uses Apache Gobblin to open up the data lifecycle.