Open Source

Choosing the right NoSQL database

Relational databases dominated the software industry for a long time and are very mature with mechanisms such as redundancy, transaction control and standard interfaces. However, they were initially only able to react moderately to higher demands on scalability and performance. Thus, from the beginning of 2010, the term NoSQL was increasingly used to describe new types of databases that better met these requirements.

NoSQL databases should solve the following problems:

  • Bridging the internal data structure of the application and the relational data structure of the database.
  • Moving away from the integration of a wide variety of data structures into a uniform data model.
  • The growing amount of data increasingly required clusters for data storage

Aggregated data models

Relational database modelling is very different from the types of data structures that application developers use. The use of data structures modelled by developers to solve different problem domains has led to a move away from relational modelling towards aggregate models. Most of this is inspired by Domain Driven Design. An aggregate is a collection of data that we interact with as a unit. These aggregates form the boundaries for ACID operations, where Key Values, Documents and Column Family can be seen as forms of an aggregator-oriented database.

Aggregates make it easier for the database to manage data storage on a cluster, as the data unit can now be on any computer. Aggregator-oriented databases work best when most data interactions are performed with the same aggregate, e.g. when a profile needs to be retrieved with all its details. It is better to store the profile as an aggregation object and use these aggregates to retrieve profile details.

Distribution models

Aggregator-oriented databases facilitate the distribution of data because the distribution mechanism only has to move the aggregate and doesn’t have to worry about related data, since all related data is contained in the aggregate itself. There are two main types of data distribution:

Sharding
Sharding distributes different data across multiple servers so that each server acts as a single source for a subset of data.
Replication

Replication copies data across multiple servers so that the same data can be found in multiple locations. Replication takes two forms:

Master-slave replication makes one node the authoritative copy, processing writes, while slaves are synchronised with the master and may process reads.

Peer-to-peer replication allows writes to any node. Nodes coordinate to synchronise their copies of the data.

Master-slave replication reduces the likelihood of update conflicts, but peer-to-peer replication avoids writing all operations to a single server, thus avoiding a single point of failure. A system can use one or both techniques.

CAP Theorem

In distributed systems, the following three aspects are important:

  • Consistency
  • Availability
  • Partition tolerance

Eric Brewer has established the CAP theorem, which states that in any distributed system we can only choose two of the three options. Many NoSQL databases try to provide options where a setup can be chosen to set up the database according to your requirements. For example, if you consider Riak as a distributed key-value database, there are essentially the three variables

r
Number of nodes to respond to a read request before it is considered successful
w
number of nodes to respond to a write request before it is considered successful
n
Number of nodes on which the data is replicated, also called replication factor

In a Riak cluster with 5 nodes, we can adjust the values for r, w and n so that the system is very consistent by setting r = 5 and w = 5. However, by doing this we have made the cluster vulnerable to network partitions, as no write is possible if only one node is not responding. We can make the same cluster highly available for writes or reads by setting r = 1 and w = 1. However, now consistency may be affected as some nodes may not have the latest copy of the data. The CAP theorem states that when you get a network partition, you have to balance the availability of data against the consistency of data. Durability can also be weighed against latency, especially if you want to survive failures with replicated data.

Often with relational databases you needed little understanding of these requirements; now they become important again. So you may have been used to using transactions in relational databases. In NoSQL databases, however, these are no longer available to you and you have to think about how they should be implemented. Does the writing have to be transaction-safe? Or is it acceptable for data to be lost from time to time? Finally, sometimes an external transaction manager like ZooKeeper can be helpful.

Different types of NoSQL databases

NoSQL databases can be roughly divided into four types:

Key-value databases

Key-value databases are the simplest NoSQL data stores from an API perspective. The client can either retrieve the value for the key, enter a value for a key or delete a key from the data store. The value is a blob that the datastore just stores without caring or knowing what is inside. It is solely the responsibility of the application to understand what has been stored. Because key-value databases always use primary key access, they generally have high performance and can be easily scaled.

Some of the most popular key-value databases are

Riak KV
Home | GitHub | Docs
Redis
Home | GitHub | Docs
Memcached
Home | GitHub | Docs
Berkeley DB
Home | GitHub | Docs
Upscaledb
Home | GitHub | C API Docs

You need to choose them carefully as there are big differences between them. For example, while Riak stores data persistently, Memcached usually does not.

Document databases

These databases store and retrieve documents, which may be XML, JSON, BSON, etc. These documents are hierarchical tree data structures that can consist of maps, collections and scalar values. Document databases provide rich query languages and constructs such as databases, indexes, etc. that allow for an easier transition from relational databases.

Some of the most popular document databases are

MongoDB
Home | GitHub | Docs
CouchDB
Home | GitHub | Docs
RavenDB
Home | GitHub | Docs
Elasticsearch
Home | GitHub | Docs
eXist
Home | GitHub | Docs

Column Family Stores

These databases store data in column families as rows assigned to a row key. They are excellent for groups of related data that are frequently accessed together. For example, this could be all of a person’s profile information, but not their activities.

While each Column Family can be compared to the row in an RDBMS table where the key identifies the row and the row consists of multiple columns, in Column Family Stores the different rows do not have to have the same columns.

Some of the most popular Column Family Stores are

Cassandra
Home | GitHub | Docs
HBase
Home | GitHub | Docs
Hypertable
Home | GitHub | Docs

Cassandra can be described as fast and easily scalable because writes are distributed across the cluster. The cluster does not have a master node, so reads and writes can be performed by any node in the cluster.

Graph database

In graph databases you can store entities with certain properties and relationships between these entities. Entities are also called nodes. Think of a node as an instance of an object in an application; relationships can then be called edges, which can also have properties and are directed.

Graph models
Labeled Property Graph
In a labelled property graph, both nodes and edges can have properties.
Resource Description Framework (RDF)
In RDF, graphs are represented using triples. A triple consists of three elements in the form node-edge-node subject --predicate-> object, which are defined as resources in the form of a globally unique URI or as an anonymous resource. In order to be able to manage different graphs within a database, these are stored as quads, whereby a quad extends each triple by a reference to the associated graph. Building on RDF, a vocabulary has been developed with RDF Schema to formalise weak ontologies and furthermore to describe fully decidable ontologies with the Web Ontology Language.
Algorithms

Important algorithms for querying nodes and edges are:

Breadth-first search, depth-first search
Breadth-first search (BFS) is a method for traversing the nodes of a graph. In contrast to depth-first search (DFS), all nodes that can be reached directly from the initial node are traversed first. Only then are subsequent nodes traversed.
Shortest path
Path between two different nodes of a graph, which has minimum length with respect to an edge weight function.
Eigenvector
In linear algebra, a vector different from the zero vector, whose direction is not changed by the mapping. An eigenvector is therefore only scaled and the scaling factor is called the eigenvalue of the mapping.
Query languages
Blueprints
a Java API for property graphs that can be used together with various graph databases.
Cypher
a query language developed by Neo4j.
GraphQL
an SQL-like query language
Gremlin
an open source graph programming language that can be used with various graph databases (Neo4j, OrientDB).
SPARQL
query language specified by the W3C for RDF data models.
Distinction from relational databases

When we want to store graphs in relational databases, this is usually only done for specific conditions, e.g. for relationships between people. Adding more types of relationships then usually involves many schema changes.

In graph databases, traversing the links or relationships is very fast because the relationship between nodes doesn’t have to be calculated at query time.

Some of the most popular graph databases are

Neo4j
Home | GitHub | Docs
InfiniteGraph
Home

Selecting the NoSQL database

What all NoSQL databases have in common is that they don’t enforce a particular schema. Unlike strong-schema relational databases, schema changes do not need to be stored along with the source code that accesses those changes. Schema-less databases can tolerate changes in the implied schema, so they do not require downtime to migrate; they are therefore especially popular for systems that need to be available 24/7.

But how do we choose the right NoSQL database from so many? In the following we can only give you some general criteria:

Key-value databases
are generally useful for storing sessions, user profiles and settings. However, if relationships between the stored data are to be queried or multiple keys are to be edited simultaneously, we would avoid key-value databases.
Document databases
are generally useful for content management systems and e-commerce applications. However, we would avoid using document databases if complex transactions are required or multiple operations or queries are to be made for different aggregate structures.
Column Family Stores
are generally useful for content management systems, and high volume writes such as log aggregation. We would avoid using Column Family Stores databases that are in early development and whose query patterns may still change.
Graph databases
are well suited for problem areas where we need to connect data such as social networks, geospatial data, routing information as well as recommender system.

Conclusion

The rise of NoSQL databases did not lead to the demise of relational databases. They can coexist well. Often, different data storage technologies are used to store the data to match your structure and required query.

Criteria for safe and sustainable software

Open Source
The best way to check how secure your data is against unauthorised access is to use open source software.
Virtual Private Network
This is usually the basis for accessing a company network from outside. However, do not blindly trust the often false promises of VPN providers, but use open source programmes such as OpenVPN or WireGuard.
Remote desktop software
Remotely is a good open source alternative to TeamViewer or AnyDesk.
Configuration

Even with open-source software, check whether the default settings are really privacy-friendly:

For example, Jitsi Meet creates external connections to gravatar.com and logs far too much information with the INFO logging level. Previous Jitsi apps also tied in the trackers Google CrashLytics, Google Firebase Analytics and Amplitude. Run your own STUN servers if possible, otherwise meet-jit-si-turnrelay.jitsi.net is used.

Encryption methods

Here you should distinguish between transport encryption – ideally end-to-end – and encryption of stored data.

The synchronisation software Syncthing, for example, uses both TLS and Perfect Forward Secrecy to protect communication.

You should be informed if the fingerprint of a key changes.

Metadata
Make sure that communication software avoids or at least protects metadata; it can tell a lot about users’ lives.
Audits
Even the security risks of open source software can only be detected by experts. Use software that has successfully passed a security audit.
Tracker

Smartphone apps often integrate a lot of trackers that pass on data to third parties such as Google or Facebook without the user’s knowledge. εxodus Privacy is a website that analyses Android apps and shows which trackers are included in an app.

It also checks whether the permissions requested by an app fit the intended use. For example, it is incomprehensible why messengers such as Signal, Telegram and WhatsApp compulsorily require the entry of one’s own telephone number.

Malvertising

Avoid apps that embed advertising and thus pose the risk of malicious code advertising. Furthermore, tracking companies can evaluate and market the activities of users via embedded advertising.

There are numerous tools such as uBlock Origin for Firefox, Blokada for Android and iOS or AdGuard Pro for iOS that prevent the delivery of advertising and the leakage of personal data. With HttpCanary for Android apps and Charles Proxy for iOS apps, users can investigate for themselves how apps behave unless the app developers resort to certificate pinning. Burp Suite intercepts much more than just data packets and can also bypass certificate pinning.

Decentralised data storage
It is safest if data is stored decentrally. If this is not possible, federated systems, such as email infrastructure, are preferable to centralised ones.
Financial transparency
If there are companies behind open source software, they should be transparent about their finances and financial interests in the software. A good example in this respect is Delta Chat.
Availability
If an Android app is available, for example, only via Google's Play Store or also via the more privacy-friendly F-Droid Store.
Data economy
When selecting software, check not only whether it meets all functional requirements, but also whether it stores only the necessary data.
Data synchronisation
Data from a software should be able to be synchronised between multiple devices without the need for a central server to mediate it. For example, we sync our KeePass database directly between our devices using Syncthing and not via WebDAV or Nextcloud. This means that password data is not cached anywhere, but only stored where it is needed.
Backup
To ensure that all relevant data is securely available for the entire period of use, backup copies should be made. These should be stored in a safe place that is also legally permissible. The backup should also be automatic and the backups should be encrypted.

Microsoft alternatives – migration to open source technologies

We are developing a migration strategy from Microsoft 365 to open source technologies for a large German research company. On the one hand, it’s a question of regaining cyber souvereignty and, on the other hand, of meeting increased security requirements. This is to be achieved by using free and open source software (FLOSS). Overall, the project is similar to the Microsoft Alternatives Project (MAlt) at CERN and the Project Phoenix of Dataport.

The principles of the project are:

  • The same service should be offered to all employees
  • Vendor lock-ins should be avoided in order to reduce the risk of dependency
  • Most of the data should be owned by the research company

We evaluate alternative solutions for many services, implement prototypes and pilot projects.

Product group Service Product to evaluate Status
Identity and access management LDAP OpenLDAP 🏭 Production
Personal Information Management eMail Zimbra 🚦 Evaluation
Calendar Zimbra 🚦 Evaluation
Contacts Zimbra 🚦 Evaluation
Collaboration File sharing NextCloud 🏭 Production
Office integration NextCloud 🚦 Evaluation
Direct mails/chat Mattermost 🏭 Production
Video conferences Jitsi Meet 🏭 Production
Search Search engine OpenSearch 🚦 Evaluation
Frontend/ Visualisierung OpenSearch Dashboards 🚦 Evaluation
Authentication/ Access control Open Distro Security 🚦 Evaluation
k-nearest neighbors Open Distro KNN 🚦 Evaluation
Project management Issues/Milestones GitLab 🚦 Evaluation
Time tracking gitlabtime 🚦 Evaluation
Documentation GitLab Wiki 🚦 Evaluation
Research software [1], [2] Package manager Spack 🏭 Production
IDE JupyterHub 🏭 Production
Development environments Jupyter Kernels 🏭 Production
Software versioning Git 🏭 Production
Data versioning DVC 🏭 Production
Gathering and storing data Intake 🛫 Pilot
Spreadsheet ipysheet 🏭 Production
Geospatial data PostGIS 🏭 Production
Map creation OpenStreetMap 🏭 Production
DevOps GitLab CI/CD Pipeline 🛫 Pilot
Documentation Sphinx 🏭 Production

Hosting strategy

There are essentially three different hosting variants within the research company:

Society-wide infrastructure
Infrastructure, which is used by most of the research projects and administrations across the institutes, is to be provided by the research company’s central IT.
Institute-wide infrastructure
Infrastructure that is required for the special research areas of one institute or that needs IT support on site should be provided by the IT of the respective institute.
Operational and geo-redundancy
These are mainly produced through institute cooperation or through cooperation between individual institutes and the IT of the research society. In terms of technology, floating IPs and the Virtual Router Redundancy Protocol (VRRP) are used for this, with decisions being made on the basis of BGP announcements.

[1]

There is extensive German documentation for the infrastructure on which the research software is developed, which is published under the BSD 3 Clause license:

[2]The planned uniform API represents a significant simplification here; see also Announcing the Consortium for Python Data API Standards.

Security gaps and other incalculable risks at Microsoft – time to switch to open source

Tens of thousands of Microsoft Exchange servers in Germany are vulnerable to attack via the internet and are very likely already infected with malware. But Microsoft 365 and the Azure Cloud are also struggling with massive problems. Suitable open source alternatives could offer a remedy here.

Multiple critical vulnerabilities in Microsoft Exchange servers

The German Federal Office for Information Security (BSI) rates the current vulnerabilities in MS Exchange as a business-critical IT threat situation with massive impairments of regular operations [1]. But that is not enough:

«In many infrastructures, Exchange servers have a large number of permissions in the Active Directory by default (sometimes unjustified). It is conceivable that further attacks with the rights of a taken-over Exchange server could potentially compromise the entire domain with little effort. It must also be taken into account that the vulnerability could also be exploited from the local network if, for example, an attacker gains access to Outlook Web Access via an infected client.»

– BSI: Mehrere Schwachstellen in MS Exchange, 17.03.2021

The Computer Emergency Response Team of the Federal Administration (CERT-Bund) considers a compromise of an Outlook Web Access server (OWA) accessible from the Internet to be probable as of 26 February [2]. If admins now apply the security patches, the known security gaps will be closed, but a possible malware will not be eliminated. In addition, it is necessary to check whether servers in the domain have already been the target of such an attack and, for example, whether backdoors have already been opened.

Even today, about 12,000 of 56,000 Exchange servers with open OWA in Germany are said to be vulnerable to ProxyLogon [3] and another 9,000 Exchange servers have been taken offline or prevented from accessing OWA from the Internet in the last two weeks [4]:

Exchange server with open OWA in Germany. Vulnerability for ProxyLogon (CVE-2021-26855 et al.). Status: 16.03.2021

Source: CERT-Bund

Only a month ago, the CERT-Bund warned that even one year after the release of the security update, 31-63% of Exchange servers in Germany with OWA openly accessible from the internet were still vulnerable to the critical vulnerability CVE-2020-0688 [5].

Problems also with Microsoft 365 and Azure Cloud

Several Microsoft services failed on the evening of 15 March 2021. The problems were caused by an automatically deleted key in the Azure Active Directory (AAD). This prevented users from logging in for hours. This also affected other services that are based on the Azure Cloud, including Microsoft Teams, XBox Streams and Microsoft Dynamics. In September 2020 and February 2019, there were already outages of Microsoft cloud services lasting several hours, which were also due to errors in the Azure Active Directory Service (AAD) [6].

However, these were by no means the only problems with Microsoft 365 and the Azure Cloud: according to a study by Sapio Research on behalf of Vectra AI [7] of 1,112 IT security decision-makers in companies that use Office 365 and have more than a thousand employees, attackers can take over Office 365 accounts in most companies.

The reason for the study was that remote working has become normal as a result of the global Corona pandemic and companies have quickly moved to the cloud. Office 365 has been a common choice. However, with 258 million users, they have also become a tempting target for cyber attacks. According to the survey, in the last 12 months

  • 82% have experienced an increased security risk to their organisation
  • 70% have seen the takeover of accounts of authorised users, with an average of seven accounts hijacked
  • 58% have seen a widening gap between the capabilities of attackers and defenders.

Most respondents see a growing threat in moving data to the cloud. Above all, the effort required to secure the infrastructure was initially underestimated.

Microsoft vs. privacy

Finally, the State Commissioner for Data Protection in Mecklenburg-Western Pomerania and the State Audit Office are now demanding that the state government stop using Microsoft products with immediate effect:

«Among others, products of the company Microsoft are affected. However, a legally compliant use of these products solely on the basis of standard data protection clauses is not possible due to the principles established by the European Court of Justice. Without further security measures, personal data would be transmitted to servers located in the USA.»

– Press release of the State Commissioner for Data Protection and Freedom of Information Mecklenburg-Western Pomerania, 17.03.2021 [8].

In fact, however, this demand does not come as a surprise: The Conference of the Independent Data Protection Authorities of the Federation and the States already pointed out these dangers in 2015 [9].

Consequently, the State Audit Office of Mecklenburg-Western Pomerania concludes that software that cannot be used in compliance with the law cannot be operated economically.

Outlook

The news of the last few days make it clear that Microsoft can be operated neither securely nor reliably even if the operating costs were significantly increased. Finally, Microsoft services are not likely to be used in a legally compliant manner in Europe. With open source alternatives, more transparency, security and digital sovereignty could be achieved again. For example, last year we began advising a large German research institution on alternatives to Microsoft 365: Microsoft alternatives – migration to open source technologies. In the process, we are gradually replacing Microsoft products with OpenSource products over several years.

Update: 13 April 2021

There are again vulnerabilities in Microsoft Exchange Server, for which Microsoft has published updates for Exchange Server as part of its patch day [10]. This should be installed immediately. The BSI considers this to be a security vulnerability with increased monitoring of anomalies with temporary impairment of regular operation [11].

Update: 23. Juli 2021

In the press release, the State Commissioner for Data Protection of Lower Saxony, Barbara Thiel, continues to be critical of the use of Microsoft 365:

«We have not yet issued a corresponding order or prohibition, but it is true that we regard the use of these products as very critical. … Due to the overall situation described I can still only strongly advise against the use of Office 365 from the data protection perspective.»

– Press release of the State Commissioner for Data Protection of Lower Saxony, Barbara Thiel, 22 July 2021 [12]