Open Source

Choosing the right NoSQL database

Relational databases dominated the software industry for a long time and are very mature with mechanisms such as redundancy, transaction control and standard interfaces. However, they were initially only able to react moderately to higher demands on scalability and performance. Thus, from the beginning of 2010, the term NoSQL was increasingly used to describe new types of databases that better met these requirements.

NoSQL databases should solve the following problems:

Bridging the internal data structure of the application and the relational data structure of the database.
Moving away from the integration of a wide variety of data structures into a uniform data model.
The growing amount of data increasingly required clusters for data storage

Aggregated data models

Relational database modelling is very different from the types of data structures that application developers use. The use of data structures modelled by developers to solve different problem domains has led to a move away from relational modelling towards aggregate models. Most of this is inspired by Domain Driven Design. An aggregate is a collection of data that we interact with as a unit. These aggregates form the boundaries for ACID operations, where Key Values, Documents and Column Family can be seen as forms of an aggregator-oriented database.

Aggregates make it easier for the database to manage data storage on a cluster, as the data unit can now be on any computer. Aggregator-oriented databases work best when most data interactions are performed with the same aggregate, e.g. when a profile needs to be retrieved with all its details. It is better to store the profile as an aggregation object and use these aggregates to retrieve profile details.

Distribution models

Aggregator-oriented databases facilitate the distribution of data because the distribution mechanism only has to move the aggregate and doesn’t have to worry about related data, since all related data is contained in the aggregate itself. There are two main types of data distribution:

Sharding

Sharding distributes different data across multiple servers so that each server acts as a single source for a subset of data.

Replication

Replication copies data across multiple servers so that the same data can be found in multiple locations. Replication takes two forms:

Master-slave replication makes one node the authoritative copy, processing writes, while slaves are synchronised with the master and may process reads.

Peer-to-peer replication allows writes to any node. Nodes coordinate to synchronise their copies of the data.

Master-slave replication reduces the likelihood of update conflicts, but peer-to-peer replication avoids writing all operations to a single server, thus avoiding a single point of failure. A system can use one or both techniques.

CAP Theorem

In distributed systems, the following three aspects are important:

Consistency
Availability
Partition tolerance

Eric Brewer has established the CAP theorem, which states that in any distributed system we can only choose two of the three options. Many NoSQL databases try to provide options where a setup can be chosen to set up the database according to your requirements. For example, if you consider Riak as a distributed key-value database, there are essentially the three variables

r: Number of nodes to respond to a read request before it is considered successful
w: number of nodes to respond to a write request before it is considered successful
n: Number of nodes on which the data is replicated, also called replication factor

In a Riak cluster with 5 nodes, we can adjust the values for r, w and n so that the system is very consistent by setting r = 5 and w = 5. However, by doing this we have made the cluster vulnerable to network partitions, as no write is possible if only one node is not responding. We can make the same cluster highly available for writes or reads by setting r = 1 and w = 1. However, now consistency may be affected as some nodes may not have the latest copy of the data. The CAP theorem states that when you get a network partition, you have to balance the availability of data against the consistency of data. Durability can also be weighed against latency, especially if you want to survive failures with replicated data.

Often with relational databases you needed little understanding of these requirements; now they become important again. So you may have been used to using transactions in relational databases. In NoSQL databases, however, these are no longer available to you and you have to think about how they should be implemented. Does the writing have to be transaction-safe? Or is it acceptable for data to be lost from time to time? Finally, sometimes an external transaction manager like ZooKeeper can be helpful.

Different types of NoSQL databases

NoSQL databases can be roughly divided into four types:

Key-value databases

Key-value databases are the simplest NoSQL data stores from an API perspective. The client can either retrieve the value for the key, enter a value for a key or delete a key from the data store. The value is a blob that the datastore just stores without caring or knowing what is inside. It is solely the responsibility of the application to understand what has been stored. Because key-value databases always use primary key access, they generally have high performance and can be easily scaled.

Some of the most popular key-value databases are

Riak KV: Home | GitHub | Docs
Redis: Home | GitHub | Docs
Memcached: Home | GitHub | Docs
Berkeley DB: Home | GitHub | Docs
Upscaledb: Home | GitHub | C API Docs

You need to choose them carefully as there are big differences between them. For example, while Riak stores data persistently, Memcached usually does not.

Document databases

These databases store and retrieve documents, which may be XML, JSON, BSON, etc. These documents are hierarchical tree data structures that can consist of maps, collections and scalar values. Document databases provide rich query languages and constructs such as databases, indexes, etc. that allow for an easier transition from relational databases.

Some of the most popular document databases are

MongoDB: Home | GitHub | Docs
CouchDB: Home | GitHub | Docs
RavenDB: Home | GitHub | Docs
Elasticsearch: Home | GitHub | Docs
eXist: Home | GitHub | Docs

Column Family Stores

These databases store data in column families as rows assigned to a row key. They are excellent for groups of related data that are frequently accessed together. For example, this could be all of a person’s profile information, but not their activities.

While each Column Family can be compared to the row in an RDBMS table where the key identifies the row and the row consists of multiple columns, in Column Family Stores the different rows do not have to have the same columns.

Some of the most popular Column Family Stores are

Cassandra: Home | GitHub | Docs
HBase: Home | GitHub | Docs
Hypertable: Home | GitHub | Docs

Cassandra can be described as fast and easily scalable because writes are distributed across the cluster. The cluster does not have a master node, so reads and writes can be performed by any node in the cluster.

Graph database

In graph databases you can store entities with certain properties and relationships between these entities. Entities are also called nodes. Think of a node as an instance of an object in an application; relationships can then be called edges, which can also have properties and are directed.

Graph models

Labeled Property Graph: In a labelled property graph, both nodes and edges can have properties.
Resource Description Framework (RDF): In RDF, graphs are represented using triples. A triple consists of three elements in the form node-edge-node subject --predicate-> object, which are defined as resources in the form of a globally unique URI or as an anonymous resource. In order to be able to manage different graphs within a database, these are stored as quads, whereby a quad extends each triple by a reference to the associated graph. Building on RDF, a vocabulary has been developed with RDF Schema to formalise weak ontologies and furthermore to describe fully decidable ontologies with the Web Ontology Language.

Algorithms

Important algorithms for querying nodes and edges are:

Breadth-first search, depth-first search: Breadth-first search (BFS) is a method for traversing the nodes of a graph. In contrast to depth-first search (DFS), all nodes that can be reached directly from the initial node are traversed first. Only then are subsequent nodes traversed.
Shortest path: Path between two different nodes of a graph, which has minimum length with respect to an edge weight function.
Eigenvector: In linear algebra, a vector different from the zero vector, whose direction is not changed by the mapping. An eigenvector is therefore only scaled and the scaling factor is called the eigenvalue of the mapping.

Query languages

Blueprints: a Java API for property graphs that can be used together with various graph databases.
Cypher: a query language developed by Neo4j.
GraphQL: an SQL-like query language
Gremlin: an open source graph programming language that can be used with various graph databases (Neo4j, OrientDB).
SPARQL: query language specified by the W3C for RDF data models.

Distinction from relational databases

When we want to store graphs in relational databases, this is usually only done for specific conditions, e.g. for relationships between people. Adding more types of relationships then usually involves many schema changes.

In graph databases, traversing the links or relationships is very fast because the relationship between nodes doesn’t have to be calculated at query time.

Some of the most popular graph databases are

Neo4j: Home | GitHub | Docs
InfiniteGraph: Home

Selecting the NoSQL database

What all NoSQL databases have in common is that they don’t enforce a particular schema. Unlike strong-schema relational databases, schema changes do not need to be stored along with the source code that accesses those changes. Schema-less databases can tolerate changes in the implied schema, so they do not require downtime to migrate; they are therefore especially popular for systems that need to be available 24/7.

But how do we choose the right NoSQL database from so many? In the following we can only give you some general criteria:

Key-value databases: are generally useful for storing sessions, user profiles and settings. However, if relationships between the stored data are to be queried or multiple keys are to be edited simultaneously, we would avoid key-value databases.
Document databases: are generally useful for content management systems and e-commerce applications. However, we would avoid using document databases if complex transactions are required or multiple operations or queries are to be made for different aggregate structures.
Column Family Stores: are generally useful for content management systems, and high volume writes such as log aggregation. We would avoid using Column Family Stores databases that are in early development and whose query patterns may still change.
Graph databases: are well suited for problem areas where we need to connect data such as social networks, geospatial data, routing information as well as recommender system.

Conclusion

The rise of NoSQL databases did not lead to the demise of relational databases. They can coexist well. Often, different data storage technologies are used to store the data to match your structure and required query.

Criteria for safe and sustainable software

Open Source

The best way to check how secure your data is against unauthorised access is to use open source software.

Virtual Private Network

This is usually the basis for accessing a company network from outside. However, do not blindly trust the often false promises of VPN providers, but use open source programmes such as OpenVPN or WireGuard.

Remote desktop software

Remotely is a good open source alternative to TeamViewer or AnyDesk.

Configuration

Even with open-source software, check whether the default settings are really privacy-friendly:

For example, Jitsi Meet creates external connections to gravatar.com and logs far too much information with the INFO logging level. Previous Jitsi apps also tied in the trackers Google CrashLytics, Google Firebase Analytics and Amplitude. Run your own STUN servers if possible, otherwise meet-jit-si-turnrelay.jitsi.net is used.

Encryption methods

Here you should distinguish between transport encryption – ideally end-to-end – and encryption of stored data.

The synchronisation software Syncthing, for example, uses both TLS and Perfect Forward Secrecy to protect communication.

You should be informed if the fingerprint of a key changes.

Metadata

Make sure that communication software avoids or at least protects metadata; it can tell a lot about users’ lives.

Audits

Even the security risks of open source software can only be detected by experts. Use software that has successfully passed a security audit.

Tracker

Smartphone apps often integrate a lot of trackers that pass on data to third parties such as Google or Facebook without the user’s knowledge. εxodus Privacy is a website that analyses Android apps and shows which trackers are included in an app.

It also checks whether the permissions requested by an app fit the intended use. For example, it is incomprehensible why messengers such as Signal, Telegram and WhatsApp compulsorily require the entry of one’s own telephone number.

Malvertising

Avoid apps that embed advertising and thus pose the risk of malicious code advertising. Furthermore, tracking companies can evaluate and market the activities of users via embedded advertising.

There are numerous tools such as uBlock Origin for Firefox, Blokada for Android and iOS or AdGuard Pro for iOS that prevent the delivery of advertising and the leakage of personal data. With HttpCanary for Android apps and Charles Proxy for iOS apps, users can investigate for themselves how apps behave unless the app developers resort to certificate pinning. Burp Suite intercepts much more than just data packets and can also bypass certificate pinning.

Decentralised data storage

It is safest if data is stored decentrally. If this is not possible, federated systems, such as email infrastructure, are preferable to centralised ones.

Financial transparency

If there are companies behind open source software, they should be transparent about their finances and financial interests in the software. A good example in this respect is Delta Chat.

Availability

If an Android app is available, for example, only via Google's Play Store or also via the more privacy-friendly F-Droid Store.

Data economy

When selecting software, check not only whether it meets all functional requirements, but also whether it stores only the necessary data.

Data synchronisation

Data from a software should be able to be synchronised between multiple devices without the need for a central server to mediate it. For example, we sync our KeePass database directly between our devices using Syncthing and not via WebDAV or Nextcloud. This means that password data is not cached anywhere, but only stored where it is needed.

Backup

To ensure that all relevant data is securely available for the entire period of use, backup copies should be made. These should be stored in a safe place that is also legally permissible. The backup should also be automatic and the backups should be encrypted.

Microsoft alternatives – migration to open source technologies

We are developing a migration strategy from Microsoft 365 to open source technologies for a large German research company. On the one hand, it’s a question of regaining cyber souvereignty and, on the other hand, of meeting increased security requirements. This is to be achieved by using free and open source software (FLOSS). Overall, the project is similar to the Microsoft Alternatives Project (MAlt) at CERN and the Project Phoenix of Dataport.

The principles of the project are:

The same service should be offered to all employees
Vendor lock-ins should be avoided in order to reduce the risk of dependency
Most of the data should be owned by the research company

We evaluate alternative solutions for many services, implement prototypes and pilot projects.

Product group	Service	Product to evaluate	Status
Identity and access management	LDAP	OpenLDAP	🏭 Production
Personal Information Management	eMail	Zimbra	🚦 Evaluation
	Calendar	Zimbra	🚦 Evaluation
	Contacts	Zimbra	🚦 Evaluation
Collaboration	File sharing	NextCloud	🏭 Production
	Office integration	NextCloud	🚦 Evaluation
	Direct mails/chat	Mattermost	🏭 Production
	Video conferences	Jitsi Meet	🏭 Production
Search	Search engine	OpenSearch	🚦 Evaluation
	Frontend/ Visualisierung	OpenSearch Dashboards	🚦 Evaluation
	Authentication/ Access control	Open Distro Security	🚦 Evaluation
	k-nearest neighbors	Open Distro KNN	🚦 Evaluation
Project management	Issues/Milestones	GitLab	🚦 Evaluation
	Time tracking	gitlabtime	🚦 Evaluation
	Documentation	GitLab Wiki	🚦 Evaluation
Research software [1], [2]	Package manager	Spack	🏭 Production
	IDE	JupyterHub	🏭 Production
	Development environments	Jupyter Kernels	🏭 Production
	Software versioning	Git	🏭 Production
	Data versioning	DVC	🏭 Production
	Gathering and storing data	Intake	🛫 Pilot
	Spreadsheet	ipysheet	🏭 Production
	Geospatial data	PostGIS	🏭 Production
	Map creation	OpenStreetMap	🏭 Production
	DevOps	GitLab CI/CD Pipeline	🛫 Pilot
	Documentation	Sphinx	🏭 Production

Hosting strategy

There are essentially three different hosting variants within the research company:

Society-wide infrastructure: Infrastructure, which is used by most of the research projects and administrations across the institutes, is to be provided by the research company’s central IT.
Institute-wide infrastructure: Infrastructure that is required for the special research areas of one institute or that needs IT support on site should be provided by the IT of the respective institute.
Operational and geo-redundancy: These are mainly produced through institute cooperation or through cooperation between individual institutes and the IT of the research society. In terms of technology, floating IPs and the Virtual Router Redundancy Protocol (VRRP) are used for this, with decisions being made on the basis of BGP announcements.

[1]	There is extensive German documentation for the infrastructure on which the research software is developed, which is published under the BSD 3 Clause license: Jupyter Tutorial PyViz Tutorial

[2]	The planned uniform API represents a significant simplification here; see also Announcing the Consortium for Python Data API Standards.

Security gaps and other incalculable risks at Microsoft – time to switch to open source

Tens of thousands of Microsoft Exchange servers in Germany are vulnerable to attack via the internet and are very likely already infected with malware. But Microsoft 365 and the Azure Cloud are also struggling with massive problems. Suitable open source alternatives could offer a remedy here.

Multiple critical vulnerabilities in Microsoft Exchange servers

The German Federal Office for Information Security (BSI) rates the current vulnerabilities in MS Exchange as a business-critical IT threat situation with massive impairments of regular operations [1]. But that is not enough:

«In many infrastructures, Exchange servers have a large number of permissions in the Active Directory by default (sometimes unjustified). It is conceivable that further attacks with the rights of a taken-over Exchange server could potentially compromise the entire domain with little effort. It must also be taken into account that the vulnerability could also be exploited from the local network if, for example, an attacker gains access to Outlook Web Access via an infected client.»

– BSI: Mehrere Schwachstellen in MS Exchange, 17.03.2021

The Computer Emergency Response Team of the Federal Administration (CERT-Bund) considers a compromise of an Outlook Web Access server (OWA) accessible from the Internet to be probable as of 26 February [2]. If admins now apply the security patches, the known security gaps will be closed, but a possible malware will not be eliminated. In addition, it is necessary to check whether servers in the domain have already been the target of such an attack and, for example, whether backdoors have already been opened.

Even today, about 12,000 of 56,000 Exchange servers with open OWA in Germany are said to be vulnerable to ProxyLogon [3] and another 9,000 Exchange servers have been taken offline or prevented from accessing OWA from the Internet in the last two weeks [4]:

Exchange server with open OWA in Germany. Vulnerability for ProxyLogon (CVE-2021-26855 et al.). Status: 16.03.2021

Source: CERT-Bund

Only a month ago, the CERT-Bund warned that even one year after the release of the security update, 31-63% of Exchange servers in Germany with OWA openly accessible from the internet were still vulnerable to the critical vulnerability CVE-2020-0688 [5].

Problems also with Microsoft 365 and Azure Cloud

Several Microsoft services failed on the evening of 15 March 2021. The problems were caused by an automatically deleted key in the Azure Active Directory (AAD). This prevented users from logging in for hours. This also affected other services that are based on the Azure Cloud, including Microsoft Teams, XBox Streams and Microsoft Dynamics. In September 2020 and February 2019, there were already outages of Microsoft cloud services lasting several hours, which were also due to errors in the Azure Active Directory Service (AAD) [6].

However, these were by no means the only problems with Microsoft 365 and the Azure Cloud: according to a study by Sapio Research on behalf of Vectra AI [7] of 1,112 IT security decision-makers in companies that use Office 365 and have more than a thousand employees, attackers can take over Office 365 accounts in most companies.

The reason for the study was that remote working has become normal as a result of the global Corona pandemic and companies have quickly moved to the cloud. Office 365 has been a common choice. However, with 258 million users, they have also become a tempting target for cyber attacks. According to the survey, in the last 12 months

82% have experienced an increased security risk to their organisation
70% have seen the takeover of accounts of authorised users, with an average of seven accounts hijacked
58% have seen a widening gap between the capabilities of attackers and defenders.

Most respondents see a growing threat in moving data to the cloud. Above all, the effort required to secure the infrastructure was initially underestimated.

Microsoft vs. privacy

Finally, the State Commissioner for Data Protection in Mecklenburg-Western Pomerania and the State Audit Office are now demanding that the state government stop using Microsoft products with immediate effect:

«Among others, products of the company Microsoft are affected. However, a legally compliant use of these products solely on the basis of standard data protection clauses is not possible due to the principles established by the European Court of Justice. Without further security measures, personal data would be transmitted to servers located in the USA.»

– Press release of the State Commissioner for Data Protection and Freedom of Information Mecklenburg-Western Pomerania, 17.03.2021 [8].

In fact, however, this demand does not come as a surprise: The Conference of the Independent Data Protection Authorities of the Federation and the States already pointed out these dangers in 2015 [9].

Consequently, the State Audit Office of Mecklenburg-Western Pomerania concludes that software that cannot be used in compliance with the law cannot be operated economically.

Outlook

The news of the last few days make it clear that Microsoft can be operated neither securely nor reliably even if the operating costs were significantly increased. Finally, Microsoft services are not likely to be used in a legally compliant manner in Europe. With open source alternatives, more transparency, security and digital sovereignty could be achieved again. For example, last year we began advising a large German research institution on alternatives to Microsoft 365: Microsoft alternatives – migration to open source technologies. In the process, we are gradually replacing Microsoft products with OpenSource products over several years.

Update: 13 April 2021

There are again vulnerabilities in Microsoft Exchange Server, for which Microsoft has published updates for Exchange Server as part of its patch day [10]. This should be installed immediately. The BSI considers this to be a security vulnerability with increased monitoring of anomalies with temporary impairment of regular operation [11].

Update: 23. Juli 2021

In the press release, the State Commissioner for Data Protection of Lower Saxony, Barbara Thiel, continues to be critical of the use of Microsoft 365:

«We have not yet issued a corresponding order or prohibition, but it is true that we regard the use of these products as very critical. … Due to the overall situation described I can still only strongly advise against the use of Office 365 from the data protection perspective.»

– Press release of the State Commissioner for Data Protection of Lower Saxony, Barbara Thiel, 22 July 2021 [12]

Get in touch

I will be happy to answer your questions and create a customised offer for the migration from Microsoft 365 to OpenSource alternatives.

Veit Schiele
Phone: +49 30 22430082
Write email

I will also be happy to call you!

Request now

[1]	BSI: Mehrere Schwachstellen in MS Exchange

[2]	@certbund, 6 Mar. 2021

[3]	ProxyLogon

[4]	@certbund 17 Mar. 2021

[5]	@certbund, 9 Feb. 2021

[6]	Microsoft-Dienste fielen wegen Authentifizierungs-Fehler aus

[7]	IT Security Changes Amidst the Pandemic. Office 365 research amongst 1,112 enterprises worldwide

[8]	Pressemitteilung des Landesbeauftragten für Datenschutz und Informationsfreiheit Mecklenburg-Vorpommern, 17.03.2021

[9]	90. Konferenz der Datenschutzbeauftragten des Bundes und der Länder: Cloud-unterstützte Betriebssysteme bergen Datenschutzrisiken, 01.10.2015

[10]	Microsoft Security Response Center: April 2021 Update Tuesday packages now available

[11]	BSI: Neue Schwachstellen in Microsoft Exchange Server

[12]	Thiel: Einsatz von Office 365 weiter kritisch – Verantwortliche müssen datenschutzkonforme Kommunikationsstrukturen etablieren

The democratisation of digital maps: How protomaps are changing the game

In today’s digital landscape, maps have become essential components of countless applications and services, from navigation and logistics to social platforms and data visualisation. But for too long, the field has been dominated by a handful of companies whose services, while powerful, come with significant drawbacks: Usage quotas, tracking requirements, styling limitations, and recurring costs that can quickly skyrocket as applications grow.

Protomaps is an innovative open source map technology that is fundamentally changing the way digital maps are created, distributed and used. At its core, Protomaps utilises the ground-breaking PMTiles format – a single-file approach to vector tiles that eliminates the need for a complex tile server infrastructure while increasing performance and reducing bandwidth consumption.

Technical innovation

Unlike traditional solutions that rely on thousands of individual tile files served from a complex infrastructure, PMTiles bundles vector map data into a single, efficiently indexed file that can be hosted anywhere from traditional web servers to object stores with no special configuration required.

This approach enables progressive loading so that maps can be rendered quickly at variable zoom levels while maintaining the rich detail and interactive features that users expect from modern map solutions.

Democratisation

How is Protomaps democratising digital cartography in practice?

Economic accessibility

By eliminating recurring API costs and usage-based pricing models, Protomaps opens up map functionality to projects of all sizes, from hobby developers to non-profit organisations and educational institutions with limited budgets.

Technical accessibility

Leaflet, MapLibre and OpenLayers can be integrated with just a few lines of code and minimal configuration.

Freedom of customisation

Without the styling restrictions imposed by commercial vendors, Protomaps allows complete creative control over the appearance of maps. Maps can be customised in ways that would be difficult or impossible to achieve with traditional services.

Privacy by design

As Protomaps enables fully self-hosted map solutions, there is no need to share user location data or map activity with third parties – a crucial aspect for privacy-conscious applications and those operating under strict regulatory frameworks.

Real-World Applications

Let’s take a look at various applications where protomaps can prove to be transformative:

Small municipalities: Cities can replace their commercial mapping system with a Protomaps-based solution to display area information, infrastructure projects and municipal resources. The self-hosted implementation not only eliminated recurring licence costs, but also the ability to add local landmarks and municipal boundaries to the map that were previously difficult to highlight with commercial services.
Offline map: Protomaps can enable the creation of offline-enabled mapping tools in areas with spotty or no internet connectivity. By distributing PMTiles files containing detailed local and regional maps, interactive mapping tools can be accessed without the need for constant internet access.
Privacy-focused applications: For example, healthcare provider networks can use protomaps to create facility location tools, with user location data stored only on the device.
Specialised tools for businesses: PMTiles enables companies to create highly specialised maps with industry-specific symbology and data visualisation that could not be offered by commercial map providers, while ensuring that the maps are accessible on mobile devices even in remote areas without mobile coverage. For example, in forestry, maps with special vegetation and topography layers can be developed for field workers.

Limitations

PMTiles is intended for the web-based display of large, mostly static data sets,

based on a web platform and not on a local desktop application.
where the information to be explored totals more than a few megabytes – more than can be loaded at once for a pleasant website experience.
whose data set changes at most daily or never.

If your application does not have these three features, there are simpler alternatives to PMTiles:

GeoJSON

If you are creating a web-based map with static information, but your data is small, you should provide it as a single GeoJSON file.

With MapLibre, it’s as simple as adding a GeoJSON source.

This saves you the hassle of converting your data into tiles, and you can use the same map design and interaction techniques as you would with tiled data.

PostGIS

If you are creating a web-based map for a large dataset that is dynamic and frequently updated by users, you should store your features in a transactional database.

While it is possible to update a PMTiles file regularly, this requires the file to be reloaded each time it is saved. While this may be fine for daily updates, any higher frequency requires an inefficient amount of data transfer.

PostGIS is the industry standard for transactional geographic feature databases. Popular methods for retrieving tile data from PostGIS are pg_tileserv, martin and the raw function ST_asMVT.

A major challenge for web maps, including PostGIS-based maps, is the generalisation of data for tiles with lower zoom. One method is to selectively omit attribute data in zoom levels to make overviews brighter at lower zoom levels.

GeoParquet

If you are exploring large, static datasets but do not require publication on the internet, you can avoid tiling and visualise the files directly with desktop software.

Tiling with tools like tippecanoe requires the prior calculation of general overview tiles and is designed for retrieving small, optimised pieces of data over the internet. If the network is not the bottleneck and you have your dataset locally, QGIS is an excellent open source solution for visualisation and map creation.

Together with GeoJSON and FlatGeobuf, GeoParquet is a new format that can efficiently store large datasets and is interoperable with open source data tools. The GeoParquet 1.0.0 specification has been supported since GDAL 3.8.0, the GeoParquet 1.1.0 specification since GDAL 3.9.0.

Lonboard

Tools such as Lonboard enable the visualisation of GeoParquet in Jupyter notebooks. It is possible to publish these on the web using hosted notebooks, although transferring tens or hundreds of megabytes results in more latency than tile maps. For local data, however, GeoParquet and Lonboard is a great solution for exploratory data analysis, saving you the trouble of converting to a network-optimised, tiled format.

Future directions

The standard libraries continue to evolve:

PMTiles

The main library for processing the PMTiles format

Client-side implementations for JavaScript, Python, Dart , Rust and Go.
Server-side implementations and pmtiles CLI.

Integrations with common mapping libraries:

Protomaps Basemaps

creates a cartographic ‘base map’ from OpenStreetMap and other data sources as well as MapLibre styles for display in a browser.

basemaps-flavors

Map themes and styles.

However, the ecosystem around Protomaps is also growing:

tippecanoe: Tool for creating vector tiles from GeoJSON and other geodata formats.
PMTiles tile inspector: Tool for analysing and troubleshooting PMTiles files.
osmextract: Tool for extracting regional OpenStreetMap data for use with protomaps.
Maputnik: An open source visual editor for the MapLibre Style Specification and PMTiles sources.

The focus of Protomaps is exclusively tile-based cartography and interactive visualization. However, there are also extensions for geocoding and routing:

Geocoding
- Nominatim
- Photon
- Pelias
Routing
- Valhalla
- OpenTripPlanner