Project Metamorphosis: Unveiling the next-gen event streaming platform.Learn More

How I Learned to Stop Worrying and Love the Schema

The rise in schema-free and document-oriented databases has led some to question the value and necessity of schemas. Schemas, in particular those following the relational model, can seem too restrictive, and the case has been made that software development can be faster and more agile without them. However, just because it’s possible to go without schemas doesn’t mean it’s wise to do so – this sort of local optimization can cause huge headaches within even a small organization.

The Hazards of Many Languages

Imagine for a moment that you work at a company where all employees are required to speak only in their native language. Intercommunication can work, but either everyone has to be multilingual, or expensive translators must be added for every pair of languages spoken in the company. Even if you have a sophisticated and efficient way of getting messages from place to place, you’re still stuck with the overhead of constant translation.

Furthermore, even if your company phases out, say Latin, unless you are willing to discard all Latin records, you’re either stuck employing your Latin translators for the rest of eternity, or with the work of converting all Latin records into a new language.

Compare this to a company which standardized on a single language from the start. Every single form of communication is easier, and every message can be consumed many times at zero extra cost. Although there is an up-front cost in the sense that new employees must already know the language or be trained in it, the payoff is huge and permanent.

Having no standardized way of defining data across an organization presents a similar problem. It may be fine in the short term, but it quickly causes unnecessary difficulties, and it just doesn’t scale when you imagine multiplying by potentially thousands of different categories of data.

Temp-o-meter – A Tale of Woe

Let’s illustrate with a simple example. Say that your company has built a smartphone app called temp-o-meter which collects temperature data and sends it back to HQ. Version 0.1 was produced in a hurry, and produces simple comma separated data points with the format

“device_id, temp_celsius, timestamp, latitude, longitude”

A typical data point might look like this:


Pretty reasonable, but time passes, and the team decides JSON is easier to work with, and temp-o-meter v0.2 produces data like this:

    “device_id”: 123,
    “temperature”: 212,
    “latitude”: 37.386052,
    “longitude”: -122.083851

The problem is, some stage(s) in the downstream pipeline must now have logic to differentiate between CSV and JSON, and this logic must be aware that in the CSV format, temperature readings are in Celsius, but temperature stored in JSON is in Fahrenheit. What’s more, there may be some users who never upgrade their app, so the different versions of this data will continue to be published indefinitely.

Granted, this example is a bit contrived – clearly, for a given type of data, it’s not great to represent it with a mix of formats such as CSV, JSON, or XML, etc. However, just standardizing on a format such as JSON without schemas is not enough. Standardizing on JSON without using schemas is a little like standardizing on the Roman alphabet without standardizing on a language – everyone can easily read and write individual letters, but that still doesn’t guarantee they can read the messages!

Let’s go a little further with the temp-o-meter example and pretend that we live in a science-fiction world where not only phones, but even things like watches can produce streams of data. temp-o-meter needs to be ported, but lucky for the watch team, JSON is now the standard, and the format of temperature data was loosely documented on an obscure wiki page.

Here’s what the temp-o-meter watch team came up with:

    “device_id”: “watch_345”,
    “timestamp”: “Tue 05-17-2015 6:00”,
    “temperature”: 212,
    “latitude”: 37.386052,
    “longitude”: -122.083851

Not so bad on its own. However, although the format is now JSON, and although the field names are identical, the watch team used slightly different data formats in a few of the fields. “device_id” is no longer parseable as an integer, and “timestamp” is a completely different format.

Why is this a problem? Suppose there is an application which consumes data produced by the temp-o-meter v0.2 – it might have a chunk of code like this:

# Consumer parses a chunk of data from temp-o-meter v0.2
data = json.loads(data_chunk)

device_id = int(data[‘device_id’])
timestamp = float(data[‘timestamp’])
temperature = float(data[‘temperature’])
latitude = float(data[‘latitude’])
longitude = float(data[‘longitude’])

This consumer must be upgraded before the watch team releases, otherwise it will be completely broken when it encounters the (unintentionally) new data format. Despite the fact that the watch team and phone teams now both use JSON, different components of the system are now tightly coupled because the data’s ‘schema’ is embedded in both producers and consumers of this data. The ability to safely and independently evolve different components of the system has been hamstrung.

Had this company been using schemas, the watch and phone teams could have simply reused the same schema, avoiding the need for the watch team to reinvent the wheel, and preventing subtle incompatibilities which ultimately break a bunch of downstream consumers. By sharing the schema between watch and phone apps, this unintentional data evolution would have easily been avoided.

DRY Your Data Definitions With Schemas

At this stage in our little story, the ‘definition’ of temp-o-meter’s temperature data is decidedly un-DRY: it is encoded informally in temp-o-meter v0.1, temp-o-meter v0.2, some wiki pages, the watch app, and in all the various consumers which later parse and analyze this data.

On the other hand, by using schemas, the data definition for a particular kind of data exists in a single place. What’s more, schemas serve as self-contained and automatically enforceable contracts between writers and readers of data. Though they don’t remove the need for testing, schemas make testing data compatibility significantly simpler and can nip an entire class of problems in the bud by preventing corrupt, malformed or incompatible data from ever being published in the first place.

For additional compelling reasons to use schemas, it’s worth revisiting this post on Stream Data Platforms. In the next post on schemas, I’ll talk more about how schemas can provide a powerful tool to help evolve data formats in a sane and compatible way.

Did you like this blog post? Share it now

Subscribe to the Confluent blog

More Articles Like This

Getting Started with Protobuf in Confluent Cloud

Confluent Cloud supports Schema Registry as a fully managed service that allows you to easily manage schemas used across topics, with Apache Kafka® as a central nervous system that connects […]

Broadcom Modernizes Machine Learning and Anomaly Detection with ksqlDB

Mainframes are still ubiquitous, used for almost every financial transaction around the world—credit card transactions, billing, payroll, etc. You might think that working on mainframe software would be dull, requiring […]

Confluent Platform Now Supports Protobuf, JSON Schema, and Custom Formats

When Confluent Schema Registry was first introduced, Apache Avro™ was initially chosen as the default format. While Avro has worked well for many users, over the years, we’ve received many […]

Jetzt registrieren

Erhalten Sie in den ersten drei Monaten einen Rabatt von bis zu 50 USD pro Kalendermonat auf Ihre Rechnung.

Nur neue Registrierungen.

By clicking “sign up” above you understand we will process your personal information in accordance with our und bin damit einverstanden.

Indem Sie oben auf „Registrieren“ klicken, akzeptieren Sie die Nutzungsbedingungen und den gelegentlichen Erhalt von Marketing-E-Mails von Confluent. Zudem ist Ihnen bekannt, dass wir Ihre personenbezogenen Daten gemäß unserer und bin damit einverstanden.

Auf einem einzigen Kafka Broker unbegrenzt kostenlos verfügbar

Die Software ermöglicht die unbegrenzte Nutzung der kommerziellen Funktionen auf einem einzelnen Kafka Broker. Nach dem Hinzufügen eines zweiten Brokers startet automatisch ein 30-tägiger Timer für die kommerziellen Funktionen, der auch durch ein erneutes Herunterstufen auf einen einzigen Broker nicht zurückgesetzt werden kann.

Wählen Sie den Implementierungstyp aus
Manual Deployment
  • tar
  • zip
  • deb
  • rpm
  • docker
Automatische Implementierung
  • kubernetes
  • ansible

By clicking "download free" above you understand we will process your personal information in accordance with our Datenschutzerklärung zu.

Indem Sie oben auf „kostenlos herunterladen“ klicken, akzeptieren Sie die Confluent-Lizenzvertrag und den gelegentlichen Erhalt von Marketing-E-Mails von Confluent. Zudem erklären Sie sich damit einverstanden, dass wir Ihre personenbezogenen Daten gemäß unserer Datenschutzerklärung zu.

Diese Website verwendet Cookies zwecks Verbesserung der Benutzererfahrung sowie zur Analyse der Leistung und des Datenverkehrs auf unserer Website. Des Weiteren teilen wir Informationen über Ihre Nutzung unserer Website mit unseren Social-Media-, Werbe- und Analytics-Partnern.