Avro FAQ

Why does Avro have multiple notions of schema?

All Avro objects are encoded with reference to a schema, called the writer schema. Simple enough so far! Decoding an Avro object is a bit more complicated, and always done with reference to two schemas, the writer schema the object was written with, and the reader schema.

The purpose of the reader schema is to define the type of the decoded object as expected by the consuming application, which may in general differ from the type of the object from the producing application's perspective. For example, consider two applications, which would like to exchange information about a business's customers. This data might be encoded with the following schema:

{
    "type": "record",
    "name": "CustomerInfo",
    "fields": [
        {"name": "FirstName", "type": "string"},
        {"name": "LastName", "type": "string"}
    ]
}

So far, both the producer and consumer can use the same schema, and everything will work out fine. Now imagine that the producing application is updated, to encode some more data. It might now use the following schema:

{
    "type": "record",
    "name": "CustomerInfo",
    "fields": [
        {"name": "Age", "type": "int"},
        {"name": "FirstName", "type": "string"},
        {"name": "LastName", "type": "string"}
    ]
}

Notice that the Age field has been added. For the consuming application to understand this new field, it would have to be updated with the new schema. However, until this is done, it can happily continue interpreting the new data and processing FirstName, LastName records; the new Age field will be ignored, being present in the writer schema but not the reader schema.

Why must the writer schema be available during decoding?

Having read the above discussion, one might reasonably wonder why the decoder needs access to the writer schema at all. Couldn't it simply interpret everything according to its reader schema, skipping unknown fields and filling in null for missing ones, as needed?

The answer is no. Avro objects do not contain any data type tags or other signposts for decoding; they simply write out the fields of each record in order. In the second schema given above, including the Age field, each object would be laid out as a (variable-length-encoded) integer, followed by two length-prefixed strings. If the consuming application attempted to decode this with reference only to the first schema given above, it would incorrectly interpret the Age integer as the length of FirstName, and go off the rails.

Caveat: There are certain special cases in which the object could in theory be correctly interpreted with reference only to the reader schema. The most obvious example is when the two schemas are identical, but there are others; (exercise for the reader: what would have happened had I added Age to the end of the CustomerInfo record, rather than the beginning?)

This idea is not explored at all by the Avro spec, which makes it clear that a writer schema should always be available during decoding. My personal stance is that attempting to enumerate, detect, and properly handle these special cases is more trouble than it's worth, and so we don't do so in Materialize. However, honesty requires me to admit that that stance is disagreed with by some engineers whose competence I respect.

How does the consumer gain access to the writer schema?

It depends on the overall container for the Avro objects. In Avro files (also called Object Container Files, or OCFs), the writer schema is written once, at the beginning of the file, and all objects in the file are assumed to have been encoded with that writer schema.

In message queues like Kafka or Redpanda, each Avro object is typically encoded as its own message, and users expect to be able to evolve the schema over time, rather than use one global schema per topic. Thus each message must somehow carry its writer schema. But encoding a schema along with each message would be prohibitively wasteful, as schemas are often much larger than messages themselves. Thus, usually a schema registry is used -- each message is prepended with a schema ID, which the consuming application uses to look up the corresponding writer schema, for example over a REST endpoint. The consuming application is expected to cache these ID to schema mappings, so that traffic is only necessary when the writer schema actually changes.

What sorts of schemas can be used together?

A detailed description of schema resolution is available in the Avro spec. A few basic guidelines will be illustrative:

Is it always possible to tell, given two schemas, whether they are compatible?

Not quite. It's possible for success or failure to be data-dependent; that is, only raised when particular actual records are decoded. For example, consider a writer schema defining one field whose type may be either string or int, and a reader schema defining the same field, but expecting only an int. The writer schema is as follows:

{
    "type": "record",
    "name": "MyRecord",
    "fields": [
        {"name": "MyField", "type": ["string", "int"]}
    ]
}

The reader schema is as follows:

{
    "type": "record",
    "name": "MyRecord",
    "fields": [
        {"name": "MyField", "type": "int"}
    ]
}

In general, these schemas are not guaranteed to be compatible; the producing application might write MyField with type string, which the consuming application will not know how to interpret. But if it happens to be the case that the producing application only ever writes records whose value in that field is of type int, the consuming application can interpret them fine!

In Materialize, we put the entire source into an errored state if an uninterpretable message is discovered, but it would also not be unreasonable, depending on the use case, to skip uninterpretable messages and continue processing the remaining ones.

Is the reader schema always one of the (present, past, or future) writer schemas?

In practice, usually, but that's not logically required. Imagine that two writer schemas are used over the lifetime of an application:

{
    "type": "record",
    "name": "MyRecord",
    "fields": [
        {"name": "Foo", "type": "int"},
        {"name": "Bar", "type": "int"}
    ]
}

and:

{
    "type": "record",
    "name": "MyRecord",
    "fields": [
        {"name": "Bar", "type": "int"},
        {"name": "Quux", "type": "int"}
    ]
}

Neither of these schemas is compatible with the other; if the first one is used as a reader and the second as a writer schema, the consuming application will not know what to fill in as the value of Foo (remember that fields in Avro are not nullable by default), and similarly for Quux in the reverse case.

However, each of these (as writer schema) would be successfully interpreted by the following reader schema, which only cares about Bar:

{
    "type": "record",
    "name": "MyRecord",
    "fields": [
        {"name": "Bar", "type": "int"}
    ]
}

What's currently flawed in Materialize's support for Avro schemas?

This section is all my personal opinion, and doesn't necessarily reflect anything about Materialize's roadmap!

In Materialize, when an Avro-formatted source is created from Kafka or Redpanda, we simply choose the latest writer schema from the schema registry and use that as the reader schema. We then record that schema in the catalog so that it will be used again for the same source on restart. However, as should be clear from the above sections, it is not necessarily the case that the application is interested in the most recently-used schema. For example, a field may have been deleted in a later schema, but the user might still wish to see the values of that field from older records (and a default value for newer ones). Or, as explained in the preceding section, a valid reader schema might not have been one of the writer schemas at all!

We should instead give the user three choices for schema selection:

  1. The current behavior, which is a sane default.
  2. Specifying a particular schema ID; the reader schema will be taken as the entry in the schema registry with this ID.
  3. Specifying an arbitrary reader schema inline.

Of course, in all three cases, we must not confuse our selection of reader schema with the writer schema of each object, and must continue fetching those from the schema registry.

How does one evolve an Avro reader schema in Materialize?

Drop the source and re-create it with whatever reader schema you like (assuming the options in the above section have been implemented).

What we will probably not support anytime soon is on-the-fly modification of an existing source's schema. This would require somehow rehydrating all the computation downstream of it with the new values, re-planning all queries, migrating the persisted version of the source, etc.; supporting something like this is clearly a major project and it's not even clear whether it's possible.

Why do you keep writing "schemas" ? Shouldn't it be schemata ?

I have no opinion on this subject.