Get Started Free
‹ Back to courses
course: Schema Registry 101

Working with Schema Formats

5 min
Danica Fine

Danica Fine

Senior Developer Advocate (Presenter)

Protobuf

sr101-m5-01

This is the Protobuf file we looked at in the previous module. Let’s dive in to discuss its format.

Protobuf defines all fields of a message in type - name format. In this example, the types are scalar values of string and double. Other scalar types supported by protobuf include float, int32 (int in Java), int64 (long in Java), bool, and bytes.

Each field in the message definition contains a unique number and Protobuf uses them to identify your fields in the message binary format. Once you’ve defined your schema you should not change the number or the order of the fields once it’s in use.

Protobuf supports a number of more complex field types. Let’s take a look at some of them now.

Protobuf Collections

sr101-m5-02

Protobuf supports collection types like a list or a map. Here you see both being used to add fields to the Purchase message. For a list you use the repeated keyword. In Java, this will translate to a List. For a map you use the keyword map. Note that the key can only be a string or integer type (int32, int64, for example) but the value can be any type, just not another map type.

Protobuf Enumerations

sr101-m5-03

Protobuf also supports an enumeration type. Note that the first element must always map to 0 value so that the default value of 0 can be used.

Importing Protobuf

sr101-m5-04

You can use definitions from other .proto files by adding an import statement at the top of your file.

Consider this schema that tracks the online events generated by a customer. The purchase and page_view event types are already defined in separate .proto files so you can import them and then list the fields in your .proto file.

Alternate Values for a Field

sr101-m5-05

Protocol buffers also support having a field that could be one of an arbitrary number of values. Let’s take a look at the CustomerEvent schema again. If we know that only one of the page_view or purchase fields will be populated, we can define a single customer_action field that could be oneof these possible events.

To help determine which object fills the field, Protobuf generates an enum named <field name>Case. With this example, the enum would be CustomerActionCase and the value of it would be either PURCHASE or PAGE_VIEW depending on what value actually fills the field.

Default Values

sr101-m5-06

If a field is not present when it’s serialized, Protobuf assigns a default value based on the type of the field. The default for a string field is an empty string, a number field is 0, and a boolean is false.

It’s also worth noting that if a field is set to the default value, for example the total_cost of the sale is zero, it’s not serialized and sent across the wire.

Now let’s move on to Avro schemas.

Avro

sr101-m5-07

Here’s the Avro definition of the Purchase schema we’ve seen so far. Avro schemas use JSON to define the schema.

  • You will always use a type of record when defining a schema.
  • The namespace field is a way to prevent name collisions with other generated Avro objects. When using Avro with Java, the namespace becomes the package name.
  • You define the fields for an Avro object as a JSON array and each field is defined as a JSON object with the name of the field and the type. Avro supports the usual scalar types for fields – string, int, long, double, boolean, float, bytes. Avro also supports more complex types which we will look at next starting with collection types.

Avro Collections – Arrays

sr101-m5-08

Avro supports array types. Notice here that you nest another JSON object when declaring the array. The coupon_codes field could also be a complex type instead of the string shown here.

Avro Collections – Maps

sr101-m5-09

In Avro, maps are also defined using a nested type. The keys of a map in Avro are assumed to be strings. But you can also have complex types for the values of a map.

Avro Enumerations

sr101-m5-10

Avro supports enumeration types as well.

Avro Records in a Schema

sr101-m5-11

Avro permits having another record as a field type. You can either have the full JSON definition in the schema or use the fully qualified name of the record as shown here. It is recommended that if you have a schema that references other record types you use the name of the record so that when you make changes, other Avro files that reference it won’t have to be updated.

Avro Unions

sr101-m5-12

Similar to protobuf, Avro also supports having a field that could contain one of multiple values. This is represented in an Avro schema by using array notation for the type and it will contain the different types that could be in the field. Note that unlike Protobuf, Avro does not provide any support for determining what type is present. The generated code will have a type of Object for the action field and you would have to determine the type by using the instanceof operator in Java.

Avro Default Values

sr101-m5-13

Avro has default values like Protobuf but in Avro you need to explicitly provide them in the schema.

Working with Generated Objects

sr101-m5-14

To work with the generated object from either Avro or Protobuf you need to follow the builder pattern. You first need to create a builder instance, then set the desired fields, and call the build to get the concrete object type.

Changing State of Generated Objects

sr101-m5-15

Avro provides setter methods on the generated objects that allow you to directly change their state. With Protobuf, the objects returned by the builder are immutable. To update the value of a field with Protobuf, you need to pass the object into a builder, update the field(s) you want to change, and then call the build again, resulting in a brand-new object.

Avro builders also have an overloaded constructor that accepts an object of the same type that the builder returns.

Use the promo code SCHEMA101 & CONFLUENTDEV1 to get $25 of free Confluent Cloud usage and skip credit card entry.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.

Working with Schema Formats

Hi, I'm Danica Fine, welcome back to the Schema Registry Course. In this module, we're going to take a look at Protocol Buffers and Avro in more detail. Here's the Protobuf file we referenced in the previous module, it should look familiar. Let's dive in to discuss its format. Protobuf defines all fields of a message in type-name format. Here the types are scalar values of string and double. Other scalar types supported by Protobuf include float, int32 , int64 , boolean, and bytes. Each field in the message definition contains a unique number that Protobuf uses to identify fields in the message binary format. Because of this, once you've defined your schema, you should not change the number of the field. Protobuf also supports more complex field types, so let's take a look. In Protobuf, you may use collection types, like a list or a map. Let's use these to add a couple of fields to the Purchase message. To use a list, you use the "repeated" keyword. And in Java, this will just translate to a List. For a map, you use the keyword "map". But before you get too wild, note that the key of this map can only be a string or integer type. The value can be any type, just not another map type. Protobuf also supports an enumeration type. When you set this up, the first element must always map to 0 so that the default value of 0 can be used. You can also have another Protobuf schema as a field in your definition by using the import statement. Consider this schema that tracks online events generated by a customer. The purchase and page_view event types are already defined in separate .proto files, so all we do is add the import statements and then list them as fields in the new .proto file. Protocol buffers also support having a field that could be one of any arbitrary values. Let's take a look at the CustomerEvent schema again. If we know that only one of the page_view or purchase fields will be populated, we can define a single customer_action field that could be one of these possible events. To help determine which object fills the field, Protobuf generates an enum named "field name" with case appended to it. So with our example here, the enum in the generated Java file would be CustomerActionCase, and the value of it would be either PURCHASE or PAGE_VIEW depending on what value actually fills the field. If a field is not present when it's serialized, Protobuf assigns a default value based on the type of the field. The default for a string field is an empty string, a number field is 0, and a boolean is false. It's also worth noting that if a field is set to the default value, for example, the total cost of the sale is zero, it's not serialized or sent across the wire. All right, now let's move on to Avro schemas. Here's the Avro definition of the Purchase schema we've seen so far. Recall that Avro schemas use JSON to define the schema. You'll always use a type of record when defining a schema. The namespace field that you see here is a way to prevent name collisions with other generated Avro objects. When using Avro with Java, the namespace becomes the package name. You define the fields for an Avro object as a JSON array, and each field is defined as a JSON object with the name of the field and its type. Avro supports the usual scalar types for fields, string, int, long, double, boolean, float, and bytes. Avro also supports more complex types; let's look into these a bit more starting with collection types. We'll explore the Array type first. Notice here that you nest another JSON object when declaring the array. The coupon_codes field could also be a complex type instead of the string shown here. In Avro, maps are also defined using a nested type. The keys of a map in Avro are assumed to be strings. But you're free to use complex types for the values of a map. Avro supports enumeration types as well. And Avro also permits having another record as a field type. You can either have the full JSON definition in the schema or use the fully qualified name of the record as shown here. But it's recommended that you use the name of the record so that when you make changes, all the other Avro files that reference it won't have to be updated. Similar to Protobuf, Avro supports having a field that could contain one of multiple values. This is represented in an Avro schema by using array notation for the type, and it will contain the different types that could be in the field. Note that unlike Protobuf, Avro does not provide any support for determining which type is present. In this case, the generated code will have a type of Object for the action field, and you'd have to determine the type by using the instanceof operator in Java. Avro has default values like Protobuf, but in Avro, you need to explicitly provide them in the schema. To work with the generated object from either Avro or Protobuf, you'll need to follow the builder pattern. You create a builder instance then set the desired fields and call build to get the concrete object type. Avro provides setter methods on the generated objects, allowing you to directly change their state. On the other hand, with Protobuf, the objects returned by the builder are immutable. To update the value of a field, you'll need to pass the object into a builder, update whichever fields you want to change, and then call build again. This results in a brand new object. Avro builders also have an overloaded constructor that accepts an object of the same type that the builder returns. And with that, you now know a bit more about Avro and Protobuf schemas, and how to build them up. See you in the next module where we'll learn how to manage schemas successfully.