I particularly found the encoding scheme that protocol buffers use interesting. Delivered a talk @ the FOSS united Bangalore meetup on 19th October too!
Protobufs are a message exchange format, find more here — https://protobuf.dev/. Protobufs allow a type-safe contract to be shared between multiple languages, i.e define your “message” in a “.proto” file, the protoc compiler generates code accordingly for your target language.
Consider the following message, Person
, that consists of 2 fields — name and age, and they have field_numbers 1 and 2 respectively — think field_number as a way to uniquely identify the keys — this is important when we discuss protobuf encoding.
message Person {
int32 age = 1; // the field_number is 1
string name = 2; // used to uniquely identify the key
}
In this section, let us consider that the values of the name
and age
are set, and the encoded protobuf message is 080212046a616e65 (represented as hexadecimal, since the expansion to 0s and 1s would be too long), what does this mean ?
It is important to remember that the message has only a string
and an int32
, string would correspond to ID 2
, named as LEN
while int32
would correspond to ID 0
or name VARINT
in the following diagram.
Before we get into the specifics, 2 important pointers to keep in mind:
- tag = field_number + wire_type, when a protobuf message is encoded, it does not package the field name, in our example, “name” and “age” are not sent. But how do we decode this message you may ask, the answer to this is — the tag uniquely identifies field and its data type. On the decoding end, the proto message is necessary.
- we are using hexadecimal for representation purposes — we all have probably learned about it at some point, if not this is a refresher —
Tag
Tags follow the variable-width encoding. To simplify this, consider a 64-bit integer, however, to represent the number 1 which requires only one set bit, do we really need 63 other unset bits ? This is where variable encoding comes into the picture! (https://www.youtube.com/watch?v=9b2e_iRVJ0k, will probably help you understand better)
Essentially, the MSB of each byte represents a “continuation bit”, if it is set it means there are more bytes to come, while if it is unset, or has the value zero, this is the last byte of the sequence.
Now assuming you have understood about variable-width integer, in the case of the encoding of the tag, the field_number and wire_type are represented sequentially, i.e tag = field_number + wire_type
The last 3 bits of the number represent the wire_type, why last 3 only you may ask ? There are 6 types; 0–5, the maximum value 5 takes 3 bits represent. The field_number is represented by the remaining bits to the left of the wire_type.
Thus, the lower the value of the field_number, the lesser bits it will take — which is why, the documentation also mentions that it is preferable to have field_numbers ≤ 15. Again, if you aren’t clear about why, here’s an explanation —
Now, let’s try to understand what 0802 represents, shall we ? As stated earlier, the first field in the message is an integer that stores the age of a person, and the field_number is 1.. so can you guess what it stands for ?
We have figured out what 08 stands for, if you guessed correctly, then 02 basically represents the value in this case since this is an integer, and 02 does indeed represent the number 2! Find more about integer encoding here — https://wiki.ubc.ca/images/b/bf/FSS-UTF_1992_UTF-8_1993.png
Essentially 0802 means to say this is a VARINT
with a field_number 1 and value 2
Let’s go to the next bit — 12046a616e65, we know that this portion represents the string name, and is of type LEN
(from the table above). There is a mention of tag-length-value method of representation used here. Since this is a string, it is also essential that we know how many characters it has. Strings follow the utf-8 encoding. Now again, if you are wondering what the utf-8 encoding is, it is a variable-width encoding format, that can represent any character in the world — literally any character in any language that exists today! This encoding scheme uses 1–4 bytes to represent a character based on the following —
The ASCII encoding format utilizes 7 bits, and represents 128 different characters. In the context of utf-8, ASCII fits perfectly well, the MSB of the first byte is considered 0 and hence, is backward compatible. For our example, since it is a name, we can assume ASCII characters only.
Thus, we know this is a string of 4 characters, now what are those characters ? Looking up the conversion for 6a616e65, would map to the following —
Thus, our message represents a person whose name is jane and age is 2.
How do we verify this? We can do this using a tool called protoscope. Providing our encoding as input, this is what we see —
We have verified that our message corresponds to what we stated and decoded earlier.
JSON
Without any sort of compression, how many bytes do you think a JSON object with the same content would require ? In the case of protobufs it took us 8 bytes. JSON takes us about 26 bytes, which is more than 3 times the size, this is one of the reasons protobufs are preferred over other encoding formats.
To conclude, this was a brief about protobuf encoding, the documentation covers in detail about different data types and is definitely an interesting read.