RFC 002: Zero-Copy Encoding
Changelog
- 2022-03-08: Initial draft
Background
When the SDK originally migrated to [protobuf state encoding]../../build/architecture/adr-019-protobuf-state-encoding.md), zero-copy encodings such as Cap'n Proto and FlatBuffers were considered. We considered how a zero-copy encoding could be beneficial for interoperability with modules and scripts in other languages and VMs. However, protobuf was still chosen because the maturity of its ecosystem and tooling was much higher and the client experience and performance were considered the highest priorities.
In [ADR 033: Protobuf-based Inter-Module Communication]../../build/architecture/adr-033-protobuf-inter-module-comm.md), the idea of cross-language/VM inter-module communication was considered again. And in the discussions surrounding [ADR 054: Semver Compatible SDK Modules]../../build/architecture/adr-054-semver-compatible-modules.md), it was determined that multi-language/VM support in the SDK is a near term priority.
While we could do cross-language/VM inter-module communication with protobuf binary or even JSON, the performance overhead is deemed to be too high because:
- we are proposing replacing keeper calls with inter-module message calls and the overhead of even the inter-module routing checks has come into question by some SDK users without even considering the possible overhead of encoding. Effectively we would be replacing function calls with encoding. One of the SDK's primary objectives currently is improving performance, and we want to avoid inter-module calls from becoming a big step backward.
- we want Rust code to be able to operate in highly resource constrained virtual machines so whatever we can do to reduce performance overhead as well as the size of generated code will make it easier and more feasible to deploy first-class integrations with these virtual machines.
Thus, the agreement when the [ADR 054]../../build/architecture/adr-054-semver-compatible-modules.md) working group concluded was to pursue a performant zero-copy encoding which is suitable for usage in highly resource constrained environments.
Proposal
This RFC proposes a zero-copy encoding that is derived from the schema definitions defined in .proto files in the SDK and all app chains. This would result in a new code generator for that supports both this zero-copy encoding as well as the existing protobuf binary and JSON encodings as well as the google.golang.org/protobuf API. To make this zero-copy encoding work, a number of changes are needed to how we manage the versioning of protobuf messages that should address other concerns raised in [ADR 054]../../build/architecture/adr-054-semver-compatible-modules.md). The API for using protobuf in golang would also change and this will be described in the code generation section along with a proposed Rust code generator.
An alternative approach to building a zero-copy encoding based on protobuf schemas would be to switch to FlatBuffers or Cap'n Proto directly. However, this would require a complete rewrite of the SDK and all app chains. Places this burden on the ecosystem would not be a wise choice when creating a zero-copy encoding compatible with all our existing types and schemas is feasible. In the future, we may consider a native schema language for this encoding that is more natural and succinct for its rules, but for now we are assuming that it is best to continue supporting the existing protobuf based workflow.
Also, we are not proposing a new encoding for transactions or gRPC query servers. From a client API perspective nothing would change. The SDK would be capable of marshaling any message to and from protobuf binary and this zero-copy encoding as needed.
Furthermore, migrating to the new golang generated code would be 100% opt-in because the inter-module router will simply marshal existing gogo proto generated types to/from the zero-copy encoding when needed. So migrating to the new code generator would provide a performance benefit, but would not be required.
In addition to supporting first-class Cosmos SDK modules defined in other languages and VMs, this encoding is intended to be useful for user-defined code executing in a VM. To satisfy this, this encoding is designed to enable proper bounds checking on all memory access at the expense of introducing some error return values in generated code.
New Protobuf Linting and Breaking Change Rules
This zero-copy encoding places some additional requirements on the definition and maintenance of protobuf schemas.
No New Fields Can Be Added To Existing Messages
The biggest change is that it will be invalid to add a new field to an existing message and a breaking change detector will need to be created which augments buf breaking to detect this.
The reasons for this are two-fold:
1) from an API compatibility perspective, adding a new field to an existing message is actually a state machine breaking change which in ADR 020 required us to add an unknown field detector. Furthermore, in ADR 054 this "feature" of protobuf poses one of the biggest problems for correct forward compatibility between different versions of the same module. 2) not allowing new fields in existing messages makes the generated code in languages like Rust (which is currently our highest priority target), much simpler and more performant because we can assume a fixed size struct gets allocated. If new fields can be added to existing messages, we need to encode the number of fields into the message and then do runtime checks. So this both increases memory layers and requires another layout of indirection. With the encoding proposed below, "plain old Rust structs" (used with some special field types) can be used.
Instead of adding new fields to existing messages, APIs can add new messages to existing packages or create new packages
with new versions of the messages. Also, we are not restricting the addition of cases to oneof
s or values to enum
s.
All of these cases are easier to detect at runtime with standard switch
statements than the addition of new fields.
Additional Linting Rules
The following additional rules will be enforced by a linter that complements buf lint:
- all message fields must be specified in continuous ascending order starting from
1
- all enums must be specified in continuous ascending order starting from
0
- otherwise it is too complex to check at runtime whether an enum value is unknown. An alternative would be to make adding new values to existing enums breaking - all enum values must be
<= 255
. Any enum in a blockchain application which needs more than 256 values is probably doing something very wrong. - all oneof's must be the only element in their containing message and must start at field number
1
and be added in continuous ascending order - this makes it possible to quickly check for unknown values - all
oneof
field numbers must be<= 255
. Anyoneof
which needs more field cases is probably doing something very wrong.
These requirements make the encoding and generated code simpler.
Encoding
Buffers and Memory Management
By default, this encoding attempts to use a single fixed size encoding buffer of 64kb. This imposes a limit on the
maximum size of a message that can be encoded. In the context of a message passing protocol for blockchains, this
is generally a reasonable limit and the only known valid use case for exceeding it is to store user-uploaded byte
code for execution in VMs. To accommodate this, large string
and bytes
values can be encoded in additional
standalone buffers if needed. Still, the body of a message included all scalar and message fields
must fit inside the 64kb buffer.
While this design decision greatly simplifies the encoding and decoding logic, as well as the complexity of generated code, it does mean that APIs will need to do proper bounds checking when writing data that is not fixed size and return errors.
The term Root
is used to refer to the main 64kb buffer plus any additional large string
/bytes
buffers that are
allocated.
Scalar Encoding
bool
s are encoded as 1 byte -0
or1
uint32
,int32
,sint32
,fixed32
,sfixed32
are encoded as 4 bytes by defaultuint64
,int64
,sint64
,fixed64
,sfixed64
are encoded as 8 bytes by defaultenum
s are encoded as 1 byte and values MUST be in the range of0
to255
.- all scalars declared as
optional
are prefixed with 1 additional byte whose value is0
or1
to indicate presence
All multibyte integers are encoded as little-endian which is by far the most common native byte order for modern CPUs. Signed integers always use two's complement encoding.
Message Encoding
By default, messages field are encoded inline as structs. Meaning that if a message struct takes 8 bytes then its inline field in another struct will add 8 bytes to that struct size.
optional
message fields will be prefixed by 1 byte to indicate presence. (Alternatively, we could encode optional
message fields as pointers (see below) if the desire is to save memory when they are rarely used needed.)
Oneof’s
oneof
s are encoded as a combination of a uint8
discriminant field and memory that is as large as the largest member
field. oneof
field numbers MUST be between 1
and 255
.
message Foo {
oneof sum {
bool x = 1;
int32 y = 2;
}
}
A discriminant of 0
indicates that the field is not set.
Pointer Types: Bytes and Strings and Repeated fields
A pointer is an 16-bit unsigned integer that points to an offset in the current memory buffer or to another memory
buffer. If the bit mask 0xFF00
on the is unset, then the pointer points to an offset in the main 64kb memory buffer.
If that bit mask is set, then the pointer points to a large string
or bytes
buffer. Up to 256 such buffers
can be referenced in a single Root
. The pointer 0
indicates that a field is not defined.
bytes
, string
s and repeated fields are encoded as pointers to a memory location that is prefixed with the
length of the bytes
, string
or repeated field value. If the referenced memory location is in the main 64kb memory
buffer, then this length prefix will be a 16-bit unsigned integer. If the referenced memory location is a large
string
or bytes
buffer, then this length prefix will be a 32-bit unsigned integer.
Any
s
Any
s are encoded as a pointer to the type URL string and a pointer to the start of the message
specified by the type URL.
Maps
Maps are not supported.
Extended Encoding Options
We may choose to allow customizing the encoding of fields so that they take up less space.
For example, we could allow 8-bit or 16-bit integers:
int32 x = 1 [(cosmos_proto.int16) = true]
would indicate that the field only needs 2 bytes
Or we could allow string
, bytes
or repeated
fields to have a fixed size rather than being encoding as
pointers to a variable-length value:
string y = 2 [(cosmos_proto.fixed_size) = 3]
could indicate that this is a fixed width 3 byte string
If we choose to enable these encoding options, changing these options would be a breaking change that needs to be prevented by the breaking change detector.
Generated Code
We will describe the generated Go and Rust code using this example protobuf file:
message Foo {
int32 x = 1;
optional uint32 y = 2;
string z = 3;
Bar bar = 4;
repeated Bar bars = 5;
}
message Bar {
ABC abc = 1;
Baz baz = 2;
repeated uint32 xs = 3;
}
message Baz {
oneof sum {
uint32 x = 1;
string y = 2;
}
}
enum ABC {
A = 0;
B = 1;
C = 2;
D = 3;
}
Go
In golang, the generated code would not expose any exported struct fields, but rather getters and setters as an interface or struct methods, ex:
type Foo interface {
X() int32
SetX(int32)
Y() zpb.Option[uint32]
SetY(zpb.Option[uint32])
Z() (string, error)
SetZ(string) error
Bar() Bar
Bars() (zpb.Array[Bar], error)
}
type Bar interface {
Abc() ABC
SetAbc(ABC) Bar
Baz() Baz
Xs() (zpb.ScalarArray[uint32], error)
}
type Baz interface {
Case() Baz_case
GetX() uint32
SetX(uint32)
GetY() (string, error)
SetY(string)
}
type Baz_case int32
const (
Baz_X Baz_case = 0
Baz_Y Baz_case = 1
)
type ABC int32
const (
ABC_A ABC = 0
ABC_B ABC = 1
ABC_C ABC = 2
ABC_D ABC = 3
)
Special types zpb.Option
, zpb.Array
and zpb.ScalarArray
are used to represent optional
and repeated fields
respectively. These types would be included in the runtime library (called zpb
here for zero-copy protobuf) and would
have an API like this:
type Option[T] interface {
IsSet() bool
Value() T
}
type Array[T] interface {
InitWithLength(int) error
Len() int
Get(int) T
}
type ScalarArray[T] interface {
Array[T]
Set(int, T)
}
Arrays in particular would not be resizable, but would be initialized with a fixed length. This is to ensure that arrays can be written to the underlying buffer in a linear way.
In golang, buffers would be managed transparently under the hood by the first message initialized, and usage of this generated code might look like this:
foo := NewFoo()
foo.SetX(1)
foo.SetY(zpb.NewOption[uint32](2))
err := foo.SetZ("hello")
if err != nil {
panic(err)
}
bar := foo.Bar()
bar.Baz().SetX(3)
xs, err = bar.Xs()
if err != nil {
panic(err)
}
xs.InitWithLength(2)
xs.Set(0, 0)
xs.Set(1, 2)
bars, err = foo.Bars()
if err != nil {
panic(err)
}
bars.InitWithLength(3)
bars.Get(0).Baz().SetY("hello")
bars.Get(1).SetAbc(ABC_B)
bars.Get(2).Baz().SetX(4)
Under the hood the generated code would manage memory buffers on its own. The usage of oneof
s is a bit easier than
the existing go generated code (as with bar.Baz()
above). And rather than using setters on embedded messages, we
simply get the field (already allocated) and set its fields (as in the case of foo.Bar()
above or the repeated
field foo.Bars()
). Whenever a field is stored with a pointer (string
, bytes
, and repeated
fields), there is
always an error returned on the getter to do proper bounds checking on the buffer.
Rust
This encoding should allow generating native structs in Rust that are annotated with #[repr(C, align(1))]
. It should
be fairly natural to use from Rust with a key difference that memory buffers (called Root
s) must be manually allocated
and passed into any pointer type.
Here is some example code that uses library types Option
, Enum
, String
, OneOf
and Repeated
as well as little-endian integer types from rend:
#[repr(C, align(1))]
struct Foo {
x: rend:i32_le,
y: cosmos_proto::Option<rend:u32_le>,
z: cosmos_proto::String, // String wraps a pointer to a string
bar: Bar
}
#[repr(C, align(1))]
struct Bar {
abc: cosmos_proto::Enum<ABC, 3>, // the Enum wrapper allows us to distinguish undefined and defined values of ABC at runtime. 3 is specified as the max value of ABC.
baz: cosmos_proto::OneOf<Baz, 2>, // the OneOf wrapper allows distinguished undefined values of Baz at runtime. 2 is specified as the max field value of Baz.
xs: cosmos_proto::Repeated<rend:u32_le> // Repeated wraps a pointer to repeated fields
}
#[repr(u8)]
enum ABC {
A = 0,
B = 1,
C = 2,
D = 3,
}
#[repr(C, u8)]
enum Baz {
Empty, // all oneof's have a case for Empty if they are unset
X(rend::u32_le),
Y(cosmos_proto::String)
}
Example usage (which does the exact same thing as the go example above) would be:
let mut root = Root<Foo>::new();
let mut foo = root.get_mut();
foo.x = 1.into();
foo.y = Some(2.into());
foo.z.set(root.new_string("hello")?); // could return an allocation error
foo.bar.baz = Baz::X(3.into());
foo.bar.xs.init_with_size(&mut root, 2)?; // could return an allocation error
foo.bar.xs[0] = 0.into();
foo.bar.xs[1] = 2.into();
foo.bars.init_with_size(&mut root, 3)?; // could return an allocation error
foo.bars[0].baz = Baz::Y(root.new_string("hello")?); // could return an allocation error
foo.bars[1].abc = ABC::B;
foo.bars[2].baz = Baz::X(4.into());