www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [GSoC] Improved FlatBuffers and/or Protobuf Support ~ Binary

reply Ahmet Sait <nightmarex1337 hotmail.com> writes:
Hi,
I've been thinking about working on binary serialization as my 
potential GSoC
project. It's originally one of the older entries in GSoC ideas 
page [0]. I
think D is pretty much cut out for this kind of task and 
serialization is a
topic I'm rather interested in so hopefully it will be a great 
candidate.

# About me:
My name is Ahmet Sait Ko├žak, currently studying Computer Science 
in Turkey. I
first met with D way back in high school which in itself is an 
interesting
story.

I've been introduced to programming in first year in high school 
with C for
some sort of competition. I then proceeded to learn C# myself and 
continued
coding as a hobby, having lots of fun. One of my biggest projects 
was LF2 IDE
[1] - a modding tool for the game LF2. It didn't take long for me 
to fall in
love with open source since that project made use of several OSS 
libraries
itself which led me to using version control, embracing git and 
open sourcing
nearly anything I code on GitHub from there on.

Fast forward 4 years I was hitting walls trying to do low level 
stuff in C#
and the fact that bytecode compiled languages being too easy to 
reverse
engineer was hindering my motivation to do anything commercial 
with them. I
never liked C++ but gave it another try telling myself "come on 
it's not that
bad" but failed miserably, there had to be a better way. Besides, 
I was
already crafting my dream language in my head.

One day, I sat in front of my computer and thought "I bet there 
is a language
called D". It was once in a life time magical moment reading 
through the home
page and seeing how it is the same language if I were to create 
one (static
reflection, native compiled, GC...). My first project in D was 
IDL [2] - it
made it possible for LF2 IDE to hot reload modded data files into 
the game's
memory, it was amazing working with slices for the first time. 
I'm a D user
and an evangelist ever since.

# Overview:
- Improving & updating D implementation of flatbuffers and/or 
protobuf
- Contributing the D support to the upstream repositories
- Better documentation & samples
- Benchmarking and making sure D rocks

# Key Points:
- Meta-programming (DbI, CTFE, mixin...)
I plan to make D meta features shine in this library.

   - It should be possible to parse schema and output mixable D 
code at
     compile time
   const schema = `message Person
   {
       required string name = 1;
       required int32 id = 2;
   }`;
   mixin(fromProtoSchema(schema));

   - There should be no need for a schema definition, a custom 
type annotated
     with UDAs should be enough
   struct Person
   {
        protoID(1) string name;
        protoID(2) int age;
   }
   serialize(Person("Walter", 42), stdout);

- Simple things should be simple
It should be dead simple to do basic stuff:
   auto obj = deserialize!SomeType(stdin);
   serialize(obj, stdout);

- Complex things should be possible
The library should be flexible and extensible without modification

- Support for library and tool based usage
It should be usable as a library without any additional setup but 
also usable
as a schema compiler.

- Support for common Phobos types
Nullable, tuples, std.datetime, std.complex, std.bigint, 
containers...

Existing work:
https://github.com/huntlabs/flatbuffers
https://github.com/dcarp/protobuf-d
https://github.com/msoucy/dproto

I'm personally not happy with any of the existing libraries but 
they will
likely be a valuable resource regardless.

Questions:
- How much work would be ideal for GSoC? Should I be working on 
flatbuffers
   only or protobuf too? (Seems like flatbuffers need more love)
- Should I tackle the std.serialization [3] idea?
- Any other serialization related suggestions?
- Anything I'm missing?

I'm still not entirely sure about my project (probably gonna 
write a few
proposals) so if you have other suggestions do not hesitate. All 
kinds of
constructive feedback is welcome!


[0] 
https://wiki.dlang.org/GSOC_2018_Ideas#FlatBuffers_Support_and.2For_Improved_Protocol_Buffer_Support
[1] https://github.com/ahmetsait/LF2.IDE
[2] https://github.com/ahmetsait/IDL
[3] https://wiki.dlang.org/GSOC_2019_Ideas#std.serialization
Mar 28
next sibling parent reply Dragos Carp <dragoscarp gmail.com> writes:
Hi Ahmet,

welcome to the D forum.

As the author of protobuf-d I'll try to give you some feedback to 
the points you made. I couldn't find the time to also do the 
flatbuffers implementation, so my comments are related just to 
protobuf. If you are interested to do the Flatbuffers work, I'll 
be more than happy to play the mentor role for you - I have some 
ideas there. But let's get to the existing, real stuff.

On Friday, 29 March 2019 at 00:18:40 UTC, Ahmet Sait wrote:
   - It should be possible to parse schema and output mixable D 
 code at
     compile time
   const schema = `message Person
   {
       required string name = 1;
       required int32 id = 2;
   }`;
   mixin(fromProtoSchema(schema));
I don't think that it is worth the effort. 1. A complete implementation for .proto file parsing is complicated (https://developers.google.com/protocol-buffers/docs/reference/proto3-spec). 2. Theoretically, protobuf definitions does not change often, and considering that compile time parsing is somehow slow, the benefit of parsing them at every compilation is actually a drawback. 3. protoc plugin is the Protobuf recommended way of parsing .proto definitions: https://developers.google.com/protocol-buffers/docs/proto3#generating
   - There should be no need for a schema definition, a custom 
 type annotated
     with UDAs should be enough
   struct Person
   {
        protoID(1) string name;
        protoID(2) int age;
   }
   serialize(Person("Walter", 42), stdout);
protobuf-d does that already, see the unittest for toProtobuf: https://github.com/dcarp/protobuf-d/blob/3f8a1a5129c98920e1652e965004ac77e9bb8ef1/src/google/protobuf/encoding.d#L193
 - Simple things should be simple
 It should be dead simple to do basic stuff:
   auto obj = deserialize!SomeType(stdin);
   serialize(obj, stdout);
Again, protobuf-d has that: https://github.com/dcarp/protobuf-d/blob/3f8a1a5129c98920e1652e965004ac77e9bb8ef1/src/google/protobuf/decoding.d#L214
 - Complex things should be possible
 The library should be flexible and extensible without 
 modification
toProtobuf, fromProtobuf, toJSONValue, fromJSONValue methods are protobuf customization points in protobuf-d. For an example see https://github.com/dcarp/protobuf-d/blob/3f8a1a5129c98920e1652e965004ac77e9bb8ef1/src/google/protobuf/wrappers.d#L27-L54
 - Support for library and tool based usage
 It should be usable as a library without any additional setup 
 but also usable
 as a schema compiler.
protobuf-d is usable as library, see https://github.com/huntlabs/grpc-dlang/blob/57c8fe9808f8e860c4b0668a83cdabd78b296ce5/dub.json#L9 Regarding the usage as schema compiler, review the first comment.
 - Support for common Phobos types
 Nullable, tuples, std.datetime, std.complex, std.bigint, 
 containers...
Protobuf is a language agnostic serialization format. Having .protobuf definitions for common Phobos types will just shift the problem somewhere else (i.e. other programming languages). Nevertheless Protobuf addresses probably the same problem by defining the "well-known" types (https://developers.google.com/protocol-buffers/docs/reference/google.protobuf). protobuf-d also supports those, so that std.datetime.Systime is mapped to google.protobuf.Timestamp and std.datetime.Duration to google.protobuf.Duration
 I'm personally not happy with any of the existing libraries but 
 they will
 likely be a valuable resource regardless.
The existing protobuf libraries are quite mature and probably improving those will be time better spent than starting once again from scratch.
 Questions:
 - How much work would be ideal for GSoC? Should I be working on 
 flatbuffers
   only or protobuf too? (Seems like flatbuffers need more love)
I'm quite satisfied with protobuf-d implementation: it is small (aprox. 4k LOC), clean and quite feature complete - 26 failing conformance test vs. 27 resp. 41 for the official C++ and Java counterparts. Of course there is still enough space for improvement, but at least in case of protobuf-d not enough for a GSoC application. On the other hand Flatbuffers is a very good candidate: it has its own specialties, but is also somehow similar to protobuf. This would reduce the planning risks considerably.
 - Should I tackle the std.serialization [3] idea?
I see std.serialization as a high level API. Probably this will be a long term std.experimental.serialization, that will require quite some time till multiple serialization formats implements it. Just after that, if it will ever happen, we can remove the "experimental" part. I don't see this as a suited GSoC project.
 - Any other serialization related suggestions?
https://arrow.apache.org/ Cheers, Dragos
Mar 29
next sibling parent Ahmet Sait <nightmarex1337 hotmail.com> writes:
On Friday, 29 March 2019 at 23:19:10 UTC, Dragos Carp wrote:
 Hi Ahmet,

 welcome to the D forum.

 As the author of protobuf-d I'll try to give you some feedback 
 to the points you made. I couldn't find the time to also do the 
 flatbuffers implementation, so my comments are related just to 
 protobuf. If you are interested to do the Flatbuffers work, 
 I'll be more than happy to play the mentor role for you - I 
 have some ideas there. But let's get to the existing, real 
 stuff.
Glad to hear, thanks!
 On Friday, 29 March 2019 at 00:18:40 UTC, Ahmet Sait wrote:
   - It should be possible to parse schema and output mixable D 
 code at
     compile time
   const schema = `message Person
   {
       required string name = 1;
       required int32 id = 2;
   }`;
   mixin(fromProtoSchema(schema));
I don't think that it is worth the effort. 1. A complete implementation for .proto file parsing is complicated (https://developers.google.com/protocol-buffers/docs/reference/proto3-spec). 2. Theoretically, protobuf definitions does not change often, and considering that compile time parsing is somehow slow, the benefit of parsing them at every compilation is actually a drawback. 3. protoc plugin is the Protobuf recommended way of parsing .proto definitions: https://developers.google.com/protocol-buffers/docs/proto3#generating
It doesn't immediately strike me as complicated and https://github.com/msoucy/dproto apparently has this feature so I'm guessing it can be used as a reference. Compile times are of course not expected to be good with this approach but it's promising if Stefan's New CTFE gets completed in the future. Then again you likely have more experience about this so I should probably defer this to when New CTFE is ready.
   - There should be no need for a schema definition, a custom 
 type annotated
     with UDAs should be enough
   struct Person
   {
        protoID(1) string name;
        protoID(2) int age;
   }
   serialize(Person("Walter", 42), stdout);
protobuf-d does that already, see the unittest for toProtobuf: https://github.com/dcarp/protobuf-d/blob/3f8a1a5129c98920e1652e965004ac77e9bb8ef1/src/google/protobuf/encoding.d#L193
 - Simple things should be simple
 It should be dead simple to do basic stuff:
   auto obj = deserialize!SomeType(stdin);
   serialize(obj, stdout);
Again, protobuf-d has that: https://github.com/dcarp/protobuf-d/blob/3f8a1a5129c98920e1652e965004ac77e9bb8ef1/src/google/prot
I assumed it wasn't the case since examples folder didn't have such code, thanks for pointing out.
 - Complex things should be possible
 The library should be flexible and extensible without 
 modification
toProtobuf, fromProtobuf, toJSONValue, fromJSONValue methods are protobuf customization points in protobuf-d. For an example see https://github.com/dcarp/protobuf-d/blob/3f8a1a5129c98920e1652e965004ac77e9bb8ef1/src/google/protobuf/wrappers.d#L27-L54
 - Support for library and tool based usage
 It should be usable as a library without any additional setup 
 but also usable
 as a schema compiler.
protobuf-d is usable as library, see https://github.com/huntlabs/grpc-dlang/blob/57c8fe9808f8e860c4b0668a83cdabd78b296ce5/dub.json#L9 Regarding the usage as schema compiler, review the first comment.
These are basically a checklist that I want to fill whether it already exists. Say, if I were to write flatbuffers-d I would want to implement them.
 - Support for common Phobos types
 Nullable, tuples, std.datetime, std.complex, std.bigint, 
 containers...
Protobuf is a language agnostic serialization format. Having .protobuf definitions for common Phobos types will just shift the problem somewhere else (i.e. other programming languages). Nevertheless Protobuf addresses probably the same problem by defining the "well-known" types (https://developers.google.com/protocol-buffers/docs/reference/google.protobuf). protobuf-d also supports those, so that std.datetime.Systime is mapped to google.protobuf.Timestamp and std.datetime.Duration to google.protobuf.Duration
Makes sense, I'm in the opinion that API should support common types if there is direct correspondence or well established conventions for said type.
 I'm personally not happy with any of the existing libraries 
 but they will
 likely be a valuable resource regardless.
The existing protobuf libraries are quite mature and probably improving those will be time better spent than starting once again from scratch.
I feel like there is some lack of documentation since none of those things you mentioned are obvious looking at the repo. Nevertheless, I'm happy to hear that protobuf-d is mature & feature complete.
 Questions:
 - How much work would be ideal for GSoC? Should I be working 
 on flatbuffers
   only or protobuf too? (Seems like flatbuffers need more love)
I'm quite satisfied with protobuf-d implementation: it is small (aprox. 4k LOC), clean and quite feature complete - 26 failing conformance test vs. 27 resp. 41 for the official C++ and Java counterparts. Of course there is still enough space for improvement, but at least in case of protobuf-d not enough for a GSoC application. On the other hand Flatbuffers is a very good candidate: it has its own specialties, but is also somehow similar to protobuf. This would reduce the planning risks considerably.
Agreed, I'm going to focus on flatbuffers in my proposal then.
 - Should I tackle the std.serialization [3] idea?
I see std.serialization as a high level API. Probably this will be a long term std.experimental.serialization, that will require quite some time till multiple serialization formats implements it. Just after that, if it will ever happen, we can remove the "experimental" part. I don't see this as a suited GSoC project.
I see, thanks for the feedback.
 - Any other serialization related suggestions?
https://arrow.apache.org/
Thanks, I'll take a look.
Apr 01
prev sibling parent reply Ahmet Sait <nightmarex1337 hotmail.com> writes:
https://docs.google.com/document/d/1kFXDbs-LLsIW5nTIt8EZkNBq3N6vayaEw25IcuoGxgI/edit?usp=sharing

Seeking some feedback, thanks in advance..!
Apr 04
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2019-04-04 18:43, Ahmet Sait wrote:
 https://docs.google.com/document/d/1kFXDbs-LLsIW5nTIt8EZkNBq3N6vayaEw25IcuoG
gI/edit?usp=sharing 
 
 
 Seeking some feedback, thanks in advance..!
I think "Contributing D support to the upstream repositories" might be a hurdle. You never know how much time someone else will have to review pull requests. "Using D traits, UDAs and static introspection, it is possible to generate flatbuffer accessors without a schema file" I don't know how flatbuffer works, but are accessors necessary? It might be interesting to specify if you have any requirements that it should work with any of the attributes: "nothrow", " safe", "pure", " nogc" and the betterC subset. -- /Jacob Carlborg
Apr 04
parent Ahmet Sait <nightmarex1337 hotmail.com> writes:
On Thursday, 4 April 2019 at 18:27:05 UTC, Jacob Carlborg wrote:
 On 2019-04-04 18:43, Ahmet Sait wrote:
 https://docs.google.com/document/d/1kFXDbs-LLsIW5nTIt8EZkNBq3N6vayaEw25IcuoGxgI/edit?usp=sharing
 
 Seeking some feedback, thanks in advance..!
I think "Contributing D support to the upstream repositories" might be a hurdle. You never know how much time someone else will have to review pull requests.
That's what I thought too, but at least I want the project in a state where I can make PR to the upstream, which is not a clear/measurable criteria.
 "Using D traits, UDAs and static introspection, it is possible 
 to generate flatbuffer accessors without a schema file"

 I don't know how flatbuffer works, but are accessors necessary?
AFAIU, accessors make vector (array) fields and backward/forward compatibility possible. I'm still learning so don't count on me.
 It might be interesting to specify if you have any requirements 
 that it should work with any of the attributes: "nothrow", 
 " safe", "pure", " nogc" and the betterC subset.
This is something that came to my mind after the fact (since I don't bother with attributes much), but I still couldn't decide yet. It makes a lot of sense to provide nogc functionality for potential RPC protocol usage (not high priority right now), not sure about the others.
Apr 04
prev sibling parent reply Dragos Carp <dragoscarp gmail.com> writes:
On Thursday, 4 April 2019 at 16:43:44 UTC, Ahmet Sait wrote:
 https://docs.google.com/document/d/1kFXDbs-LLsIW5nTIt8EZkNBq3N6vayaEw25IcuoGxgI/edit?usp=sharing

 Seeking some feedback, thanks in advance..!
I added some comments directly in the document.
Apr 04
parent Ahmet Sait <nightmarex1337 hotmail.com> writes:
On Thursday, 4 April 2019 at 19:54:03 UTC, Dragos Carp wrote:
 On Thursday, 4 April 2019 at 16:43:44 UTC, Ahmet Sait wrote:
 https://docs.google.com/document/d/1kFXDbs-LLsIW5nTIt8EZkNBq3N6vayaEw25IcuoGxgI/edit?usp=sharing

 Seeking some feedback, thanks in advance..!
I added some comments directly in the document.
Unless there is some final touchs necessary, I'm about to submit my final proposal.
Apr 09
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2019-03-29 01:18, Ahmet Sait wrote:

 - Should I tackle the std.serialization [3] idea?
I would love that. FlatBuffers or Protobuf could be one of the backends. Although you might need to implement more than one backend to make sure the frontend API actually is general enough to implement multiple backend. Ideally two completely different kind of backend, like FlatBuffers and JSON, for example. -- /Jacob Carlborg
Apr 01
parent reply Ahmet Sait <nightmarex1337 hotmail.com> writes:
On Monday, 1 April 2019 at 09:57:08 UTC, Jacob Carlborg wrote:
 On 2019-03-29 01:18, Ahmet Sait wrote:

 - Should I tackle the std.serialization [3] idea?
I would love that. FlatBuffers or Protobuf could be one of the backends. Although you might need to implement more than one backend to make sure the frontend API actually is general enough to implement multiple backend. Ideally two completely different kind of backend, like FlatBuffers and JSON, for example.
Thanks for the feedback! I decided I should gather some experience building a serialization library first before thinking about designing std.serialization. Also, I want to know if I can ask you questions when working on my project (since you're the author of orange lib and have experience) ?
Apr 01
parent Jacob Carlborg <doob me.com> writes:
On 2019-04-02 02:05, Ahmet Sait wrote:

 Also, I want to know if I can ask you questions when working on my 
 project (since you're the author of orange lib and have experience) ?
Sure, please do. -- /Jacob Carlborg
Apr 02