www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [GSoC] 'Independency of D from the C Standard Library' progress and

reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
I'm moving forward with the D implementations of the C parts that 
the D Runtime
uses. A not-so-small description of the project. I hope it will
be descriptive enough to clear things up for people, as I 
probably did
not do a good job in the previous public discussions about this 
project.

     The goal of this project is to remove the dependency of the D 
Runtime
     from the C Standard Library. Currently, the D Runtime uses a 
small part
     of the C Standard Library. That is:
         a) The string.h family of functions: memcpy(), memmove(), 
etc.
         b) The standard allocator functions: malloc(), free(), 
etc.

     Those don't justify the dependency on the C Standard Library, 
as a very
     small part of it is utilized. However, there are problems 
coming with it:
         1) C’s implementations are not type-safe and memory-safe.
         2) C’s implementations have accumulated a lot of cruft 
over the years.
         3) Cross-compiling is more difficult as now one should 
have available and
         configured a C runtime and toolchain apart from the D 
runtime. This
         makes it difficult for D to create freestanding software.

     So, this project will provide alternative implementations of 
this functions,
     dependent only in the D Runtime. We hope that in the process, 
we will
     leverage D features that C doesn't have:
         1) Type-safety and memory safety (bounds-checking etc.)
         2) Templates to branch to an optimal implementation at 
compile-time.
         3) Inlining, as the branching in C happens at runtime.
         4) Compile-Time Function Execution (CTFE) and 
introspection (type info).

     Important clarifications:
         1) It will not use the C Standard Library.
         2) C Standard Library will be still available.
         3) We target the D Runtime and not the user (although, of 
course the
         users will be able to use it).
         4) We will provide a different interface from the C 
implementations,
         with the prospect to be more idiomatic D.
         5) Same or better performance with libc is not a hard 
constraint. We
         might succeed to reach it, but we might not. The hard 
constraint is that
         it will at least be close.

This month
==========

Implementation of string.h family:
     -- Week 1-2: Handling mis-alignment in memcpy().
     -- Week 3: memmove()
     -- Week 4: memset()

As a starting point, I reached the same performance with libc 
memcpy for (big) aligned data (hopefully, the implementation will 
be part of the memcpyD in some time).
Now, along with Mike Franklin's previous work, memcpyD() is 
faster than libc memcpy for small data (less than 32768) and as 
fast for big data.

Next month
==========
Mike Franklin initially proposed the idea that it we may be able 
to do
something better than implementing malloc() and free() again, 
just in D.
So, we decided that the best option is to integrate the 
std.experimental.allocator
to D Runtime. That involves creating an allocator of its building 
blocks
and removing the dependency in Phobos.

Week 1: Create a basic allocator using std.experimental.allocator 
interface
Week 2: Replace malloc(), free(), realloc() in D runtime
Week 3: Re-iterate until we have good benchmarks.
Week 4: Remove Phobos dependencies from the allocator.

Blockers
========
Not any major one.
May 31
next sibling parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Friday, 31 May 2019 at 21:01:01 UTC, Stefanos Baziotis wrote:
     Important clarifications:
Forgot that it targets x86_64.
May 31
parent Jacob Carlborg <doob me.com> writes:
On Friday, 31 May 2019 at 21:40:11 UTC, Stefanos Baziotis wrote:
 On Friday, 31 May 2019 at 21:01:01 UTC, Stefanos Baziotis wrote:
     Important clarifications:
Forgot that it targets x86_64.
And which OS? -- /Jacob Carlborg
Jun 02
prev sibling next sibling parent reply sarn <sarn theartofmachinery.com> writes:
On Friday, 31 May 2019 at 21:01:01 UTC, Stefanos Baziotis wrote:
 I'm moving forward with the D implementations of the C parts 
 that the D Runtime
 uses.
Hi Stefanos, good project :) Here's something to consider if you're replacing malloc() et al: it's popular (especially with large server deployments) to tune application memory allocation performance by replacing libc malloc() with alternatives such as tcmalloc and jemalloc. That works because they use the same libc malloc() API but with a different implementation, injected at link or load time (using LD_PRELOAD or something). It would be great if D code can still take advantage of alternative allocators developed by third-parties who may or may not be writing for D.
May 31
next sibling parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Saturday, 1 June 2019 at 02:40:10 UTC, sarn wrote:
 On Friday, 31 May 2019 at 21:01:01 UTC, Stefanos Baziotis wrote:
 I'm moving forward with the D implementations of the C parts 
 that the D Runtime
 uses.
Hi Stefanos, good project :)
Thank you!
 Here's something to consider if you're replacing malloc() et 
 al: it's popular (especially with large server deployments) to 
 tune application memory allocation performance by replacing 
 libc malloc() with alternatives such as tcmalloc and jemalloc.  
 That works because they use the same libc malloc() API but with 
 a different implementation, injected at link or load time 
 (using LD_PRELOAD or something).

 It would be great if D code can still take advantage of 
 alternative allocators developed by third-parties who may or 
 may not be writing for D.
As you can see in the "Next Month" above, we're planning to replace malloc() et al but with a different interface. The reason is that we believe that it is idiomatic D this way (I personally also believe that malloc(), free() etc. have a bad interface for allocation). We even hope that in the end (probably after GSoC) the allocator will be typed. But the allocators you proposed might be an inspiration for the allocator I will build using the std.experimental.allocator interface. Moreover, let me stress that malloc(), free().. will be available as well.
Jun 01
parent reply sarn <sarn theartofmachinery.com> writes:
On Saturday, 1 June 2019 at 14:18:25 UTC, Stefanos Baziotis wrote:
 Moreover, let me stress that malloc(), free().. will be 
 available as well.
Do you mean you're planning to allow the stdlib's allocation backend to be switched completely to libc-style malloc() and free(), or just that developers can always import core.stdc.stdlib and call malloc() if they like? (The second option won't be enough.) One option is to design D's allocation so that users can link with wrapped versions of tcmalloc, etc. However, it's important that this be designed properly so that it doesn't require a custom compiler toolchain, otherwise it'll just be a theoretical thing that no one actually does. Preferably it would work with LD_PRELOAD. I like the idea of moving beyond libc's API, but please consider and test this use case. A lot of smart people outside D are working on allocators, and it would be a major disadvantage if D can't use them.
Jun 01
parent Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Saturday, 1 June 2019 at 22:45:40 UTC, sarn wrote:
 Do you mean you're planning to allow the stdlib's allocation 
 backend to be switched completely to libc-style malloc() and 
 free()
Currently, it is using malloc() and free(). Maybe you mean move away?
 or just that developers can always import core.stdc.stdlib and 
 call malloc() if they like?  (The second option won't be 
 enough.)
They will be able because libc is not going anywhere. The purpose is to create an allocator _for the D Runtime_. Of course this allocator will be available for users to use as well. It's just that the focus will be there. Our initial plan was to make a D version of malloc() and free(). But, as Mike first suggested, we have the chance to create a more D-style version allocator. And fortunately, the foundation has already been built in std.experimental allocator. And as a personal opinion, the interface of malloc() and free() is not ideal for an allocator. From what I know, a lot of people working on allocators seem to have the same opinion. Just to disambiguate again, the purpose is that D Runtime won't depend on libc.
 One option is to design D's allocation so that users can link 
 with wrapped versions of tcmalloc, etc.  However, it's 
 important that this be designed properly so that it doesn't 
 require a custom compiler toolchain, otherwise it'll just be a 
 theoretical thing that no one actually does.  Preferably it 
 would work with LD_PRELOAD.
Well, the thing is to wrap an allocator, you first have to either write the allocator in D, or create a dependency on that allocator. Our choice is not the first, but somewhat the first. Meaning, I won't port any allocator but the allocator I will write will of course be inspired from work of others. Now, the important thing here is that I have so much time. It's only a summer, which is not even completely devoted to the allocator (it's about half the time). So, hopefully, either I or other people will continue the work post-GSoC.
 I like the idea of moving beyond libc's API, but please 
 consider and test this use case.  A lot of smart people outside 
 D are working on allocators, and it would be a major 
 disadvantage if D can't use them.
As a I said, it will be able to use them. The purpose is not to replace them in general, but specifically in the D Runtime. Be sure to check again the starting post in this thread for why we're doing this, and if there are any questions, please ask. - Stefanos
Jun 01
prev sibling parent reply Mike Franklin <slavo5150 yahoo.com> writes:
On Saturday, 1 June 2019 at 02:40:10 UTC, sarn wrote:

 Here's something to consider if you're replacing malloc() et 
 al: it's popular (especially with large server deployments) to 
 tune application memory allocation performance by replacing 
 libc malloc() with alternatives such as tcmalloc and jemalloc.  
 That works because they use the same libc malloc() API but with 
 a different implementation, injected at link or load time 
 (using LD_PRELOAD or something).

 It would be great if D code can still take advantage of 
 alternative allocators developed by third-parties who may or 
 may not be writing for D.
std.experimental.allocator (https://dlang.org/phobos/std_experimental_allocator.html) supports an `IAllocator` interface (https://dlang.org/phobos/std_experimental_allocator.html#IAllocator). The way I envision this playing out is that when std.experimenal.allocator is ported to druntime, callers would use the `IAllocator` interface. Therefore, any allocator conforming to that interface could potentially serve as druntime's allocator. In order to swap the allocator, one would only have to implement the `IAllocator` interface, potentially even using the `Mallocator` (https://dlang.org/phobos/std_experimental_allocator_mallocator.html), and make the swap. Providing the machinery to make that convenient (compiler switches, runtime configuration, etc.) should probably not be in the scope of the GSoC project as it is already pressed for time, but that should only be a PR away for anyone who considers it a priority. That being said, we recognize that change needs to happen gradually to not rock the boat. Therefore, even when this project is complete, it should probably still default to libc with a `-preview` switch or something like to allow users to opt-in to the D allocator. Once there is sufficient experience in the real world with the D allocator, the defaults can potentially be swapped. This GSoC project will attempt to remove libc as a hard, intrinsic dependency in druntime, and reduce it to a platform implementation detail. In other words, druntime will not depend on libc, but a specific platform's port of druntime might. Mike
Jun 01
next sibling parent sarn <sarn theartofmachinery.com> writes:
On Sunday, 2 June 2019 at 00:10:51 UTC, Mike Franklin wrote:
 On Saturday, 1 June 2019 at 02:40:10 UTC, sarn wrote:

 Here's something to consider if you're replacing malloc() et 
 al: it's popular (especially with large server deployments) to 
 tune application memory allocation performance by replacing 
 libc malloc() with alternatives such as tcmalloc and jemalloc.
  That works because they use the same libc malloc() API but 
 with a different implementation, injected at link or load time 
 (using LD_PRELOAD or something).

 It would be great if D code can still take advantage of 
 alternative allocators developed by third-parties who may or 
 may not be writing for D.
std.experimental.allocator (https://dlang.org/phobos/std_experimental_allocator.html) supports an `IAllocator` interface (https://dlang.org/phobos/std_experimental_allocator.html#IAllocator). The way I envision this playing out is that when std.experimenal.allocator is ported to druntime, callers would use the `IAllocator` interface. Therefore, any allocator conforming to that interface could potentially serve as druntime's allocator. In order to swap the allocator, one would only have to implement the `IAllocator` interface, potentially even using the `Mallocator` (https://dlang.org/phobos/std_experimental_allocator_mallocator.html), and make the swap.
Thanks, that makes sense. It sounds like a version spec that switches to Mallocator (or whatever) could do it, as long as it doesn't force a recompilation of the whole runtime library. (Even more convenient would be a runtime flag like --DRT-gcopt, but I'm guessing you'd want to make it happen at compile time.)
Jun 02
prev sibling parent reply Sebastiaan Koppe <mail skoppe.eu> writes:
On Sunday, 2 June 2019 at 00:10:51 UTC, Mike Franklin wrote:
 The way I envision this playing out is that when 
 std.experimenal.allocator is ported to druntime
You probably don't need or want to port the whole of std.experimental.allocator to druntime. I recently looked at the GC in druntime and it has it's own pools etc. If it didn't, then the mark phase would be a lot harder and slower. (according to my understanding...) Therefor, for normal D programs, the only thing that makes sense is to implement the allocator that underlies the GC (an mmap or sbrk allocator). And be sure to make it is pluggable. What I am trying to say is that you can avoid porting the whole thing.
 use the `IAllocator` interface.  Therefore, any allocator 
 conforming to that interface could potentially serve as 
 druntime's allocator.
I am not a big fan of the IAllocator interface since it introduces a layer of indirection. There is no simple solution to avoid the indirection and get a pluggable allocator. Well, maybe a combination of ldc's weak and LTO. Dunno... https://wiki.dlang.org/LDC-specific_language_changes#.40.28ldc.attributes.weak.29 http://johanengelen.github.io/ldc/2016/11/10/Link-Time-Optimization-LDC.html
Jun 02
next sibling parent Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Sunday, 2 June 2019 at 11:19:20 UTC, Sebastiaan Koppe wrote:
 On Sunday, 2 June 2019 at 00:10:51 UTC, Mike Franklin wrote:
 [...]
You probably don't need or want to port the whole of std.experimental.allocator to druntime. I recently looked at the GC in druntime and it has it's own pools etc. If it didn't, then the mark phase would be a lot harder and slower. (according to my understanding...) [...]
Sebastiaan I don't have a good answer for you right now. std.experimental.allocator is quite new for me. I hope Mike can give you more insight until I start working on this part.
Jun 03
prev sibling parent Mike Franklin <slavo5150 yahoo.com> writes:
On Sunday, 2 June 2019 at 11:19:20 UTC, Sebastiaan Koppe wrote:

 What I am trying to say is that you can avoid porting the whole 
 thing.
Yes, that is understood. Only what is required to implement a malloc replacement is within the scope of the project.
 use the `IAllocator` interface.  Therefore, any allocator 
 conforming to that interface could potentially serve as 
 druntime's allocator.
I am not a big fan of the IAllocator interface since it introduces a layer of indirection. There is no simple solution to avoid the indirection and get a pluggable allocator. Well, maybe a combination of ldc's weak and LTO. Dunno... https://wiki.dlang.org/LDC-specific_language_changes#.40.28ldc.attributes.weak.29 http://johanengelen.github.io/ldc/2016/11/10/Link-Time-Optimization-LDC.html
The project is pressed for time, so I'd like to stick with something known and well-documented. Perhaps IAllocator is not the right solution in the end, but still, implementing it and seeing how it fits into druntime should inform future directions, and perhaps even elicit some new ideas. Mike
Jun 03
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
Note that DMC++'s standard library is now Boost licensed, and so can be used.

https://github.com/DigitalMars/dmc/tree/master/src
May 31
next sibling parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Saturday, 1 June 2019 at 05:59:05 UTC, Walter Bright wrote:
 Note that DMC++'s standard library is now Boost licensed, and 
 so can be used.

 https://github.com/DigitalMars/dmc/tree/master/src
Thanks, I'll take a look. Do we have any benchmarks?
Jun 01
parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/1/2019 7:24 AM, Stefanos Baziotis wrote:
 Do we have any benchmarks?
No.
Jun 02
prev sibling parent reply =?UTF-8?B?QXVyw6lsaWVu?= Plazzotta <here gmail.com> writes:
On Saturday, 1 June 2019 at 05:59:05 UTC, Walter Bright wrote:
 Note that DMC++'s standard library is now Boost licensed, and 
 so can be used.

 https://github.com/DigitalMars/dmc/tree/master/src
Do you think it is planned to remove all C++ dependencies from D as well?
Jun 01
parent Seb <seb wilzba.ch> writes:
On Saturday, 1 June 2019 at 19:50:08 UTC, Aurélien Plazzotta 
wrote:
 On Saturday, 1 June 2019 at 05:59:05 UTC, Walter Bright wrote:
 Note that DMC++'s standard library is now Boost licensed, and 
 so can be used.

 https://github.com/DigitalMars/dmc/tree/master/src
Do you think it is planned to remove all C++ dependencies from D as well?
If I understand you correctly, this has already happened. DMD can be built today without the need for any C++ as it's written in 100% D.
Jun 01
prev sibling next sibling parent reply Thomas Mader <thomas.mader gmail.com> writes:
On Friday, 31 May 2019 at 21:01:01 UTC, Stefanos Baziotis wrote:
     The goal of this project is to remove the dependency of the 
 D Runtime
     from the C Standard Library.
Cool project! Is it possible to follow the project somewhere on github?
Jun 01
parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Saturday, 1 June 2019 at 09:29:55 UTC, Thomas Mader wrote:
 On Friday, 31 May 2019 at 21:01:01 UTC, Stefanos Baziotis wrote:
     The goal of this project is to remove the dependency of 
 the D Runtime
     from the C Standard Library.
Cool project!
Thanks!
 Is it possible to follow the project somewhere on github?
For now, you can follow this repo: https://github.com/JinShil/memcpyD related to memcpy. It's my mistake that I haven't forked that and do the changes with PRs, so you can see a commit "... from Stefanos". I'll post an update about where my experimentation will be visible.
Jun 01
parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Saturday, 1 June 2019 at 14:29:03 UTC, Stefanos Baziotis wrote:
 I'll post an update about where my experimentation will be 
 visible.
https://github.com/baziotis/Dmemcpy You can follow this repo for memcpy. In the future, probably I will merge all the string.h functions in one repo, but in the development stage I think it's better to have them on their own. Any feedback is greatly appreciated!
Jun 03
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/3/19 10:17 AM, Stefanos Baziotis wrote:
 On Saturday, 1 June 2019 at 14:29:03 UTC, Stefanos Baziotis wrote:
 I'll post an update about where my experimentation will be visible.
https://github.com/baziotis/Dmemcpy You can follow this repo for memcpy. In the future, probably I will merge all the string.h functions in one repo, but in the development stage I think it's better to have them on their own. Any feedback is greatly appreciated!
At 512 lines including tests, it seems on the involved side. The benchmarks ought to show a hefty improvement to match. Are there benchmark results available? Quoting the rationale from the motivation in another thread: 1) C’s implementations are not type-safe and memory-safe. 2) C’s implementations have accumulated a lot of cruft over the years. 3) Cross-compiling is more difficult as now one should have available and configured a C runtime and toolchain apart from the D runtime. This makes it difficult for D to create freestanding software. And then the listed advantages of using D for implementation (renumbered): 4) Type-safety and memory safety (bounds-checking etc.) 5) Templates to branch to an optimal implementation at compile-time. 6) Inlining, as the branching in C happens at runtime. 7) Compile-Time Function Execution (CTFE) and introspection (type info). My view on formulating motivation is simple: do it like a scientist. Argue the facts. If facts are not available, argue fundaments and universal principles. If such are not available, the motivation is too weak. (1) checks the "facts" box but has the obvious comeback "then how about a 2-line trusted wrapper over memcpy?" that needs to be explained. Related, obviously people who reach for memcpy() are often not looking for a safe primitive. a[] = b[] is safe, syntactically simple, and could lower to anything including memcpy. (2) is quite specious and really needs some evidence. Is cruft in memcpy really an issue? I looked memcpy() implementations a while ago but didn't save bookmarks. Did a google search just now and found https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c, which is very far from cruft-ridden. I do remember elaborate implementations of memcpy but so are (somewhat ironically) the 512 lines of the proposed implementation. I found one here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/lib/memcpy_64.S?id=HEAD No idea of its level of cruftiness, where it's used etc. The right way to argue (2) is to provide links to implementations that people can look at and decide without doubt, "yep, crufty". (3) is... odd. Doesn't every machine ever come with a C implementation including a ready-to-link standard library? If not, isn't that a rarity? Again, that should be argued preemptively by the motivation section. (4) brings again the wrapper argument (5) is nice if and only if confirmed by benchmarks (6) is also nice under the same conditions as (5) (7) again... what's wrong with a wrapper that does if (__ctfe) These considerations are built with memcpy() in mind. With malloc() we're looking at a completely different ballgame. Implementing malloc() from scratch is a very serious project that needs almost overwhelming motivation. The goal of std.experimental.allocator was to offer a flexible framework for implementing general and specialized allocators, but simply replacing malloc() is more difficult to argue. Also, achieving comparable performance will be difficult.
Jun 03
next sibling parent reply Mike Franklin <slavo5150 yahoo.com> writes:
On Monday, 3 June 2019 at 22:45:28 UTC, Andrei Alexandrescu wrote:

 At 512 lines including tests, it seems on the involved side. 
 The benchmarks ought to show a hefty improvement to match. Are 
 there benchmark results available?

 Quoting the rationale from the motivation in another thread:

 1) C’s implementations are not type-safe and memory-safe.
 2) C’s implementations have accumulated a lot of cruft over the 
 years.
 3) Cross-compiling is more difficult as now one should have 
 available and configured a C runtime and toolchain apart from 
 the D runtime. This makes it difficult for D to create 
 freestanding software.

 And then the listed advantages of using D for implementation 
 (renumbered):

 4) Type-safety and memory safety (bounds-checking etc.)
 5) Templates to branch to an optimal implementation at 
 compile-time.
 6) Inlining, as the branching in C happens at runtime.
 7) Compile-Time Function Execution (CTFE) and introspection 
 (type info).

 My view on formulating motivation is simple: do it like a 
 scientist. Argue the facts. If facts are not available, argue 
 fundaments and universal principles. If such are not available, 
 the motivation is too weak.

 (1) checks the "facts" box but has the obvious comeback "then 
 how about a 2-line trusted wrapper over memcpy?" that needs to 
 be explained. Related, obviously people who reach for memcpy() 
 are often not looking for a safe primitive. a[] = b[] is safe, 
 syntactically simple, and could lower to anything including 
 memcpy.

 (2) is quite specious and really needs some evidence. Is cruft 
 in memcpy really an issue? I looked memcpy() implementations a 
 while ago but didn't save bookmarks. Did a google search just 
 now and found 
 https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c, 
 which is very far from cruft-ridden. I do remember elaborate 
 implementations of memcpy but so are (somewhat ironically) the 
 512 lines of the proposed implementation. I found one here:

 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/lib/memcpy_64.S?id=HEAD

 No idea of its level of cruftiness, where it's used etc. The 
 right way to argue (2) is to provide links to implementations 
 that people can look at and decide without doubt, "yep, crufty".

 (3) is... odd. Doesn't every machine ever come with a C 
 implementation including a ready-to-link standard library? If 
 not, isn't that a rarity? Again, that should be argued 
 preemptively by the motivation section.

 (4) brings again the wrapper argument
 (5) is nice if and only if confirmed by benchmarks
 (6) is also nice under the same conditions as (5)
 (7) again... what's wrong with a wrapper that does if (__ctfe)

 These considerations are built with memcpy() in mind. With 
 malloc() we're looking at a completely different ballgame. 
 Implementing malloc() from scratch is a very serious project 
 that needs almost overwhelming motivation. The goal of 
 std.experimental.allocator was to offer a flexible framework 
 for implementing general and specialized allocators, but simply 
 replacing malloc() is more difficult to argue. Also, achieving 
 comparable performance will be difficult.
Stefanos, everything Andrei has said here is correct, but it is missing some perspective and does not consider everything we've discussed. Please STAY THE COURSE! Do not let this post discourage you. The time for questioning the merits of this proposal was 2 months ago; not now. Now that it is a full-fledged GSoC project you are tasked to do the best you can. Andrei, I agree with everything you've said, but there's more to take into consideration. I have a response to some of the items you've mentioned, and maybe I'll post that later. For now allow me to express that I'm quite disappointed that you are questioning the merit of this proposal when the time to do so was 2 months ago when the GSoC projects were being reviewed, and you were supposed to participate. The GSoC project is well underway and Stefanos now needs to see the project through to completion regardless of what anyone thinks about it. Please don't undermine this project or diminish the morale of our students with such posts. At the moment we need feedback on the actual memcpy implementation, not whether you think this project is a good idea or not. Stefanos, please don't let this post discourage you. Please STAY ON TASK. Mike
Jun 03
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/3/19 7:35 PM, Mike Franklin wrote:
 Andrei, I agree with everything you've said, but there's more to take 
 into consideration.  I have a response to some of the items you've 
 mentioned, and maybe I'll post that later.
My point was not to cast doubt and debate away in the forum (I think we really should avoid the "weak writeup/strong forum argument" pattern like the plague), but instead to help improve the writeup of the motivation. Ideally the pertinent parts should be used to improve that section in the project. Essentially someone reading the motivation should gather a good understanding and have most relevant questions preempted.
Jun 03
prev sibling parent Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Monday, 3 June 2019 at 23:35:21 UTC, Mike Franklin wrote:
 Stefanos, everything Andrei has said here is correct, but it is 
 missing some perspective and does not consider everything we've 
 discussed.  Please STAY THE COURSE!  Do not let this post 
 discourage you.  The time for questioning the merits of this 
 proposal was 2 months ago; not now.  Now that it is a 
 full-fledged GSoC project you are tasked to do the best you can.

 Andrei, I agree with everything you've said, but there's more 
 to take into consideration.  I have a response to some of the 
 items you've mentioned, and maybe I'll post that later.

 For now allow me to express that I'm quite disappointed that 
 you are questioning the merit of this proposal when the time to 
 do so was 2 months ago when the GSoC projects were being 
 reviewed, and you were supposed to participate.  The GSoC 
 project is well underway and Stefanos now needs to see the 
 project through to completion regardless of what anyone thinks 
 about it.  Please don't undermine this project or diminish the 
 morale of our students with such posts.

 At the moment we need feedback on the actual memcpy 
 implementation, not whether you think this project is a good 
 idea or not.

 Stefanos, please don't let this post discourage you.  Please 
 STAY ON TASK.
Thank you very much Mike! Andrei, I hear you as well and thank you for the feedback! I want to say this. Before about 5 days, _I_ was even unsure about the goals. And in the last 5 days of writing code, I'm getting more and more unsure. However, Mike is so ridiculously helpful that the last thing I want is to: - Sound disheartening. - Sound like the guy that was picked and doesn't believe in the project. - Sound like the guy who writes D for 5 months and came to question Mike, jpf and any other involved in the project. Please, for any constructive feedback, questions or anything that you're unsure about, direct the message to the most relevant person. The facts mentioned are not my responsibility (hopefully, for the better), so you probably want to ask the mentors. But, my opinion was asked. First of all, for memcpy et all: - To reach memcpy, you have to write assembly, not D. In the end the code will be bigger than memcpy because we will have the D improvement (what Mike has done), plus a memcpy-size implementation (The two implementations that you posted are not the version that gets called. You can check (the horror) if you step in a debugger. Mike had a link about what seems to be the actual implementation, but I can't find it). - Because of the above, that code will not be D, it will be assembly, which brings one to the question "Why not use the already made asm versions for the assembly parts (like the libc version) rather than re-write it yourself?". - To reach memcpy, although I'm getting good benchmarks, is next to impossible in half a summer. Yet, this is what is expected. - Personally, I don't use dynamic arrays. My D is mostly betterC. For me, if people use the D features, they would probably never use memcpy. And if they don't, then they would probably use a low-level (unsafer?) memcpy with pointers. However, this is targeted to the D Runtime, with which I don't have any experience. So, I trust the mentors. - In my opinion, the best way to go about this is to get only the memcpy implementation linked (so remove the dependency on libc), create wrappers around it, something like Jonathan Marler's code _and_ use D for the small sizes, where it shines (as Mike's work has already showed). That way you leverage the work that has been put on memcpy, write idiomatic D, remove the libc dependency and make a (way) faster memcpy for small sizes. But, who am I to question things? And I don't claim in any way that because my opinion is different than what we planned to do, that I believe that I am correct "but hey, they decide.. right?". No. I just trust that they know better than me. For malloc: - The initial plan was that a malloc() would be written. Having tried to write my own malloc(), I say that that I was pretty naive to think I would do a replacement from scratch in half a summer. Thankfully, something else was decided. The decision to use std.experimental.allocator was not mine. I learned about it probably less than a month ago. I can't support if it's a bad or good decision, because I know very little to have any meaningful input. To me, it sounded good though. And again, I trust that the mentors know better (and it's not a final decision yet). I don't want to sound rude. I'm grateful to the D community for giving me the chance to work on something so challenging. But the project, the goals, the motivation and the approach are not my responsibility. I, of course, have opinions about those, but also my opinion is that: For something that I'm not in charge, better try to help than contradict, except if I think there's something _very_ wrong. And I already felt I contradicted a lot. In the end, if the motivation is too weak, if the approach is wrong and if the goals are not that desired, then why was the project picked? And why are those things questioned after 2 months? Last but not least, while this may not be the best place for "famous last words", I want again to thank the mentors and especially Mike(!). Seb as well. This project, well.. let's just say it didn't have exactly the warmest feedback and their support is important.
Jun 03
prev sibling next sibling parent reply Mike Franklin <slavo5150 yahoo.com> writes:
On Monday, 3 June 2019 at 22:45:28 UTC, Andrei Alexandrescu wrote:

 At 512 lines including tests, it seems on the involved side. 
 The benchmarks ought to show a hefty improvement to match. Are 
 there benchmark results available?
I did some initial benchmarks at https://github.com/JinShil/memcpyD when I made the first feasibility study to see if this project was worth pursuing. The initial results were encouraging, which is why we're taking it further in this project. I'll work with Stefanos to get a more polished implementation that users can download and run for themselves.
 Quoting the rationale from the motivation in another thread:

 1) C’s implementations are not type-safe and memory-safe.
 2) C’s implementations have accumulated a lot of cruft over the 
 years.
 3) Cross-compiling is more difficult as now one should have 
 available and configured a C runtime and toolchain apart from 
 the D runtime. This makes it difficult for D to create 
 freestanding software.
 4) Type-safety and memory safety (bounds-checking etc.)
 5) Templates to branch to an optimal implementation at 
 compile-time.
 6) Inlining, as the branching in C happens at runtime.
 7) Compile-Time Function Execution (CTFE) and introspection 
 (type info).

 My view on formulating motivation is simple: do it like a 
 scientist. Argue the facts. If facts are not available, argue 
 fundaments and universal principles. If such are not available, 
 the motivation is too weak.
Yes, the motivation could be improved, but the time for motivating this project was 2 months ago, not now. Now the project is underway, and we need to see it to completion. The focus now should be on providing feedback on the implementations not the rationale/motivation.
 (1) checks the "facts" box but has the obvious comeback "then 
 how about a 2-line trusted wrapper over memcpy?" that needs to 
 be explained. Related, obviously people who reach for memcpy() 
 are often not looking for a safe primitive. a[] = b[] is safe, 
 syntactically simple, and could lower to anything including 
 memcpy.
Part of the motivation is so druntime no longer has a hard intrinsic dependency on libc. If you just wrap the libc function you're not acheiving that goal. Now, that being said, it is way out of the scope of this project to provide a D implementation of memcpy for all platforms, architectures and mircoarchitectures that D supports. So, we need to deal with that. Before I elaborate further, it's important to understand that druntime is currently a monolith that is not architected or structures properly. druntime is supposed to be the language implementation, not libc bindings, libc++ bindings, windows bindings, linux bindings, low-level code (whatever that means), etc. The language implementation *will* require certain features of the underlying operating system and hardware. Some of those features may be provided by libc, but that decision should be made on a platform-by-platform basis. So what we hope to achieve with this project is an idiomatic-D memory copy/compare interface. That interface may simply forward to libc for those features that don't have an optimized D implementation. Other platforms may choose to implement a highly optimized implementation in D. Other platforms may choose to mix the two (e.g. an optimized D implementation for small copies, and forward to libc for large copies). Others may choose to just implement a simple while-loop because they either don't want to obtain a C toolchain (those cross-compiling to embedded targets) or because there isn't C implementation available (new platforms like WASM). This project aims to remove druntime's dependency on libc, but the platform port of druntime may still choose to depend on it. That being said you might be wondering why we are bothering to implement an entire memcpy in D for the x86_64 architecture. 1) because DMD's implementation is suboptimal, 2) to help motivate the entire project 3) to demonstrate D as a first-class systems programming language 4) to set an example and precedent for other plaforms to potentially follow Please keep in mind we're trying to expand D to more platforms include resource-constrained embedded systems, OS programming, bare-metal applications, and new platforms such as WASM. We want D to be more easily portable, and that is partically achieved by making a platform abstraction, independent of libc. libc is a platform implementation detail.
 (2) is quite specious and really needs some evidence. Is cruft 
 in memcpy really an issue? I looked memcpy() implementations a 
 while ago but didn't save bookmarks. Did a google search just 
 now and found 
 https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c, 
 which is very far from cruft-ridden.
That is not the memcpy that is actually on your machine. You can find the more elaborate implementations here: https://sourceware.org/git/?p=glibc.git;a=tree;f=sysdeps/x86_64/multiarch;h=14ec2285c0f82b570bf872c5b9ff0a7f25724dfd;hb=HEAD Another from intel: https://github.com/DPDK/dpdk/blob/master/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
 I do remember elaborate implementations of memcpy but so are 
 (somewhat ironically) the 512 lines of the proposed 
 implementation. I found one here:

 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/lib/memcpy_64.S?id=HEAD

 No idea of its level of cruftiness, where it's used etc. The 
 right way to argue (2) is to provide links to implementations 
 that people can look at and decide without doubt, "yep, crufty".
The more elaborate C implementations are typically written in assembly. They are difficult to follow due to all of the various techniques to handle misalignment and the cleverness typically required to achieve the best performance. It is my hope that this project will explore how D can improve such implementations by reducing the cleverness to small isolated inline assembly blocks surrounded by D to make it easier to see the flow control. I think D can do that.
 (3) is... odd. Doesn't every machine ever come with a C 
 implementation including a ready-to-link standard library? If 
 not, isn't that a rarity? Again, that should be argued 
 preemptively by the motivation section.
Yes its a rarity, but nevertheless an artificial dependency for druntime. druntime does not sufficiently utilize libc to justify the hard dependency. It just needs a few memory utilities and an allocator. I think it's worthwhile to see if D can do just as well without libc. In fact, if I had my druthers, I'd remove libc's malloc altogether today and just add jemalloc to the druntime repository. Maybe it could even be mechanically translated to D.
 (4) brings again the wrapper argument
For some platforms, it may just be a wrapper.
 (5) is nice if and only if confirmed by benchmarks
We've already demonstrated this with benchmarks, I'll work with Stefanos to get them made available, but https://github.com/JinShil/memcpyD already shows the benefit.
 (6) is also nice under the same conditions as (5)
Yep, see my response to (5)
 (7) again... what's wrong with a wrapper that does if (__ctfe)
I think Stefanos is probably arguing in general about the design-by-introspection features of D which include CTFE and other metaprogramming features which is more-or-less the same as (5). Those benefits have been demonstrated, and we'll work to make those more apparent in the near future. That being said, there's nothing ruling out an `if (__ctfe)` block in the implementation if that's what is determined to be best.
 With malloc() we're looking at a completely different ballgame. 
 Implementing malloc() from scratch is a very serious project 
 that needs almost overwhelming motivation. The goal of 
 std.experimental.allocator was to offer a flexible framework 
 for implementing general and specialized allocators, but simply 
 replacing malloc() is more difficult to argue. Also, achieving 
 comparable performance will be difficult.
I agree to all of that, but we're going to try it anyway and see how it does. If all we achieve in the end is just a wrapper that forwards to libc's malloc and friends, it will still be better than what we have now, because libc will then be simply an implementation detail. Mike
Jun 03
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/3/19 9:11 PM, Mike Franklin wrote:
 Yes, the motivation could be improved, but the time for motivating this 
 project was 2 months ago, not now.  Now the project is underway, and we 
 need to see it to completion.  The focus now should be on providing 
 feedback on the implementations not the rationale/motivation.
Mike, you must understand this is a terrible argument. It should never be made again. It is in fact the only part about your response that makes me genuinely worried. Benchmarks look good. This is not a pregnancy. Any time is good for asking about the motivation, and the difference 2 months make goes in favor of the student and mentor is that the answers to questions about the motivation are only stronger, clearer, and more convincing. I've seen PhD candidates roasted over their motivation (literally their "thesis", which means "proposition") on their DEFENSE day after years of hard work. It is not the fault of the person asking.
Jun 03
next sibling parent Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Tuesday, 4 June 2019 at 01:32:49 UTC, Andrei Alexandrescu 
wrote:
 On 6/3/19 9:11 PM, Mike Franklin wrote:
 Yes, the motivation could be improved, but the time for 
 motivating this project was 2 months ago, not now.
Mike, you must understand this is a terrible argument. It should never be made again. It is in fact the only part about your response that makes me genuinely worried. Benchmarks look good. This is not a pregnancy. Any time is good for asking about the motivation, and the difference 2 months make goes in favor of the student and mentor is that the answers to questions about the motivation are only stronger, clearer, and more convincing.
I agree that a good time to ask is simply always. And that we should always argue if something doesn't seem right. But it's true that this was to be decided months ago. Moreover, for me it's important to consider the other side, and I think that's what Mike meant. Imagine that you're a GSoC student and you open the forum, 4 a.m.: "Aah... Here's a 1-page post about what you might be doing wrong with your project, already 1+ month working on it. Have a good night.." :p This is a personal opinion, but to me: Of course contradict the bad, but also help / motivate the good. These people (meaning any mentors and any GSoC student) chances are they're doing _something_ beneficial. For example..
 Benchmarks look good.
They're better now. ;)
Jun 03
prev sibling parent reply Mike Franklin <slavo5150 yahoo.com> writes:
On Tuesday, 4 June 2019 at 01:32:49 UTC, Andrei Alexandrescu 
wrote:
 On 6/3/19 9:11 PM, Mike Franklin wrote:
 Yes, the motivation could be improved, but the time for 
 motivating this project was 2 months ago, not now.  Now the 
 project is underway, and we need to see it to completion.  The 
 focus now should be on providing feedback on the 
 implementations not the rationale/motivation.
Mike, you must understand this is a terrible argument. It should never be made again. It is in fact the only part about your response that makes me genuinely worried. Benchmarks look good.
The point I'm trying to make is that we are in the coding stage of this project. Right now, students should be focused on getting the assignment done, not justifying the project to everyone. The project already went through a vetting process and was approved.
 This is not a pregnancy.
Thank you. I'm glad you were able to fulfill your sarcasm quota.
 Any time is good for asking about the motivation
Asking for more information about the motivation is one thing. Publicly doubting the motivation of a project that the D Language Foundation approved (a process you even participated in) is another.
Jun 03
parent reply KnightMare <black80 bk.ru> writes:
TL;DR
Should we attn to WASM where there are no system things (mmap, 
allocators), where memory is an array of ints?
Jun 04
next sibling parent KnightMare <black80 bk.ru> writes:
On Tuesday, 4 June 2019 at 08:31:54 UTC, KnightMare wrote:
 TL;DR
 Should we attn to WASM where there are no system things (mmap, 
 allocators), where memory is an array of ints?
LDC can compile code to WASM already
Jun 04
prev sibling parent Mike Franklin <slavo5150 yahoo.com> writes:
On Tuesday, 4 June 2019 at 08:31:54 UTC, KnightMare wrote:
 TL;DR
 Should we attn to WASM where there are no system things (mmap, 
 allocators), where memory is an array of ints?
If any of the work from this project gets merged into druntime (which appears will be an uphill battle) it should be easier to port druntime to new platforms like WASM. That is one of the motivations: To reduce the dependency on libc to platform implementation detail that any platform can override/reimplement/supplement as needed without impacting any other platform. Mike
Jun 04
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2019 3:45 PM, Andrei Alexandrescu wrote:
 (2) is quite specious and really needs some evidence. Is cruft in memcpy
really 
 an issue? I looked memcpy() implementations a while ago but didn't save 
 bookmarks. Did a google search just now and found 
 https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c, which is very
far 
 from cruft-ridden. I do remember elaborate implementations of memcpy but so
are 
 (somewhat ironically) the 512 lines of the proposed implementation. I found
one 
 here:
 
 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/lib/memcpy_64.S?id=HEAD
And here: https://github.com/DigitalMars/dmc/blob/master/src/CORE32/MEMCPY.ASM
Jun 04
parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Tuesday, 4 June 2019 at 22:18:20 UTC, Walter Bright wrote:
 And here:

 https://github.com/DigitalMars/dmc/blob/master/src/CORE32/MEMCPY.ASM
Please, consider again that this is not the version with which we're trying to compete with. Mike posted this link: https://sourceware.org/git/?p=glibc.git;a=tree;f=sysdeps/x86_64/multiarch;h=14ec2285c0f82b570bf872c5b9ff0a7f25724dfd;hb=HEAD This looks like the one that is called. Consider also the other link Mike posted: https://github.com/DPDK/dpdk/blob/master/lib/librte_eal/common/include/arch/x86/rte_memcpy.h Although I did not have time to benchmark, my guess is that this, which is one from Intel, is not at all enough against libc.
Jun 04
parent reply Mike Franklin <slavo5150 yahoo.com> writes:
On Wednesday, 5 June 2019 at 00:22:08 UTC, Stefanos Baziotis 
wrote:

 This looks like the one that is called. Consider also the other 
 link Mike posted:
 https://github.com/DPDK/dpdk/blob/master/lib/librte_eal/common/include/arch/x86/rte_memcpy.h

 Although I did not have time to benchmark, my guess is that 
 this, which is one from Intel, is not at all enough against 
 libc.
I benchmarked the older rte_memcpy here (https://github.com/DPDK/dpdk/blob/60a3df650d523bd2e4bb4f77f9278f25f7f1a65c/lib/librte_eal/common/include/ar h/x86/rte_memcpy.h) on a Linux virtual machine and rte_mempcy was quite a bit faster than libc. It's worth a deeper look. Mike
Jun 04
parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Wednesday, 5 June 2019 at 01:14:26 UTC, Mike Franklin wrote:
 On Wednesday, 5 June 2019 at 00:22:08 UTC, Stefanos Baziotis

 I benchmarked the older rte_memcpy here 
 (https://github.com/DPDK/dpdk/blob/60a3df650d523bd2e4bb4f77f9278f25f7f1a65c/lib/librte_eal/common/include/ar
h/x86/rte_memcpy.h) on a Linux virtual machine and rte_mempcy was quite a bit
faster than libc.  It's worth a deeper look.

 Mike
Our Dmemcpy is faster than libc on a Linux virtual machine too. :p But yes, again, take what I said with a grain of salt, it's just an assumption. Indeed it deserves greater analysis.
Jun 04
parent reply Exil <Exil gmall.com> writes:
On Wednesday, 5 June 2019 at 01:21:20 UTC, Stefanos Baziotis 
wrote:
 On Wednesday, 5 June 2019 at 01:14:26 UTC, Mike Franklin wrote:
 On Wednesday, 5 June 2019 at 00:22:08 UTC, Stefanos Baziotis

 I benchmarked the older rte_memcpy here 
 (https://github.com/DPDK/dpdk/blob/60a3df650d523bd2e4bb4f77f9278f25f7f1a65c/lib/librte_eal/common/include/ar
h/x86/rte_memcpy.h) on a Linux virtual machine and rte_mempcy was quite a bit
faster than libc.  It's worth a deeper look.

 Mike
Our Dmemcpy is faster than libc on a Linux virtual machine too. :p But yes, again, take what I said with a grain of salt, it's just an assumption. Indeed it deserves greater analysis.
How did you compile the code? GCC and Clang both target baseline x64, to use features like AVX2 you have to enable them, that of course means that not all CPUs will be able to run the code, though it will run faster on those that do. I'd say this should include ARM as well, but there's one D compiler that doesn't support it so...
Jun 04
parent Mike Franklin <slavo5150 yahoo.com> writes:
On Wednesday, 5 June 2019 at 03:00:13 UTC, Exil wrote:

 How did you compile the code? GCC and Clang both target 
 baseline x64, to use features like AVX2 you have to enable 
 them, that of course means that not all CPUs will be able to 
 run the code, though it will run faster on those that do.
If you're referring to the rte_memcpy file, I compiled it with -march=native.
Jun 04
prev sibling parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Friday, 31 May 2019 at 21:01:01 UTC, Stefanos Baziotis wrote:
 The goal of this project is to remove the dependency of the D 
 Runtime from the C Standard Library.
An update regarding the project. There was a lot of turbulance in this project, so I'm sorry I did not post earlier. Previous month ============== In this month the goals were replacements for memcpy(), memmove() and memset(), named Dmemcpy, Dmemmove and Dmemset. Dmemcpy and Dmemmove is merged in one repo [1] and the Dmemset is this [2] The goal was to create fast versions of those, targetted to x86_64 and DMD. Because of that and because of Blockers (refer to that later), there is some inline ASM in those implementations. There was an effort for this to be minimized (currently it's only on Dmemcpy), because I got informed that pure D should be the first priority. In the last week there was an effort to create a test suite and a benchmark suite for these repos. Quoting Mike and Johannes: # Make sure the implementation works for all kinds of D types (basic types, structs, classes, static arrays, and dynamic arrays) * Add naive implementations for now to fill the gaps. // NOTE(stefanos): Meaning, when x86 is not available or in any case that my code is not // able to be compiled for the target, there should a minimial pure D fall-back implementation. // NOTE(stefanos): Classes are not tested, more on that on the Blockers. # Separate benchmarks from tests Anyone visiting the repository should be able to clone it and do something like `run tests` and `run benchmarks`. 2. Create a `run.d` file, a `tests.d` file and a `benchmarks.d` file 3. When the user executes `rdmd run.d tests` it should compile the `tests.d` file and execute it producing a test report. 4. When the user executes `rdmd run.d benchmarks` it should compile `benchmarks.d`, execute it producing a benchmark report. // NOTE(stefanos): I'm relatively satisfied with Dmemset. Dmemmove got better the last 3 // days but it probably still needs review / more work. # Use the `tests.d` file to implement a thorough test suite for each repository including edge cases. * It should test each kind of type (basic types, structs, classes, static arrays, and dynamic arrays). // NOTE(stefanos): Again, for the classes refer to the Blockers. * Where relevant it should include a test of all interesting sizes. * Where relevant, it should test all variations of alignment up to 32. This includes aligned-src & aligned-dst, unaligned-src & unaligned-dst, aligned-src & unaligned-dst, and unaligned-src and aligned-dst. A nested foreach look (e.g. `foreach (srcOffset, alignments) { foreach(dstOffset; alignments) { ... } }`) should cover it. // NOTE(stefanos): This is not done as proposed here. I had my own variation // for alignment testing and this alternative was to be considered. My own, and this, // still need review. * For memmove it should test all variations of overlap: no overlap, exact overlap, source leading destination, destination leading source, etc... * Make sure each repository passes the test suite * Make sure the tests are easily comprehendible. Keep them simple so any visitor to the repository can easily verify that the test suite is thorough. * Be sure the tests cover all implementations. # Use the `benchmarks.d` file to create a benchmark suite for each repository * Benchmark all sizes from at least 0~512 (preferably up to 1024). After 1024 exponentially increasing sizes up to at least 65536. They do not need to be powers of 2; consider even powers of 10 so it is easy to graph on a logarithmic scale. An average of alignments is good for an overview, but the user should also be able to pick a single size and see how it performs for all variations of alignments. // NOTE(stefanos): I don't test that many sizes in experimental branch since the compile // time explodes. Meaning to the point that freezes Visual Studio. // But I should have added a logarithmic scale, that was an overlook. * Be sure the benchmark is thorough enough to covers all implementations. There is of course a lot to be said about the actual implementations and the decisions taken but I guess the post would be very big, so I decided to focus on the final goals and on the blockers. Please feel free to ask more specific questions on the implementations. [1] https://github.com/baziotis/Dmemmove/tree/experimental - experimental branch [2] https://github.com/baziotis/Dmemset
Jun 28
next sibling parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Friday, 28 June 2019 at 12:11:10 UTC, Stefanos Baziotis wrote:
 // NOTE(stefanos): Classes are not tested, more on that on the 
 Blockers.
=== Blockers === -- Blocker 1 - DMD -- The main blocker was that the project was targetted to DMD. The main problems are: - The optimizer is limited. - The code generated is a lot of times unpredictable (at least to me). That is, both as far as performance is concerned but comprehensibility as well. - Inline ASM can't be interleaved with pure D. I want to stress that when writing such performance sensitive utilities, the language is used as the tool to generate the ASM more effieciently (and with less errors) instead of writing it yourself. This is a subjective opinion, but I guess that most people having worked on such utilities will agree. This is why these utilities are either written in ASM or in a language that is low-level enough and with a good enough optimizer that will let them write in this more high-level language. Now, I picked inline ASM as my preference because with pure D and DMD there was: - Poor debugability. When the ASM is not hand-written, it is not as easily comprehensible. To sacrifice that, the ASM generated from the compiler has to be predictable, which for me it wasn't. - Poor tuning. One should not fight the optimizer. If I expect an optimization to be done and it's not, then that's a problem. - Poor scalabitliy. If a person after me comes and tries to optimize it further, I might have potentially created more problems with pure D than what I would have solved. For example, if I was that person and I did a compile and there was an unexpected load inside a loop that I can't get around by transforming the code, then that would be a problem. Basically, if we go the pure-whatever-language-we-choose way, we must never, in the future, say "Better have written in ASM from the start". And my prediction was that that would be the case. I can be a lot more specific on the reasons behind the pick of inline ASM, so feel free to ask. Don't get me wrong, DMD is pretty good but, at least I, could not get it to the point of hand-written ASM. I want to say that this inline ASM I'm talking about is being minimized / removed and is replaced with pure D for various reasons. -- Blocker 2 - Test suite -- In this month, I was working with a test suite that I had not examined carefully. That was certainly my biggest mistake up until now. And that test suite was not good. When I got advised to make a new test suite, that new suite revealed serious bugs in the code. That was both good and bad. The good thing was that I now had the chance to think hard on the test suite and that of course the bug were revealed. But the bad part was that Dmemcpy and Dmemmove had to almost be complete remade in 3 days. It was done, but it was a serious blocker. In that time, problems with Windows were revealed (specifically, the calling convention), which were also solved, but that was a lot of spent time as well. -- Blocker 3 - Classes -- The problem with classes is that it is mentioned that the compiler can change the layout of the fields in a class / struct. Even if that means that the two hidden fields (vptr and monitor) are still on the start, it still seems hacky to take the class pointer, move forward 16 bytes and start the operations there (and the 16 bytes is not standard because the pointer size changes by the operating system). So, we decided to leave it for now. My guess is that classes probably will never be used directly in such low-level code. -- Blocker 4 - SIMD intrinsics -- When I started writing Dmemset, I decided to go pure-D first. In that effort, there were 2 ASM instructions that I was trying to get them work for about 4 hours. The ASM instructions are: movd XMM0, ESI; pshufd XMM0, XMM0, 0; I don't if more details on what I tried matter, but if anyone has an idea, please inform me.
Jun 28
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Friday, 28 June 2019 at 12:14:13 UTC, Stefanos Baziotis wrote:
 On Friday, 28 June 2019 at 12:11:10 UTC, Stefanos Baziotis 
 wrote:
 // NOTE(stefanos): Classes are not tested, more on that on the 
 Blockers.
=== Blockers === -- Blocker 1 - DMD -- The main blocker was that the project was targetted to DMD. The main problems are: - The optimizer is limited. - The code generated is a lot of times unpredictable (at least to me). That is, both as far as performance is concerned but comprehensibility as well. - Inline ASM can't be interleaved with pure D. I want to stress that when writing such performance sensitive utilities, the language is used as the tool to generate the ASM more effieciently (and with less errors) instead of writing it yourself. This is a subjective opinion, but I guess that most people having worked on such utilities will agree. This is why these utilities are either written in ASM or in a language that is low-level enough and with a good enough optimizer that will let them write in this more high-level language. Now, I picked inline ASM as my preference because with pure D and DMD there was: - Poor debugability. When the ASM is not hand-written, it is not as easily comprehensible. To sacrifice that, the ASM generated from the compiler has to be predictable, which for me it wasn't. - Poor tuning. One should not fight the optimizer. If I expect an optimization to be done and it's not, then that's a problem. - Poor scalabitliy. If a person after me comes and tries to optimize it further, I might have potentially created more problems with pure D than what I would have solved. For example, if I was that person and I did a compile and there was an unexpected load inside a loop that I can't get around by transforming the code, then that would be a problem. Basically, if we go the pure-whatever-language-we-choose way, we must never, in the future, say "Better have written in ASM from the start". And my prediction was that that would be the case. I can be a lot more specific on the reasons behind the pick of inline ASM, so feel free to ask. Don't get me wrong, DMD is pretty good but, at least I, could not get it to the point of hand-written ASM. I want to say that this inline ASM I'm talking about is being minimized / removed and is replaced with pure D for various reasons.
inline asm is generally very bad for the optimiser because is can have any side-effects and is completely opaque. It is possible to generate the asm with string mixins, see e.g. the BigInt routines in phobos. You should test your work with LDC at some point which has an optimiser worth using, but note the bit about opaque inline ASM hurting performance.
 -- Blocker 2 - Test suite --

 In this month, I was working with a test suite that I had not 
 examined carefully.
 That was certainly my biggest mistake up until now. And that 
 test suite was not good.
 When I got advised to make a new test suite, that new suite 
 revealed serious bugs in the code. That was both good and bad. 
 The good thing was that I now had the chance to think
 hard on the test suite and that of course the bug were revealed.
 But the bad part was that Dmemcpy and Dmemmove had to almost be 
 complete remade in 3 days.
 It was done, but it was a serious blocker.

 In that time, problems with Windows were revealed 
 (specifically, the calling convention),
 which were also solved, but that was a lot of spent time as 
 well.

 -- Blocker 3 - Classes --

 The problem with classes is that it is mentioned that the 
 compiler can change the layout
 of the fields in a class / struct. Even if that means that the 
 two hidden fields
 (vptr and monitor) are still on the start, it still seems hacky 
 to take the class
 pointer, move forward 16 bytes and start the operations there 
 (and the 16 bytes is not standard because the pointer size 
 changes by the operating system). So, we decided
 to leave it for now.
 My guess is that classes probably will never be used directly 
 in such low-level code.
You should be able to get the offset of the first member with int foo() { static class A { int a; } return A.init.a.offsetof; } which will apply to any other non-nested class.
 -- Blocker 4 - SIMD intrinsics --

 When I started writing Dmemset, I decided to go pure-D first. 
 In that effort, there
 were 2 ASM instructions that I was trying to get them work for 
 about 4 hours. The ASM
 instructions are:
         movd    XMM0, ESI;
         pshufd  XMM0, XMM0, 0;

 I don't if more details on what I tried matter, but if anyone 
 has an idea, please inform me.
Take a look at https://github.com/AuburnSounds/intel-intrinsics Keep up the good work!
Jun 28
parent Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Friday, 28 June 2019 at 12:33:16 UTC, Nicholas Wilson wrote:
 inline asm is generally very bad for the optimiser because is 
 can have any side-effects and is completely opaque.
Exactly, that's the primary reason I mentioned that inline asm can't be interleaved with D. For performance reasons. The compiler has to be very conservative (more than one would expect). Which means that the only way to go is either pure D or full ASM and in fact, `_naked` ASM.
 It is possible to generate the asm with string mixins, see e.g. 
 the BigInt routines in phobos.
I suppose you mean this: https://github.com/dlang/phobos/blob/master/std/bigint.d With a quick look I'm not sure I understand the reason to do string mixins. I understand that it is for convenience (i.e. construct the ASM appropriately and not write a million different versions) and not performance reasons.
 You should test your work with LDC at some point which has an 
 optimiser worth using, but note the bit about opaque inline ASM 
 hurting performance.
It is tested with LDC but LDC was not a target for this project. Yes, inline ASM is risky as is pure D for the reasons I said above. Maybe I should note explicitly the risk of using only ASM as well, since I did for pure D. It's a matter of compromise.
 You should be able to get the offset of the first member with

 int foo()
 {
     static class A { int a; }
     return A.init.a.offsetof;
 }

 which will apply to any other non-nested class.
Thanks, I had not considered that. I think I should do an explicit post where I ask the opinion of the community about whether they would like the support of classes and how so.
 Take a look at https://github.com/AuburnSounds/intel-intrinsics
Just a little bit more detail, from my research, it is supposed that these two instructions should correspond somehow to these 2 instructions: simd_stof!(XMM.LODD, void16)(v, XMM0); simd!(XMM.PSHUFD, 0, void16, void16)(XMM0, XMM0); But I could not get them work for my life. I had not considered the intel intrinsics which is dumb if you consider that there is a whole talk I watched on this topic. It is this: https://www.youtube.com/watch?v=cmswsx1_BUQ for anyone interested.
 Keep up the good work!
Thank you! - Stefanos
Jun 28
prev sibling parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Friday, 28 June 2019 at 12:11:10 UTC, Stefanos Baziotis wrote:
 An update regarding the project. There was a lot of turbulance 
 in this project, so I'm sorry I did not post earlier.
I'm now moving to weekly updates. Before the updates of what I did, let me update you on the state of the project. The focus of the project has changed in the following ways: - No assembly - Generic and portable code - Focus on LDC and GDC - PRs to core.experimental This week ========== - Because of the above, this week I started with the replacement of all the ASM with SSE intrinsics and providing simple implementation for when SIMD is not available. The goal was not only the replacement but also the optimization for LDC. Eventually (either as part of this summer or of future work), the simple implementation should not be so "simple" and be one that helps LDC and GDC optimize it without the need to be explicitly in `version (D_SIMD)`. - I moved the functions in a common repository: https://github.com/baziotis/Dmemutils - I made a draft PR in the D runtime: https://github.com/dlang/druntime/pull/2662 (Thanks to lesderid and wilzbach for their help). * A note on intel-intrinsics: I first tried intel-intrinsics for the use of intrinsics. That worked great in LDC (I think it's focused on LDC), not so good in DMD and not at all in GDC. Firstly, in DMD it didn't work meaning it generated "wrong" code. The problem is that doing a load/store with intel-intrinsics and doing a load/store with load/storeUnaligned of core.simd does not generate the same code. This is important for Dmemmove because it is very sensitive to the order of instructions because of the overlap (e.g. here [1]) So, I made my own intrinsics that are different depending on if we use DMD or LDC. Regarding GDC, I just couldn't get it compile. My purpose is not to disparage intel-intrinsics, it's a great library. This was just my experience, in which maybe I did something wrong. I tried also to contact the creator, becase maybe he has some insight. * A note on GDC intrinsics: GDC now compiles to the naive version, because I don't know of load/storeUnaligned respective functions for GDC. Iain told me that I could use the i386 intrinsics (which as far as I know is this [2]), but I could not use them in GDC. Blockers ======== Only what I said above regarding GDC intrinsics. [1] https://github.com/baziotis/Dmemutils/blob/master/Dmemmove/Dmemmove.d#L267 [2] https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/X86-Built-in-Functions.html Next week ========== Sadly, I don't know. According to my schedule, the work on the allocator should have started. But, there were a couple of problems in the project, which changed its focus and so there were things that had to be done that were not initially planned. That means that the allocator, that should have started by now, hasn't. Other than that, the plans for the allocator changed when the project started to things that I'm not fully experienced with (changed from malloc(), free() etc. to using the std.experimental.allocator). So, how the project will continue is currently an open discussion. If std.experimental.allocator is interesting to the community, I'm happy to discuss it and learn how to continue. If we fall back to classic malloc(), free() implementations, this is something that can't be fully done in the time available. To make a complete replacement of malloc() et al, one has to make a serious attempt on multi-threading and optimization. _HOWEVER_, one possible alternative is to provide minimalistic versions of those functions for "baremetal" systems. That means either embedded systems or WASM. I think that this is interesting, meaning, to not have a dependency on the libc there and have minimal (regarding resources and code) implementations.
Jul 05
next sibling parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Friday, 5 July 2019 at 11:02:00 UTC, Stefanos Baziotis wrote:
 - Because of the above, this week I started with the 
 replacement of all the ASM
 with SSE intrinsics and providing simple implementation for 
 when SIMD is not available.
 The goal was not only the replacement but also the optimization 
 for LDC.
 Eventually (either as part of this summer or of future work), 
 the simple implementation
 should not be so "simple" and be one that helps LDC and GDC 
 optimize it without
 the need to be explicitly in `version (D_SIMD)`.
An important omission is that GDC and LDC optimize the simple version of Dmemset for my AMD Ryzen in such a way that it reaches total parity with libc memset. The difference is amazing when working with the LLVM / GCC back-ends. Unfortunately, I don't have an Intel to test. It would be really good to have benchmarks from Intel users.
Jul 05
next sibling parent reply Piotrek <dummy dummy.gov> writes:
On Friday, 5 July 2019 at 15:42:48 UTC, Stefanos Baziotis wrote:
 Unfortunately, I don't have an Intel to test. It would be 
 really good to have benchmarks from Intel users.
Hi Stefanos, This is great work. I hope Phobos will move away from clib some day. As for the benchmarks. I think you can post your results somewhere. Or you did. Unfortunately I cannot find them. I tested Dmemset with dmd (lcd and gdc didn't compile) on i3-3220 3.30GHz (Ubuntu). The strange thing is I get different results when I change the following line in benchamrks.d //note the upper bound static foreach(i; 1..256) to static foreach(i; 1..257) (1..256) 127 24.1439 20.7726 128 24.333 20.8421 129 24.3768 20.9648 (1..257) 127 24.4276 25.8072 128 24.679 26.2316 129 24.8052 26.0236 So D version becomes better. Maybe this is related to different binary file after compilation. Some other results for "(1..257)" variant: size(bytes) Cmemmove(GB/s) Dmemmove(GB/s) 1 0.269991 0.180151 2 0.438143 0.386652 3 0.657527 0.543067 4 1.00408 0.767028 5 1.26435 0.96617 6 1.51675 1.09579 7 1.76942 1.2771 8 2.02263 1.54563 9 2.27596 1.6421 10 2.52917 1.82534 11 2.78175 2.00729 12 3.03507 2.1897 13 3.28674 2.37267 14 3.53581 2.54155 15 3.79338 2.59328 16 5.25561 2.91728 17 5.58319 5.07972 18 5.91207 5.37934 19 6.24159 5.67784 20 6.56863 5.97583 21 6.84187 6.26141 22 7.22644 6.57598 23 7.55238 6.81922 24 7.88487 7.17182 ... 39 9.85228 9.48541 40 10.1054 9.72436 41 10.3587 10.0661 42 10.5787 10.3286 43 10.862 10.661 44 11.1155 10.9688 45 11.3691 11.2042 46 11.6228 11.5771 47 11.8245 11.6284 48 12.1258 12.1853 49 12.3849 12.4931 ... 59 14.7853 15.7441 60 15.165 16.1076 61 15.4095 16.4647 62 15.6639 16.803 63 15.9273 17.0932 64 16.1733 17.4991 65 11.862 17.671 66 12.0373 17.8678 67 12.2148 17.8533 68 12.4066 18.2475 69 12.5497 18.2762 ... 124 23.6536 25.3192 125 23.9933 25.5515 126 24.2049 26.0169 127 24.4276 25.8072 128 24.679 26.2316 129 24.8052 26.0236 130 25.0353 26.446 131 24.8123 26.2339 132 25.2592 26.176 133 25.3562 26.6108 134 25.8571 26.8894 ... 252 33.7209 33.9282 253 33.7367 34.1942 254 33.8958 34.59 255 33.412 33.6378 256 33.6542 34.661 500 39.5868 39.6527 700 43.7852 43.3711 3434 34.2489 45.8683 7128 35.2755 49.4049 13908 35.5447 51.2273 16343 35.0748 51.4501 27897 35.5615 51.0826 32344 35.1398 48.1469 46830 32.8887 34.9705 64349 33.2305 34.9398 Are they meaningful for you? If you want I can run additional benchmarks for you. For details, mabe we can continue on github. On forum we can discuss some fundamentals points. Cheers, Piotrek
Jul 05
parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Friday, 5 July 2019 at 20:22:30 UTC, Piotrek wrote:
 Hi Stefanos,

 This is great work. I hope Phobos will move away from clib some 
 day.
Hello, thank you! Yes, I hope too. If the D runtime moves away, that will be easier for the rest of D.
 As for the benchmarks.
 I think you can post your results somewhere. Or you did. 
 Unfortunately I cannot find them.
You're right, my mistake, there are no recent benchmarks. I'll try to post today. They're similar to yours.
 I tested Dmemset with dmd (lcd and gdc didn't compile) on 
 i3-3220 3.30GHz (Ubuntu).
That's weird. Could you give some more info on how did you compile? Did you use the procedure described in the README? Meaning, `rdmd run benchmarks gdc` and `rdmd run benchmarks ldc`. Now I checked and there was a regression which is now fixed. But with this regression, I could compile benchmarks for gdc but not ldc or dmd.
 The strange thing is I get different results when I change the 
 following line in benchamrks.d
 So D version becomes better. Maybe this is related to different 
 binary file after compilation.
That is indeed strange but not too unexpected. A compiler (more possible in the DMD back-end) might decide to do strange things for reasons I don't know. I'll try to re-create similar behavior in mine.
 Some other results for "(1..257)" variant:

 Are they meaningful for you?
They are, thank you! The benchmarks are good. Just some more info for anyone interested: Regarding sizes 1-16. With GDC / LDC, in my benchmarks (and by reading the ASM, I assume in all the benchmarks), it reaches parity with libc (note that for sizes 1-16 the naive version is used, meaning, a simple for loop). Now, for such small sizes, the standard way to go is a fall-through switch (I can give more info on that if someone is interested). The problem with that is that it's difficult to be optimized along with the rest of the code. Meaning, by the compiler. Or at least, I didn't find a way to do it. And so, I use the naive version which is only slightly slower but doesn't affect bigger sizes. Another important thing is that +/- 1 GB/s should not be considered. The reason is that at some point I benchmarked libc memset() against libc memset() and there were +/- 1 GB/s differences.
 If you want I can run additional benchmarks for you.
Thanks, I don't want to pressure you. If you have time, I'm interested in some feedback on GDC / LDC (if they compile and / or benchmarks). My guess is that especially with GDC / LDC (and DMD, but I'm not yet sure for DMD across different hardware), Dmemset can actually replace libc memset(). In Dmemmove / Dmemcpy is harder to have a clear winner.
 For details, mabe we can continue on github. On forum we can 
 discuss some fundamentals points.
I'm available to you or anyone to give additional info / explanations etc. on every line of code, decision, alternative implementations, possible improvements etc. You can post here, contact me on Slack or email. Some of these things will be added on the READMEs in the end, but we can go in more detail. Best regards, Stefanos
Jul 06
parent reply Piotrek <dummy dummy.gov> writes:
On Saturday, 6 July 2019 at 11:07:41 UTC, Stefanos Baziotis wrote:
 As for the benchmarks.
 I think you can post your results somewhere. Or you did. 
 Unfortunately I cannot find them.
You're right, my mistake, there are no recent benchmarks. I'll try to post today. They're similar to yours.
 I tested Dmemset with dmd (lcd and gdc didn't compile) on 
 i3-3220 3.30GHz (Ubuntu).
That's weird. Could you give some more info on how did you compile?
I used the old repo for Dmemset. With Dmemutils it works now. I removed static foreach from benchmark.d in order to run gdc. Text results: https://github.com/PiotrekDlang/Dmemutils/tree/master/Dmemset/output
 The strange thing is I get different results when I change the 
 following line in benchamrks.d
 So D version becomes better. Maybe this is related to 
 different binary file after compilation.
That is indeed strange but not too unexpected. A compiler (more possible in the DMD back-end) might decide to do strange things for reasons I don't know. I'll try to re-create similar behavior in mine.
It seems it wasn't related to this change. Looks like heisen optimization.
 Just some more info for anyone interested:
 Regarding sizes 1-16. With GDC / LDC, in my benchmarks
 (and by reading the ASM, I assume in all the benchmarks), it 
 reaches parity
 with libc (note that for sizes 1-16 the naive version is used,
 meaning, a simple for loop). Now, for such small sizes, the 
 standard way to go
 is a fall-through switch (I can give more info on that if 
 someone is interested).
 The problem with that is that it's difficult to be optimized 
 along with the rest
 of the code. Meaning, by the compiler. Or at least, I didn't 
 find a way
 to do it. And so, I use the naive version which is only 
 slightly slower but
 doesn't affect bigger sizes.
Funnily enough, DMD (with Dmemset) holds the speed record, over 50 GB/s, copying some big block sizes. However, aren't smaller sizes more important?
 My guess is that especially with GDC / LDC (and DMD, but I'm 
 not yet sure
 for DMD across different hardware), Dmemset can actually 
 replace libc memset().
One issue is it should be tested on all variation of HW and OS. At least it can be placed in experimental module. Cheers, Piotrek
Jul 06
parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Saturday, 6 July 2019 at 15:33:44 UTC, Piotrek wrote:
 I used the old repo for Dmemset. With Dmemutils it works now. I 
 removed static foreach from benchmark.d in order to run gdc.
 Text results:
 https://github.com/PiotrekDlang/Dmemutils/tree/master/Dmemset/output
Great, earlier today I realized that there were problems with static foreach, so now it's only using mixin in the main repo. Basically, I should have been able to do: version (GNU) { // mixin } else { static foreach } but that didn't work, meaning GDC tried to compile static foreach Anyway, the benchmarks look good. In DMD, small sizes are not so good but the big ones are better. But DMD is not the focus, since it now changed to GDC, LDC. If you're interested, there are a lot of things to say regarding optimization for DMD. Some have been said in this thread as initially the project was focused on DMD. I'm actually thinking of writing an article so that maybe I can help the next guy that tries to optimize for DMD. I don't think it's a good decision to care at all about optimization in DMD, but one might do. And it's a hard road. A tl;dr is that, for me at least, the only way to reach parity with libc is using (inline) ASM. But the important benchmarks are for GDC, LDC, which agree with my benchmarks on AMD and the result is that Dmemset reaches total parity with libc memset(). That's great to have from an Intel user as well, thanks for your time!
 It seems it wasn't related to this change. Looks like heisen 
 optimization.
Again, DMD. Quite an unexpected compiler.
 Funnily enough, DMD (with Dmemset) holds the speed record, over 
 50 GB/s, copying some big block sizes.
DMD might have been able to get these results due to inlining that was unrelated to the actual function (i.e. the benchmark code got inlined).
 However, aren't smaller sizes more important?
Again, fortunately DMD is not the focus but I guess one way one can somewhat answer this question is to do a report of the sizes used in the D runtime, since this is targeted to the D runtime. Something like this: https://forum.dlang.org/post/jdfiqpronazgglrkmwfq forum.dlang.org But this is not enough. A big part of optimization is to know the most common cases (which could be the data format, size, hardware etc.) and optimize for that first. And this is not adequate to show us the most common cases. - For one, eventually different sizes might be added or removed and so the common cases might change. - Someone might want to use this function outside of the D runtime. So, Dmemset() should be even or better than libc, which is (currently) achieved. Note something interesting. GDC gets these results with the naive version. This version is literally a 8-lines for loop.
 One issue is it should be tested on all variation of HW and OS.
 At least it can be placed in experimental module.
Right, it's currently PR'd to the D runtime: https://github.com/dlang/druntime/pull/2662 Just like you said, in an experimental module. :P Best regards, Stefanos
Jul 06
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 06.07.19 18:10, Stefanos Baziotis wrote:
 
 Basically, I should have been able to do:
 version (GNU)
 {
      // mixin
 }
 else
 {
      static foreach
 }
 
 but that didn't work, meaning GDC tried to compile static foreach
It won't compile it, but it will attempt to parse it. You should be able to do: version(GNU){ /+mixin+/ } else mixin(q{ /+static foreach+/ });
Jul 23
prev sibling parent Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Friday, 5 July 2019 at 15:42:48 UTC, Stefanos Baziotis wrote:
 Unfortunately, I don't have an Intel to test. It would be 
 really good to have benchmarks from Intel users.
A kind request to anyone interested in helping, please for the time being, put a priority in Dmemset (as Piotrek did). It is in a somewhat final and polished state and so we can have a more fruitful discussion, without it undergoing big changes. Dmemcpy / Dmemmove will undergo some changes (not fundamental I hope, but certainly layout, naming etc.) before it can be PR'd to D runtime too.
Jul 06
prev sibling parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Friday, 5 July 2019 at 11:02:00 UTC, Stefanos Baziotis wrote:
 I'm now moving to weekly updates. Before the updates of what I 
 did, let me update
 you on the state of the project.
Last 2 Weeks ============ I could not do weekly updates because unfortunately, there are a lot of things out of schedule in the project. So basically, the last 2 weeks I improved memcpy() / memmove() so they can be PR'd to the druntime. This [1] was the first PR. It had to be moved into separate PRs for memcpy() and memmove. Yesterday, an important question was answered which let me do a new PR for memcpy() [2] Along with that, I created memcmp() replacement [3]. I'm relatively satisfied with how the code looks, but this can't be PR'd yet to the druntime due to performance problems (more on that on the blockers). Blockers ======== -- On memcmp: That was my post on Slack: There are 3 major problems: 1) The performance is really really bad. Assuming that I have not done something stupid, it's just really bad. And actually, the asm generated (from `LDC`) is really weird too. 2) I made a version of Agner Fog's `memcmp` (which incidentally is similar to mine, it just goes reverse and does some smart things in subtractions). The thing is: a) Mine and Agner's should be about the same but it's not (Agner's is way better). b) Agner's is still quite low compared to `libc`. 3) The `LDC` version gives some very weird results for `libc memcmp`. Meaning, in benchmarks. And actually, the -O3 ASM generated by LDC seems bad as well. -- Τhe state of the project Right now, there is no specific roadmap nor any specific goals moving forward. The project was divided in 2 parts. One was the memcpy() et al. which included memcpy(), memmove() and memcmp() and the second was the allocator. The first part is mostly done. After discussions with Seb, we decided that the second part is not really needed after the mimalloc() of Microsoft: https://forum.dlang.org/thread/krsnngbaudausabfsqkn forum.dlang.org So, currently I don't know how to move forward. I asked on the druntime whether I can help with anything and zombinedev and Nicholas Wilson proposed refactorings on core.thread. Nicholas helped me to start with that, so this is going to be the next thing I will do. But this is supposed to be quick. If anyone has any proposal on what to do next, I'm glad to discuss. [1] https://github.com/dlang/druntime/pull/2671 [2] https://github.com/dlang/druntime/pull/2687 [3] https://github.com/baziotis/Dmemutils/tree/master/Dmemcmp
Jul 20
parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
As it is mentioned in a previous post, this project has got 
hardly any attention. And since there was nothing to do, I did 
not post weekly.

=== Current State ===

--- Dmem* utilities ---

Fortunately, Nicholas Wilson has been helping me the last week 
get the 2 Dmem*
PRs I had done merged [1], [2]

I don't know of anything that these PRs need, although possibly I 
have done something wrong in the documentation.

I don't know if / when they will get merged since they're 
awaiting review.
I hope to have enough reviews to merge at least memset in the 
next 2.5 weeks.
And again, thanks a lot Nicholas for your time.

--- core.thread ---

Since there was nothing to do, I asked if there was anything that 
I could
do in the time. It was proposed that I could refactor core.thread.
With some help from Nicholas, I made a PR [3].
I'm glad that people seem to care about this change. It's going 
good I think.

=== Final 2.5 weeks ===

I honestly have no idea. Ideally, I would PR memmove() as well 
but I think it's
better to try to get at least one of the other 2 PRs merged first.
Other than that, if the core.thread gets merged, I will finish it.

One thing I proposed for the time remaining is a cross-compiler 
SIMD module.
I will write in a separate thread about that, but the idea came 
from the fact that when writing Dmem* utils, I could not find a 
way to use SIMD intrinsics
across compilers. So, I created something like a small SIMD 
library [4].
That is of course not really general, but it shows the idea.

[1] memset: https://github.com/dlang/druntime/pull/2662
[2] memcpy: https://github.com/dlang/druntime/pull/2687
[3] core.thread: https://github.com/dlang/druntime/pull/2689
[4] Mini SIMD module: 
https://github.com/dlang/druntime/pull/2687/files#diff-c2fcd73761ae6659ef91245ce1195b6d
Aug 02
parent reply 12345swordy <alexanderheistermann gmail.com> writes:
On Friday, 2 August 2019 at 14:51:25 UTC, Stefanos Baziotis wrote:
 As it is mentioned in a previous post, this project has got 
 hardly any attention. And since there was nothing to do, I did 
 not post weekly.

 === Current State ===

 --- Dmem* utilities ---

 Fortunately, Nicholas Wilson has been helping me the last week 
 get the 2 Dmem*
 PRs I had done merged [1], [2]

 I don't know of anything that these PRs need, although possibly 
 I have done something wrong in the documentation.

 I don't know if / when they will get merged since they're 
 awaiting review.
 I hope to have enough reviews to merge at least memset in the 
 next 2.5 weeks.
 And again, thanks a lot Nicholas for your time.

 --- core.thread ---

 Since there was nothing to do, I asked if there was anything 
 that I could
 do in the time. It was proposed that I could refactor 
 core.thread.
 With some help from Nicholas, I made a PR [3].
 I'm glad that people seem to care about this change. It's going 
 good I think.

 === Final 2.5 weeks ===

 I honestly have no idea. Ideally, I would PR memmove() as well 
 but I think it's
 better to try to get at least one of the other 2 PRs merged 
 first.
 Other than that, if the core.thread gets merged, I will finish 
 it.

 One thing I proposed for the time remaining is a cross-compiler 
 SIMD module.
 I will write in a separate thread about that, but the idea came 
 from the fact that when writing Dmem* utils, I could not find a 
 way to use SIMD intrinsics
 across compilers. So, I created something like a small SIMD 
 library [4].
 That is of course not really general, but it shows the idea.

 [1] memset: https://github.com/dlang/druntime/pull/2662
 [2] memcpy: https://github.com/dlang/druntime/pull/2687
 [3] core.thread: https://github.com/dlang/druntime/pull/2689
 [4] Mini SIMD module: 
 https://github.com/dlang/druntime/pull/2687/files#diff-c2fcd73761ae6659ef91245ce1195b6d
Is this project dead in the water? Great, another dead project in the graveyard of dead projects.
Sep 05
parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Thursday, 5 September 2019 at 15:53:20 UTC, 12345swordy wrote:
 Is this project dead in the water? Great, another dead project 
 in the graveyard of dead projects.
A dead project is a project that hasn't achieved its goals. This project did twice, but both times the goals were not useful. I explained that in the other thread [1]. Let's please only concern ourselves with constructive discussions from now on. - Stefanos [1] https://forum.dlang.org/post/triweshixkzzyxnaldlj forum.dlang.org
Sep 05
parent reply 12345swordy <alexanderheistermann gmail.com> writes:
On Thursday, 5 September 2019 at 16:26:08 UTC, Stefanos Baziotis 
wrote:
 On Thursday, 5 September 2019 at 15:53:20 UTC, 12345swordy 
 wrote:
 Is this project dead in the water? Great, another dead project 
 in the graveyard of dead projects.
A dead project is a project that hasn't achieved its goals. This project did twice, but both times the goals were not useful. I explained that in the other thread [1]. Let's please only concern ourselves with constructive discussions from now on. - Stefanos [1] https://forum.dlang.org/post/triweshixkzzyxnaldlj forum.dlang.org
Is the implementation of memory allocation of the C standard library ever going to be achieved? - Alex
Sep 05
next sibling parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Thursday, 5 September 2019 at 17:30:54 UTC, 12345swordy wrote:
 Is the implementation of memory allocation of the C standard 
 library ever going to be achieved?

 - Alex
It depends on what you mean "achieved". Let me state some questions: - Why do you want that memory allocator ? - What this allocator should be able to achieve ? - Why the libc one is not appropriate for the job ? - Why no other allocator is appropriate for the job ? - Can we create and maintain this allocator ? These questions are presented humbly. And they are important. The fact that I did not set and answer such questions firstly _to myself_ for the first part of the project, meant that I did the project twice, yet all this work was just thrown away as far as the D community is concerned. - Stefanos
Sep 05
parent reply 12345swordy <alexanderheistermann gmail.com> writes:
On Thursday, 5 September 2019 at 17:56:07 UTC, Stefanos Baziotis 
wrote:
 On Thursday, 5 September 2019 at 17:30:54 UTC, 12345swordy 
 wrote:
 Is the implementation of memory allocation of the C standard 
 library ever going to be achieved?

 - Alex
It depends on what you mean "achieved". Let me state some questions: - Why do you want that memory allocator ? - What this allocator should be able to achieve ? - Why the libc one is not appropriate for the job ? - Why no other allocator is appropriate for the job ? - Can we create and maintain this allocator ? These questions are presented humbly. And they are important. The fact that I did not set and answer such questions firstly _to myself_ for the first part of the project, meant that I did the project twice, yet all this work was just thrown away as far as the D community is concerned. - Stefanos
- It is easier to debug and read in the d langauge then in the c language. - I was shown faster memory allocation speed compared to libc. - other memory allocator are not part of d langauge standard library. Most importantly a yet another disappointed development I seen in regards to the development of the d language. - Alex
Sep 05
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Sep 05, 2019 at 08:16:24PM +0000, 12345swordy via Digitalmars-d wrote:
[...]
 - It is easier to debug and read in the d langauge then in the c language.
 - I was shown faster memory allocation speed compared to libc.
 - other memory allocator are not part of d langauge standard library.
 
 Most importantly a yet another disappointed development I seen in
 regards to the development of the d language.
[...] Read the discussion that Stefanos referred to. Here are some of the key blocking issues: - C library APIs like memcpy, memset, etc., are not only in the C library, but are often implemented as *intrinsics* in compilers. One of the most important effects of this is that optimizers recognize them and understand their semantics, and can sometimes produce better code because of that. For example: int x, y=5; memcpy(&x, &y, int.sizeof); // C version ... // optimizer knows that now x==5. Using a D version of memcpy in the above code can mean that the optimizer does *not* recognize that x==5, which can lead to poorer performance. - Even if the previous point isn't an issue, there's still the problem of maintenance: the D version of mem* needs to be continuously updated because hardware is constantly evolving, and it takes significant manpower to (1) port the implementation to every supported architecture, (2) make sure they take maximum advantage of the quirks of the targeted platform, and (3) checking that they are actually faster than the C implementations (which is available on basically any new platform anyway). - D already has syntax for abstractly representing a memcpy operation: a[] = b[]. This syntax is type-safe, memory-safe, and the compiler can lower it to whatever it likes, including memcpy, or a custom implementation specialized for the target platform. That's where such primitives really belong, actually. (Historically they went into the C library, but these days compilers are more and more building them into intrinsics that can drive various codegen strategies (inlining, arch-specific optimizations, etc). They're gradually becoming more like low-level compiler primitives than your average C library functions.) The current work Stefanos has produced has a big performance impact mainly only in DMD, which is known to have a weak optimizer, and anyone who cares about runtime performance ought to be using GDC or LDC anyway. In GDC/LDC using these custom D implementations wind up being worse because they defeat the respective optimizers (they no longer recognize memcpy/etc. semantics from these functions, so can't optimize based on that). So lot of the effort ended up being directed towards working around flaws in DMD's optimizer rather than producing *actual* improvement over C's mem* primitives. This is really the wrong way to go about things IMO; we should rather be fixing DMD's optimizer instead. But once that's done there's even less reason to implement mem* ourselves. Note that this does not preclude the D compiler from, e.g., translating statements like `a[] = b[]` into target-optimized instructions instead of calling a function named 'memcpy'. I'd argue that it's the compiler's job (more specifically, the optimizer's job) to do the best translation of a[] = b[] into machine code, not the standard library's problem to account for N versions of M platforms in a gigantic unmaintainable block of static if'd (or version'd) custom implementation, whose only real value is to be able to pat ourselves in the back that yes, we have our own memcpy/memset/etc., implementation that we wrote in D, just because we can. Porting the D compiler to a new architecture already requires codegen work anyway, and work on memory-copying/moving primitives really should be included under that umbrella, rather than poorly reinvented in the runtime library. T -- Curiosity kills the cat. Moral: don't be the cat.
Sep 05
parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Thursday, 5 September 2019 at 21:17:07 UTC, H. S. Teoh wrote:

Thanks for the descriptive comment! Some comments from me:

 Read the discussion that Stefanos referred to. Here are some of 
 the key blocking issues:

 - C library APIs like memcpy, memset, etc., are not only in the 
 C
   library, but are often implemented as *intrinsics* in 
 compilers. One
   of the most important effects of this is that optimizers 
 recognize
   them and understand their semantics, and can sometimes 
 produce better
   code because of that. For example:

 	int x, y=5;
 	memcpy(&x, &y, int.sizeof); // C version
 	... // optimizer knows that now x==5.

   Using a D version of memcpy in the above code can mean that 
 the
   optimizer does *not* recognize that x==5, which can lead to 
 poorer
   performance.

 - Even if the previous point isn't an issue, there's still the 
 problem
   of maintenance: the D version of mem* needs to be 
 continuously updated
   because hardware is constantly evolving, and it takes 
 significant
   manpower to (1) port the implementation to every supported
   architecture, (2) make sure they take maximum advantage of 
 the quirks
   of the targeted platform, and (3) checking that they are 
 actually
   faster than the C implementations (which is available on 
 basically any
   new platform anyway).
- For the first 2, let me thank again Manu and Johan helped who me realize them! Note also that we don't currently know of a way of informing LLVM or GCC about the semantics and thus get this optimization. The closest thing we have is LLVM recognizing that a function does what e.g. memcpy() does by name. Which is a bad assumption to build upon.
 - D already has syntax for abstractly representing a memcpy 
 operation:
   a[] = b[]. This syntax is type-safe, memory-safe, and the 
 compiler can
   lower it to whatever it likes, including memcpy, or a custom
   implementation specialized for the target platform. That's 
 where such
   primitives really belong, actually. (Historically they went 
 into the C
   library, but these days compilers are more and more building 
 them into
   intrinsics that can drive various codegen strategies 
 (inlining,
   arch-specific optimizations, etc). They're gradually becoming 
 more
   like low-level compiler primitives than your average C library
   functions.)
AFAIK, this is implemented in the druntime. And the druntime calls memcpy(). Essentially the goal of this project was to create versions that would be used from the druntime, not the user. Other than that, I agree!
 The current work Stefanos has produced has a big performance 
 impact mainly only in DMD, which is known to have a weak 
 optimizer,
Actually, when I was optimizing for DMD, I used assembly mainly because I had to reach libc in performance. And using DMD, the only way to do that is using assembly. A more useful goal would be to not try to reach libc (certainly not in x86_64). Rather, create optimized versions but using generic D. Meaning, to optimize purely based on algorithms, with very few assumptions about the hardware. Much like MUSL.
 and anyone who cares about runtime performance ought to be 
 using GDC or LDC anyway. In GDC/LDC using these custom D 
 implementations wind up being worse because they defeat the 
 respective optimizers (they no longer recognize memcpy/etc. 
 semantics from these functions, so can't optimize based on 
 that).
Actually, this project reached libc in LDC, GDC in 1-1 benchmarks using D and SIMD functions (but not ASM). The problem is when used in context exactly for the reasons you described.
 So lot of the effort ended up being directed towards working 
 around flaws in DMD's optimizer rather than producing *actual* 
 improvement over C's mem* primitives.
Yes essentially that was one of my first objections. To counter-act the DMD flaws, you have to write ASM (if you want parity) which in turn brings the question: Then why do it ? This is what libc already does.
 This is really the wrong way to go about things IMO; we should 
 rather be fixing DMD's optimizer instead. But once that's done 
 there's even less reason to implement mem* ourselves.
IMHO, I don't think that fixing the DMD optimizer is a good way to go. Rather, as I said above, aim for generic D implementation, _without_ SIMD, based purely on algorithms. This can be useful for systems that don't have libc and since the DMD optimizer does not use intrinsics as LLVM / GCC, the aforementioned problems, are not problems. Essentially, it's a win-win situation. - Stefanos
Sep 05
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Sep 05, 2019 at 09:50:04PM +0000, Stefanos Baziotis via Digitalmars-d
wrote:
[...]
 - For the first 2, let me thank again Manu and Johan helped who me
   realize them! Note also that we don't currently know of a way of
   informing LLVM or GCC about the semantics and thus get this
   optimization. The closest thing we have is LLVM  recognizing that a
   function does what e.g. memcpy() does by name. Which is a bad
   assumption to build upon.
That's pretty scary that LLVM does that. It shakes my confidence in LLVM a little. OTOH, the identifier "memcpy" is pretty unique and practically universally understood to mean C's implementation of it, so it's a reasonably safe assumption. Of course, if you ever wish to override memcpy() with something that does something *other* than memcpy, you could potentially have a vector for Thompson-style backdoors (function does one thing when called, does something else when optimizer picks it up). [...]
 This is really the wrong way to go about things IMO; we should
 rather be fixing DMD's optimizer instead. But once that's done
 there's even less reason to implement mem* ourselves.
IMHO, I don't think that fixing the DMD optimizer is a good way to go. Rather, as I said above, aim for generic D implementation, _without_ SIMD, based purely on algorithms. This can be useful for systems that don't have libc and since the DMD optimizer does not use intrinsics as LLVM / GCC, the aforementioned problems, are not problems. Essentially, it's a win-win situation.
[...] But that seems to me to be quite backwards. If DMD were to target systems that don't have libc, which AFAIK it currently doesn't, we'd already have to do porting work in the form of how codegen is done. Then whatever implementation of memcpy & co you end up with, will simply become a part of this codegen implementation. It could be instructions directly produced by the backend, it could be calling a druntime function version'd by that specific platform, etc.. But it'd be a platform-specific, dmd-specific thing, not something generic that applies across all platforms that D might target, and not something that, e.g., GDC or LDC would use. My point is that something like this seems to be more appropriate as part of the support infrastructure for targeting libc-less platforms, rather than a generic library function that can be used by everyone. So any such implementation would be nested inside a version(platform_XYZ) block, ostensibly something like: version(noLibC) { void _d_memcpy(...) { ... } } where the compiler targeting any platform with no libc would define version=noLibC, and emit references to _d_memcpy as part of the codegen for copying memory. It just wouldn't be something you could use in general from any platform. T -- People tell me I'm stubborn, but I refuse to accept it!
Sep 05
next sibling parent Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Thursday, 5 September 2019 at 22:56:30 UTC, H. S. Teoh wrote:
 That's pretty scary that LLVM does that. It shakes my 
 confidence in LLVM a little. OTOH, the identifier "memcpy" is 
 pretty unique and practically universally understood to mean 
 C's implementation of it, so it's a reasonably safe assumption. 
 Of course, if you ever wish to override memcpy() with something 
 that does something *other* than memcpy, you could potentially 
 have a vector for Thompson-style backdoors (function does one 
 thing when called, does something else when optimizer picks it 
 up).
I don't like it either. Although, I _think_ that you can specifically set this off. Or that it is done by specific flags. I'd have to check.
 But that seems to me to be quite backwards.  If DMD were to 
 target systems that don't have libc, which AFAIK it currently 
 doesn't, we'd already have to do porting work in the form of 
 how codegen is done. Then whatever implementation of memcpy & 
 co you end up with, will simply become a part of this codegen 
 implementation.  It could be instructions directly produced by 
 the backend, it could be calling a druntime function version'd 
 by that specific platform, etc..  But it'd be a 
 platform-specific, dmd-specific thing, not something generic 
 that applies across all platforms that D might target, and not 
 something that, e.g., GDC or LDC would use.
I don't know if I understood this correctly. For memcpy() et al to become part of the compiler codegen, they have to be recognized as intrinsics. Like LLVM does. Is this what you refer to ? Because that's another (interesting) discussion. I was talking in the assymption that they're handled as just functions (as now), and things like a[] = b[] just call memcpy(). In that case, it doesn't pay to write arch-specific (meaning, the function implementor, not the compiler) implementation. Because that can't be leveraged across architectures (or you have to write a specific one for each which is not a good goal because of maintenance). Even if there was an LLVM-like thing where you can e.g. call vector extension intrinsics, but these are lowered to whatever arch-specific thing. Even if the arch does not have the concept of vectorization. Even then, it would be better to focus on the algorithmic part, as the translation of the compiler would be relatively basic. I hope the above made _some_ sense. I feel I didn't articulate my thoughts perfectly. - Stefanos
Sep 05
prev sibling parent Johan Engelen <j j.nl> writes:
On Thursday, 5 September 2019 at 22:56:30 UTC, H. S. Teoh wrote:
 On Thu, Sep 05, 2019 at 09:50:04PM +0000, Stefanos Baziotis via 
 Digitalmars-d wrote: [...]
 - For the first 2, let me thank again Manu and Johan helped 
 who me
   realize them! Note also that we don't currently know of a 
 way of
   informing LLVM or GCC about the semantics and thus get this
   optimization. The closest thing we have is LLVM  recognizing 
 that a
   function does what e.g. memcpy() does by name. Which is a bad
   assumption to build upon.
That's pretty scary that LLVM does that. It shakes my confidence in LLVM a little. OTOH, the identifier "memcpy" is pretty unique and practically universally understood to mean C's implementation of it, so it's a reasonably safe assumption.
FYI, GCC does the same. (the opposite too: converting a copying pattern into memcpy is an optimization performed by LLVM, GCC, and MSVC) -Johan
Sep 05
prev sibling parent reply Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Thursday, 5 September 2019 at 20:16:24 UTC, 12345swordy wrote:
 - It is easier to debug and read in the d langauge then in the 
 c language.
 - I was shown faster memory allocation speed compared to libc.
 - other memory allocator are not part of d langauge standard 
 library.

 Most importantly a yet another disappointed development I seen 
 in regards to the development of the d language.

 - Alex
Sorry, but IMHO, these reasons are not enough for me to start an allocator project. You may want to consider that these reasons are not enough for you too and / or the D community either. The first one is subjective. Considering that we're part of the D community, most of us would agree. But what is not subjective is how many people know D vs e.g. C, meaning how many people can actually contribute. For the second, I guess you mean "if you were shown". It's really very difficult to create _and_ maintain a libc all-around equivalent in performance (for all archs etc.). And even then, it probably is not a useful goal. Most people will have the libc available if they care so much about performance. Maybe a more useful goal would be to create a minimalistic allocator, which is very different. And then you have to think if we actually need it. I had asked a person who was working on WASM, which would be one target if this moved forward and he told me that he could do his job using the std.experimental.allocator. For the third question, I'll reply with a question: So? :) - Stefanos
Sep 05
parent reply 12345swordy <alexanderheistermann gmail.com> writes:
On Thursday, 5 September 2019 at 21:33:37 UTC, Stefanos Baziotis 
wrote:

 For the second, I guess you mean "if you were shown".
No, I *had* been showed. Regardless this is major disappointment. Good work has gone to waste. I can not believe it was accepted in the first place, if it were turn out to be pointless. This speaks very poorly of the d language foundation, IMO. You better close those PR request, as it quite clear that they are never going to be accepted. - Alex
Sep 05
parent Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Friday, 6 September 2019 at 01:50:59 UTC, 12345swordy wrote:
 On Thursday, 5 September 2019 at 21:33:37 UTC, Stefanos 
 Baziotis wrote:

 No, I *had* been showed.
Ok, I'm not aware.
 Regardless this is major disappointment. Good work has gone to 
 waste. I can not believe it was accepted in the first place, if 
 it were turn out to be pointless.
 This speaks very poorly of the d language foundation, IMO. You 
 better close those PR request, as it quite clear that they are 
 never going to be accepted.

 - Alex
It will be closed yes. I hope that it will not go completely wasted. When I have time, I will gather the useful code in one repo of D, C and assembly versions. I think there are some important things to take. - Stefanos
Sep 06
prev sibling parent a11e99z <black80 bk.ru> writes:
On Thursday, 5 September 2019 at 17:30:54 UTC, 12345swordy wrote:
 On Thursday, 5 September 2019 at 16:26:08 UTC, Stefanos

 Is the implementation of memory allocation of the C standard 
 library ever going to be achieved?

 - Alex
 ... mimalloc() of Microsoft:
 https://forum.dlang.org/thread/krsnngbaudausabfsqkn forum.dlang.org
Sep 05