digitalmars.D - [GSoC] 'Independency of D from the C Standard Library' progress and

Stefanos Baziotis (84/84) May 31 2019 I'm moving forward with the D implementations of the C parts that

Stefanos Baziotis (2/3) May 31 2019 Forgot that it targets x86_64.

Jacob Carlborg (4/7) Jun 02 2019 And which OS?

sarn (12/15) May 31 2019 Hi Stefanos, good project :)

Stefanos Baziotis (15/30) Jun 01 2019 As you can see in the "Next Month" above, we're planning to

sarn (16/18) Jun 01 2019 Do you mean you're planning to allow the stdlib's allocation

Stefanos Baziotis (42/58) Jun 01 2019 Currently, it is using malloc() and free(). Maybe you mean move

Mike Franklin (30/40) Jun 01 2019 std.experimental.allocator

sarn (6/30) Jun 02 2019 Thanks, that makes sense. It sounds like a version spec that
Sebastiaan Koppe (17/22) Jun 02 2019 You probably don't need or want to port the whole of

Stefanos Baziotis (6/14) Jun 03 2019 Sebastiaan I don't have a good answer for you right now.
Mike Franklin (9/20) Jun 03 2019 Yes, that is understood. Only what is required to implement a

Walter Bright (2/2) May 31 2019 Note that DMC++'s standard library is now Boost licensed, and so can be ...

Stefanos Baziotis (2/5) Jun 01 2019 Thanks, I'll take a look. Do we have any benchmarks?

Walter Bright (2/3) Jun 02 2019 No.

=?UTF-8?B?QXVyw6lsaWVu?= Plazzotta (3/6) Jun 01 2019 Do you think it is planned to remove all C++ dependencies from D

Seb (5/12) Jun 01 2019 If I understand you correctly, this has already happened. DMD can

Thomas Mader (3/6) Jun 01 2019 Cool project!

Stefanos Baziotis (9/15) Jun 01 2019 For now, you can follow this repo:

Stefanos Baziotis (7/9) Jun 03 2019 https://github.com/baziotis/Dmemcpy

Andrei Alexandrescu (48/58) Jun 03 2019 At 512 lines including tests, it seems on the involved side. The

Mike Franklin (24/81) Jun 03 2019 Stefanos, everything Andrei has said here is correct, but it is

Andrei Alexandrescu (8/11) Jun 03 2019 My point was not to cast doubt and debate away in the forum (I think we
Stefanos Baziotis (103/125) Jun 03 2019 Thank you very much Mike! Andrei, I hear you as well and thank

Mike Franklin (96/151) Jun 03 2019 I did some initial benchmarks at

Andrei Alexandrescu (11/15) Jun 03 2019 Mike, you must understand this is a terrible argument. It should never

Stefanos Baziotis (18/30) Jun 03 2019 I agree that a good time to ask is simply always. And that we
Mike Franklin (12/24) Jun 03 2019 The point I'm trying to make is that we are in the coding stage

KnightMare (3/3) Jun 04 2019 TL;DR

KnightMare (2/5) Jun 04 2019 LDC can compile code to WASM already
Mike Franklin (9/12) Jun 04 2019 If any of the work from this project gets merged into druntime

Walter Bright (3/12) Jun 04 2019 And here:

Stefanos Baziotis (9/11) Jun 04 2019 Please, consider again that this is not the version with which

Mike Franklin (5/11) Jun 04 2019 I benchmarked the older rte_memcpy here

Stefanos Baziotis (4/8) Jun 04 2019 Our Dmemcpy is faster than libc on a Linux virtual machine too. :p

Exil (8/19) Jun 04 2019 How did you compile the code? GCC and Clang both target baseline

Mike Franklin (3/7) Jun 04 2019 If you're referring to the rte_memcpy file, I compiled it with

Stefanos Baziotis (92/94) Jun 28 2019 An update regarding the project. There was a lot of turbulance in

Stefanos Baziotis (86/88) Jun 28 2019 === Blockers ===

Nicholas Wilson (17/109) Jun 28 2019 inline asm is generally very bad for the optimiser because is can

Stefanos Baziotis (39/55) Jun 28 2019 Exactly, that's the primary reason I mentioned that inline asm

Stefanos Baziotis (91/93) Jul 05 2019 I'm now moving to weekly updates. Before the updates of what I

Stefanos Baziotis (9/20) Jul 05 2019 An important omission is that GDC and LDC optimize the simple

Piotrek (109/111) Jul 05 2019 Hi Stefanos,

Stefanos Baziotis (59/76) Jul 06 2019 Hello, thank you! Yes, I hope too. If the D runtime moves away,

Piotrek (14/53) Jul 06 2019 I used the old repo for Dmemset. With Dmemutils it works now. I

Stefanos Baziotis (63/74) Jul 06 2019 Great, earlier today I realized that there were problems with

Timon Gehr (5/17) Jul 23 2019 It won't compile it, but it will attempt to parse it.

Stefanos Baziotis (10/12) Jul 06 2019 A kind request to anyone interested in helping, please for the

Stefanos Baziotis (55/58) Jul 20 2019 Last 2 Weeks

Stefanos Baziotis (40/40) Aug 02 2019 As it is mentioned in a previous post, this project has got

12345swordy (3/46) Sep 05 2019 Is this project dead in the water? Great, another dead project in

Stefanos Baziotis (10/12) Sep 05 2019 A dead project is a project that hasn't achieved its goals. This

12345swordy (5/18) Sep 05 2019 Is the implementation of memory allocation of the C standard

Stefanos Baziotis (16/19) Sep 05 2019 It depends on what you mean "achieved". Let me state some

12345swordy (10/32) Sep 05 2019 - It is easier to debug and read in the d langauge then in the c

H. S. Teoh (62/68) Sep 05 2019 [...]

Stefanos Baziotis (48/113) Sep 05 2019 - For the first 2, let me thank again Manu and Johan helped who

H. S. Teoh (38/54) Sep 05 2019 That's pretty scary that LLVM does that. It shakes my confidence in LLVM

Stefanos Baziotis (34/54) Sep 05 2019 I don't like it either. Although, I _think_ that you can
Johan Engelen (5/20) Sep 05 2019 FYI, GCC does the same.

Stefanos Baziotis (28/36) Sep 05 2019 Sorry, but IMHO, these reasons are not enough for me to start an

12345swordy (10/11) Sep 05 2019 No, I *had* been showed.

Stefanos Baziotis (9/19) Sep 06 2019 It will be closed yes. I hope that it will not go completely

a11e99z (1/7) Sep 05 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

I'm moving forward with the D implementations of the C parts that 
the D Runtime
uses. A not-so-small description of the project. I hope it will
be descriptive enough to clear things up for people, as I 
probably did
not do a good job in the previous public discussions about this 
project.

     The goal of this project is to remove the dependency of the D 
Runtime
     from the C Standard Library. Currently, the D Runtime uses a 
small part
     of the C Standard Library. That is:
         a) The string.h family of functions: memcpy(), memmove(), 
etc.
         b) The standard allocator functions: malloc(), free(), 
etc.

     Those don't justify the dependency on the C Standard Library, 
as a very
     small part of it is utilized. However, there are problems 
coming with it:
         1) C’s implementations are not type-safe and memory-safe.
         2) C’s implementations have accumulated a lot of cruft 
over the years.
         3) Cross-compiling is more difficult as now one should 
have available and
         configured a C runtime and toolchain apart from the D 
runtime. This
         makes it difficult for D to create freestanding software.

     So, this project will provide alternative implementations of 
this functions,
     dependent only in the D Runtime. We hope that in the process, 
we will
     leverage D features that C doesn't have:
         1) Type-safety and memory safety (bounds-checking etc.)
         2) Templates to branch to an optimal implementation at 
compile-time.
         3) Inlining, as the branching in C happens at runtime.
         4) Compile-Time Function Execution (CTFE) and 
introspection (type info).

     Important clarifications:
         1) It will not use the C Standard Library.
         2) C Standard Library will be still available.
         3) We target the D Runtime and not the user (although, of 
course the
         users will be able to use it).
         4) We will provide a different interface from the C 
implementations,
         with the prospect to be more idiomatic D.
         5) Same or better performance with libc is not a hard 
constraint. We
         might succeed to reach it, but we might not. The hard 
constraint is that
         it will at least be close.

This month
==========

Implementation of string.h family:
     -- Week 1-2: Handling mis-alignment in memcpy().
     -- Week 3: memmove()
     -- Week 4: memset()

As a starting point, I reached the same performance with libc 
memcpy for (big) aligned data (hopefully, the implementation will 
be part of the memcpyD in some time).
Now, along with Mike Franklin's previous work, memcpyD() is 
faster than libc memcpy for small data (less than 32768) and as 
fast for big data.

Next month
==========
Mike Franklin initially proposed the idea that it we may be able 
to do
something better than implementing malloc() and free() again, 
just in D.
So, we decided that the best option is to integrate the 
std.experimental.allocator
to D Runtime. That involves creating an allocator of its building 
blocks
and removing the dependency in Phobos.

Week 1: Create a basic allocator using std.experimental.allocator 
interface
Week 2: Replace malloc(), free(), realloc() in D runtime
Week 3: Re-iterate until we have good benchmarks.
Week 4: Remove Phobos dependencies from the allocator.

Blockers
========
Not any major one.

May 31 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Friday, 31 May 2019 at 21:01:01 UTC, Stefanos Baziotis wrote:
     Important clarifications:

Forgot that it targets x86_64.

May 31 2019

Jacob Carlborg <doob me.com> writes:

On Friday, 31 May 2019 at 21:40:11 UTC, Stefanos Baziotis wrote:
 On Friday, 31 May 2019 at 21:01:01 UTC, Stefanos Baziotis wrote:
     Important clarifications:

 Forgot that it targets x86_64.

And which OS?

--
/Jacob Carlborg

Jun 02 2019

sarn <sarn theartofmachinery.com> writes:

On Friday, 31 May 2019 at 21:01:01 UTC, Stefanos Baziotis wrote:
 I'm moving forward with the D implementations of the C parts 
 that the D Runtime
 uses.

Hi Stefanos, good project :)

Here's something to consider if you're replacing malloc() et al: 
it's popular (especially with large server deployments) to tune 
application memory allocation performance by replacing libc 
malloc() with alternatives such as tcmalloc and jemalloc.  That 
works because they use the same libc malloc() API but with a 
different implementation, injected at link or load time (using 
LD_PRELOAD or something).

It would be great if D code can still take advantage of 
alternative allocators developed by third-parties who may or may 
not be writing for D.

May 31 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Saturday, 1 June 2019 at 02:40:10 UTC, sarn wrote:
 On Friday, 31 May 2019 at 21:01:01 UTC, Stefanos Baziotis wrote:
 I'm moving forward with the D implementations of the C parts 
 that the D Runtime
 uses.

 Hi Stefanos, good project :)

Thank you!

 Here's something to consider if you're replacing malloc() et 
 al: it's popular (especially with large server deployments) to 
 tune application memory allocation performance by replacing 
 libc malloc() with alternatives such as tcmalloc and jemalloc.  
 That works because they use the same libc malloc() API but with 
 a different implementation, injected at link or load time 
 (using LD_PRELOAD or something).

 It would be great if D code can still take advantage of 
 alternative allocators developed by third-parties who may or 
 may not be writing for D.

As you can see in the "Next Month" above, we're planning to 
replace malloc() et al
but with a different interface. The reason is that we believe that
it is idiomatic D this way (I personally also believe that 
malloc(), free() etc.
have a bad interface for allocation). We even hope that in the 
end (probably
after GSoC) the allocator will be typed.

But the allocators you proposed might be an inspiration for the 
allocator I will build
using the std.experimental.allocator interface.
Moreover, let me stress that malloc(), free().. will be available 
as well.

Jun 01 2019

sarn <sarn theartofmachinery.com> writes:

On Saturday, 1 June 2019 at 14:18:25 UTC, Stefanos Baziotis wrote:
 Moreover, let me stress that malloc(), free().. will be 
 available as well.

Do you mean you're planning to allow the stdlib's allocation 
backend to be switched completely to libc-style malloc() and 
free(), or just that developers can always import 
core.stdc.stdlib and call malloc() if they like?  (The second 
option won't be enough.)

One option is to design D's allocation so that users can link 
with wrapped versions of tcmalloc, etc.  However, it's important 
that this be designed properly so that it doesn't require a 
custom compiler toolchain, otherwise it'll just be a theoretical 
thing that no one actually does.  Preferably it would work with 
LD_PRELOAD.

I like the idea of moving beyond libc's API, but please consider 
and test this use case.  A lot of smart people outside D are 
working on allocators, and it would be a major disadvantage if D 
can't use them.

Jun 01 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Saturday, 1 June 2019 at 22:45:40 UTC, sarn wrote:
 Do you mean you're planning to allow the stdlib's allocation 
 backend to be switched completely to libc-style malloc() and 
 free()

Currently, it is using malloc() and free(). Maybe you mean move 
away?

 or just that developers can always import core.stdc.stdlib and 
 call malloc() if they like?  (The second option won't be 
 enough.)

They will be able because libc is not going anywhere. The purpose 
is to create
an allocator _for the D Runtime_. Of course this allocator will 
be available
for users to use as well. It's just that the focus will be there.
Our initial plan was to make a D version of malloc() and free().
But, as Mike first suggested, we have the chance to create a more 
D-style
version allocator. And fortunately, the foundation has already 
been built
in std.experimental allocator.
And as a personal opinion, the interface of malloc() and free() 
is not ideal
for an allocator. From what I know, a lot of people working on 
allocators
seem to have the same opinion.
Just to disambiguate again, the purpose is that D Runtime won't 
depend on libc.

 One option is to design D's allocation so that users can link 
 with wrapped versions of tcmalloc, etc.  However, it's 
 important that this be designed properly so that it doesn't 
 require a custom compiler toolchain, otherwise it'll just be a 
 theoretical thing that no one actually does.  Preferably it 
 would work with LD_PRELOAD.

Well, the thing is to wrap an allocator, you first have to either 
write
the allocator in D, or create a dependency on that allocator.
Our choice is not the first, but somewhat the first. Meaning, I 
won't
port any allocator but the allocator I will write will of course 
be inspired
from work of others. Now, the important thing here is that I have 
so much time.
It's only a summer, which is not even completely devoted to the 
allocator (it's
about half the time). So, hopefully, either I or other people 
will continue
the work post-GSoC.

 I like the idea of moving beyond libc's API, but please 
 consider and test this use case.  A lot of smart people outside 
 D are working on allocators, and it would be a major 
 disadvantage if D can't use them.

As a I said, it will be able to use them. The purpose is not to 
replace them
in general, but specifically in the D Runtime.
Be sure to check again the starting post in this thread for why 
we're doing this,
and if there are any questions, please ask.

- Stefanos

Jun 01 2019

Mike Franklin <slavo5150 yahoo.com> writes:

On Saturday, 1 June 2019 at 02:40:10 UTC, sarn wrote:

 Here's something to consider if you're replacing malloc() et 
 al: it's popular (especially with large server deployments) to 
 tune application memory allocation performance by replacing 
 libc malloc() with alternatives such as tcmalloc and jemalloc.  
 That works because they use the same libc malloc() API but with 
 a different implementation, injected at link or load time 
 (using LD_PRELOAD or something).

 It would be great if D code can still take advantage of 
 alternative allocators developed by third-parties who may or 
 may not be writing for D.

std.experimental.allocator 
(https://dlang.org/phobos/std_experimental_allocator.html) 
supports an `IAllocator` interface 
(https://dlang.org/phobos/std_experimental_allocator.html#IAllocator).

The way I envision this playing out is that when 
std.experimenal.allocator is ported to druntime, callers would 
use the `IAllocator` interface.  Therefore, any allocator 
conforming to that interface could potentially serve as 
druntime's allocator.  In order to swap the allocator, one would 
only have to implement the `IAllocator` interface, potentially 
even using the `Mallocator` 
(https://dlang.org/phobos/std_experimental_allocator_mallocator.html), and make
the swap.

Providing the machinery to make that convenient (compiler 
switches, runtime configuration, etc.) should probably not be in 
the scope of the GSoC project as it is already pressed for time, 
but that should only be a PR away for anyone who considers it a 
priority.

That being said, we recognize that change needs to happen 
gradually to not rock the boat.  Therefore, even when this 
project is complete, it should probably still default to libc 
with a `-preview` switch or something like to allow users to 
opt-in to the D allocator.  Once there is sufficient experience 
in the real world with the D allocator, the defaults can 
potentially be swapped.

This GSoC project will attempt to remove libc as a hard, 
intrinsic dependency in druntime, and reduce it to a platform 
implementation detail.  In other words, druntime will not depend 
on libc, but a specific platform's port of druntime might.

Mike

Jun 01 2019

sarn <sarn theartofmachinery.com> writes:

On Sunday, 2 June 2019 at 00:10:51 UTC, Mike Franklin wrote:
 On Saturday, 1 June 2019 at 02:40:10 UTC, sarn wrote:

 Here's something to consider if you're replacing malloc() et 
 al: it's popular (especially with large server deployments) to 
 tune application memory allocation performance by replacing 
 libc malloc() with alternatives such as tcmalloc and jemalloc.
  That works because they use the same libc malloc() API but 
 with a different implementation, injected at link or load time 
 (using LD_PRELOAD or something).

 It would be great if D code can still take advantage of 
 alternative allocators developed by third-parties who may or 
 may not be writing for D.

 std.experimental.allocator 
 (https://dlang.org/phobos/std_experimental_allocator.html) 
 supports an `IAllocator` interface 
 (https://dlang.org/phobos/std_experimental_allocator.html#IAllocator).

 The way I envision this playing out is that when 
 std.experimenal.allocator is ported to druntime, callers would 
 use the `IAllocator` interface.  Therefore, any allocator 
 conforming to that interface could potentially serve as 
 druntime's allocator.  In order to swap the allocator, one 
 would only have to implement the `IAllocator` interface, 
 potentially even using the `Mallocator` 
 (https://dlang.org/phobos/std_experimental_allocator_mallocator.html), and
make the swap.

Thanks, that makes sense.  It sounds like a version spec that 
switches to Mallocator (or whatever) could do it, as long as it 
doesn't force a recompilation of the whole runtime library.  
(Even more convenient would be a runtime flag like --DRT-gcopt, 
but I'm guessing you'd want to make it happen at compile time.)

Jun 02 2019

Sebastiaan Koppe <mail skoppe.eu> writes:

On Sunday, 2 June 2019 at 00:10:51 UTC, Mike Franklin wrote:
 The way I envision this playing out is that when 
 std.experimenal.allocator is ported to druntime

You probably don't need or want to port the whole of 
std.experimental.allocator to druntime. I recently looked at the 
GC in druntime and it has it's own pools etc. If it didn't, then 
the mark phase would be a lot harder and slower. (according to my 
understanding...)

Therefor, for normal D programs, the only thing that makes sense 
is to implement the allocator that underlies the GC (an mmap or 
sbrk allocator). And be sure to make it is pluggable.

What I am trying to say is that you can avoid porting the whole 
thing.

 use the `IAllocator` interface.  Therefore, any allocator 
 conforming to that interface could potentially serve as 
 druntime's allocator.

I am not a big fan of the IAllocator interface since it 
introduces a layer of indirection. There is no simple solution to 
avoid the indirection and get a pluggable allocator. Well, maybe 
a combination of ldc's  weak and LTO. Dunno...

https://wiki.dlang.org/LDC-specific_language_changes#.40.28ldc.attributes.weak.29
http://johanengelen.github.io/ldc/2016/11/10/Link-Time-Optimization-LDC.html

Jun 02 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Sunday, 2 June 2019 at 11:19:20 UTC, Sebastiaan Koppe wrote:
 On Sunday, 2 June 2019 at 00:10:51 UTC, Mike Franklin wrote:
 [...]

 You probably don't need or want to port the whole of 
 std.experimental.allocator to druntime. I recently looked at 
 the GC in druntime and it has it's own pools etc. If it didn't, 
 then the mark phase would be a lot harder and slower. 
 (according to my understanding...)

 [...]

Sebastiaan I don't have a good answer for you right now. 
std.experimental.allocator
is quite new for me. I hope Mike can give you more insight until 
I start working
on this part.

Jun 03 2019

Mike Franklin <slavo5150 yahoo.com> writes:

On Sunday, 2 June 2019 at 11:19:20 UTC, Sebastiaan Koppe wrote:

 What I am trying to say is that you can avoid porting the whole 
 thing.

Yes, that is understood.  Only what is required to implement a 
malloc replacement is within the scope of the project.

 use the `IAllocator` interface.  Therefore, any allocator 
 conforming to that interface could potentially serve as 
 druntime's allocator.

 I am not a big fan of the IAllocator interface since it 
 introduces a layer of indirection. There is no simple solution 
 to avoid the indirection and get a pluggable allocator. Well, 
 maybe a combination of ldc's  weak and LTO. Dunno...

 https://wiki.dlang.org/LDC-specific_language_changes#.40.28ldc.attributes.weak.29
 http://johanengelen.github.io/ldc/2016/11/10/Link-Time-Optimization-LDC.html

The project is pressed for time, so I'd like to stick with 
something known and well-documented.  Perhaps IAllocator is not 
the right solution in the end, but still, implementing it and 
seeing how it fits into druntime should inform future directions, 
and perhaps even elicit some new ideas.

Mike

Jun 03 2019

Walter Bright <newshound2 digitalmars.com> writes:

Note that DMC++'s standard library is now Boost licensed, and so can be used.

https://github.com/DigitalMars/dmc/tree/master/src

May 31 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Saturday, 1 June 2019 at 05:59:05 UTC, Walter Bright wrote:
 Note that DMC++'s standard library is now Boost licensed, and 
 so can be used.

 https://github.com/DigitalMars/dmc/tree/master/src

Thanks, I'll take a look. Do we have any benchmarks?

Jun 01 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 6/1/2019 7:24 AM, Stefanos Baziotis wrote:
 Do we have any benchmarks?

No.

Jun 02 2019

=?UTF-8?B?QXVyw6lsaWVu?= Plazzotta <here gmail.com> writes:

On Saturday, 1 June 2019 at 05:59:05 UTC, Walter Bright wrote:
 Note that DMC++'s standard library is now Boost licensed, and 
 so can be used.

 https://github.com/DigitalMars/dmc/tree/master/src

Do you think it is planned to remove all C++ dependencies from D 
as well?

Jun 01 2019

Seb <seb wilzba.ch> writes:

On Saturday, 1 June 2019 at 19:50:08 UTC, Aurélien Plazzotta 
wrote:
 On Saturday, 1 June 2019 at 05:59:05 UTC, Walter Bright wrote:
 Note that DMC++'s standard library is now Boost licensed, and 
 so can be used.

 https://github.com/DigitalMars/dmc/tree/master/src

 Do you think it is planned to remove all C++ dependencies from 
 D as well?

If I understand you correctly, this has already happened. DMD can 
be built today without the need for any C++ as it's written in 
100% D.

Jun 01 2019

Thomas Mader <thomas.mader gmail.com> writes:

On Friday, 31 May 2019 at 21:01:01 UTC, Stefanos Baziotis wrote:
     The goal of this project is to remove the dependency of the 
 D Runtime
     from the C Standard Library.

Cool project!
Is it possible to follow the project somewhere on github?

Jun 01 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Saturday, 1 June 2019 at 09:29:55 UTC, Thomas Mader wrote:
 On Friday, 31 May 2019 at 21:01:01 UTC, Stefanos Baziotis wrote:
     The goal of this project is to remove the dependency of 
 the D Runtime
     from the C Standard Library.

 Cool project!

Thanks!

 Is it possible to follow the project somewhere on github?

For now, you can follow this repo: 
https://github.com/JinShil/memcpyD
related to memcpy. It's my mistake that I haven't forked that
and do the changes with PRs, so you can see a commit "... from 
Stefanos".

I'll post an update about where my experimentation will be 
visible.

Jun 01 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Saturday, 1 June 2019 at 14:29:03 UTC, Stefanos Baziotis wrote:
 I'll post an update about where my experimentation will be 
 visible.

https://github.com/baziotis/Dmemcpy

You can follow this repo for memcpy. In the future, probably
I will merge all the string.h functions in one repo, but in the 
development
stage I think it's better to have them on their own.

Any feedback is greatly appreciated!

Jun 03 2019

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 6/3/19 10:17 AM, Stefanos Baziotis wrote:
 On Saturday, 1 June 2019 at 14:29:03 UTC, Stefanos Baziotis wrote:
 I'll post an update about where my experimentation will be visible.

 
 https://github.com/baziotis/Dmemcpy
 
 You can follow this repo for memcpy. In the future, probably
 I will merge all the string.h functions in one repo, but in the development
 stage I think it's better to have them on their own.
 
 Any feedback is greatly appreciated!

At 512 lines including tests, it seems on the involved side. The 
benchmarks ought to show a hefty improvement to match. Are there 
benchmark results available?

Quoting the rationale from the motivation in another thread:

1) C’s implementations are not type-safe and memory-safe.
2) C’s implementations have accumulated a lot of cruft over the years.
3) Cross-compiling is more difficult as now one should have available 
and configured a C runtime and toolchain apart from the D runtime. This 
makes it difficult for D to create freestanding software.

And then the listed advantages of using D for implementation (renumbered):

4) Type-safety and memory safety (bounds-checking etc.)
5) Templates to branch to an optimal implementation at compile-time.
6) Inlining, as the branching in C happens at runtime.
7) Compile-Time Function Execution (CTFE) and introspection (type info).

My view on formulating motivation is simple: do it like a scientist. 
Argue the facts. If facts are not available, argue fundaments and 
universal principles. If such are not available, the motivation is too weak.

(1) checks the "facts" box but has the obvious comeback "then how about 
a 2-line trusted wrapper over memcpy?" that needs to be explained. 
Related, obviously people who reach for memcpy() are often not looking 
for a safe primitive. a[] = b[] is safe, syntactically simple, and could 
lower to anything including memcpy.

(2) is quite specious and really needs some evidence. Is cruft in memcpy 
really an issue? I looked memcpy() implementations a while ago but 
didn't save bookmarks. Did a google search just now and found 
https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c, which is 
very far from cruft-ridden. I do remember elaborate implementations of 
memcpy but so are (somewhat ironically) the 512 lines of the proposed 
implementation. I found one here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/lib/memcpy_64.S?id=HEAD

No idea of its level of cruftiness, where it's used etc. The right way 
to argue (2) is to provide links to implementations that people can look 
at and decide without doubt, "yep, crufty".

(3) is... odd. Doesn't every machine ever come with a C implementation 
including a ready-to-link standard library? If not, isn't that a rarity? 
Again, that should be argued preemptively by the motivation section.

(4) brings again the wrapper argument
(5) is nice if and only if confirmed by benchmarks
(6) is also nice under the same conditions as (5)
(7) again... what's wrong with a wrapper that does if (__ctfe)

These considerations are built with memcpy() in mind. With malloc() 
we're looking at a completely different ballgame. Implementing malloc() 
from scratch is a very serious project that needs almost overwhelming 
motivation. The goal of std.experimental.allocator was to offer a 
flexible framework for implementing general and specialized allocators, 
but simply replacing malloc() is more difficult to argue. Also, 
achieving comparable performance will be difficult.

Jun 03 2019

Mike Franklin <slavo5150 yahoo.com> writes:

On Monday, 3 June 2019 at 22:45:28 UTC, Andrei Alexandrescu wrote:

At 512 lines including tests, it seems on the involved side.
The benchmarks ought to show a hefty improvement to match. Are
there benchmark results available?

Quoting the rationale from the motivation in another thread:

1) C’s implementations are not type-safe and memory-safe.
2) C’s implementations have accumulated a lot of cruft over the
years.
3) Cross-compiling is more difficult as now one should have
available and configured a C runtime and toolchain apart from
the D runtime. This makes it difficult for D to create
freestanding software.

And then the listed advantages of using D for implementation
(renumbered):

4) Type-safety and memory safety (bounds-checking etc.)
5) Templates to branch to an optimal implementation at
compile-time.
6) Inlining, as the branching in C happens at runtime.
7) Compile-Time Function Execution (CTFE) and introspection
(type info).

My view on formulating motivation is simple: do it like a
scientist. Argue the facts. If facts are not available, argue
fundaments and universal principles. If such are not available,
the motivation is too weak.

(1) checks the "facts" box but has the obvious comeback "then
how about a 2-line trusted wrapper over memcpy?" that needs to
be explained. Related, obviously people who reach for memcpy()
are often not looking for a safe primitive. a[] = b[] is safe,
syntactically simple, and could lower to anything including
memcpy.

(2) is quite specious and really needs some evidence. Is cruft
in memcpy really an issue? I looked memcpy() implementations a
while ago but didn't save bookmarks. Did a google search just
now and found
https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c,
which is very far from cruft-ridden. I do remember elaborate
implementations of memcpy but so are (somewhat ironically) the
512 lines of the proposed implementation. I found one here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/lib/memcpy_64.S?id=HEAD

No idea of its level of cruftiness, where it's used etc. The
right way to argue (2) is to provide links to implementations
that people can look at and decide without doubt, "yep, crufty".

(3) is... odd. Doesn't every machine ever come with a C
implementation including a ready-to-link standard library? If
not, isn't that a rarity? Again, that should be argued
preemptively by the motivation section.

(4) brings again the wrapper argument
(5) is nice if and only if confirmed by benchmarks
(6) is also nice under the same conditions as (5)
(7) again... what's wrong with a wrapper that does if (__ctfe)

These considerations are built with memcpy() in mind. With
malloc() we're looking at a completely different ballgame.
Implementing malloc() from scratch is a very serious project
that needs almost overwhelming motivation. The goal of
std.experimental.allocator was to offer a flexible framework
for implementing general and specialized allocators, but simply
replacing malloc() is more difficult to argue. Also, achieving
comparable performance will be difficult.

Stefanos, everything Andrei has said here is correct, but it is
missing some perspective and does not consider everything we've
discussed. Please STAY THE COURSE! Do not let this post
discourage you. The time for questioning the merits of this
proposal was 2 months ago; not now. Now that it is a
full-fledged GSoC project you are tasked to do the best you can.

Andrei, I agree with everything you've said, but there's more to
take into consideration. I have a response to some of the items
you've mentioned, and maybe I'll post that later.

For now allow me to express that I'm quite disappointed that you
are questioning the merit of this proposal when the time to do so
was 2 months ago when the GSoC projects were being reviewed, and
you were supposed to participate. The GSoC project is well
underway and Stefanos now needs to see the project through to
completion regardless of what anyone thinks about it. Please
don't undermine this project or diminish the morale of our
students with such posts.

At the moment we need feedback on the actual memcpy
implementation, not whether you think this project is a good idea
or not.

Stefanos, please don't let this post discourage you. Please STAY
ON TASK.

Mike

Jun 03 2019

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 6/3/19 7:35 PM, Mike Franklin wrote:
 Andrei, I agree with everything you've said, but there's more to take 
 into consideration.  I have a response to some of the items you've 
 mentioned, and maybe I'll post that later.

My point was not to cast doubt and debate away in the forum (I think we 
really should avoid the "weak writeup/strong forum argument" pattern 
like the plague), but instead to help improve the writeup of the 
motivation. Ideally the pertinent parts should be used to improve that 
section in the project. Essentially someone reading the motivation 
should gather a good understanding and have most relevant questions 
preempted.

Jun 03 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Monday, 3 June 2019 at 23:35:21 UTC, Mike Franklin wrote:
 Stefanos, everything Andrei has said here is correct, but it is 
 missing some perspective and does not consider everything we've 
 discussed.  Please STAY THE COURSE!  Do not let this post 
 discourage you.  The time for questioning the merits of this 
 proposal was 2 months ago; not now.  Now that it is a 
 full-fledged GSoC project you are tasked to do the best you can.

 Andrei, I agree with everything you've said, but there's more 
 to take into consideration.  I have a response to some of the 
 items you've mentioned, and maybe I'll post that later.

 For now allow me to express that I'm quite disappointed that 
 you are questioning the merit of this proposal when the time to 
 do so was 2 months ago when the GSoC projects were being 
 reviewed, and you were supposed to participate.  The GSoC 
 project is well underway and Stefanos now needs to see the 
 project through to completion regardless of what anyone thinks 
 about it.  Please don't undermine this project or diminish the 
 morale of our students with such posts.

 At the moment we need feedback on the actual memcpy 
 implementation, not whether you think this project is a good 
 idea or not.

 Stefanos, please don't let this post discourage you.  Please 
 STAY ON TASK.

Thank you very much Mike! Andrei, I hear you as well and thank 
you for the feedback!

I want to say this. Before about 5 days, _I_ was even unsure 
about the goals.
And in the last 5 days of writing code, I'm getting more and more 
unsure.
However, Mike is so ridiculously helpful that the last thing I 
want is to:
- Sound disheartening.
- Sound like the guy that was picked and doesn't believe in the 
project.
- Sound like the guy who writes D for 5 months and came to 
question Mike, jpf
   and any other involved in the project.

Please, for any constructive feedback, questions or anything that 
you're
unsure about, direct the message to the most relevant person. The 
facts mentioned
are not my responsibility (hopefully, for the better), so you 
probably
want to ask the mentors.

But, my opinion was asked.
First of all, for memcpy et all:
- To reach memcpy, you have to write assembly, not D. In the end 
the code
   will be bigger than memcpy because we will have the D 
improvement (what Mike
   has done), plus a memcpy-size implementation (The two 
implementations that
   you posted are not the version that gets called. You can check 
(the horror)
   if you step in a debugger. Mike had a link about what seems to 
be the
   actual implementation, but I can't find it).
- Because of the above, that code will not be D, it will be 
assembly, which
   brings one to the question "Why not use the already made asm 
versions for
   the assembly parts (like the libc version) rather than re-write 
it yourself?".
- To reach memcpy, although I'm getting good benchmarks, is next 
to impossible
   in half a summer. Yet, this is what is expected.
- Personally, I don't use dynamic arrays. My D is mostly betterC.
   For me, if people use the D features, they would probably never 
use memcpy.
   And if they don't, then they would probably use a low-level 
(unsafer?) memcpy
   with pointers. However, this is targeted to the D Runtime, with 
which I don't
   have any experience. So, I trust the mentors.

- In my opinion, the best way to go about this is to get only the
   memcpy implementation linked (so remove the dependency on 
libc), create wrappers
   around it, something like Jonathan Marler's code _and_ use D 
for the small
   sizes, where it shines (as Mike's work has already showed). 
That way you
   leverage the work that has been put on memcpy, write idiomatic 
D, remove the
   libc dependency and make a (way) faster memcpy for small sizes.

But, who am I to question things? And I don't claim in any way 
that because
my opinion is different than what we planned to do, that I 
believe that I am
correct "but hey, they decide.. right?". No. I just trust that 
they know better than me.

For malloc:
- The initial plan was that a malloc() would be written. Having 
tried to write my
   own malloc(), I say that that I was pretty naive to think I 
would do a
   replacement from scratch in half a summer. Thankfully, 
something else was
   decided. The decision to use std.experimental.allocator was not 
mine. I learned
   about it probably less than a month ago.
   I can't support if it's a bad or good decision, because I know 
very little to
   have any meaningful input. To me, it sounded good though. And 
again, I trust
   that the mentors know better (and it's not a final decision 
yet).

I don't want to sound rude. I'm grateful to the D community for 
giving me the
chance to work on something so challenging. But the project, the 
goals, the
motivation and the approach are not my responsibility.
I, of course, have opinions about those, but also my opinion is 
that: For something that I'm not in charge, better try to help 
than contradict, except if I think there's something _very_ 
wrong. And I already felt I contradicted a lot.
In the end, if the motivation is too weak, if the approach is 
wrong and
if the goals are not that desired, then why was the project 
picked?
And why are those things questioned after 2 months?

Last but not least, while this may not be the best place for 
"famous last words", I want again to thank the mentors and 
especially Mike(!). Seb as well. This project, well.. let's just 
say it didn't have exactly the warmest feedback and their
support is important.

Jun 03 2019

Mike Franklin <slavo5150 yahoo.com> writes:

On Monday, 3 June 2019 at 22:45:28 UTC, Andrei Alexandrescu wrote:

At 512 lines including tests, it seems on the involved side.
The benchmarks ought to show a hefty improvement to match. Are
there benchmark results available?

I did some initial benchmarks at
https://github.com/JinShil/memcpyD when I made the first
feasibility study to see if this project was worth pursuing. The
initial results were encouraging, which is why we're taking it
further in this project.

I'll work with Stefanos to get a more polished implementation
that users can download and run for themselves.

Quoting the rationale from the motivation in another thread:

Yes, the motivation could be improved, but the time for
motivating this project was 2 months ago, not now. Now the
project is underway, and we need to see it to completion. The
focus now should be on providing feedback on the implementations
not the rationale/motivation.

Part of the motivation is so druntime no longer has a hard
intrinsic dependency on libc. If you just wrap the libc function
you're not acheiving that goal.

Now, that being said, it is way out of the scope of this project
to provide a D implementation of memcpy for all platforms,
architectures and mircoarchitectures that D supports. So, we
need to deal with that.

Before I elaborate further, it's important to understand that
druntime is currently a monolith that is not architected or
structures properly. druntime is supposed to be the language
implementation, not libc bindings, libc++ bindings, windows
bindings, linux bindings, low-level code (whatever that means),
etc.

The language implementation *will* require certain features of
the underlying operating system and hardware. Some of those
features may be provided by libc, but that decision should be
made on a platform-by-platform basis. So what we hope to achieve
with this project is an idiomatic-D memory copy/compare
interface. That interface may simply forward to libc for those
features that don't have an optimized D implementation. Other
platforms may choose to implement a highly optimized
implementation in D. Other platforms may choose to mix the two
(e.g. an optimized D implementation for small copies, and forward
to libc for large copies). Others may choose to just implement a
simple while-loop because they either don't want to obtain a C
toolchain (those cross-compiling to embedded targets) or because
there isn't C implementation available (new platforms like WASM).
This project aims to remove druntime's dependency on libc, but
the platform port of druntime may still choose to depend on it.

That being said you might be wondering why we are bothering to
implement an entire memcpy in D for the x86_64 architecture.
1) because DMD's implementation is suboptimal,
2) to help motivate the entire project
3) to demonstrate D as a first-class systems programming language
4) to set an example and precedent for other plaforms to
potentially follow

Please keep in mind we're trying to expand D to more platforms
include resource-constrained embedded systems, OS programming,
bare-metal applications, and new platforms such as WASM. We want
D to be more easily portable, and that is partically achieved by
making a platform abstraction, independent of libc. libc is a
platform implementation detail.

That is not the memcpy that is actually on your machine. You can
find the more elaborate implementations here:
https://sourceware.org/git/?p=glibc.git;a=tree;f=sysdeps/x86_64/multiarch;h=14ec2285c0f82b570bf872c5b9ff0a7f25724dfd;hb=HEAD

Another from intel:
https://github.com/DPDK/dpdk/blob/master/lib/librte_eal/common/include/arch/x86/rte_memcpy.h

I do remember elaborate implementations of memcpy but so are
(somewhat ironically) the 512 lines of the proposed
implementation. I found one here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/lib/memcpy_64.S?id=HEAD

No idea of its level of cruftiness, where it's used etc. The
right way to argue (2) is to provide links to implementations
that people can look at and decide without doubt, "yep, crufty".

The more elaborate C implementations are typically written in
assembly. They are difficult to follow due to all of the various
techniques to handle misalignment and the cleverness typically
required to achieve the best performance.

It is my hope that this project will explore how D can improve
such implementations by reducing the cleverness to small isolated
inline assembly blocks surrounded by D to make it easier to see
the flow control. I think D can do that.

Yes its a rarity, but nevertheless an artificial dependency for
druntime.

druntime does not sufficiently utilize libc to justify the hard
dependency. It just needs a few memory utilities and an
allocator. I think it's worthwhile to see if D can do just as
well without libc. In fact, if I had my druthers, I'd remove
libc's malloc altogether today and just add jemalloc to the
druntime repository. Maybe it could even be mechanically
translated to D.

(4) brings again the wrapper argument

For some platforms, it may just be a wrapper.

(5) is nice if and only if confirmed by benchmarks

We've already demonstrated this with benchmarks, I'll work with
Stefanos to get them made available, but
https://github.com/JinShil/memcpyD already shows the benefit.

(6) is also nice under the same conditions as (5)

Yep, see my response to (5)

(7) again... what's wrong with a wrapper that does if (__ctfe)

I think Stefanos is probably arguing in general about the
design-by-introspection features of D which include CTFE and
other metaprogramming features which is more-or-less the same as
(5). Those benefits have been demonstrated, and we'll work to
make those more apparent in the near future.

That being said, there's nothing ruling out an `if (__ctfe)`
block in the implementation if that's what is determined to be
best.

With malloc() we're looking at a completely different ballgame.
Implementing malloc() from scratch is a very serious project
that needs almost overwhelming motivation. The goal of
std.experimental.allocator was to offer a flexible framework
for implementing general and specialized allocators, but simply
replacing malloc() is more difficult to argue. Also, achieving
comparable performance will be difficult.

I agree to all of that, but we're going to try it anyway and see
how it does. If all we achieve in the end is just a wrapper that
forwards to libc's malloc and friends, it will still be better
than what we have now, because libc will then be simply an
implementation detail.

Mike

Jun 03 2019

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 6/3/19 9:11 PM, Mike Franklin wrote:
 Yes, the motivation could be improved, but the time for motivating this 
 project was 2 months ago, not now.  Now the project is underway, and we 
 need to see it to completion.  The focus now should be on providing 
 feedback on the implementations not the rationale/motivation.

Mike, you must understand this is a terrible argument. It should never 
be made again. It is in fact the only part about your response that 
makes me genuinely worried. Benchmarks look good.

This is not a pregnancy. Any time is good for asking about the 
motivation, and the difference 2 months make goes in favor of the 
student and mentor is that the answers to questions about the motivation 
are only stronger, clearer, and more convincing. I've seen PhD 
candidates roasted over their motivation (literally their "thesis", 
which means "proposition") on their DEFENSE day after years of hard 
work. It is not the fault of the person asking.

Jun 03 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Tuesday, 4 June 2019 at 01:32:49 UTC, Andrei Alexandrescu 
wrote:
 On 6/3/19 9:11 PM, Mike Franklin wrote:
 Yes, the motivation could be improved, but the time for 
 motivating this project was 2 months ago, not now.

 Mike, you must understand this is a terrible argument. It 
 should never be made again. It is in fact the only part about 
 your response that makes me genuinely worried. Benchmarks look 
 good.

 This is not a pregnancy. Any time is good for asking about the 
 motivation, and the difference 2 months make goes in favor of 
 the student and mentor is that the answers to questions about 
 the motivation are only stronger, clearer, and more convincing.

I agree that a good time to ask is simply always. And that we 
should
always argue if something doesn't seem right. But it's true
that this was to be decided months ago.

Moreover, for me it's important to consider the other side, and I 
think that's what Mike meant.
Imagine that you're a GSoC student and you open the forum, 4 
a.m.: "Aah... Here's a 1-page post about what you might be doing 
wrong with your project, already 1+ month working on it. Have a 
good night.." :p
This is a personal opinion, but to me: Of course contradict the 
bad, but also help / motivate the good.
These people (meaning any mentors and any GSoC student) chances 
are they're doing
_something_ beneficial. For example..

 Benchmarks look good.

They're better now. ;)

Jun 03 2019

Mike Franklin <slavo5150 yahoo.com> writes:

On Tuesday, 4 June 2019 at 01:32:49 UTC, Andrei Alexandrescu 
wrote:
 On 6/3/19 9:11 PM, Mike Franklin wrote:
 Yes, the motivation could be improved, but the time for 
 motivating this project was 2 months ago, not now.  Now the 
 project is underway, and we need to see it to completion.  The 
 focus now should be on providing feedback on the 
 implementations not the rationale/motivation.

 Mike, you must understand this is a terrible argument. It 
 should never be made again. It is in fact the only part about 
 your response that makes me genuinely worried. Benchmarks look 
 good.

The point I'm trying to make is that we are in the coding stage 
of this project.  Right now, students should be focused on 
getting the assignment done, not justifying the project to 
everyone.  The project already went through a vetting process and 
was approved.

 This is not a pregnancy.

Thank you.  I'm glad you were able to fulfill your sarcasm quota.

 Any time is good for asking about the motivation

Asking for more information about the motivation is one thing.  
Publicly doubting the motivation of a project that the D Language 
Foundation approved (a process you even participated in) is 
another.

Jun 03 2019

KnightMare <black80 bk.ru> writes:

TL;DR
Should we attn to WASM where there are no system things (mmap, 
allocators), where memory is an array of ints?

Jun 04 2019

KnightMare <black80 bk.ru> writes:

On Tuesday, 4 June 2019 at 08:31:54 UTC, KnightMare wrote:
 TL;DR
 Should we attn to WASM where there are no system things (mmap, 
 allocators), where memory is an array of ints?

LDC can compile code to WASM already

Jun 04 2019

Mike Franklin <slavo5150 yahoo.com> writes:

On Tuesday, 4 June 2019 at 08:31:54 UTC, KnightMare wrote:
 TL;DR
 Should we attn to WASM where there are no system things (mmap, 
 allocators), where memory is an array of ints?

If any of the work from this project gets merged into druntime 
(which appears will be an uphill battle) it should be easier to 
port druntime to new platforms like WASM.  That is one of the 
motivations:  To reduce the dependency on libc to platform 
implementation detail that any platform can 
override/reimplement/supplement as needed without impacting any 
other platform.

Mike

Jun 04 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 6/3/2019 3:45 PM, Andrei Alexandrescu wrote:
 (2) is quite specious and really needs some evidence. Is cruft in memcpy
really 
 an issue? I looked memcpy() implementations a while ago but didn't save 
 bookmarks. Did a google search just now and found 
 https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c, which is very
far 
 from cruft-ridden. I do remember elaborate implementations of memcpy but so
are 
 (somewhat ironically) the 512 lines of the proposed implementation. I found
one 
 here:
 
 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/lib/memcpy_64.S?id=HEAD

And here:

https://github.com/DigitalMars/dmc/blob/master/src/CORE32/MEMCPY.ASM

Jun 04 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Tuesday, 4 June 2019 at 22:18:20 UTC, Walter Bright wrote:
 And here:

 https://github.com/DigitalMars/dmc/blob/master/src/CORE32/MEMCPY.ASM

Please, consider again that this is not the version with which 
we're trying to compete with. Mike posted this link: 
https://sourceware.org/git/?p=glibc.git;a=tree;f=sysdeps/x86_64/multiarch;h=14ec2285c0f82b570bf872c5b9ff0a7f25724dfd;hb=HEAD

This looks like the one that is called. Consider also the other 
link Mike posted:
https://github.com/DPDK/dpdk/blob/master/lib/librte_eal/common/include/arch/x86/rte_memcpy.h

Although I did not have time to benchmark, my guess is that this, 
which is one from Intel, is not at all enough against libc.

Jun 04 2019

Mike Franklin <slavo5150 yahoo.com> writes:

On Wednesday, 5 June 2019 at 00:22:08 UTC, Stefanos Baziotis 
wrote:

 This looks like the one that is called. Consider also the other 
 link Mike posted:
 https://github.com/DPDK/dpdk/blob/master/lib/librte_eal/common/include/arch/x86/rte_memcpy.h

 Although I did not have time to benchmark, my guess is that 
 this, which is one from Intel, is not at all enough against 
 libc.

I benchmarked the older rte_memcpy here 
(https://github.com/DPDK/dpdk/blob/60a3df650d523bd2e4bb4f77f9278f25f7f1a65c/lib/librte_eal/common/include/ar
h/x86/rte_memcpy.h) on a Linux virtual machine and rte_mempcy was quite a bit
faster than libc.  It's worth a deeper look.

Mike

Jun 04 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Wednesday, 5 June 2019 at 01:14:26 UTC, Mike Franklin wrote:
 On Wednesday, 5 June 2019 at 00:22:08 UTC, Stefanos Baziotis

 I benchmarked the older rte_memcpy here 
 (https://github.com/DPDK/dpdk/blob/60a3df650d523bd2e4bb4f77f9278f25f7f1a65c/lib/librte_eal/common/include/ar
h/x86/rte_memcpy.h) on a Linux virtual machine and rte_mempcy was quite a bit
faster than libc.  It's worth a deeper look.

 Mike

Our Dmemcpy is faster than libc on a Linux virtual machine too. :p

But yes, again, take what I said with a grain of salt, it's just 
an assumption. Indeed it deserves greater analysis.

Jun 04 2019

Exil <Exil gmall.com> writes:

On Wednesday, 5 June 2019 at 01:21:20 UTC, Stefanos Baziotis 
wrote:
 On Wednesday, 5 June 2019 at 01:14:26 UTC, Mike Franklin wrote:
 On Wednesday, 5 June 2019 at 00:22:08 UTC, Stefanos Baziotis

 I benchmarked the older rte_memcpy here 
 (https://github.com/DPDK/dpdk/blob/60a3df650d523bd2e4bb4f77f9278f25f7f1a65c/lib/librte_eal/common/include/ar
h/x86/rte_memcpy.h) on a Linux virtual machine and rte_mempcy was quite a bit
faster than libc.  It's worth a deeper look.

 Mike

 Our Dmemcpy is faster than libc on a Linux virtual machine too. 
 :p

 But yes, again, take what I said with a grain of salt, it's 
 just an assumption. Indeed it deserves greater analysis.

How did you compile the code? GCC and Clang both target baseline 
x64, to use features like AVX2 you have to enable them, that of 
course means that not all CPUs will be able to run the code, 
though it will run faster on those that do.

I'd say this should include ARM as well, but there's one D 
compiler that doesn't support it so...

Jun 04 2019

Mike Franklin <slavo5150 yahoo.com> writes:

On Wednesday, 5 June 2019 at 03:00:13 UTC, Exil wrote:

 How did you compile the code? GCC and Clang both target 
 baseline x64, to use features like AVX2 you have to enable 
 them, that of course means that not all CPUs will be able to 
 run the code, though it will run faster on those that do.

If you're referring to the rte_memcpy file, I compiled it with 
-march=native.

Jun 04 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Friday, 31 May 2019 at 21:01:01 UTC, Stefanos Baziotis wrote:
 The goal of this project is to remove the dependency of the D 
 Runtime from the C Standard Library.

An update regarding the project. There was a lot of turbulance in 
this project, so I'm sorry I did not post earlier.

Previous month
==============

In this month the goals were replacements for memcpy(), memmove() 
and memset(), named
Dmemcpy, Dmemmove and Dmemset. Dmemcpy and Dmemmove is merged in 
one repo [1]
and the Dmemset is this [2]

The goal was to create fast versions of those, targetted to 
x86_64 and DMD.
Because of that and because of Blockers (refer to that later), 
there is some inline ASM in those implementations.
There was an effort for this to be minimized (currently it's only 
on Dmemcpy),
because I got informed that pure D should be the first priority.

In the last week there was an effort to create a test suite and a 
benchmark suite
for these repos. Quoting Mike and Johannes:


(basic types, structs, classes, static arrays, and dynamic arrays)
   * Add naive implementations for now to fill the gaps.

// NOTE(stefanos): Meaning, when x86 is not available or in any 
case that my code is not
// able to be compiled for the target, there should a minimial 
pure D fall-back implementation.
// NOTE(stefanos): Classes are not tested, more on that on the 
Blockers.


Anyone visiting the repository should be able to clone it and do 
something like `run tests` and `run benchmarks`.
   2.  Create a `run.d` file, a `tests.d` file and a 
`benchmarks.d` file
   3.  When the user executes `rdmd run.d tests` it should compile 
the `tests.d` file and execute it producing a test report.
   4.  When the user executes `rdmd run.d benchmarks` it should 
compile `benchmarks.d`, execute it producing a benchmark report.

// NOTE(stefanos): I'm relatively satisfied with Dmemset. 
Dmemmove got better the last 3
// days but it probably still needs review / more work.


each repository including edge cases.
   * It should test each kind of type (basic types, structs, 
classes, static arrays, and dynamic arrays). // NOTE(stefanos): 
Again, for the classes refer to the Blockers.
   * Where relevant it should include a test of all interesting 
sizes.
   * Where relevant, it should test all variations of alignment up 
to 32.  This includes aligned-src & aligned-dst, unaligned-src & 
unaligned-dst, aligned-src & unaligned-dst, and unaligned-src and 
aligned-dst.  A nested foreach look (e.g.  `foreach (srcOffset, 
alignments) { foreach(dstOffset; alignments) { ... } }`) should 
cover it.

// NOTE(stefanos): This is not done as proposed here. I had my 
own variation
// for alignment testing and this alternative was to be 
considered. My own, and this,
// still need review.

   * For memmove it should test all variations of overlap:  no 
overlap, exact overlap, source leading destination, destination 
leading source, etc...
   * Make sure each repository passes the test suite
   * Make sure the tests are easily comprehendible.  Keep them 
simple so any visitor to the repository can easily verify that 
the test suite is thorough.
   * Be sure the tests cover all implementations.


each repository
   * Benchmark all sizes from at least 0~512 (preferably up to 
1024).  After 1024 exponentially increasing sizes up to at least 
65536.  They do not need to be powers of 2; consider even powers 
of 10 so it is easy to graph on a logarithmic scale.  An average 
of alignments is good for an overview, but the user should also 
be able to pick a single size and see how it performs for all 
variations of alignments.

// NOTE(stefanos): I don't test that many sizes in experimental 
branch since the compile
// time explodes. Meaning to the point that freezes Visual Studio.
// But I should have added a logarithmic scale, that was an 
overlook.

   * Be sure the benchmark is thorough enough to covers all 
implementations.



There is of course a lot to be said about the actual 
implementations and the decisions
taken but I guess the post would be very big, so I decided to 
focus on the final goals
and on the blockers. Please feel free to ask more specific 
questions on the implementations.




[1] https://github.com/baziotis/Dmemmove/tree/experimental - 
experimental branch
[2] https://github.com/baziotis/Dmemset

Jun 28 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Friday, 28 June 2019 at 12:11:10 UTC, Stefanos Baziotis wrote:
 // NOTE(stefanos): Classes are not tested, more on that on the 
 Blockers.

=== Blockers ===

-- Blocker 1 - DMD --

The main blocker was that the project was targetted to DMD. The 
main problems are:
- The optimizer is limited.
- The code generated is a lot of times unpredictable (at least to 
me).
That is, both as far as performance is concerned but 
comprehensibility as well.
- Inline ASM can't be interleaved with pure D.

I want to stress that when writing such performance sensitive 
utilities, the language
is used as the tool to generate the ASM more effieciently (and 
with less errors) instead
of writing it yourself. This is a subjective opinion, but I guess 
that most people
having worked on such utilities will agree.
This is why these utilities are either written in ASM or in a 
language that is low-level
enough and with a good enough optimizer that will let them write 
in this more high-level language.

Now, I picked inline ASM as my preference because with pure D and 
DMD there was:
- Poor debugability. When the ASM is not hand-written, it is not 
as easily comprehensible.
To sacrifice that, the ASM generated from the compiler has to be 
predictable, which for me it wasn't.

- Poor tuning. One should not fight the optimizer. If I expect an 
optimization to be done
and it's not, then that's a problem.

- Poor scalabitliy. If a person after me comes and tries to 
optimize it further, I might have potentially created more 
problems with pure D than what I would have solved. For example, 
if I was that person and I did a compile and there was an 
unexpected load inside a loop that I can't get around by 
transforming the code, then that would be a problem.
Basically, if we go the pure-whatever-language-we-choose way, we 
must never, in the future, say "Better have written in ASM from 
the start". And my prediction was that that would be the case.

I can be a lot more specific on the reasons behind the pick of 
inline ASM, so feel free to ask.

Don't get me wrong, DMD is pretty good but, at least I, could not 
get it to the point
of hand-written ASM.
I want to say that this inline ASM I'm talking about is being 
minimized / removed and is replaced with pure D for various 
reasons.

-- Blocker 2 - Test suite --

In this month, I was working with a test suite that I had not 
examined carefully.
That was certainly my biggest mistake up until now. And that test 
suite was not good.
When I got advised to make a new test suite, that new suite 
revealed serious bugs in the code. That was both good and bad. 
The good thing was that I now had the chance to think
hard on the test suite and that of course the bug were revealed.
But the bad part was that Dmemcpy and Dmemmove had to almost be 
complete remade in 3 days.
It was done, but it was a serious blocker.

In that time, problems with Windows were revealed (specifically, 
the calling convention),
which were also solved, but that was a lot of spent time as well.

-- Blocker 3 - Classes --

The problem with classes is that it is mentioned that the 
compiler can change the layout
of the fields in a class / struct. Even if that means that the 
two hidden fields
(vptr and monitor) are still on the start, it still seems hacky 
to take the class
pointer, move forward 16 bytes and start the operations there 
(and the 16 bytes is not standard because the pointer size 
changes by the operating system). So, we decided
to leave it for now.
My guess is that classes probably will never be used directly in 
such low-level code.

-- Blocker 4 - SIMD intrinsics --

When I started writing Dmemset, I decided to go pure-D first. In 
that effort, there
were 2 ASM instructions that I was trying to get them work for 
about 4 hours. The ASM
instructions are:
         movd    XMM0, ESI;
         pshufd  XMM0, XMM0, 0;

I don't if more details on what I tried matter, but if anyone has 
an idea, please inform me.

Jun 28 2019

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Friday, 28 June 2019 at 12:14:13 UTC, Stefanos Baziotis wrote:
 On Friday, 28 June 2019 at 12:11:10 UTC, Stefanos Baziotis 
 wrote:
 // NOTE(stefanos): Classes are not tested, more on that on the 
 Blockers.

 === Blockers ===

 -- Blocker 1 - DMD --

 The main blocker was that the project was targetted to DMD. The 
 main problems are:
 - The optimizer is limited.
 - The code generated is a lot of times unpredictable (at least 
 to me).
 That is, both as far as performance is concerned but 
 comprehensibility as well.
 - Inline ASM can't be interleaved with pure D.

 I want to stress that when writing such performance sensitive 
 utilities, the language
 is used as the tool to generate the ASM more effieciently (and 
 with less errors) instead
 of writing it yourself. This is a subjective opinion, but I 
 guess that most people
 having worked on such utilities will agree.
 This is why these utilities are either written in ASM or in a 
 language that is low-level
 enough and with a good enough optimizer that will let them 
 write in this more high-level language.

 Now, I picked inline ASM as my preference because with pure D 
 and DMD there was:
 - Poor debugability. When the ASM is not hand-written, it is 
 not as easily comprehensible.
 To sacrifice that, the ASM generated from the compiler has to 
 be predictable, which for me it wasn't.

 - Poor tuning. One should not fight the optimizer. If I expect 
 an optimization to be done
 and it's not, then that's a problem.

 - Poor scalabitliy. If a person after me comes and tries to 
 optimize it further, I might have potentially created more 
 problems with pure D than what I would have solved. For 
 example, if I was that person and I did a compile and there was 
 an unexpected load inside a loop that I can't get around by 
 transforming the code, then that would be a problem.
 Basically, if we go the pure-whatever-language-we-choose way, 
 we must never, in the future, say "Better have written in ASM 
 from the start". And my prediction was that that would be the 
 case.

 I can be a lot more specific on the reasons behind the pick of 
 inline ASM, so feel free to ask.

 Don't get me wrong, DMD is pretty good but, at least I, could 
 not get it to the point
 of hand-written ASM.
 I want to say that this inline ASM I'm talking about is being 
 minimized / removed and is replaced with pure D for various 
 reasons.

inline asm is generally very bad for the optimiser because is can 
have any side-effects and is completely opaque. It is possible to 
generate the asm with string mixins, see e.g. the BigInt routines 
in phobos.

You should test your work with LDC at some point which has an 
optimiser worth using, but note the bit about opaque inline ASM 
hurting performance.

 -- Blocker 2 - Test suite --

 In this month, I was working with a test suite that I had not 
 examined carefully.
 That was certainly my biggest mistake up until now. And that 
 test suite was not good.
 When I got advised to make a new test suite, that new suite 
 revealed serious bugs in the code. That was both good and bad. 
 The good thing was that I now had the chance to think
 hard on the test suite and that of course the bug were revealed.
 But the bad part was that Dmemcpy and Dmemmove had to almost be 
 complete remade in 3 days.
 It was done, but it was a serious blocker.

 In that time, problems with Windows were revealed 
 (specifically, the calling convention),
 which were also solved, but that was a lot of spent time as 
 well.

 -- Blocker 3 - Classes --

 The problem with classes is that it is mentioned that the 
 compiler can change the layout
 of the fields in a class / struct. Even if that means that the 
 two hidden fields
 (vptr and monitor) are still on the start, it still seems hacky 
 to take the class
 pointer, move forward 16 bytes and start the operations there 
 (and the 16 bytes is not standard because the pointer size 
 changes by the operating system). So, we decided
 to leave it for now.
 My guess is that classes probably will never be used directly 
 in such low-level code.

You should be able to get the offset of the first member with

int foo()
{
     static class A { int a; }
     return A.init.a.offsetof;
}

which will apply to any other non-nested class.

 -- Blocker 4 - SIMD intrinsics --

 When I started writing Dmemset, I decided to go pure-D first. 
 In that effort, there
 were 2 ASM instructions that I was trying to get them work for 
 about 4 hours. The ASM
 instructions are:
         movd    XMM0, ESI;
         pshufd  XMM0, XMM0, 0;

 I don't if more details on what I tried matter, but if anyone 
 has an idea, please inform me.

Take a look at https://github.com/AuburnSounds/intel-intrinsics

Keep up the good work!

Jun 28 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Friday, 28 June 2019 at 12:33:16 UTC, Nicholas Wilson wrote:
 inline asm is generally very bad for the optimiser because is 
 can have any side-effects and is completely opaque.

Exactly, that's the primary reason I mentioned that inline asm 
can't
be interleaved with D. For performance reasons. The compiler has
to be very conservative (more than one would expect). Which means
that the only way to go is either pure D or full ASM and in fact, 
`_naked`
ASM.

 It is possible to generate the asm with string mixins, see e.g. 
 the BigInt routines in phobos.

I suppose you mean this: 
https://github.com/dlang/phobos/blob/master/std/bigint.d
With a quick look I'm not sure I understand the reason to do 
string mixins.
I understand that it is for convenience (i.e. construct the ASM 
appropriately and not write a million different versions) and not 
performance reasons.

 You should test your work with LDC at some point which has an 
 optimiser worth using, but note the bit about opaque inline ASM 
 hurting performance.

It is tested with LDC but LDC was not a target for this project. 
Yes, inline ASM
is risky as is pure D for the reasons I said above. Maybe I 
should note explicitly
the risk of using only ASM as well, since I did for pure D.
It's a matter of compromise.

 You should be able to get the offset of the first member with

 int foo()
 {
     static class A { int a; }
     return A.init.a.offsetof;
 }

 which will apply to any other non-nested class.

Thanks, I had not considered that. I think I should do an 
explicit post
where I ask the opinion of the community about whether they would 
like
the support of classes and how so.


 Take a look at https://github.com/AuburnSounds/intel-intrinsics

Just a little bit more detail, from my research, it is supposed 
that these two instructions should correspond somehow to these 2 
instructions:
     simd_stof!(XMM.LODD, void16)(v, XMM0);
     simd!(XMM.PSHUFD, 0, void16, void16)(XMM0, XMM0);

But I could not get them work for my life.

I had not considered the intel intrinsics which is dumb if you 
consider
that there is a whole talk I watched on this topic.
It is this: https://www.youtube.com/watch?v=cmswsx1_BUQ
for anyone interested.

 Keep up the good work!

Thank you!

- Stefanos

Jun 28 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Friday, 28 June 2019 at 12:11:10 UTC, Stefanos Baziotis wrote:
 An update regarding the project. There was a lot of turbulance 
 in this project, so I'm sorry I did not post earlier.

I'm now moving to weekly updates. Before the updates of what I 
did, let me update
you on the state of the project.
The focus of the project has changed in the following ways:
- No assembly
- Generic and portable code
- Focus on LDC and GDC
- PRs to core.experimental

This week
==========
- Because of the above, this week I started with the replacement 
of all the ASM
with SSE intrinsics and providing simple implementation for when 
SIMD is not available.
The goal was not only the replacement but also the optimization 
for LDC.
Eventually (either as part of this summer or of future work), the 
simple implementation
should not be so "simple" and be one that helps LDC and GDC 
optimize it without
the need to be explicitly in `version (D_SIMD)`.

- I moved the functions in a common repository: 
https://github.com/baziotis/Dmemutils

- I made a draft PR in the D runtime: 
https://github.com/dlang/druntime/pull/2662
(Thanks to lesderid and wilzbach for their help).


* A note on intel-intrinsics:
I first tried intel-intrinsics for the use of intrinsics.
That worked great in LDC (I think it's focused on LDC),
not so good in DMD and not at all in GDC.
Firstly, in DMD it didn't work meaning it generated "wrong" code. 
The problem is
that doing a load/store with intel-intrinsics and doing a 
load/store with load/storeUnaligned of core.simd does not 
generate the same code.
This is important for Dmemmove because it is very
sensitive to the order of instructions because of the overlap 
(e.g. here [1])

So, I made my own intrinsics that are different depending on if 
we use DMD or LDC.

Regarding GDC, I just couldn't get it compile.

My purpose is not to disparage intel-intrinsics, it's a great 
library. This was just
my experience, in which maybe I did something wrong. I tried also 
to contact
the creator, becase maybe he has some insight.

* A note on GDC intrinsics:
GDC now compiles to the naive version, because I don't know of 
load/storeUnaligned
respective functions for GDC.
Iain told me that I could use the i386 intrinsics
(which as far as I know is this [2]), but I could not use them in 
GDC.

Blockers
========

Only what I said above regarding GDC intrinsics.


[1] 
https://github.com/baziotis/Dmemutils/blob/master/Dmemmove/Dmemmove.d#L267
[2] 
https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/X86-Built-in-Functions.html



Next week
==========
Sadly, I don't know. According to my schedule, the work on the 
allocator should have
started. But, there were a couple of problems in the project, 
which changed its focus and so there were things that had to be 
done that were not initially planned.
That means that the allocator, that should have started by now, 
hasn't.

Other than that, the plans for the allocator changed when the 
project started to things
that I'm not fully experienced with (changed from malloc(), 
free() etc. to using the std.experimental.allocator).

So, how the project will continue is currently an open discussion.

If std.experimental.allocator is interesting to the community, 
I'm happy to discuss it
and learn how to continue.

If we fall back to classic malloc(), free() implementations, this 
is something
that can't be fully done in the time available. To make a 
complete replacement
of malloc() et al, one has to make a serious attempt on 
multi-threading and optimization.

_HOWEVER_, one possible alternative is to provide minimalistic 
versions of those functions
for "baremetal" systems. That means either embedded systems or 
WASM. I think that this
is interesting, meaning, to not have a dependency on the libc 
there and have minimal
(regarding resources and code) implementations.

Jul 05 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Friday, 5 July 2019 at 11:02:00 UTC, Stefanos Baziotis wrote:
 - Because of the above, this week I started with the 
 replacement of all the ASM
 with SSE intrinsics and providing simple implementation for 
 when SIMD is not available.
 The goal was not only the replacement but also the optimization 
 for LDC.
 Eventually (either as part of this summer or of future work), 
 the simple implementation
 should not be so "simple" and be one that helps LDC and GDC 
 optimize it without
 the need to be explicitly in `version (D_SIMD)`.

An important omission is that GDC and LDC optimize the simple 
version of Dmemset
for my AMD Ryzen in such a way that it reaches total parity with 
libc memset.
The difference is amazing when working with the LLVM / GCC 
back-ends.

Unfortunately, I don't have an Intel to test. It would be really 
good to have benchmarks from Intel users.

Jul 05 2019

Piotrek <dummy dummy.gov> writes:

On Friday, 5 July 2019 at 15:42:48 UTC, Stefanos Baziotis wrote:
 Unfortunately, I don't have an Intel to test. It would be 
 really good to have benchmarks from Intel users.

Hi Stefanos,

This is great work. I hope Phobos will move away from clib some 
day.

As for the benchmarks.
I think you can post your results somewhere. Or you did. 
Unfortunately I cannot find them.

I tested Dmemset with dmd (lcd and gdc didn't compile) on 
i3-3220 3.30GHz (Ubuntu).

The strange thing is I get different results when I change the 
following line in benchamrks.d
//note the upper bound
static foreach(i; 1..256)
  to
static foreach(i; 1..257)

(1..256)
127 24.1439 20.7726
128 24.333 20.8421
129 24.3768 20.9648

(1..257)
127 24.4276 25.8072
128 24.679 26.2316
129 24.8052 26.0236

So D version becomes better. Maybe this is related to different 
binary file after compilation.


Some other results for "(1..257)" variant:

size(bytes) Cmemmove(GB/s) Dmemmove(GB/s)
1 0.269991 0.180151
2 0.438143 0.386652
3 0.657527 0.543067
4 1.00408 0.767028
5 1.26435 0.96617
6 1.51675 1.09579
7 1.76942 1.2771
8 2.02263 1.54563
9 2.27596 1.6421
10 2.52917 1.82534
11 2.78175 2.00729
12 3.03507 2.1897
13 3.28674 2.37267
14 3.53581 2.54155
15 3.79338 2.59328
16 5.25561 2.91728
17 5.58319 5.07972
18 5.91207 5.37934
19 6.24159 5.67784
20 6.56863 5.97583
21 6.84187 6.26141
22 7.22644 6.57598
23 7.55238 6.81922
24 7.88487 7.17182
...
39 9.85228 9.48541
40 10.1054 9.72436
41 10.3587 10.0661
42 10.5787 10.3286
43 10.862 10.661
44 11.1155 10.9688
45 11.3691 11.2042
46 11.6228 11.5771
47 11.8245 11.6284
48 12.1258 12.1853
49 12.3849 12.4931
...
59 14.7853 15.7441
60 15.165 16.1076
61 15.4095 16.4647
62 15.6639 16.803
63 15.9273 17.0932
64 16.1733 17.4991
65 11.862 17.671
66 12.0373 17.8678
67 12.2148 17.8533
68 12.4066 18.2475
69 12.5497 18.2762
...
124 23.6536 25.3192
125 23.9933 25.5515
126 24.2049 26.0169
127 24.4276 25.8072
128 24.679 26.2316
129 24.8052 26.0236
130 25.0353 26.446
131 24.8123 26.2339
132 25.2592 26.176
133 25.3562 26.6108
134 25.8571 26.8894
...
252 33.7209 33.9282
253 33.7367 34.1942
254 33.8958 34.59
255 33.412 33.6378
256 33.6542 34.661
500 39.5868 39.6527
700 43.7852 43.3711
3434 34.2489 45.8683
7128 35.2755 49.4049
13908 35.5447 51.2273
16343 35.0748 51.4501
27897 35.5615 51.0826
32344 35.1398 48.1469
46830 32.8887 34.9705
64349 33.2305 34.9398

Are they meaningful for you?

If you want I can run additional benchmarks for you. For details, 
mabe we can continue on github. On forum we can discuss some 
fundamentals points.

Cheers,
Piotrek

Jul 05 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Friday, 5 July 2019 at 20:22:30 UTC, Piotrek wrote:
 Hi Stefanos,

 This is great work. I hope Phobos will move away from clib some 
 day.

Hello, thank you! Yes, I hope too. If the D runtime moves away, 
that
will be easier for the rest of D.

 As for the benchmarks.
 I think you can post your results somewhere. Or you did. 
 Unfortunately I cannot find them.

You're right, my mistake, there are no recent benchmarks. I'll 
try to post
today. They're similar to yours.

 I tested Dmemset with dmd (lcd and gdc didn't compile) on 
 i3-3220 3.30GHz (Ubuntu).

That's weird. Could you give some more info on how did you 
compile?
Did you use the procedure described in the README?
Meaning, `rdmd run benchmarks gdc` and `rdmd run benchmarks ldc`.
Now I checked and there was a regression which is now fixed. But 
with this
regression, I could compile benchmarks for gdc but not ldc or dmd.

 The strange thing is I get different results when I change the 
 following line in benchamrks.d
 So D version becomes better. Maybe this is related to different 
 binary file after compilation.

That is indeed strange but not too unexpected. A compiler (more 
possible in
the DMD back-end) might decide to do strange things for reasons I 
don't know.
I'll try to re-create similar behavior in mine.

 Some other results for "(1..257)" variant:

 Are they meaningful for you?

They are, thank you! The benchmarks are good.

Just some more info for anyone interested:
Regarding sizes 1-16. With GDC / LDC, in my benchmarks
(and by reading the ASM, I assume in all the benchmarks), it 
reaches parity
with libc (note that for sizes 1-16 the naive version is used,
meaning, a simple for loop). Now, for such small sizes, the 
standard way to go
is a fall-through switch (I can give more info on that if someone 
is interested).
The problem with that is that it's difficult to be optimized 
along with the rest
of the code. Meaning, by the compiler. Or at least, I didn't find 
a way
to do it. And so, I use the naive version which is only slightly 
slower but
doesn't affect bigger sizes.

Another important thing is that +/- 1 GB/s should not be 
considered. The reason
is that at some point I benchmarked libc memset() against libc 
memset() and
there were +/- 1 GB/s differences.

 If you want I can run additional benchmarks for you.

Thanks, I don't want to pressure you. If you have time, I'm 
interested in
some feedback on GDC / LDC (if they compile and / or benchmarks).
My guess is that especially with GDC / LDC (and DMD, but I'm not 
yet sure
for DMD across different hardware), Dmemset can actually replace 
libc memset().

In Dmemmove / Dmemcpy is harder to have a clear winner.

 For details, mabe we can continue on github. On forum we can 
 discuss some fundamentals points.

I'm available to you or anyone to give additional info / 
explanations etc.
on every line of code, decision, alternative implementations, 
possible
improvements etc. You can post here, contact me on Slack or email.
Some of these things will be added on the READMEs in the end, but 
we can
go in more detail.

Best regards,
Stefanos

Jul 06 2019

Piotrek <dummy dummy.gov> writes:

On Saturday, 6 July 2019 at 11:07:41 UTC, Stefanos Baziotis wrote:
 As for the benchmarks.
 I think you can post your results somewhere. Or you did. 
 Unfortunately I cannot find them.

 You're right, my mistake, there are no recent benchmarks. I'll 
 try to post
 today. They're similar to yours.


 I tested Dmemset with dmd (lcd and gdc didn't compile) on 
 i3-3220 3.30GHz (Ubuntu).

 That's weird. Could you give some more info on how did you 
 compile?

I used the old repo for Dmemset. With Dmemutils it works now. I 
removed static foreach from benchmark.d in order to run gdc.
Text results:
https://github.com/PiotrekDlang/Dmemutils/tree/master/Dmemset/output

 The strange thing is I get different results when I change the 
 following line in benchamrks.d
 So D version becomes better. Maybe this is related to 
 different binary file after compilation.

 That is indeed strange but not too unexpected. A compiler (more 
 possible in
 the DMD back-end) might decide to do strange things for reasons 
 I don't know.
 I'll try to re-create similar behavior in mine.

It seems it wasn't related to this change. Looks like heisen 
optimization.

 Just some more info for anyone interested:
 Regarding sizes 1-16. With GDC / LDC, in my benchmarks
 (and by reading the ASM, I assume in all the benchmarks), it 
 reaches parity
 with libc (note that for sizes 1-16 the naive version is used,
 meaning, a simple for loop). Now, for such small sizes, the 
 standard way to go
 is a fall-through switch (I can give more info on that if 
 someone is interested).
 The problem with that is that it's difficult to be optimized 
 along with the rest
 of the code. Meaning, by the compiler. Or at least, I didn't 
 find a way
 to do it. And so, I use the naive version which is only 
 slightly slower but
 doesn't affect bigger sizes.

Funnily enough, DMD (with Dmemset) holds the speed record, over 
50 GB/s, copying some big block sizes.
However, aren't smaller sizes more important?


 My guess is that especially with GDC / LDC (and DMD, but I'm 
 not yet sure
 for DMD across different hardware), Dmemset can actually 
 replace libc memset().

One issue is it should be tested on all variation of HW and OS.
At least it can be placed in experimental module.


Cheers,
Piotrek

Jul 06 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Saturday, 6 July 2019 at 15:33:44 UTC, Piotrek wrote:
 I used the old repo for Dmemset. With Dmemutils it works now. I 
 removed static foreach from benchmark.d in order to run gdc.
 Text results:
 https://github.com/PiotrekDlang/Dmemutils/tree/master/Dmemset/output

Great, earlier today I realized that there were problems with 
static foreach,
so now it's only using mixin in the main repo.

Basically, I should have been able to do:
version (GNU)
{
     // mixin
}
else
{
     static foreach
}

but that didn't work, meaning GDC tried to compile static foreach

Anyway, the benchmarks look good. In DMD, small sizes are not so 
good but the big
ones are better. But DMD is not the focus, since it now changed 
to GDC, LDC.

If you're interested, there are a lot of things to say regarding 
optimization for DMD. Some have been said in this thread as 
initially the project was focused on DMD. I'm actually thinking 
of writing an article so that maybe I can help the next guy that 
tries to optimize for DMD. I don't think it's a good decision to 
care at all about optimization in DMD, but one might do. And it's 
a hard road.
A tl;dr is that, for me at least, the only way to reach parity 
with libc is using (inline) ASM.

But the important benchmarks are for GDC, LDC, which agree with 
my benchmarks
on AMD and the result is that Dmemset reaches total parity with 
libc memset().
That's great to have from an Intel user as well, thanks for your 
time!

 It seems it wasn't related to this change. Looks like heisen 
 optimization.

Again, DMD. Quite an unexpected compiler.

 Funnily enough, DMD (with Dmemset) holds the speed record, over 
 50 GB/s, copying some big block sizes.

DMD might have been able to get these results
due to inlining that was unrelated to the actual function (i.e. 
the benchmark code got inlined).

 However, aren't smaller sizes more important?

Again, fortunately DMD is not the focus but I guess one way one 
can somewhat answer this question is to do a report of the sizes 
used in the D runtime, since this is targeted to the D runtime.
Something like this: 
https://forum.dlang.org/post/jdfiqpronazgglrkmwfq forum.dlang.org

But this is not enough. A big part of optimization is to know the 
most
common cases (which could be the data format, size, hardware 
etc.) and optimize
for that first. And this is not adequate to show us the most 
common cases.

- For one, eventually different sizes might be added or removed 
and so the
common cases might change.
- Someone might want to use this function outside of the D 
runtime.

So, Dmemset() should be even or better than libc, which is 
(currently) achieved.

Note something interesting. GDC gets these results with the naive 
version. This
version is literally a 8-lines for loop.

 One issue is it should be tested on all variation of HW and OS.
 At least it can be placed in experimental module.

Right, it's currently PR'd to the D runtime: 
https://github.com/dlang/druntime/pull/2662
Just like you said, in an experimental module. :P

Best regards,
Stefanos

Jul 06 2019

Timon Gehr <timon.gehr gmx.ch> writes:

On 06.07.19 18:10, Stefanos Baziotis wrote:
 
 Basically, I should have been able to do:
 version (GNU)
 {
      // mixin
 }
 else
 {
      static foreach
 }
 
 but that didn't work, meaning GDC tried to compile static foreach

It won't compile it, but it will attempt to parse it.

You should be able to do:

version(GNU){ /+mixin+/ }
else mixin(q{ /+static foreach+/ });

Jul 23 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Friday, 5 July 2019 at 15:42:48 UTC, Stefanos Baziotis wrote:
 Unfortunately, I don't have an Intel to test. It would be 
 really good to have benchmarks from Intel users.

A kind request to anyone interested in helping, please for the 
time being,
put a priority in Dmemset (as Piotrek did).
It is in a somewhat final and polished state
and so we can have a more fruitful discussion, without it 
undergoing big changes.

Dmemcpy / Dmemmove will undergo some changes (not fundamental I 
hope, but certainly
layout, naming etc.) before it can be PR'd to D runtime too.

Jul 06 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Friday, 5 July 2019 at 11:02:00 UTC, Stefanos Baziotis wrote:
 I'm now moving to weekly updates. Before the updates of what I 
 did, let me update
 you on the state of the project.

Last 2 Weeks
============

I could not do weekly updates because unfortunately, there are a 
lot of things out of schedule in the project.

So basically, the last 2 weeks I improved memcpy() / memmove() so 
they can be PR'd to the druntime. This [1] was the first PR. It 
had to be moved into separate
PRs for memcpy() and memmove. Yesterday, an important question 
was answered which let me do a new PR for memcpy() [2]

Along with that, I created memcmp() replacement [3]. I'm 
relatively satisfied with how the code looks, but this can't be 
PR'd yet to the druntime due to performance
problems (more on that on the blockers).

Blockers
========

-- On memcmp:

That was my post on Slack:

There are 3 major problems:
1) The performance is really really bad. Assuming that I have not 
done something stupid, it's just really bad. And actually, the 
asm generated (from `LDC`) is really weird too.
2) I made a version of Agner Fog's `memcmp` (which incidentally 
is similar to mine, it just goes reverse and does some smart 
things in subtractions). The thing is:
    a) Mine and Agner's should be about the same but it's not 
(Agner's is way better).
    b) Agner's is still quite low compared to `libc`.
3) The `LDC` version gives some very weird results for `libc 
memcmp`. Meaning, in benchmarks. And actually, the -O3 ASM 
generated by LDC seems bad as well.

-- Τhe state of the project

Right now, there is no specific roadmap nor any specific goals 
moving forward.
The project was divided in 2 parts. One was the memcpy() et al. 
which included
memcpy(), memmove() and memcmp() and the second was the allocator.
The first part is mostly done. After discussions with Seb, we 
decided
that the second part is not really needed after the mimalloc() of 
Microsoft:
https://forum.dlang.org/thread/krsnngbaudausabfsqkn forum.dlang.org

So, currently I don't know how to move forward. I asked on the 
druntime
whether I can help with anything and zombinedev and Nicholas 
Wilson proposed
refactorings on core.thread. Nicholas helped me to start with 
that, so this
is going to be the next thing I will do. But this is supposed to 
be quick.

If anyone has any proposal on what to do next, I'm glad to 
discuss.


[1] https://github.com/dlang/druntime/pull/2671
[2] https://github.com/dlang/druntime/pull/2687
[3] https://github.com/baziotis/Dmemutils/tree/master/Dmemcmp

Jul 20 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

As it is mentioned in a previous post, this project has got 
hardly any attention. And since there was nothing to do, I did 
not post weekly.

=== Current State ===

--- Dmem* utilities ---

Fortunately, Nicholas Wilson has been helping me the last week 
get the 2 Dmem*
PRs I had done merged [1], [2]

I don't know of anything that these PRs need, although possibly I 
have done something wrong in the documentation.

I don't know if / when they will get merged since they're 
awaiting review.
I hope to have enough reviews to merge at least memset in the 
next 2.5 weeks.
And again, thanks a lot Nicholas for your time.

--- core.thread ---

Since there was nothing to do, I asked if there was anything that 
I could
do in the time. It was proposed that I could refactor core.thread.
With some help from Nicholas, I made a PR [3].
I'm glad that people seem to care about this change. It's going 
good I think.

=== Final 2.5 weeks ===

I honestly have no idea. Ideally, I would PR memmove() as well 
but I think it's
better to try to get at least one of the other 2 PRs merged first.
Other than that, if the core.thread gets merged, I will finish it.

One thing I proposed for the time remaining is a cross-compiler 
SIMD module.
I will write in a separate thread about that, but the idea came 
from the fact that when writing Dmem* utils, I could not find a 
way to use SIMD intrinsics
across compilers. So, I created something like a small SIMD 
library [4].
That is of course not really general, but it shows the idea.

[1] memset: https://github.com/dlang/druntime/pull/2662
[2] memcpy: https://github.com/dlang/druntime/pull/2687
[3] core.thread: https://github.com/dlang/druntime/pull/2689
[4] Mini SIMD module: 
https://github.com/dlang/druntime/pull/2687/files#diff-c2fcd73761ae6659ef91245ce1195b6d

Aug 02 2019

12345swordy <alexanderheistermann gmail.com> writes:

On Friday, 2 August 2019 at 14:51:25 UTC, Stefanos Baziotis wrote:
 As it is mentioned in a previous post, this project has got 
 hardly any attention. And since there was nothing to do, I did 
 not post weekly.

 === Current State ===

 --- Dmem* utilities ---

 Fortunately, Nicholas Wilson has been helping me the last week 
 get the 2 Dmem*
 PRs I had done merged [1], [2]

 I don't know of anything that these PRs need, although possibly 
 I have done something wrong in the documentation.

 I don't know if / when they will get merged since they're 
 awaiting review.
 I hope to have enough reviews to merge at least memset in the 
 next 2.5 weeks.
 And again, thanks a lot Nicholas for your time.

 --- core.thread ---

 Since there was nothing to do, I asked if there was anything 
 that I could
 do in the time. It was proposed that I could refactor 
 core.thread.
 With some help from Nicholas, I made a PR [3].
 I'm glad that people seem to care about this change. It's going 
 good I think.

 === Final 2.5 weeks ===

 I honestly have no idea. Ideally, I would PR memmove() as well 
 but I think it's
 better to try to get at least one of the other 2 PRs merged 
 first.
 Other than that, if the core.thread gets merged, I will finish 
 it.

 One thing I proposed for the time remaining is a cross-compiler 
 SIMD module.
 I will write in a separate thread about that, but the idea came 
 from the fact that when writing Dmem* utils, I could not find a 
 way to use SIMD intrinsics
 across compilers. So, I created something like a small SIMD 
 library [4].
 That is of course not really general, but it shows the idea.

 [1] memset: https://github.com/dlang/druntime/pull/2662
 [2] memcpy: https://github.com/dlang/druntime/pull/2687
 [3] core.thread: https://github.com/dlang/druntime/pull/2689
 [4] Mini SIMD module: 
 https://github.com/dlang/druntime/pull/2687/files#diff-c2fcd73761ae6659ef91245ce1195b6d

Is this project dead in the water? Great, another dead project in 
the graveyard of dead projects.

Sep 05 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Thursday, 5 September 2019 at 15:53:20 UTC, 12345swordy wrote:
 Is this project dead in the water? Great, another dead project 
 in the graveyard of dead projects.

A dead project is a project that hasn't achieved its goals. This 
project
did twice, but both times the goals were not useful.
I explained that in the other thread [1].

Let's please only concern ourselves with constructive discussions 
from now on.

- Stefanos

[1] 
https://forum.dlang.org/post/triweshixkzzyxnaldlj forum.dlang.org

Sep 05 2019

12345swordy <alexanderheistermann gmail.com> writes:

On Thursday, 5 September 2019 at 16:26:08 UTC, Stefanos Baziotis 
wrote:
 On Thursday, 5 September 2019 at 15:53:20 UTC, 12345swordy 
 wrote:
 Is this project dead in the water? Great, another dead project 
 in the graveyard of dead projects.

 A dead project is a project that hasn't achieved its goals. 
 This project
 did twice, but both times the goals were not useful.
 I explained that in the other thread [1].

 Let's please only concern ourselves with constructive 
 discussions from now on.

 - Stefanos

 [1] 
 https://forum.dlang.org/post/triweshixkzzyxnaldlj forum.dlang.org

Is the implementation of memory allocation of the C standard 
library ever going to be achieved?

- Alex

Sep 05 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Thursday, 5 September 2019 at 17:30:54 UTC, 12345swordy wrote:
 Is the implementation of memory allocation of the C standard 
 library ever going to be achieved?

 - Alex

It depends on what you mean "achieved". Let me state some 
questions:
- Why do you want that memory allocator ?
- What this allocator should be able to achieve ?
- Why the libc one is not appropriate for the job ?
- Why no other allocator is appropriate for the job ?
- Can we create and maintain this allocator ?

These questions are presented humbly. And they are important.
The fact that I did not set and answer such questions firstly _to 
myself_
for the first part of the project, meant that I did the project 
twice,
yet all this work was just thrown away as far as the D community 
is concerned.

- Stefanos

Sep 05 2019

12345swordy <alexanderheistermann gmail.com> writes:

On Thursday, 5 September 2019 at 17:56:07 UTC, Stefanos Baziotis 
wrote:
 On Thursday, 5 September 2019 at 17:30:54 UTC, 12345swordy 
 wrote:
 Is the implementation of memory allocation of the C standard 
 library ever going to be achieved?

 - Alex

 It depends on what you mean "achieved". Let me state some 
 questions:
 - Why do you want that memory allocator ?
 - What this allocator should be able to achieve ?
 - Why the libc one is not appropriate for the job ?
 - Why no other allocator is appropriate for the job ?
 - Can we create and maintain this allocator ?

 These questions are presented humbly. And they are important.
 The fact that I did not set and answer such questions firstly 
 _to myself_
 for the first part of the project, meant that I did the project 
 twice,
 yet all this work was just thrown away as far as the D 
 community is concerned.

 - Stefanos

- It is easier to debug and read in the d langauge then in the c 
language.
- I was shown faster memory allocation speed compared to libc.
- other memory allocator are not part of d langauge standard 
library.

Most importantly a yet another disappointed development I seen in 
regards to the development of the d language.

- Alex

Sep 05 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Sep 05, 2019 at 08:16:24PM +0000, 12345swordy via Digitalmars-d wrote:
[...]
 - It is easier to debug and read in the d langauge then in the c language.
 - I was shown faster memory allocation speed compared to libc.
 - other memory allocator are not part of d langauge standard library.
 
 Most importantly a yet another disappointed development I seen in
 regards to the development of the d language.

[...]

Read the discussion that Stefanos referred to. Here are some of the
key blocking issues:

- C library APIs like memcpy, memset, etc., are not only in the C
  library, but are often implemented as *intrinsics* in compilers. One
  of the most important effects of this is that optimizers recognize
  them and understand their semantics, and can sometimes produce better
  code because of that. For example:

	int x, y=5;
	memcpy(&x, &y, int.sizeof); // C version
	... // optimizer knows that now x==5.

  Using a D version of memcpy in the above code can mean that the
  optimizer does *not* recognize that x==5, which can lead to poorer
  performance.

- Even if the previous point isn't an issue, there's still the problem
  of maintenance: the D version of mem* needs to be continuously updated
  because hardware is constantly evolving, and it takes significant
  manpower to (1) port the implementation to every supported
  architecture, (2) make sure they take maximum advantage of the quirks
  of the targeted platform, and (3) checking that they are actually
  faster than the C implementations (which is available on basically any
  new platform anyway).

- D already has syntax for abstractly representing a memcpy operation:
  a[] = b[]. This syntax is type-safe, memory-safe, and the compiler can
  lower it to whatever it likes, including memcpy, or a custom
  implementation specialized for the target platform. That's where such
  primitives really belong, actually. (Historically they went into the C
  library, but these days compilers are more and more building them into
  intrinsics that can drive various codegen strategies (inlining,
  arch-specific optimizations, etc). They're gradually becoming more
  like low-level compiler primitives than your average C library
  functions.)

The current work Stefanos has produced has a big performance impact
mainly only in DMD, which is known to have a weak optimizer, and anyone
who cares about runtime performance ought to be using GDC or LDC anyway.
In GDC/LDC using these custom D implementations wind up being worse
because they defeat the respective optimizers (they no longer recognize
memcpy/etc. semantics from these functions, so can't optimize based on
that).  So lot of the effort ended up being directed towards working
around flaws in DMD's optimizer rather than producing *actual*
improvement over C's mem* primitives. This is really the wrong way to go
about things IMO; we should rather be fixing DMD's optimizer instead.
But once that's done there's even less reason to implement mem*
ourselves.

Note that this does not preclude the D compiler from, e.g., translating
statements like `a[] = b[]` into target-optimized instructions instead
of calling a function named 'memcpy'.  I'd argue that it's the
compiler's job (more specifically, the optimizer's job) to do the best
translation of a[] = b[] into machine code, not the standard library's
problem to account for N versions of M platforms in a gigantic
unmaintainable block of static if'd (or version'd) custom
implementation, whose only real value is to be able to pat ourselves in
the back that yes, we have our own memcpy/memset/etc., implementation
that we wrote in D, just because we can.  Porting the D compiler to a
new architecture already requires codegen work anyway, and work on
memory-copying/moving primitives really should be included under that
umbrella, rather than poorly reinvented in the runtime library.


T

-- 
Curiosity kills the cat. Moral: don't be the cat.

Sep 05 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Thursday, 5 September 2019 at 21:17:07 UTC, H. S. Teoh wrote:

Thanks for the descriptive comment! Some comments from me:

 Read the discussion that Stefanos referred to. Here are some of 
 the key blocking issues:

 - C library APIs like memcpy, memset, etc., are not only in the 
 C
   library, but are often implemented as *intrinsics* in 
 compilers. One
   of the most important effects of this is that optimizers 
 recognize
   them and understand their semantics, and can sometimes 
 produce better
   code because of that. For example:

 	int x, y=5;
 	memcpy(&x, &y, int.sizeof); // C version
 	... // optimizer knows that now x==5.

   Using a D version of memcpy in the above code can mean that 
 the
   optimizer does *not* recognize that x==5, which can lead to 
 poorer
   performance.

 - Even if the previous point isn't an issue, there's still the 
 problem
   of maintenance: the D version of mem* needs to be 
 continuously updated
   because hardware is constantly evolving, and it takes 
 significant
   manpower to (1) port the implementation to every supported
   architecture, (2) make sure they take maximum advantage of 
 the quirks
   of the targeted platform, and (3) checking that they are 
 actually
   faster than the C implementations (which is available on 
 basically any
   new platform anyway).

- For the first 2, let me thank again Manu and Johan helped who 
me realize them! Note also that we don't currently know of a way 
of informing LLVM or GCC
about the semantics and thus get this optimization. The closest 
thing
we have is LLVM  recognizing that a function does what e.g. 
memcpy() does
by name. Which is a bad assumption to build upon.

 - D already has syntax for abstractly representing a memcpy 
 operation:
   a[] = b[]. This syntax is type-safe, memory-safe, and the 
 compiler can
   lower it to whatever it likes, including memcpy, or a custom
   implementation specialized for the target platform. That's 
 where such
   primitives really belong, actually. (Historically they went 
 into the C
   library, but these days compilers are more and more building 
 them into
   intrinsics that can drive various codegen strategies 
 (inlining,
   arch-specific optimizations, etc). They're gradually becoming 
 more
   like low-level compiler primitives than your average C library
   functions.)

AFAIK, this is implemented in the druntime. And the druntime
calls memcpy(). Essentially the goal of this project was to create
versions that would be used from the druntime, not the user. 
Other than that,
I agree!

 The current work Stefanos has produced has a big performance 
 impact mainly only in DMD, which is known to have a weak 
 optimizer,

Actually, when I was optimizing for DMD, I used assembly mainly 
because
I had to reach libc in performance. And using DMD, the only way 
to do
that is using assembly. A more useful goal would be to not try to 
reach
libc (certainly not in x86_64). Rather, create optimized versions
but using generic D. Meaning, to optimize purely based on 
algorithms,
with very few assumptions about the hardware. Much like MUSL.

 and anyone who cares about runtime performance ought to be 
 using GDC or LDC anyway. In GDC/LDC using these custom D 
 implementations wind up being worse because they defeat the 
 respective optimizers (they no longer recognize memcpy/etc. 
 semantics from these functions, so can't optimize based on 
 that).

Actually, this project reached libc in LDC, GDC in 1-1 benchmarks 
using D
and SIMD functions (but not ASM). The problem is when used in 
context exactly
for the reasons you described.


 So lot of the effort ended up being directed towards working 
 around flaws in DMD's optimizer rather than producing *actual* 
 improvement over C's mem* primitives.

Yes essentially that was one of my first objections. To 
counter-act
the DMD flaws, you have to write ASM (if you want parity) which 
in turn
brings the question: Then why do it ? This is what libc already 
does.

 This is really the wrong way to go about things IMO; we should 
 rather be fixing DMD's optimizer instead. But once that's done 
 there's even less reason to implement mem* ourselves.

IMHO, I don't think that fixing the DMD optimizer is a good way 
to go.
Rather, as I said above, aim for generic D implementation, 
_without_ SIMD,
based purely on algorithms. This can be useful for systems that 
don't
have libc and since the DMD optimizer does not use intrinsics as 
LLVM / GCC,
the aforementioned problems, are not problems. Essentially, it's 
a win-win
situation.

- Stefanos

Sep 05 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Sep 05, 2019 at 09:50:04PM +0000, Stefanos Baziotis via Digitalmars-d
wrote:
[...]
 - For the first 2, let me thank again Manu and Johan helped who me
   realize them! Note also that we don't currently know of a way of
   informing LLVM or GCC about the semantics and thus get this
   optimization. The closest thing we have is LLVM  recognizing that a
   function does what e.g. memcpy() does by name. Which is a bad
   assumption to build upon.

That's pretty scary that LLVM does that. It shakes my confidence in LLVM
a little. OTOH, the identifier "memcpy" is pretty unique and practically
universally understood to mean C's implementation of it, so it's a
reasonably safe assumption. Of course, if you ever wish to override
memcpy() with something that does something *other* than memcpy, you
could potentially have a vector for Thompson-style backdoors (function
does one thing when called, does something else when optimizer picks it
up).


[...]
 This is really the wrong way to go about things IMO; we should
 rather be fixing DMD's optimizer instead. But once that's done
 there's even less reason to implement mem* ourselves.

 
 IMHO, I don't think that fixing the DMD optimizer is a good way to go.
 Rather, as I said above, aim for generic D implementation, _without_
 SIMD, based purely on algorithms. This can be useful for systems that
 don't have libc and since the DMD optimizer does not use intrinsics as
 LLVM / GCC, the aforementioned problems, are not problems.
 Essentially, it's a win-win situation.

[...]

But that seems to me to be quite backwards.  If DMD were to target
systems that don't have libc, which AFAIK it currently doesn't, we'd
already have to do porting work in the form of how codegen is done. Then
whatever implementation of memcpy & co you end up with, will simply
become a part of this codegen implementation.  It could be instructions
directly produced by the backend, it could be calling a druntime
function version'd by that specific platform, etc..  But it'd be a
platform-specific, dmd-specific thing, not something generic that
applies across all platforms that D might target, and not something
that, e.g., GDC or LDC would use.

My point is that something like this seems to be more appropriate as
part of the support infrastructure for targeting libc-less platforms,
rather than a generic library function that can be used by everyone. So
any such implementation would be nested inside a version(platform_XYZ)
block, ostensibly something like:

	version(noLibC)
	{
		void _d_memcpy(...) { ... }
	}

where the compiler targeting any platform with no libc would define
version=noLibC, and emit references to _d_memcpy as part of the codegen
for copying memory.  It just wouldn't be something you could use in
general from any platform.


T

-- 
People tell me I'm stubborn, but I refuse to accept it!

Sep 05 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Thursday, 5 September 2019 at 22:56:30 UTC, H. S. Teoh wrote:
 That's pretty scary that LLVM does that. It shakes my 
 confidence in LLVM a little. OTOH, the identifier "memcpy" is 
 pretty unique and practically universally understood to mean 
 C's implementation of it, so it's a reasonably safe assumption. 
 Of course, if you ever wish to override memcpy() with something 
 that does something *other* than memcpy, you could potentially 
 have a vector for Thompson-style backdoors (function does one 
 thing when called, does something else when optimizer picks it 
 up).

I don't like it either. Although, I _think_ that you can 
specifically
set this off. Or that it is done by specific flags. I'd have to 
check.

 But that seems to me to be quite backwards.  If DMD were to 
 target systems that don't have libc, which AFAIK it currently 
 doesn't, we'd already have to do porting work in the form of 
 how codegen is done. Then whatever implementation of memcpy & 
 co you end up with, will simply become a part of this codegen 
 implementation.  It could be instructions directly produced by 
 the backend, it could be calling a druntime function version'd 
 by that specific platform, etc..  But it'd be a 
 platform-specific, dmd-specific thing, not something generic 
 that applies across all platforms that D might target, and not 
 something that, e.g., GDC or LDC would use.

I don't know if I understood this correctly.
For memcpy() et al to become part of the compiler codegen, they 
have
to be recognized as intrinsics. Like LLVM does. Is this what you 
refer to ?
Because that's another (interesting) discussion.

I was talking in the assymption that they're handled as just 
functions (as now),
and things like a[] = b[] just call memcpy().
In that case, it doesn't pay to write arch-specific (meaning, the 
function
implementor, not the compiler) implementation. Because that can't 
be leveraged
across architectures (or you have to write a specific one for 
each which is
not a good goal because of maintenance).

Even if there was an LLVM-like thing where you can e.g. call 
vector extension
intrinsics, but these are lowered to whatever arch-specific 
thing. Even if the
arch does not have the concept of vectorization. Even then, it 
would be better
to focus on the algorithmic part, as the translation of the 
compiler would
be relatively basic.

I hope the above made _some_ sense. I feel I didn't articulate my 
thoughts
perfectly.

- Stefanos

Sep 05 2019

Johan Engelen <j j.nl> writes:

On Thursday, 5 September 2019 at 22:56:30 UTC, H. S. Teoh wrote:
 On Thu, Sep 05, 2019 at 09:50:04PM +0000, Stefanos Baziotis via 
 Digitalmars-d wrote: [...]
 - For the first 2, let me thank again Manu and Johan helped 
 who me
   realize them! Note also that we don't currently know of a 
 way of
   informing LLVM or GCC about the semantics and thus get this
   optimization. The closest thing we have is LLVM  recognizing 
 that a
   function does what e.g. memcpy() does by name. Which is a bad
   assumption to build upon.

 That's pretty scary that LLVM does that. It shakes my 
 confidence in LLVM a little. OTOH, the identifier "memcpy" is 
 pretty unique and practically universally understood to mean 
 C's implementation of it, so it's a reasonably safe assumption.

FYI, GCC does the same.

(the opposite too: converting a copying pattern into memcpy is an 
optimization performed by LLVM, GCC, and MSVC)

-Johan

Sep 05 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Thursday, 5 September 2019 at 20:16:24 UTC, 12345swordy wrote:
 - It is easier to debug and read in the d langauge then in the 
 c language.
 - I was shown faster memory allocation speed compared to libc.
 - other memory allocator are not part of d langauge standard 
 library.

 Most importantly a yet another disappointed development I seen 
 in regards to the development of the d language.

 - Alex

Sorry, but IMHO, these reasons are not enough for me to start an 
allocator project.
You may want to consider that these reasons are not enough for 
you too and / or
the D community either.

The first one is subjective. Considering that we're part of the D 
community,
most of us would agree. But what is not subjective is how many 
people know
D vs e.g. C, meaning how many people can actually contribute.

For the second, I guess you mean "if you were shown". It's really 
very difficult
to create _and_ maintain a libc all-around equivalent in 
performance (for all
archs etc.). And even then, it probably is not a useful goal. 
Most people will
have the libc available if they care so much about performance.

Maybe a more useful goal would be to create a minimalistic 
allocator, which
is very different. And then you have to think if we actually need 
it.
I had asked a person who was working on WASM, which
would be one target if this moved forward and he told me that he 
could
do his job using the std.experimental.allocator.

For the third question, I'll reply with a question: So? :)

- Stefanos

Sep 05 2019

12345swordy <alexanderheistermann gmail.com> writes:

On Thursday, 5 September 2019 at 21:33:37 UTC, Stefanos Baziotis 
wrote:

 For the second, I guess you mean "if you were shown".

No, I *had* been showed.

Regardless this is major disappointment. Good work has gone to 
waste. I can not believe it was accepted in the first place, if 
it were turn out to be pointless.
This speaks very poorly of the d language foundation, IMO. You 
better close those PR request, as it quite clear that they are 
never going to be accepted.

- Alex

Sep 05 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Friday, 6 September 2019 at 01:50:59 UTC, 12345swordy wrote:
 On Thursday, 5 September 2019 at 21:33:37 UTC, Stefanos 
 Baziotis wrote:

 No, I *had* been showed.

Ok, I'm not aware.

 Regardless this is major disappointment. Good work has gone to 
 waste. I can not believe it was accepted in the first place, if 
 it were turn out to be pointless.
 This speaks very poorly of the d language foundation, IMO. You 
 better close those PR request, as it quite clear that they are 
 never going to be accepted.

 - Alex

It will be closed yes. I hope that it will not go completely 
wasted.
When I have time, I will gather the useful code in one repo of D, 
C and
assembly versions. I think there are some important things to 
take.

- Stefanos

Sep 06 2019

a11e99z <black80 bk.ru> writes:

On Thursday, 5 September 2019 at 17:30:54 UTC, 12345swordy wrote:
 On Thursday, 5 September 2019 at 16:26:08 UTC, Stefanos

 Is the implementation of memory allocation of the C standard 
 library ever going to be achieved?

 - Alex

 ... mimalloc() of Microsoft:
 https://forum.dlang.org/thread/krsnngbaudausabfsqkn forum.dlang.org

Sep 05 2019

D Programming

C/C++ Programming

Other

digitalmars.D - [GSoC] 'Independency of D from the C Standard Library' progress and