www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Need a Faster Compressor

reply Walter Bright <newshound2 digitalmars.com> writes:
The current one is effective, but slow:

   https://github.com/dlang/dmd/blob/master/src/backend/compress.c

Anyone want to make a stab at making it faster? Changing the format is fair 
game, as well as down and dirty assembler if that's what it takes.

So really, how good are you at fast code?
May 21 2016
next sibling parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Saturday, 21 May 2016 at 21:12:15 UTC, Walter Bright wrote:
 So really, how good are you at fast code?
I recall being able to get a marginal boost in speed by assigning the first 1-2 bytes in preassigned buckets (256 & 65536 respectively), so you have an immediate 1-2 byte matches (skipping the chafe so to speak). Unfortunately memory-wise it's highly inefficient and requires 2 passes over the data, and probably doesn't play well with cache. I assume this is related to compressing symbols thread? I mentioned possibly considering the LZO library.
May 21 2016
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/21/2016 2:27 PM, Era Scarecrow wrote:
 I assume this is related to compressing symbols thread?
Yes.
 I mentioned possibly considering the LZO library.
Give it a try, see if it pans out. Other possibilities are the lookback dictionary is fixed at 1023 characters. Experimenting and benchmarking can indicate if it should be more or less. Perhaps 512 will yield just as good compression on real world identifier strings. Or try another scheme entirely, like huffman coding.
May 21 2016
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Saturday, 21 May 2016 at 22:01:40 UTC, Walter Bright wrote:
 Give it a try, see if it pans out.

 Other possibilities are the lookback dictionary is fixed at 
 1023 characters. Experimenting and benchmarking can indicate if 
 it should be more or less. Perhaps 512 will yield just as good 
 compression on real world identifier strings.

 Or try another scheme entirely, like huffman coding.
I'm going to try for a lookback and scan which uses quite a bit of memory. On the other hand you can reuse the memory over and over again, so that isn't an issue. Although it will take up a Meg in memory. Also written entirely in D, so... I think I'll go with a similar setup you have for the encoding (3 bytes). Should be nearly identical in output.
May 21 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/21/2016 5:52 PM, Era Scarecrow wrote:
  I'm going to try for a lookback and scan which uses quite a bit of memory. On
 the other hand you can reuse the memory over and over again, so that isn't an
 issue. Although it will take up a Meg in memory. Also written entirely in D,
so...

  I think I'll go with a similar setup you have for the encoding (3 bytes).
 Should be nearly identical in output.
Just a note: large lookup tables can have cold cache performance problems.
May 21 2016
next sibling parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Sunday, 22 May 2016 at 01:36:39 UTC, Walter Bright wrote:
 On 5/21/2016 5:52 PM, Era Scarecrow wrote:
  I'm going to try for a lookback and scan which uses quite a 
 bit of memory.
Just a note: large lookup tables can have cold cache performance problems.
I know, if it doesn't fit in 64k then you can't rely on the cache...
May 21 2016
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Sunday, 22 May 2016 at 01:46:23 UTC, Era Scarecrow wrote:
  I know, if it doesn't fit in 64k then you can't rely on the 
 cache...
Finished. It's blazing fast and even compresses in my test example as good or better than gzip/zlib (although maybe that's because it's headerless). Now I just need to stress test it against some really large input and check for failures. Regardless it has a compression and decompression functions, along with benchmarking it vs the compression algorithm I'm trying to beat.
May 21 2016
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Sunday, 22 May 2016 at 04:35:15 UTC, Era Scarecrow wrote:
  Now I just need to stress test it against some really large 
 input and check for failures. Regardless it has a compression 
 and decompression functions, along with benchmarking it vs the 
 compression algorithm I'm trying to beat.
With 1 Million iterations: new_compress: TickDuration(311404604) id_compress TickDuration(385806589) Approx 20% increase in speed (if i'm reading and did this right). I won't take this as the final verdict, but it is promising at least... Need a larger test sample to benchmark both functions against. Sample data is "E.s!(E.s!(E.s!(E.s!(E.s!(int).s(int).Result).s(E.s!(int)." etc from the bug listing: https://issues.dlang.org/show_bug.cgi?id=15831#c4
May 21 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/21/2016 11:26 PM, Era Scarecrow wrote:
 On Sunday, 22 May 2016 at 04:35:15 UTC, Era Scarecrow wrote:
  Now I just need to stress test it against some really large input and check
 for failures. Regardless it has a compression and decompression functions,
 along with benchmarking it vs the compression algorithm I'm trying to beat.
With 1 Million iterations: new_compress: TickDuration(311404604) id_compress TickDuration(385806589) Approx 20% increase in speed (if i'm reading and did this right). I won't take this as the final verdict, but it is promising at least... Need a larger test sample to benchmark both functions against.
It is promising. Need more!
May 22 2016
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/22/2016 1:50 AM, Walter Bright wrote:
 It is promising. Need more!
I did figure out an improvement to the existing algorithm whereby it could skip forward by the length of the existing match rather than test every character (when it's looking for a longer match). But I have no clue how much of a speedup that would be.
May 22 2016
prev sibling parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Sunday, 22 May 2016 at 08:50:47 UTC, Walter Bright wrote:
 On 5/21/2016 11:26 PM, Era Scarecrow wrote:
 With 1 Million iterations:

 new_compress: TickDuration(311404604)
 id_compress   TickDuration(385806589)

  Approx 20% increase in speed (if i'm reading and did this 
 right).
It is promising. Need more!
Well there's other good news. While I was working with about 2Mb for the data size in general, it can be reduced it looks like to about 8k-16k, and run entirely on the stack. It doesn't appear to gain any speed, or any speed gained is lost with better memory management. Although based on how things look, the id_compress might perform better with a max window of 255 and max match of 127. I'll give the original one a tweak based on my own findings and see if it turns out true.
May 22 2016
next sibling parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Sunday, 22 May 2016 at 09:07:07 UTC, Era Scarecrow wrote:
 On Sunday, 22 May 2016 at 08:50:47 UTC, Walter Bright wrote:
 On 5/21/2016 11:26 PM, Era Scarecrow wrote:
 With 1 Million iterations:

 new_compress: TickDuration(311404604)
 id_compress   TickDuration(385806589)

  Approx 20% increase in speed (if i'm reading and did this 
 right).
It is promising. Need more!
Although based on how things look, the id_compress might perform better with a max window of 255 and max match of 127. I'll give the original one a tweak based on my own findings and see if it turns out true.
Oh noes! Making the tiny change to id_compress and it's faster!!! Now i can't take over the world!!! new_compress: TickDuration(306321939) 21% faster modified idcompress: TickDuration(269275629) 31% faster
May 22 2016
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Sunday, 22 May 2016 at 09:19:54 UTC, Era Scarecrow wrote:
 On 5/21/2016 11:26 PM, Era Scarecrow wrote:
 With 1 Million iterations:

         id_compress: TickDuration(385806589)  
 (original/baseline)
modified id_compress: TickDuration(269275629) 31% faster
And shrinking the lookback to a mere 127 increases it faster but compression starts taking a hit. Can't shrink it anymore anyways. modified id_compress2: TickDuration(230528074) 41% faster!!
May 22 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Sunday, 22 May 2016 at 09:43:29 UTC, Era Scarecrow wrote:
 And shrinking the lookback to a mere 127 increases it faster 
 but compression starts taking a hit. Can't shrink it anymore 
 anyways.

 modified id_compress2: TickDuration(230528074)  41% faster!!
Nice results.
May 22 2016
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Sunday, 22 May 2016 at 09:48:32 UTC, Stefan Koch wrote:
 Nice results.
This is with one sample to test against. I need a much bigger sample, but I'm not sure how to generate/find it to give it a full run for it's money. Not to mention there's a huge memory leak while doing the tests since I'm trying not to have malloc/free in the code results as much as possible. Although with a 30%-40% boost it might be enough to be good enough, and has minimal code changes. Question Walter (or Andrei): Would it be terrible to have lower characters as part of the identifier string? I'm referring to characters under 32 (control characters, especially the \r, \n and \t).
May 22 2016
next sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 22 May 2016 09:57:20 +0000
schrieb Era Scarecrow <rtcvb32 yahoo.com>:

 On Sunday, 22 May 2016 at 09:48:32 UTC, Stefan Koch wrote:
 Nice results.  
This is with one sample to test against. I need a much bigger sample, but I'm not sure how to generate/find it to give it a full run for it's money. Not to mention there's a huge memory leak while doing the tests since I'm trying not to have malloc/free in the code results as much as possible. Although with a 30%-40% boost it might be enough to be good enough, and has minimal code changes. Question Walter (or Andrei): Would it be terrible to have lower characters as part of the identifier string? I'm referring to characters under 32 (control characters, especially the \r, \n and \t).
That would be my question, too. What is the allowed alphabet? Control characters might break output of debuggers, and spaces would also look confusing. So I guess everything in the range [33..127] (95 symbols)? -- Marco
May 22 2016
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Sunday, 22 May 2016 at 10:12:11 UTC, Marco Leise wrote:
 Question Walter (or Andrei): Would it be terrible to have 
 lower characters as part of the identifier string? I'm 
 referring to characters under 32 (control characters, 
 especially the \r, \n and \t).
That would be my question, too. What is the allowed alphabet? Control characters might break output of debuggers, and spaces would also look confusing. So I guess everything in the range (95 symbols)?
Actually if compressed text (which uses symbols 128-155) are okay, then that gives us closer to 223 symbols to work with, not 95. Curiously the extra 95 symbols actually is just enough to keep compression up and bring speed to about 37% speed gain, with it looks like no disadvantages. Then again I still need a larger set to test things on, but it's very promising!
May 22 2016
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/22/2016 3:32 AM, Era Scarecrow wrote:
  Curiously the extra 95 symbols actually is just enough to keep compression up
 and bring speed to about 37% speed gain, with it looks like no disadvantages.
 Then again I still need a larger set to test things on, but it's very
promising!
You're doing the gods' work, Era!
May 22 2016
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Sunday, 22 May 2016 at 19:21:13 UTC, Walter Bright wrote:
 On 5/22/2016 3:32 AM, Era Scarecrow wrote:
 Curiously the extra 95 symbols actually is just enough to keep 
 compression up and bring speed to about 37% speed gain, with 
 it looks like no disadvantages.
You're doing the gods' work, Era!
Maybe... I have to wonder if LZ4 from the looks of things will take over instead. Have to see how it performs (and if we can integrate it); Meanwhile the main advantage of id_compress (as I'm tinkering with it) is it makes no changes to the signature/behavior so taking advantage of it's boost in speed can be immediate (although it will no doubt have a bit weaker compression, and demangling needs to be re-written for it). Honestly the majority of speed gains with id_compress are simply from having a much smaller window(1023 to 220), also resulting in a smaller footprint for the compressed output (3 bytes to 2 bytes). Compare to the speed gains from my new algorithm (at 20% faster) is pretty constant, and retains a HUGE window (pre-scans 8k, has a 2k window) that it can find matches from, so it will win in compression. If LZ4 is half as good as the initial results are then I'd go with it instead.
May 22 2016
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Sunday, 22 May 2016 at 19:44:08 UTC, Era Scarecrow wrote:
 ...
Well here's the rundown of some numbers. min_compress uses a tiny window, big_compress was my original algorithmn but modified to use 16k total for a window. reduced_id_compress is the original except reduced to a 220 window and 2 byte constant output. Along with the compressed outputs of each. min_compress: [TickDuration(46410084)] 0.779836 big_compress: [TickDuration(47998202)] 0.806545 orig_id_compress: [TickDuration(59519257)] baseline reduced_id_compress: [TickDuration(44033192)] 0.739894 1001 (original size) 72 testexpansion.s!(æεó▌æ╗int).så°Resulà≡Ñ╪¢╨╘ÿ¼É ↑─↑╜►╘fñv ÿ╜ ↑│↑Ä .foo() 73 testexpansion.s!(ÅæεÅó▌Åæ╗int).sÅå°ResulÅà≡ÅÑ╪Å¢╨Å╘ÿżÉ₧├ÿÄ╜É╝▓ÿëåâ.foo() 67 tes╤xpansion.s!(ÇææÇóóÇææint)∙⌡ResulÇàÅÇѺǢ»Ç╘τǼ∩ë├τü╜∩¢▓τ².foo() 78 testexpansion.s!(æ2óCæ2int).så(Resulà0ÑH¢P╘ê¼É╘íñÖ├ê┤ÿσ║ñ¬├ê¼É╘íñÖ├êÉ1å).foo() min_compress: [TickDuration(29210832)] 0.82391 big_compress: [TickDuration(31058664)] 0.87601 orig_id_compress: [TickDuration(35466130)] baseline reduced_id_compress: [TickDuration(25032532)] 0.705977 629 (original size) 52 E.s!(à·è⌡àδint).så°Resulà≡ÖΣÅ▄╝╝ö┤ líd _Ög læ◄.foo() 61 E.s!(Åà·Åè⌡Åàδint).sÅå°ResulÅà≡ÅÖΣÅÅ▄Å╝╝Åö┤₧ç∞ÄÖΣ¡ó╠ïå╗.foo() 52 E.s!(ΣÇèèΣint)∙⌡ResulÇàÅÇÖ¢ÇÅúÇ╝├Çö╦ëçôüÖ¢Æó│².foo() 52 E.s!(à&è+à&int).så(Resulà0Ö<ÅD╝döl ┤í╝ ┴Ö╣ ┤æ9.foo()
May 22 2016
parent ZombineDev <petar.p.kirov gmail.com> writes:
On Monday, 23 May 2016 at 01:46:40 UTC, Era Scarecrow wrote:
 On Sunday, 22 May 2016 at 19:44:08 UTC, Era Scarecrow wrote:
 ...
Well here's the rundown of some numbers. min_compress uses a tiny window, big_compress was my original algorithmn but modified to use 16k total for a window. reduced_id_compress is the original except reduced to a 220 window and 2 byte constant output. Along with the compressed outputs of each. min_compress: [TickDuration(46410084)] 0.779836 big_compress: [TickDuration(47998202)] 0.806545 orig_id_compress: [TickDuration(59519257)] baseline reduced_id_compress: [TickDuration(44033192)] 0.739894 1001 (original size) 72 testexpansion.s!(æεó▌æ╗int).så°Resulà≡Ñ╪¢╨╘ÿ¼É ↑─↑╜►╘fñv ÿ╜ ↑│↑Ä .foo() 73 testexpansion.s!(ÅæεÅó▌Åæ╗int).sÅå°ResulÅà≡ÅÑ╪Å¢╨Å╘ÿżÉ₧├ÿÄ╜É╝▓ÿëåâ.foo() 67 tes╤xpansion.s!(ÇææÇóóÇææint)∙⌡ResulÇàÅÇѺǢ»Ç╘τǼ∩ë├τü╜∩¢▓τ².foo() 78 testexpansion.s!(æ2óCæ2int).så(Resulà0ÑH¢P╘ê¼É╘íñÖ├ê┤ÿσ║ñ¬├ê¼É╘íñÖ├êÉ1å).foo() min_compress: [TickDuration(29210832)] 0.82391 big_compress: [TickDuration(31058664)] 0.87601 orig_id_compress: [TickDuration(35466130)] baseline reduced_id_compress: [TickDuration(25032532)] 0.705977 629 (original size) 52 E.s!(à·è⌡àδint).så°Resulà≡ÖΣÅ▄╝╝ö┤ líd _Ög læ◄.foo() 61 E.s!(Åà·Åè⌡Åàδint).sÅå°ResulÅà≡ÅÖΣÅÅ▄Å╝╝Åö┤₧ç∞ÄÖΣ¡ó╠ïå╗.foo() 52 E.s!(ΣÇèèΣint)∙⌡ResulÇàÅÇÖ¢ÇÅúÇ╝├Çö╦ëçôüÖ¢Æó│².foo() 52 E.s!(à&è+à&int).så(Resulà0Ö<ÅD╝döl ┤í╝ ┴Ö╣ ┤æ9.foo()
If you want, you can give the test case for issue 16039 a try. It produces 300-400MB binary, so it should be a nice test case :P
May 23 2016
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/22/2016 3:32 AM, Era Scarecrow wrote:
 [...]
My idea for speeding things up is every time a new character c is matched in pattern[], a bit is set in mask: ulong mask; ... mask |= 1 << (c & 63); Then, when scanning for longer matches, test: if (!((1 << (id[i + matchlen - 1] & 63)) & mask)) i += matchlen; because we know that id[i + matchlen - 1] cannot match any of the characters in the current match. The two expressions look complex, but they should compile to single instructions on x64.
May 22 2016
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/22/2016 2:57 AM, Era Scarecrow wrote:
  Question Walter (or Andrei): Would it be terrible to have lower characters as
 part of the identifier string? I'm referring to characters under 32 (control
 characters, especially the \r, \n and \t).
Currently there's a bug in id_compress() where it doesn't take into account UTF-8 sequences, as a subset of them are allowed as identifiers. I rue the day I allowed that in D, but we're kinda stuck with it. I kinda hate disallowing control characters, because then the compressor gets a bit specific.
May 22 2016
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/22/2016 2:07 AM, Era Scarecrow wrote:
 On Sunday, 22 May 2016 at 08:50:47 UTC, Walter Bright wrote:
 On 5/21/2016 11:26 PM, Era Scarecrow wrote:
 With 1 Million iterations:

 new_compress: TickDuration(311404604)
 id_compress   TickDuration(385806589)

  Approx 20% increase in speed (if i'm reading and did this right).
It is promising. Need more!
Well there's other good news. While I was working with about 2Mb for the data size in general, it can be reduced it looks like to about 8k-16k, and run entirely on the stack. It doesn't appear to gain any speed, or any speed gained is lost with better memory management. Although based on how things look, the id_compress might perform better with a max window of 255 and max match of 127. I'll give the original one a tweak based on my own findings and see if it turns out true.
There's also https://github.com/dlang/dmd/pull/5799 which should make things faster.
May 22 2016
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Sunday, 22 May 2016 at 01:36:39 UTC, Walter Bright wrote:
 Just a note: large lookup tables can have cold cache 
 performance problems.
16k lookup table are the gold standard.
May 22 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/22/2016 5:50 AM, deadalnix wrote:
 On Sunday, 22 May 2016 at 01:36:39 UTC, Walter Bright wrote:
 Just a note: large lookup tables can have cold cache performance problems.
16k lookup table are the gold standard.
I like 64 bit vectors because the lookup and test can be done in one instruction, and the vector cached in a register!
May 22 2016
prev sibling parent reply Wyatt <wyatt.epp gmail.com> writes:
On Saturday, 21 May 2016 at 21:27:37 UTC, Era Scarecrow wrote:
 I assume this is related to compressing symbols thread? I 
 mentioned possibly considering the LZO library.
Maybe consider lz4 instead? Tends to be a bit faster, and it's BSD instead of GPL. https://cyan4973.github.io/lz4/ -Wyatt
May 23 2016
parent Wyatt <wyatt.epp gmail.com> writes:
On Monday, 23 May 2016 at 14:47:31 UTC, Wyatt wrote:
 Maybe consider lz4 instead?
Disregard that: I see it's come up already.
May 23 2016
prev sibling next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 21.05.2016 23:12, Walter Bright wrote:
 The current one is effective, but slow:

    https://github.com/dlang/dmd/blob/master/src/backend/compress.c

 Anyone want to make a stab at making it faster? Changing the format is
 fair game, as well as down and dirty assembler if that's what it takes.

 So really, how good are you at fast code?
Why is longest_match Ω(nm) instead of O(n+m) (e.g. KMP)?
May 21 2016
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 21.05.2016 23:37, Timon Gehr wrote:
 On 21.05.2016 23:12, Walter Bright wrote:
 The current one is effective, but slow:

    https://github.com/dlang/dmd/blob/master/src/backend/compress.c

 Anyone want to make a stab at making it faster? Changing the format is
 fair game, as well as down and dirty assembler if that's what it takes.

 So really, how good are you at fast code?
Why is longest_match Ω(nm) instead of O(n+m) (e.g. KMP)?
(Those are worst case bounds, not sure what's better on average on D symbol strings.)
May 21 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/21/2016 2:37 PM, Timon Gehr wrote:
 Why is longest_match Ω(nm) instead of O(n+m) (e.g. KMP)?
I don't understand the terms you use, but as to the "why" it is based on what I knew about LZ77 compression. I don't pretend to be an expert on compression, and this is what I churned out. As mentioned elsewhere, the C++ mangling scheme has a primitive and ineffective LZ77 scheme built in, so I wouldn't waste time on that. A quick google search on finding the longest match in a string did not turn up anything obvious. If you want to throw your hat (i.e. expertise) into the ring and post a faster compressor, please do so!
May 21 2016
next sibling parent reply Guillaume Boucher <guillaume.boucher.d gmail.com> writes:
On Saturday, 21 May 2016 at 22:07:27 UTC, Walter Bright wrote:
 As mentioned elsewhere, the C++ mangling scheme has a primitive 
 and ineffective LZ77 scheme built in, so I wouldn't waste time 
 on that.
Sorry if I didn't memorize everything in this forum from the last 20 years, can you give a link to some reasoning? The only thing I found are some statements in https://github.com/dlang/dmd/pull/5793:
 Note that the VC++ and g++ name mangling schemes for C++ each 
 use a similar, but primitive (and fairly ineffective) form of 
 compression. It's like people who invent their own crypto 
 algorithms.
The name mangling in the Itanium C++ ABI is similar to LZ77 but heavily optimized for the typical use case in symbol names. I don't really see how the general LZ77 would be better. It's almost as if using information about the actual data leads to a better compression scheme.
 This is what C++ does, and results are pretty poor compression. 
 Scanning to determine the backward references would consume the 
 same order of time, anyway.
It uses a constant amount of time if implemented correctly, which is much faster.
May 21 2016
next sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Saturday, 21 May 2016 at 22:41:49 UTC, Guillaume Boucher wrote:
  It's almost as if using information about the actual data 
 leads to a better compression scheme.
Exactly. It always is. Since you know where to look for redundancy instead of blindly comparing bytes.
May 21 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/21/2016 3:41 PM, Guillaume Boucher wrote:
 Sorry if I didn't memorize everything in this forum from the last 20 years, can
 you give a link to some reasoning?
DMC++ matches the Microsoft name mangling scheme, which includes such compression. It proved hopelessly inadequate, which is why I implemented compress.c in the first place (it was for the DMC++ compiler). (And frankly, I don't see how an ad-hoc scheme like that could hope to compare with a real compression algorithm.)
May 21 2016
parent reply Rainer Schuetze <r.sagitario gmx.de> writes:
On 22.05.2016 00:58, Walter Bright wrote:
 On 5/21/2016 3:41 PM, Guillaume Boucher wrote:
 Sorry if I didn't memorize everything in this forum from the last 20
 years, can
 you give a link to some reasoning?
DMC++ matches the Microsoft name mangling scheme, which includes such compression. It proved hopelessly inadequate, which is why I implemented compress.c in the first place (it was for the DMC++ compiler). (And frankly, I don't see how an ad-hoc scheme like that could hope to compare with a real compression algorithm.)
You are right about the symbols using the VC mangling. The test case "1.s.s.s.s.s" in the bugreport translated to C++ yields ?foo Result ?1???$s UResult ?1???$s UResult ?1???$s UResult ?1???$s UResult ?1???$s H testexpansion YA H Z testexpansion YA U1?1???$s H 2 YA H Z Z testexpansion YA U1?1???$s UResult ?1???$s H testexpansion YA H Z 2 YA U1?1???$s H 2 YA H Z Z Z testexpansion YA U1?1???$s UResult ?1???$s UResult ?1???$s H testexpansion YA H Z testexpansion YA U1?1???$s H 2 YA H Z Z 2 YA U1?1???$s UResult ?1???$s H testexpansion YA H Z 2 YA U1?1???$s H 2 YA H Z Z Z Z testexpansion YA U1?1???$s UResult ?1???$s UResult ?1???$s UResult ?1???$s H testexpansion YA H Z testexpansion YA U1?1???$s H 2 YA H Z Z testexpansion YA U1?1???$s UResult ?1???$s H testexpansion YA H Z 2 YA U1?1???$s H 2 YA H Z Z Z 2 YA U1?1???$s UResult ?1???$s UResult ?1???$s H testexpansion YA H Z testexpansion YA U1?1???$s H 2 YA H Z Z 2 YA U1?1???$s UResult ?1???$s H testexpansion YA H Z 2 YA U1?1???$s H 2 YA H Z Z Z Z Z QAEXXZ i.e. 936 characters. I think this is due to the very bad limitation of just back referencing the first 10 types. The gcc mangling allows arbitrary back references and is closer to what some have proposed. It yields _ZZN13testexpansion1sIZNS0_IZNS0_IZNS0_IZNS0_IiEEDaT_E6ResultEES1_S2_E6ResultEES1_S2_E6ResultEES1_S2_E6ResultEES1_S2_EN6Result3fooEv which is only 39 characters longer than the compressed version of the D symbol. It uses a much smaller character set, though. This is my translation of the test case to C++14: namespace testexpansion { template<class T> struct S { void foo(){} }; template<class T> auto testexpansion_s(T t) { #ifdef bad struct Result { void foo(){} }; return Result(); #else return S<T>(); #endif } void xmain() { auto x = s(s(s(s(s(1))))); x.foo(); } }
May 22 2016
next sibling parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Sunday, 22 May 2016 at 17:44:22 UTC, Rainer Schuetze wrote:
 You are right about the symbols using the VC mangling. The test 
 case "1.s.s.s.s.s" in the bugreport translated to C++ yields

 ?foo Result ?1???$s UResult ?1???$s UResult ?1???$s <snip>

 i.e. 936 characters. I think this is due to the very bad 
 limitation of just back referencing the first 10 types.
Unless ? is the proper/valid character those are far more likely to be failed UTF-8 decoding. Care to double check?
May 22 2016
parent reply Rainer Schuetze <r.sagitario gmx.de> writes:
On 22.05.2016 19:56, Era Scarecrow wrote:
 On Sunday, 22 May 2016 at 17:44:22 UTC, Rainer Schuetze wrote:
 You are right about the symbols using the VC mangling. The test case
 "1.s.s.s.s.s" in the bugreport translated to C++ yields

 ?foo Result ?1???$s UResult ?1???$s UResult ?1???$s <snip>

 i.e. 936 characters. I think this is due to the very bad limitation of
 just back referencing the first 10 types.
Unless ? is the proper/valid character those are far more likely to be failed UTF-8 decoding. Care to double check?
? is an allowed character in VC++ and for example permits unambiguous demangling. The gcc style using only alpha numeric chanracters and '_' can also just be plain C symbols.
May 22 2016
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Sunday, 22 May 2016 at 18:04:25 UTC, Rainer Schuetze wrote:
 On 22.05.2016 19:56, Era Scarecrow wrote:
  Unless ? is the proper/valid character those are far more 
 likely to be failed UTF-8 decoding. Care to double check?
? is an allowed character in VC++ and for example permits unambiguous demangling. The gcc style using only alpha numeric characters and '_' can also just be plain C symbols.
Not quite what I asked. Is the sample provided accurate? Is having 3 ?s in a row what should be there not just spitting a ? because it doesn't know how to decode it?
May 22 2016
parent reply Rainer Schuetze <r.sagitario gmx.de> writes:
On 22.05.2016 20:12, Era Scarecrow wrote:
 On Sunday, 22 May 2016 at 18:04:25 UTC, Rainer Schuetze wrote:
 On 22.05.2016 19:56, Era Scarecrow wrote:
  Unless ? is the proper/valid character those are far more likely to
 be failed UTF-8 decoding. Care to double check?
? is an allowed character in VC++ and for example permits unambiguous demangling. The gcc style using only alpha numeric characters and '_' can also just be plain C symbols.
Not quite what I asked. Is the sample provided accurate? Is having 3 ?s in a row what should be there not just spitting a ? because it doesn't know how to decode it?
The 3 consecutive ?s are also in the object file, so they are actually part of the mangled symbol. VC++ uses only ASCII characters for mangling (which D should do, too, IMO).
May 22 2016
parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 22 May 2016 20:22:06 +0200
schrieb Rainer Schuetze <r.sagitario gmx.de>:

 The 3 consecutive ?s are also in the object file, so they are actually 
 part of the mangled symbol. VC++ uses only ASCII characters for mangling 
 (which D should do, too, IMO).
In general I agree. The extended characters may be swallowed and the representation not copyable. In the worst case we may uncover bugs in debugging software. On the other hand, base-64 adds ~17% to the symbol length compared to base-128 and the linker in particular should only be interested in a terminating \0. -- Marco
May 22 2016
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/22/2016 10:44 AM, Rainer Schuetze wrote:
 You are right about the symbols using the VC mangling. The test case
 "1.s.s.s.s.s" in the bugreport translated to C++ yields

 ?foo Result ?1???$s UResult ?1???$s UResult ?1???$s UResult ?1???$s UResult ?1???$s H testexpansion  YA H Z  testexpansion  YA U1?1???$s H 2 YA H Z  Z  testexpansion  YA U1?1???$s UResult ?1???$s H testexpansion  YA H Z  2 YA U1?1???$s H 2 YA H Z  Z  Z  testexpansion  YA U1?1???$s UResult ?1???$s UResult ?1???$s H testexpansion  YA H Z  testexpansion  YA U1?1???$s H 2 YA H Z  Z  2 YA U1?1???$s UResult ?1???$s H testexpansion  YA H Z  2 YA U1?1???$s H 2 YA H Z  Z  Z  Z  testexpansion  YA U1?1???$s UResult ?1???$s UResult ?1???$s UResult ?1???$s H testexpansion  YA H Z  testexpansion  YA U1?1???$s H 2 YA H Z  Z  testexpansion  YA U1?1???$s UResult ?1???$s H testexpansion  YA H Z  2 YA U1?1???$s H 2 YA H Z  Z  Z  2 YA U1?1???$s UResult ?1???$s UResult ?1???$s H testexpansion  YA H Z  testexpansion  YA U1?1???$s H 2 YA H Z  Z  2 YA U1?1???$s UResult ?1???$s H testexpansion  YA H Z  2 YA U1?1???$s H 2 YA H Z  Z  Z  Z  Z QAEXXZ
Haha, thus showing why people should never invent their own ad-hoc crypto nor compression algorithms :-)
 _ZZN13testexpansion1sIZNS0_IZNS0_IZNS0_IZNS0_IiEEDaT_E6ResultEES1_S2_E6ResultEES1_S2_E6ResultEES1_S2_E6ResultEES1_S2_EN6Result3fooEv
Note "Result" appearing 5 times. It's still crappy compression. Ironically, this design predates the much worse VC++ one.
 which is only 39 characters longer than the compressed version of the D symbol.
 It uses a much smaller character set, though.
The much smaller character set can be done, if desired, by using base64 encoding. The compressor is careful to not emit 0 bytes, as those definitely would screw up bintools, but the rest seem unperturbed by binary strings.
May 22 2016
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 22.05.2016 00:07, Walter Bright wrote:
 On 5/21/2016 2:37 PM, Timon Gehr wrote:
 Why is longest_match Ω(nm) instead of O(n+m) (e.g. KMP)?
I don't understand the terms you use, but as to the "why" it is based on what I knew about LZ77 compression. I don't pretend to be an expert on compression, and this is what I churned out. As mentioned elsewhere, the C++ mangling scheme has a primitive and ineffective LZ77 scheme built in, so I wouldn't waste time on that. A quick google search on finding the longest match in a string did not turn up anything obvious. ...
E.g. the Knuth-Morris-Pratt (KMP) string search algorithm can be modified to compute longest match instead of full match (as it just efficiently builds and runs a finite state machine whose state is the length of the longest match ending at the current position). https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm Implementation: auto longestMatch(string x,string y){ if(y.length>x.length) y=y[0..x.length]; auto f=new size_t[y.length+1]; for(size_t i=0,j=f[0]=-1;i<y.length;f[++i]=++j) while(~j&&y[i]!=y[j]) j=f[j]; auto r=x[0..0]; for(size_t i=0,j=0;i<x.length&&j<y.length;++i,++j){ while(~j&&x[i]!=y[j]) j=f[j]; if(j+1>r.length) r=x[i-j..i+1]; } return r; } This returns a slice of x representing the leftmost longest match with y. Running time is O(x.length). (In practice, if this should be really fast, the allocation for 'f' should probably be shared among multiple calls.) (This probably only improves running time in case there are sufficiently many sufficiently long matches.) But this just improves longest_match. It should be possible to bring down the running time of the entire compression algorithm significantly using a suffix tree (the resulting running time bound is linear in the input size and independent of the window size).
 If you want to throw your hat (i.e. expertise) into the ring and post a
 faster compressor, please do so!
As far as I understand, the use case for this is compressing mangled names? I don't think that generating huge mangled names explicitly just to compress them afterwards is a viable strategy. This process throws away a lot of structure information during mangling that could be useful for fast and effective compression. Instead, compression should be performed while generating the mangled string. As far as I understand, mangled names only grow unmanageably large because the same symbols are mangled into them multiple times? Isn't it enough to create references to previously embedded mangled symbols (i-th symbol already mangled) while generating the mangled string?
May 23 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/23/2016 12:32 PM, Timon Gehr wrote:
 Instead, compression should be performed while generating the mangled string.
The trouble with that is the mangled string is formed from component pieces. Those component pieces may have common substrings with each other, which won't be detected until they are merged, when you're back to dealing with the string as a whole.
 Isn't it enough to create
 references to previously embedded mangled symbols (i-th symbol already mangled)
 while generating the mangled string?
That's been covered upthread, and VC++/g++ do that. But as shown, it is inadequate. There's a lot more redundancy than just the identifiers.
May 23 2016
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 23.05.2016 22:34, Walter Bright wrote:
 On 5/23/2016 12:32 PM, Timon Gehr wrote:
 Instead, compression should be performed while generating the mangled
 string.
The trouble with that is the mangled string is formed from component pieces.
Then don't do that. I.e. re-mangle recursively from scratch for each mangled name and allow sharing parts between unrelated components within that mangled name. (That's similar to how the compiler stores the structure in memory, which also avoids the exponential blowup.)
 Those component pieces may have common substrings with each
 other, which won't be detected until they are merged,
 when you're back to dealing with the string as a whole.


 Isn't it enough to create
 references to previously embedded mangled symbols (i-th symbol already
 mangled)
 while generating the mangled string?
That's been covered upthread, and VC++/g++ do that. But as shown, it is inadequate. There's a lot more redundancy than just the identifiers.
Yes, but I think that should be covered (up to constants) by what I suggested above? You can't have exponential growth before compression. Any scheme that does not prevent running time exponential in template instantiation depth is also inadequate in the long run.
May 23 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/23/2016 2:17 PM, Timon Gehr wrote:
 Then don't do that. I.e. re-mangle recursively from scratch for each mangled
 name and allow sharing parts between unrelated components within that mangled
 name.
How is that essentially different from running a compression pass? The only real problem with the compressor was the compression speed, and the LZ4 test shows this is solvable.
 (That's similar to how the compiler stores the structure in memory, which
 also avoids the exponential blowup.)
No, it doesn't.
 You can't have exponential growth before compression.
The compiler is designed so that the internal names are pretty much write-only, uniquely hashed by their address. It's not spending time running strlen()'s on them.
 Any scheme that does not prevent running time exponential in template
instantiation depth is also
 inadequate in the long run.
Most algorithms seem to have pathological worst cases, but as long as they are rare, it doesn't impede their utility.
May 23 2016
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
Just a heads up on the LZ4.
I have spent roughly 3 hours optimizing my decompresser.
And while I had stunning success, a speed-up of about 400%.
I am still about 600x slower then the C variant.

It is still a mystery to me why that is :)
Since the generated code both smaller and works almost without 
spills.
May 23 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/23/2016 4:49 PM, Stefan Koch wrote:
 Just a heads up on the LZ4.
 I have spent roughly 3 hours optimizing my decompresser.
 And while I had stunning success, a speed-up of about 400%.
 I am still about 600x slower then the C variant.

 It is still a mystery to me why that is :)
 Since the generated code both smaller and works almost without spills.
I'm not sure why you're working on a decompressor. The speed needs to be in the compressor.
May 23 2016
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 24.05.2016 01:02, Walter Bright wrote:
 On 5/23/2016 2:17 PM, Timon Gehr wrote:
 Then don't do that. I.e. re-mangle recursively from scratch for each
 mangled
 name and allow sharing parts between unrelated components within that
 mangled
 name.
How is that essentially different from running a compression pass?
Speed. The title of this thread is "Need a faster compressor".
 The
 only real problem with the compressor was the compression speed, and the
 LZ4 test shows this is solvable.
 ...
No. Just increase the recursion depth by a small number of levels to completely kill the speedups gained by using a faster compression algorithm.
  > (That's similar to how the compiler stores the structure in memory,
 which
 also avoids the exponential blowup.)
No, it doesn't. ...
Yes, it does. The compiler does not use exponential space to store the AST.
 You can't have exponential growth before compression.
The compiler is designed so that the internal names are pretty much write-only, uniquely hashed by their address. It's not spending time running strlen()'s on them. ...
It's _exponential_ growth. We don't even want to spend the time and memory required to generate the strings.
 Any scheme that does not prevent running time exponential in template
 instantiation depth is also
 inadequate in the long run.
Most algorithms seem to have pathological worst cases, but as long as they are rare, it doesn't impede their utility.
The reason we have this discussion is that the worst case isn't rare enough to make this argument. Why compress in the first place if mangled names don't grow large in practice?
May 24 2016
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 24 May 2016 at 16:22:56 UTC, Timon Gehr wrote:
 On 24.05.2016 01:02, Walter Bright wrote:
 [...]
Speed. The title of this thread is "Need a faster compressor".
 [...]
No. Just increase the recursion depth by a small number of levels to completely kill the speedups gained by using a faster compression algorithm.
 [...]
Yes, it does. The compiler does not use exponential space to store the AST.
 [...]
It's _exponential_ growth. We don't even want to spend the time and memory required to generate the strings.
[...]
The reason we have this discussion is that the worst case isn't rare enough to make this argument. Why compress in the first place if mangled names don't grow large in practice?
I completely agree! However such a thing will only work if dmd is used as a pre-linker and If we can stablize the hash. I.E. every run of the compiler generates the same hash.
May 24 2016
parent deadalnix <deadalnix gmail.com> writes:
On Tuesday, 24 May 2016 at 18:30:36 UTC, Stefan Koch wrote:
 On Tuesday, 24 May 2016 at 16:22:56 UTC, Timon Gehr wrote:
 On 24.05.2016 01:02, Walter Bright wrote:
 [...]
Speed. The title of this thread is "Need a faster compressor".
 [...]
No. Just increase the recursion depth by a small number of levels to completely kill the speedups gained by using a faster compression algorithm.
 [...]
Yes, it does. The compiler does not use exponential space to store the AST.
 [...]
It's _exponential_ growth. We don't even want to spend the time and memory required to generate the strings.
[...]
The reason we have this discussion is that the worst case isn't rare enough to make this argument. Why compress in the first place if mangled names don't grow large in practice?
I completely agree! However such a thing will only work if dmd is used as a pre-linker and If we can stablize the hash. I.E. every run of the compiler generates the same hash.
KAMOULOX!
May 24 2016
prev sibling next sibling parent reply Anon <anon anon.anon> writes:
On Tuesday, 24 May 2016 at 16:22:56 UTC, Timon Gehr wrote:
 It's _exponential_ growth. We don't even want to spend the time 
 and memory required to generate the strings.

 The reason we have this discussion is that the worst case isn't 
 rare enough to make this argument. Why compress in the first 
 place if mangled names don't grow large in practice?
Since Walter is evidently still having a hard time understanding this, I've done a few more pathological cases, comparing LZ4 to Back Referencing Named Types. For BRNT I manually replaced struct names in the mangling with N<number> numeric identifiers for all but one of their appearances, to simulate what could actually be done by the compiler. No other mangling changes (e.g., for eponymous templates or hidden types) were applied. auto foo(T)(T x) { static struct Vold { T payload; } return Vold(x); } At a chain of length 10: foo(5).foo.foo.foo.foo.foo.foo.foo.foo.foo current: Generate a 57_581 byte symbol lz4 -9: Generate a 57_581 byte symbol, then compress it to 412 bytes lzma -9: Generate a 57_581 byte symbol, then compress it to 239 bytes BRNT: Generate a 332 byte symbol Upping it to twelve levels: foo(5).foo.foo.foo.foo.foo.foo.foo.foo.foo.foo.foo current: Generate a 230_489 byte symbol lz4 -9: Generate a 230_489 byte symbol, then compress it to 1128 bytes lzma -9: Generate a 230_489 byte symbol, then compress it to 294 bytes BRNT: Generate a 393 byte symbol Upping it to fourteen levels: I'm too lazy to do more manual BRNT, so beyond this point its numbers are estimated based on the previously established fact that BRNT symbols grow linearly. current: Generate a 922_127 byte symbol lz4 -9: Generate a 922_127 byte symbol, then compress it to 4012 bytes lzma -9: Generate a 922_127 byte symbol, compress it to 422 bytes BRNT: Generate a ~457 byte symbol Upping it to sixteen levels: current: Generate a 3_688_679 byte symbol lz4 -9: Generate a 3_688_679 byte symbol, then compress it to 15535 bytes lzma -9: Generate a 3_688_679 byte symbol, then compress it to 840 bytes BRNT: Generate a ~527 byte symbol I want to let that sink in: in the general case, BRNT beats even **LZMA**. As if winning the compactness war while still being a symbol linkers won't have problems with wasn't enough, this approach also avoids generating giant symbols in the first place. Compression and/or hashing cannot do that. If D really wants to, it can compress symbols, but it *needs* to fix this problem properly *first*.
May 24 2016
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2016 12:16 PM, Anon wrote:
 As if winning the compactness war while still being a symbol linkers won't have
 problems with wasn't enough, this approach also avoids generating giant symbols
 in the first place. Compression and/or hashing cannot do that. If D really
wants
 to, it can compress symbols, but it *needs* to fix this problem properly
*first*.
Because identifiers are generated from substrings that don't know about each other, to do your algorithm a scan of the combined string will still be necessary. In any case, you should be able to implement your algorithm as a replacement for id_compress(). How about giving it a go?
May 24 2016
prev sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Tue, 24 May 2016 19:16:07 +0000
schrieb Anon <anon anon.anon>:

 Upping it to sixteen levels:
 
 current: Generate a 3_688_679 byte symbol
 lz4 -9: Generate a 3_688_679 byte symbol, then compress it to 
 15535 bytes
 lzma -9: Generate a 3_688_679 byte symbol, then compress it to 
 840 bytes
 BRNT: Generate a ~527 byte symbol
 
 
 
 I want to let that sink in: in the general case, BRNT beats even 
 **LZMA**.
Let's go this route. It seems to be a scalable solution without any big memory operations. -- Marco
May 25 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2016 9:22 AM, Timon Gehr wrote:
 Yes, it does. The compiler does not use exponential space to store the AST.
BTW, all types in dmd have a 'deco' string which is the AST linearized as a string. This string is also used to build the mangled names. All the deco's are put into a hashtable, and then all types can be uniquely identified by their address in the hashtable. This reduces type comparisons to a single pointer comparison.
May 24 2016
parent reply Guillaume Boucher <guillaume.boucher.d gmail.com> writes:
On Tuesday, 24 May 2016 at 20:16:32 UTC, Walter Bright wrote:
 On 5/24/2016 9:22 AM, Timon Gehr wrote:
 Yes, it does. The compiler does not use exponential space to 
 store the AST.
BTW, all types in dmd have a 'deco' string which is the AST linearized as a string. This string is also used to build the mangled names. All the deco's are put into a hashtable, and then all types can be uniquely identified by their address in the hashtable. This reduces type comparisons to a single pointer comparison.
There's no reason not to use the compression in the deco name. Just make sure the references are relative and you're set. No exponential space to store the AST. Please, if you change the ABI, do it right the first time. Some arbitrary compression algorithm doesn't solve the problem.
May 24 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2016 1:27 PM, Guillaume Boucher wrote:
 There's no reason not to use the compression in the deco name.  Just make sure
 the references are relative and you're set.
Not totally, since you'd like to refer to types that are common across deco names.
May 24 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 24 May 2016 at 20:39:41 UTC, Walter Bright wrote:
 On 5/24/2016 1:27 PM, Guillaume Boucher wrote:
 There's no reason not to use the compression in the deco name.
  Just make sure
 the references are relative and you're set.
Not totally, since you'd like to refer to types that are common across deco names.
On Tuesday, 24 May 2016 at 20:39:41 UTC, Walter Bright wrote: A serious question, do these templates even need a mangle ? Is anyone ever going to call them from a different module ? I don't think you could do that because those are voldemort types. Those symbols are only temporary aren't they ?
May 24 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2016 2:06 PM, Stefan Koch wrote:
 On Tuesday, 24 May 2016 at 20:39:41 UTC, Walter Bright wrote:
 On 5/24/2016 1:27 PM, Guillaume Boucher wrote:
 There's no reason not to use the compression in the deco name.
  Just make sure
 the references are relative and you're set.
Not totally, since you'd like to refer to types that are common across deco names.
On Tuesday, 24 May 2016 at 20:39:41 UTC, Walter Bright wrote: A serious question, do these templates even need a mangle ? Is anyone ever going to call them from a different module ? I don't think you could do that because those are voldemort types. Those symbols are only temporary aren't they ?
The types do exist in the compiler, and are used in template name generation, generating types, etc.
May 24 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 24 May 2016 at 21:26:58 UTC, Walter Bright wrote:
 On 5/24/2016 2:06 PM, Stefan Koch wrote:
 A serious question, do these templates even need a mangle ?
 Is anyone ever going to call them from a different module ?
 I don't think you could do that because those are voldemort 
 types.

 Those symbols are only temporary aren't they ?
The types do exist in the compiler, and are used in template name generation, generating types, etc.
But would one ever have to be able to reconstruct the mangle ? templates cannot be exported nor can instanciations. Or can they ?
May 24 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2016 2:34 PM, Stefan Koch wrote:
 But would one ever have to be able to reconstruct the mangle ?
Symbolic debugging, and dealing with various linker errors.
 templates cannot be exported nor can instanciations.
 Or can they ?
Of course they can!
May 24 2016
prev sibling parent Guillaume Boucher <guillaume.boucher.d gmail.com> writes:
On Monday, 23 May 2016 at 20:34:19 UTC, Walter Bright wrote:
 On 5/23/2016 12:32 PM, Timon Gehr wrote:
 Isn't it enough to create
 references to previously embedded mangled symbols (i-th symbol 
 already mangled)
 while generating the mangled string?
That's been covered upthread, and VC++/g++ do that. But as shown, it is inadequate. There's a lot more redundancy than just the identifiers.
Yes, not all redundancy is covered. However, I think all non-pathologic blow-ups are avoided. Let's look at the properties of the Itanium C++ ABI mangling: 1. Guaranteed linear size name for recursive types that have exponential size names. 2. Compression speed can be linear to the size of the compressed name (i.e. no need to create an exponential garbage name just to compress it). 3. Decompression easy, possible without tools and speed linear to size of the compressed name. 4. Name retains some redundancy (in itself and when compared to other mangled names), well suited for a follow-up compression. Compare this to apply some arbitrary compression algorithm to a name of exponential size: 1. Exponential size names remain exponential size. Yes, up until the view size, some compression algorithms guarantee that, but they can not guarantee it in the general case. 2. Compression needs the whole name first. 3. Decompression only possible with tools. 4. Multiple mangled names can not be compressed well.
May 24 2016
prev sibling next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Saturday, 21 May 2016 at 21:12:15 UTC, Walter Bright wrote:
 The current one is effective, but slow:

   
 https://github.com/dlang/dmd/blob/master/src/backend/compress.c

 Anyone want to make a stab at making it faster? Changing the 
 format is fair game, as well as down and dirty assembler if 
 that's what it takes.

 So really, how good are you at fast code?
Is LZ4 acceptable ? If so the current C implementation is hard to beat. And maybe it's best to just copy it in.
May 21 2016
parent Lionello Lunesu <lionello lunesu.remove.com> writes:
On 22/5/2016 05:51, Stefan Koch wrote:
 On Saturday, 21 May 2016 at 21:12:15 UTC, Walter Bright wrote:
 The current one is effective, but slow:

 https://github.com/dlang/dmd/blob/master/src/backend/compress.c

 Anyone want to make a stab at making it faster? Changing the format is
 fair game, as well as down and dirty assembler if that's what it takes.

 So really, how good are you at fast code?
Is LZ4 acceptable ? If so the current C implementation is hard to beat. And maybe it's best to just copy it in.
Agree. I've had the best results with LZ4. We can either link to the C code, or port it to D. It's very trivial. L.
May 21 2016
prev sibling next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Saturday, 21 May 2016 at 21:12:15 UTC, Walter Bright wrote:
 The current one is effective, but slow:

   
 https://github.com/dlang/dmd/blob/master/src/backend/compress.c

 Anyone want to make a stab at making it faster? Changing the 
 format is fair game, as well as down and dirty assembler if 
 that's what it takes.

 So really, how good are you at fast code?
Could you post a link to code you use for benchmarking ?
May 21 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/21/2016 3:13 PM, Stefan Koch wrote:
 Could you post a link to code you use for benchmarking ?
In the comments: https://github.com/dlang/dmd/pull/5793
May 21 2016
prev sibling next sibling parent reply qznc <qznc web.de> writes:
On Saturday, 21 May 2016 at 21:12:15 UTC, Walter Bright wrote:
 The current one is effective, but slow:

   
 https://github.com/dlang/dmd/blob/master/src/backend/compress.c

 Anyone want to make a stab at making it faster? Changing the 
 format is fair game, as well as down and dirty assembler if 
 that's what it takes.

 So really, how good are you at fast code?
A quick search led to this SO thread: https://stackoverflow.com/questions/1138345/an-efficient-compression-algorithm-for-short-text-strings It links to SMAZ: https://github.com/antirez/smaz It uses a codebook primed for english text. Since this is for name mangling, priming for Phobos might be a good idea. E.g. put "std.", "core.", "etc." into the codebook.
May 22 2016
next sibling parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Sunday, 22 May 2016 at 10:40:46 UTC, qznc wrote:
 On Saturday, 21 May 2016 at 21:12:15 UTC, Walter Bright wrote:
 It links to SMAZ:

 https://github.com/antirez/smaz

 It uses a codebook primed for english text. Since this is for 
 name mangling, priming for Phobos might be a good idea. E.g. 
 put "std.", "core.", "etc." into the codebook.
Curiously premaking a few primers as such for languages can give similar performance with zlib and gzip. I recall back in 2009 doing that when I had xml pages to deal with, it would compress using all my pages and find the best one and use that. Then decompression was simply a matter of finding the right CRC32 that matched it and away it went! Although, priming something similar for the current id_compress for early on wouldn't hurt. 'unittest' I see quite a bit, probably ones including common templates would help too for minor compression benefits.
May 22 2016
parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 22 May 2016 10:55:40 +0000
schrieb Era Scarecrow <rtcvb32 yahoo.com>:

 On Sunday, 22 May 2016 at 10:40:46 UTC, qznc wrote:
 On Saturday, 21 May 2016 at 21:12:15 UTC, Walter Bright wrote:
 It links to SMAZ:

 https://github.com/antirez/smaz

 It uses a codebook primed for english text. Since this is for 
 name mangling, priming for Phobos might be a good idea. E.g. 
 put "std.", "core.", "etc." into the codebook.  
Curiously premaking a few primers as such for languages can give similar performance with zlib and gzip.
LZ4 can seemingly use a dictionary, too. -- Marco
May 22 2016
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/22/16 6:40 AM, qznc wrote:
 On Saturday, 21 May 2016 at 21:12:15 UTC, Walter Bright wrote:
 The current one is effective, but slow:

 https://github.com/dlang/dmd/blob/master/src/backend/compress.c

 Anyone want to make a stab at making it faster? Changing the format is
 fair game, as well as down and dirty assembler if that's what it takes.

 So really, how good are you at fast code?
A quick search led to this SO thread: https://stackoverflow.com/questions/1138345/an-efficient-compression-algorithm-for-short-text-strings It links to SMAZ: https://github.com/antirez/smaz It uses a codebook primed for english text. Since this is for name mangling, priming for Phobos might be a good idea. E.g. put "std.", "core.", "etc." into the codebook.
A primed dictionary is a great idea. -- Andrei
May 22 2016
prev sibling next sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Sat, 21 May 2016 14:12:15 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 The current one is effective, but slow:
 
    https://github.com/dlang/dmd/blob/master/src/backend/compress.c
 
 Anyone want to make a stab at making it faster? Changing the format is fair 
 game, as well as down and dirty assembler if that's what it takes.
 
 So really, how good are you at fast code?
I can write something 2 times more compact and 16 times faster (in microbenchmarks) for the 10 times nested "testexpansion" test case from https://issues.dlang.org/show_bug.cgi?id=15831#c0 using BSD licensed lz4 compression. Does it have to be C++ code or can it be D code? -- Marco
May 22 2016
parent reply Marco Leise <Marco.Leise gmx.de> writes:
Content-Disposition: inline

=E2=AC=85 Download proof of concept

This is a proof of concept micro-benchmark compressing the
37_640 bytes symbol from the bug report linked above with both
id_compress from the dmd backend and lz4 (without dictionary).
The lz4 result is additionally transformed to 7-bits similar
to base-64.

Original size: 37640 bytes

id_compress : 5243 ms for 10_000 iterations, 800 bytes
lz4         :   71 ms for 10_000 iterations, 389 bytes

That roughly corresponds to a 2x higher compression ratio at
70x faster compression speed.

--=20
Marco
May 22 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Sunday, 22 May 2016 at 14:05:23 UTC, Marco Leise wrote:
 ⬅ Download proof of concept

 This is a proof of concept micro-benchmark compressing the 
 37_640 bytes symbol from the bug report linked above with both 
 id_compress from the dmd backend and lz4 (without dictionary). 
 The lz4 result is additionally transformed to 7-bits similar to 
 base-64.

 Original size: 37640 bytes

 id_compress : 5243 ms for 10_000 iterations, 800 bytes
 lz4         :   71 ms for 10_000 iterations, 389 bytes

 That roughly corresponds to a 2x higher compression ratio at 
 70x faster compression speed.
Please submit a PR.
May 22 2016
parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 22 May 2016 14:06:52 +0000
schrieb Stefan Koch <uplink.coder googlemail.com>:

 On Sunday, 22 May 2016 at 14:05:23 UTC, Marco Leise wrote:
 =E2=AC=85 Download proof of concept

 This is a proof of concept micro-benchmark compressing the=20
 37_640 bytes symbol from the bug report linked above with both=20
 id_compress from the dmd backend and lz4 (without dictionary).=20
 The lz4 result is additionally transformed to 7-bits similar to=20
 base-64.

 Original size: 37640 bytes

 id_compress : 5243 ms for 10_000 iterations, 800 bytes
 lz4         :   71 ms for 10_000 iterations, 389 bytes

 That roughly corresponds to a 2x higher compression ratio at=20
 70x faster compression speed. =20
=20 Please submit a PR.
There are a few open questions, for me at least. Are there other proposals with good properties? Are there any objections against the benchmarking method or the license? Can the implementation be in D? If not, should we link against the system liblz4.so/a or copy the C code into the backend? Is a pre-filled dictionary worthwhile? If so, which words should go in it? The latter is important because changes to the Dlang mangling need to be reflected in other projects outside our direct control, too. --=20 Marco
May 22 2016
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Sunday, 22 May 2016 at 16:06:07 UTC, Marco Leise wrote:

 There are a few open questions, for me at least. Are there 
 other proposals with good properties? Are there any objections 
 against the benchmarking method or the license? Can the 
 implementation be in D? If not, should we link against the 
 system liblz4.so/a or copy the C code into the backend? Is a 
 pre-filled dictionary worthwhile? If so, which words should go 
 in it? The latter is important because changes to the Dlang 
 mangling need to be reflected in other projects outside our 
 direct control, too.
Here are my answers please take them with a grain of ignorance. I do not think prefilling the dictionary will do much good since LZ is all about finding prefixs anyway. I am for copying the sourcecode into our source-tree we don't want to be dependent on LZ4 library being available on the users system. I think re-implementing it in D is also fine as long as the function is extern(C). However I agree with Martins point: It is a waste of resources re-implementing something that already has a perfectly fine implementation. The only reason why I re-implemented the LZ4 decompression was to use it at CTFE.
May 22 2016
parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 22 May 2016 17:19:38 +0000
schrieb Stefan Koch <uplink.coder googlemail.com>:

 Here are my answers please take them with a grain of ignorance.
 [=E2=80=A6]
Maybe you are right about the dictionary. I haven't put much any thought into it. Anyways I was looking for input from the core devs who are affected by this.
 However I agree with Martins point: It is a waste of resources=20
 re-implementing something that already has a perfectly fine=20
 implementation.
I already loosely copied and fixed up the compressor in the attached proof of concept. Thanks to the two languages' similarities it was a matter of an hour or so. If you provide the decompressor, we're set. :) --=20 Marco
May 22 2016
parent Stefan Koch <uplink.coder googlemail.com> writes:
On Sunday, 22 May 2016 at 18:18:46 UTC, Marco Leise wrote:
 If you provide the decompressor, we're set. :)
I am already working on a non-ctfeable version that does not use slices. And therefore should be faster. But decompression is only needed for demangling.
May 22 2016
prev sibling next sibling parent Era Scarecrow <rtcvb32 yahoo.com> writes:
On Sunday, 22 May 2016 at 16:06:07 UTC, Marco Leise wrote:
 There are a few open questions, for me at least. Are there 
 other proposals with good properties? Are there any objections 
 against the benchmarking method or the license? Can the 
 implementation be in D? If not, should we link against the 
 system liblz4.so/a or copy the C code into the backend? Is a 
 pre-filled dictionary worthwhile? If so, which words should go 
 in it? The latter is important because changes to the Dlang 
 mangling need to be reflected in other projects outside our 
 direct control, too.
How it's being benchmarked does make me question it a bit. I avoided doing multi-threading so my results are as they would run on 1 CPU/Core, while adding multi-threading probably isn't too hard to do (at least in D). Then there's the wonder of how many cores, and other details. A compressor designed to run in parallel will naturally be in it's own element there. This of course assumes we are allowed to in the first place; The compiler may already be using those cores for other tasks: Scanning other files, optimizing code, CTFE, etc, to which point speed gained in compression may not actually translate to an overall speedup if you are stealing CPU cycles. As for a dictionary and being primed, I'm trying to figure out what would be in it myself. Commonly used templates and keywords that appear in symbols are so far my best guess. I'd avoid library names unless they'd appear in the symbols (so 'std.core' might not mean much, while 'unittestL' does). Then again I don't have a good sample of what I'd actually encounter live, maybe every major library import would be good thing to have.
May 22 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/22/2016 9:06 AM, Marco Leise wrote:
 Are there any objections
 against the benchmarking method or the license?
BSD may cause problems.
 Can the implementation be in D?
Not yet.
 If not, should we link against the
 system liblz4.so/a
No.
 or copy the C code into the backend?
Yes.
 Is a pre-filled dictionary worthwhile?
The trouble with that is the demangler would need the same dictionary, right? If so, that would be problematic.
May 22 2016
parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 22 May 2016 13:02:37 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 5/22/2016 9:06 AM, Marco Leise wrote:
 Are there any objections
 against the benchmarking method or the license?  
BSD may cause problems.
 Can the implementation be in D?  
Not yet.
 If not, should we link against the
 system liblz4.so/a  
No.
 or copy the C code into the backend?  
Yes.
Can you elaborate on how copying the source overrules your above license concerns? I would create a directory "lz4" for the sources, compile them like we compile zlib now and install dmd with a backend license stating that it uses lz4 licensed under BSD?
 Is a pre-filled dictionary worthwhile?  
The trouble with that is the demangler would need the same dictionary, right? If so, that would be problematic.
Right, it's probably more trouble than worth it. -- Marco
May 22 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/22/2016 10:52 PM, Marco Leise wrote:
 Can you elaborate on how copying the source overrules your
 above license concerns?
It does not. It just avoids having to link in a library that may or may not exist on every machine we want to create demanglers on.
 I would create a directory "lz4" for the sources, compile
 them like we compile zlib now and install dmd with a backend
 license stating that it uses lz4 licensed under BSD?
How many lines of code are we talking about here? The post you made makes it look like just a handful - why go to all the trouble of creating more directories and build complexity?
May 22 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
The file format: http://cyan4973.github.io/lz4/lz4_Block_format.html

It doesn't look too difficult. If we implement our own LZ4 compressor based on 
that, from scratch, we can boost license it.
May 22 2016
parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 22 May 2016 23:42:33 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 The file format: http://cyan4973.github.io/lz4/lz4_Block_format.html
 
 It doesn't look too difficult. If we implement our own LZ4 compressor based on 
 that, from scratch, we can boost license it.
Ok, any volunteers? I'm not personally looking forward to reimplementing lz4 from sratch right now. As Stefan Koch said, we should be able to use the existing optimized code, like gcc uses gmp. -- Marco
May 23 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
Am Sun, 22 May 2016 23:42:33 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:
 The file format: 
 http://cyan4973.github.io/lz4/lz4_Block_format.html
 
 It doesn't look too difficult. If we implement our own LZ4 
 compressor based on that, from scratch, we can boost license it.
That it right. It's pretty simple. On Monday, 23 May 2016 at 07:30:00 UTC, Marco Leise wrote:
 Ok, any volunteers?
Well I am not a compression expert but since I am already working on optimizing the decompressor. The method for archiving perfect compression it outlined here: https://github.com/Cyan4973/lz4/issues/183
May 23 2016
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/23/2016 5:04 AM, Stefan Koch wrote:
 Am Sun, 22 May 2016 23:42:33 -0700
 schrieb Walter Bright <newshound2 digitalmars.com>:
 The file format: http://cyan4973.github.io/lz4/lz4_Block_format.html

 It doesn't look too difficult. If we implement our own LZ4 compressor based on
 that, from scratch, we can boost license it.
That it right. It's pretty simple.
Also, the LZ4 compressor posted here has a 64K string limit, which won't work for D because there are reported 8Mb identifier strings.
May 23 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 23 May 2016 at 15:33:45 UTC, Walter Bright wrote:
 Also, the LZ4 compressor posted here has a 64K string limit, 
 which won't work for D because there are reported 8Mb 
 identifier strings.
This is only partially true. The 64k limit does not apply to the input string. It does only apply to the dictionary.It would only hit if we find 64k of identifier without repetition.
May 23 2016
next sibling parent deadalnix <deadalnix gmail.com> writes:
On Monday, 23 May 2016 at 16:00:20 UTC, Stefan Koch wrote:
 On Monday, 23 May 2016 at 15:33:45 UTC, Walter Bright wrote:
 Also, the LZ4 compressor posted here has a 64K string limit, 
 which won't work for D because there are reported 8Mb 
 identifier strings.
This is only partially true. The 64k limit does not apply to the input string. It does only apply to the dictionary.It would only hit if we find 64k of identifier without repetition.
If you want speed: 16k (aka half of the L1D).
May 23 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/23/2016 9:00 AM, Stefan Koch wrote:
 On Monday, 23 May 2016 at 15:33:45 UTC, Walter Bright wrote:
 Also, the LZ4 compressor posted here has a 64K string limit, which won't work
 for D because there are reported 8Mb identifier strings.
This is only partially true. The 64k limit does not apply to the input string.
The starting line of the compression function posted here tests for it and aborts if larger.
May 23 2016
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Monday, 23 May 2016 at 20:36:04 UTC, Walter Bright wrote:
 On 5/23/2016 9:00 AM, Stefan Koch wrote:
 The 64k limit does not apply to the input string.
The starting line of the compression function posted here tests for it and aborts if larger.
So Walter, big question: How big before the compressor kicks in? What's the minimum size of the input string the compressor should expect? Found something VERY interesting, I have 4 algorithms I'm benchmarking (the original and 3 edited compression routines). Most of them perform with similar speed (15-25% boost), but when I gave it a test by compressing raw D code of 20k big (for a big example), it suddenly showed an obvious winner between the 4 (performing 5x faster than the original id_compress). This gain in speed wasn't obvious with a smaller string (1k or so), so assuming I'm getting good results this is very promising. id_compress: 1379ms (1379331μs), 22595 -> 8458 bytes big_compress: 1186ms (1186608μs), 22595 -> 6784 bytes id_reduced: 626ms (626750μs), 22595 -> 9618 bytes min_compress: 262ms (262340μs), 22595 -> 9266 bytes I assume speed is more important than a little loss in compression.
May 24 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2016 4:55 AM, Era Scarecrow wrote:
  So Walter, big question: How big before the compressor kicks in? What's the
 minimum size of the input string the compressor should expect?
It's currently set at 64 in the PR. But that number is just picked out of the air.
  I assume speed is more important than a little loss in compression.
Yes, but it's a judgement call.
May 24 2016
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Tuesday, 24 May 2016 at 20:07:02 UTC, Walter Bright wrote:
 On 5/24/2016 4:55 AM, Era Scarecrow wrote:
 So Walter, big question: How big before the compressor kicks 
 in? What's the minimum size of the input string the compressor 
 should expect?
It's currently set at 64 in the PR. But that number is just picked out of the air.
64 bytes? Hmm I had the impression it was going to be far larger...
May 24 2016
parent Era Scarecrow <rtcvb32 yahoo.com> writes:
On Tuesday, 24 May 2016 at 20:25:18 UTC, Era Scarecrow wrote:
  64 bytes? Hmm I had the impression it was going to be far 
 larger...
Well it's been a while since I last touched this topic (or code) so I'll post it, but there's 3 formulas written. 0) compress.c - control code, we are comparing against 1) compress_reduced.c, a drop-in replacement. Shouldn't be any issue with using, although doesn't compress or as fast as it could be. 2) compress_min.d, might need some API reworking, but other than that it is real fast by my tests. 3) compress_big.d, much more memory usage, but compresses similar to compress_min, but does much better compression on large objects that are more normal text like source code. For sheer speed compress_min is the best choice. although it may not compress as well. Note: wideint.d is referenced for a compress test, you can change the script to use something else, or copy something and just fill the file in. https://github.com/rtcvb32/Side-Projects/commit/452532904ca2d43af09bced31854eadc0395f406
Oct 17 2016
prev sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Mon, 23 May 2016 12:04:48 +0000
schrieb Stefan Koch <uplink.coder googlemail.com>:

 The method for archiving perfect compression it outlined here:
 https://github.com/Cyan4973/lz4/issues/183
Nice, if it can keep the compression speed up. -- Marco
May 23 2016
prev sibling next sibling parent reply Martin Nowak <code dawg.eu> writes:
On Saturday, 21 May 2016 at 21:12:15 UTC, Walter Bright wrote:
 The current one is effective, but slow:

   
 https://github.com/dlang/dmd/blob/master/src/backend/compress.c
Yes, LZO, LZ4, or snappy would be good choices. They are real-time compression algorithms, i.e. it's faster to compress and write to main memory than w/o compression. Please don't write anything on your own. There is a meta library for those algorithms http://blosc.org/, for which I added Deimos headers http://code.dlang.org/packages/d-blosc.
May 22 2016
parent Stefan Koch <uplink.coder googlemail.com> writes:
On Sunday, 22 May 2016 at 12:12:09 UTC, Martin Nowak wrote:

 Yes, LZO, LZ4, or snappy would be good choices. They are 
 real-time compression algorithms, i.e. it's faster to compress 
 and write to main memory than w/o compression.
Yes the LZ compressors are very good for this usecase
 Please don't write anything on your own. There is a meta 
 library for those algorithms http://blosc.org/, for which I 
 added Deimos headers http://code.dlang.org/packages/d-blosc.
Agreed. The existing implementations have been fine-tuned for performance, and we should use them as leverage.
May 22 2016
prev sibling parent deadalnix <deadalnix gmail.com> writes:
On Saturday, 21 May 2016 at 21:12:15 UTC, Walter Bright wrote:
 The current one is effective, but slow:

   
 https://github.com/dlang/dmd/blob/master/src/backend/compress.c

 Anyone want to make a stab at making it faster? Changing the 
 format is fair game, as well as down and dirty assembler if 
 that's what it takes.

 So really, how good are you at fast code?
Mandatory goldmine on compression: http://fastcompression.blogspot.fr/
May 22 2016