www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - struct Location size

reply Walter Bright <newshound2 digitalmars.com> writes:
This PR https://github.com/dlang/dmd/pull/15199 reduces its size by 8 bytes, 
resulting in about 20Mb of memory savings compiling druntime, according to
 dkorpel.

It is currently:

struct Loc
{
   uint linnum; // line number, starting from 1
   ushort charnum; // utf8 code unit index relative to start of line, starting 
from 1
   ushort fileIndex; // index into filenames[], starting from 1 (0 means no 
filename)
}

which is 8 bytes.

If it was 4 bytes, it could still store 4 billion unique locations, which ought 
to be enough for everyone. I toyed around with various encodings:

  6 bits for column - 1..64
15 bits for line - 1..32768
11 bits for file - 2047

I also thought of schemes like having the sign bit set an alternate encoding
scheme.

So, for great glory, can anyone come up with a clever scheme that uses only 32
bits?
May 08 2023
next sibling parent reply Adam D Ruppe <destructionator gmail.com> writes:
On Tuesday, 9 May 2023 at 00:24:44 UTC, Walter Bright wrote:
  6 bits for column - 1..64
 15 bits for line - 1..32768
 11 bits for file - 2047

 So, for great glory, can anyone come up with a clever scheme 
 that uses only 32 bits?
I wouldn't separate out column/line/file at all. Concatenate all the files together in memory and store only an offset into that gigantic array. If an error happens, then and only then go back to extract the details by slicing the filename out of a listing and rescanning it to determine line and column. (You'd probably have an index that does a bit of both, like every new file or every 5000 lines, add an entry to the lookup table. Then when you do hit an error, you just need to scan from the closest point forward instead of the whole thing.) If there's no errors, it uses little memory and is fast. If there is an error, the rescanning time is not significant anyway relative to the time to fix the error.
May 08 2023
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/8/2023 5:32 PM, Adam D Ruppe wrote:
 I wouldn't separate out column/line/file at all. Concatenate all the files 
 together in memory and store only an offset into that gigantic array. If an 
 error happens, then and only then go back to extract the details by slicing
the 
 filename out of a listing and rescanning it to determine line and column.
(You'd 
 probably have an index that does a bit of both, like every new file or every 
 5000 lines, add an entry to the lookup table. Then when you do hit an error,
you 
 just need to scan from the closest point forward instead of the whole thing.)
 
 If there's no errors, it uses little memory and is fast. If there is an error, 
 the rescanning time is not significant anyway relative to the time to fix the 
 error.
That is indeed a clever idea, thank you.
May 08 2023
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/8/2023 5:32 PM, Adam D Ruppe wrote:
 I wouldn't separate out column/line/file at all. Concatenate all the files 
 together in memory and store only an offset into that gigantic array. If an 
 error happens, then and only then go back to extract the details by slicing
the 
 filename out of a listing and rescanning it to determine line and column.
(You'd 
 probably have an index that does a bit of both, like every new file or every 
 5000 lines, add an entry to the lookup table. Then when you do hit an error,
you 
 just need to scan from the closest point forward instead of the whole thing.)
 
 If there's no errors, it uses little memory and is fast. If there is an error, 
 the rescanning time is not significant anyway relative to the time to fix the 
 error.
Thinking about it a bit more, it isn't necessary to concatenate the files at all. Just, for each file, record the starting line number, which would be the sum of all the lines in the files previously parsed. Then, the line number can be converted to file/line by looking up the range of line numbers it falls in, which gives the file, and subtracting the corresponding starting line number, which gives the line. So if we leave 8 bits for column, 1..256, that gives 24 for the line number, which gives room for 16,777,216 lines which ought to be enough, and no rescanning necessary.
May 08 2023
parent zjh <fqbqrr 163.com> writes:
On Tuesday, 9 May 2023 at 03:51:21 UTC, Walter Bright wrote:


 So if we leave 8 bits for column, 1..256, that gives 24 for the 
 line number, which gives room for 16,777,216 lines which ought 
 to be enough, and no rescanning necessary.
It can represent projects less than 1000,777,216 bytes, but projects exceeding 1000MB are not sufficient. However, most of them are sufficient.
May 08 2023
prev sibling parent deadalnix <deadalnix gmail.com> writes:
On Tuesday, 9 May 2023 at 00:32:33 UTC, Adam D Ruppe wrote:
 On Tuesday, 9 May 2023 at 00:24:44 UTC, Walter Bright wrote:
  6 bits for column - 1..64
 15 bits for line - 1..32768
 11 bits for file - 2047

 So, for great glory, can anyone come up with a clever scheme 
 that uses only 32 bits?
I wouldn't separate out column/line/file at all. Concatenate all the files together in memory and store only an offset into that gigantic array. If an error happens, then and only then go back to extract the details by slicing the filename out of a listing and rescanning it to determine line and column. (You'd probably have an index that does a bit of both, like every new file or every 5000 lines, add an entry to the lookup table. Then when you do hit an error, you just need to scan from the closest point forward instead of the whole thing.) If there's no errors, it uses little memory and is fast. If there is an error, the rescanning time is not significant anyway relative to the time to fix the error.
What if someone already implemented this. Wait... https://github.com/snazzy-d/sdc/blob/master/src/source/manager.d https://github.com/snazzy-d/sdc/blob/master/src/source/location.d
May 11 2023
prev sibling next sibling parent reply zjh <fqbqrr 163.com> writes:
On Tuesday, 9 May 2023 at 00:24:44 UTC, Walter Bright wrote:


  6 bits for column - 1..64
 15 bits for line - 1..32768
 11 bits for file - 2047
8 bits for columns - 1..256 15 bits for lines - 1..32768 9 bits for files - 512 64 for column are generally not enough. After the above changes, the general projects are sufficient. But for large projects, there should be a switch, which needs to be expanded to '8' bytes.
May 08 2023
next sibling parent zjh <fqbqrr 163.com> writes:
On Tuesday, 9 May 2023 at 01:41:31 UTC, zjh wrote:

 But for large projects, there should be a switch, which needs 
 to be expanded to '8' bytes.
Of course, if it is possible to require large projects to be compiled as sub projects, then it may generally be sufficient to use projects that meet the above conditions as unit projects A large project must be composed of multiple unit projects and then compiled separately.But then, I don't know how the total compilation will be.
May 08 2023
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/8/2023 6:41 PM, zjh wrote:
 8 bits for columns - 1..256
 15 bits for lines - 1..32768
 9 bits for files - 512
 
 64 for column are generally not enough. After the above changes, the general 
 projects are sufficient.
 But for large projects, there should be a switch, which needs to be expanded
to 
 '8' bytes.
I don't think 512 files is enough. As for columns, those aren't very important, and it can just "top out" at 64. Not perfect.
May 08 2023
prev sibling next sibling parent reply Basile B. <b2.temp gmx.com> writes:
On Tuesday, 9 May 2023 at 00:24:44 UTC, Walter Bright wrote:
 This PR https://github.com/dlang/dmd/pull/15199 reduces its 
 size by 8 bytes, resulting in about 20Mb of memory savings 
 compiling druntime, according to  dkorpel.

 [...]

 So, for great glory, can anyone come up with a clever scheme 
 that uses only 32 bits?
Yes, call me a madlad but in styx the filename information is in the DMD equivalent of `Scope` (and the Loc is 2*32 bits, {line,col}). If you do the same in DMD you can reduce the size to 32 bits, i.e 2*16 bits, {line,col}. I'm not sure this would work for DMD but for styx the rationale was 1. in DMD the size of Loc is a known problem, that's why it's usually recommended to pass its instances by ref, let's not have the same problem. 2. there's much less Scopes than Loc.
May 09 2023
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/9/2023 1:35 AM, Basile B. wrote:
 Yes, call me a madlad but in styx the filename information is in the DMD 
 equivalent of `Scope` (and the Loc is 2*32 bits, {line,col}). If you do the
same 
 in DMD you can reduce the size to 32 bits, i.e 2*16 bits, {line,col}.
 
 I'm not sure this would work for DMD but for styx the rationale was
 
 1. in DMD the size of Loc is a known problem, that's why it's usually 
 recommended to pass its instances by ref, let's not have the same problem.
 2. there's much less Scopes than Loc.
Indeed, it would eliminate the pass-by-ref of Loc.
May 09 2023
parent reply Commander Zot <no no.no> writes:
On Wednesday, 10 May 2023 at 03:16:19 UTC, Walter Bright wrote:
 On 5/9/2023 1:35 AM, Basile B. wrote:
 Yes, call me a madlad but in styx the filename information is 
 in the DMD equivalent of `Scope` (and the Loc is 2*32 bits, 
 {line,col}). If you do the same in DMD you can reduce the size 
 to 32 bits, i.e 2*16 bits, {line,col}.
 
 I'm not sure this would work for DMD but for styx the 
 rationale was
 
 1. in DMD the size of Loc is a known problem, that's why it's 
 usually recommended to pass its instances by ref, let's not 
 have the same problem.
 2. there's much less Scopes than Loc.
Indeed, it would eliminate the pass-by-ref of Loc.
how often do you need the _exact_ column? or to rephrase it, wouldn't column divided by 2 or 3 be good enough to figure out where an error happened?
May 10 2023
next sibling parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
And that's how you break already poor performing IDE support even further.

We don't need to save double digit megabytes when we are using gigabytes 
of memory, if we have to use extreme measures which result in poorer 
usability.

Instead it would be better to figure out how to minimize the number we 
store; if we can't its just the cost of compiling D.
May 10 2023
prev sibling parent reply Basile B. <b2.temp gmx.com> writes:
On Wednesday, 10 May 2023 at 16:09:53 UTC, Commander Zot wrote:
 how often do you need the _exact_ column? or to rephrase it, 
 wouldn't column divided by 2 or 3 be good enough to figure out 
 where an error happened?
The column is needed in every error message when the option `-vcolumns` is activated. It really has to be the right number.
May 10 2023
parent reply Commander Zot <no no.no> writes:
On Wednesday, 10 May 2023 at 17:57:17 UTC, Basile B. wrote:
 On Wednesday, 10 May 2023 at 16:09:53 UTC, Commander Zot wrote:
 how often do you need the _exact_ column? or to rephrase it, 
 wouldn't column divided by 2 or 3 be good enough to figure out 
 where an error happened?
The column is needed in every error message when the option `-vcolumns` is activated. It really has to be the right number.
I know that it does that, but is there a reason it has to be the exact column other than "we specified it to be that way". for usecase is there for it? because if it's just to print the line where the error happened, it really doesnt matter if you start/end one or two chars earlier. but in case it is needed, then limiting it to 64 is also not an option.
May 11 2023
parent Basile B. <b2.temp gmx.com> writes:
On Thursday, 11 May 2023 at 09:31:16 UTC, Commander Zot wrote:
 On Wednesday, 10 May 2023 at 17:57:17 UTC, Basile B. wrote:
 On Wednesday, 10 May 2023 at 16:09:53 UTC, Commander Zot wrote:
 how often do you need the _exact_ column? or to rephrase it, 
 wouldn't column divided by 2 or 3 be good enough to figure 
 out where an error happened?
The column is needed in every error message when the option `-vcolumns` is activated. It really has to be the right number.
I know that it does that, but is there a reason it has to be the exact column other than "we specified it to be that way".
The column is useful for IDEs, for example after a failed compilation it allows to accurately syntax highlight with a wavy red underline the right identifier in a chain that contains one error, instead of the whole chain. Otherwise and presumably, formatters can use the information too.
 for usecase is there for it? because if it's just to print the 
 line where the error happened, it really doesnt matter if you 
 start/end one or two chars earlier.

 but in case it is needed, then limiting it to 64 is also not an 
 option.
May 11 2023
prev sibling next sibling parent reply Dennis <dkorpel gmail.com> writes:
On Tuesday, 9 May 2023 at 00:24:44 UTC, Walter Bright wrote:
 So, for great glory, can anyone come up with a clever scheme 
 that uses only 32 bits?
I put mine in the PR already: https://github.com/dlang/dmd/pull/15199#issuecomment-1538120181 It's the same idea as Adam's. You really only need a file offset, which currently already exists, but only for DMD as a lib: ```D version (LocOffset) uint fileOffset; /// utf8 code unit index relative to start of file, starting from 0 ``` Line number and column number can be computed when needed, and can be accelerated with a binary search in a list of line numbers sorted by file offset (because of #line directives lines won't be monotonically increasing and need to be stored in the list). To handle multiple files, add the sum of all previous file sizes to the file offset and have a mapping from the resulting 'global offset' to local file offset. The cool thing is that DMDLIB can still access `fileOffset`, while keeping the small 4 byte `Loc`!
May 09 2023
next sibling parent Johan <j j.nl> writes:
On Tuesday, 9 May 2023 at 09:35:41 UTC, Dennis wrote:
 On Tuesday, 9 May 2023 at 00:24:44 UTC, Walter Bright wrote:
 So, for great glory, can anyone come up with a clever scheme 
 that uses only 32 bits?
I put mine in the PR already: https://github.com/dlang/dmd/pull/15199#issuecomment-1538120181 It's the same idea as Adam's. You really only need a file offset, which currently already exists, but only for DMD as a lib: ```D version (LocOffset) uint fileOffset; /// utf8 code unit index relative to start of file, starting from 0 ``` Line number and column number can be computed when needed, and can be accelerated with a binary search in a list of line numbers sorted by file offset (because of #line directives lines won't be monotonically increasing and need to be stored in the list).
FYI, this is indeed what Clang does. 32-bit offset, with a "SourceManager" to help convert the ID back to file:line. https://github.com/llvm/llvm-project/blob/ec77d1f3d9fcf7105b6bda25fb4d0e5ed5afd0c5/clang/include/clang/Basic/SourceLocation.h#L71-L86 -Johan
May 09 2023
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/9/2023 2:35 AM, Dennis wrote:
 Line number and column number can be computed when needed, and can be 
 accelerated with a binary search in a list of line numbers sorted by file
offset 
 (because of #line directives lines won't be monotonically increasing and need
to 
 be stored in the list).
 
 To handle multiple files, add the sum of all previous file sizes to the file 
 offset and have a mapping from the resulting 'global offset' to local file
offset.
That should work, and avoids the need to keep the source file text around.
May 09 2023
prev sibling next sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Tuesday, 9 May 2023 at 00:24:44 UTC, Walter Bright wrote:
 This PR https://github.com/dlang/dmd/pull/15199 reduces its 
 size by 8 bytes, resulting in about 20Mb of memory savings 
 compiling druntime, according to  dkorpel.

 It is currently:

 struct Loc
 {
   uint linnum; // line number, starting from 1
   ushort charnum; // utf8 code unit index relative to start of 
 line, starting from 1
   ushort fileIndex; // index into filenames[], starting from 1 
 (0 means no filename)
 }

 which is 8 bytes.

 If it was 4 bytes, it could still store 4 billion unique 
 locations, which ought to be enough for everyone. I toyed 
 around with various encodings:

  6 bits for column - 1..64
 15 bits for line - 1..32768
 11 bits for file - 2047

 I also thought of schemes like having the sign bit set an 
 alternate encoding scheme.

 So, for great glory, can anyone come up with a clever scheme 
 that uses only 32 bits?
Does it concern also C files handled by ImportC? If yes, be aware that there's no limit in C in line length and that the pre-processor can generate humongous lines. In that case splitting line/column is a bad idea and file offset is just enough.
May 09 2023
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/9/2023 4:57 AM, Patrick Schluter wrote:
 Does it concern also C files handled by ImportC?
Yes
 If yes, be aware that there's 
 no limit in C in line length and that the pre-processor can generate humongous 
 lines.
D has no line limit, either. The saving grace is the column is not used for anything other than error messages, so it's not a catastrophy if it sticks at 64.
 In that case splitting line/column is a bad idea and file offset is just
enough.
File offset is a performance problem and also requires keeping the source files in memory.
May 09 2023
next sibling parent reply Johan <j j.nl> writes:
On Wednesday, 10 May 2023 at 03:01:34 UTC, Walter Bright wrote:
 On 5/9/2023 4:57 AM, Patrick Schluter wrote:
 Does it concern also C files handled by ImportC?
Yes
 If yes, be aware that there's no limit in C in line length and 
 that the pre-processor can generate humongous lines.
D has no line limit, either. The saving grace is the column is not used for anything other than error messages, so it's not a catastrophy if it sticks at 64.
Then why save the column number at all? At limit 64, it will be wrong very often.
 In that case splitting line/column is a bad idea and file 
 offset is just enough.
File offset is a performance problem and also requires keeping the source files in memory.
It is not a problem for Clang. -Johan
May 09 2023
parent reply max haughton <maxhaton gmail.com> writes:
On Wednesday, 10 May 2023 at 06:09:25 UTC, Johan wrote:

 In that case splitting line/column is a bad idea and file 
 offset is just enough.
File offset is a performance problem and also requires keeping the source files in memory.
It is not a problem for Clang. -Johan
I also don't see why its a perf issue?
May 10 2023
next sibling parent Adam D Ruppe <destructionator gmail.com> writes:
On Wednesday, 10 May 2023 at 17:16:23 UTC, max haughton wrote:
 I also don't see why its a perf issue?
I can imagine something like compiling the entire linux kernel at once being a big ask, but it doesn't require any physical memory - the array can be offsets into discardable pages of memory mapped files (which would probably perform quite well in general).
May 10 2023
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/10/2023 10:16 AM, max haughton wrote:
 I also don't see why its a perf issue?
Every time the line number is needed, the file source has to be scanned from the start of the file source. Line numbers are needed not just for error messages, but for symbolic debug info. Generally debug compiles should be fast. Scanning the source files also faults them into memory. If one does a table of pairs of fileoffsets/linenumbers, that will speed things up, but at a cost of 8 bytes per code line. Also the binary search needed for referencing it, which will cause the array to be faulted into memory.
May 10 2023
parent max haughton <maxhaton gmail.com> writes:
On Wednesday, 10 May 2023 at 20:21:18 UTC, Walter Bright wrote:
 On 5/10/2023 10:16 AM, max haughton wrote:
 I also don't see why its a perf issue?
Every time the line number is needed, the file source has to be scanned from the start of the file source. Line numbers are needed not just for error messages, but for symbolic debug info. Generally debug compiles should be fast.
Approximately once or never per symbol in a pattern that is probably mostly linear (i.e. why do it from scratch all time?). We live in a world where we can *parse gigabytes* of JSON per second (https://github.com/simdjson/simdjson), it can be made fast. We can fit most full western names in a SIMD register now, even.
 Scanning the source files also faults them into memory.
As does parsing them. If aren't going to use it except for a burst at the beginning and the end you can tell the kernel as much very easily and it can use the physical side of the memory map for something else. Building some of the code we have at Symmetry literally takes up all the memory on my machine, its not the files. Overarching point here is that it needs to be measured properly. I would also like to point out that making `Loc` smaller is basically tittle-tattle at the scale of dmd's memory allocation at the moment. It only feels good in the numbers Dennis posted because (in relative terms) druntime doesn't stress the compiler that much, and (in absolute terms) dmd is still allocating way too many objects overall. If you want another free saving, do less `Array`s as pointer to struct (with a pointer in it) and make it smaller. Last time I measured it the modal length was either 0 or 1 depending on how I measured it so a lot of them are just wasted memory even if they never actually allocate anything themselves. Anecdotally, most memory allocated with the bump the pointer scheme is never written to (or more precisely is always 0) Longer term: Make dmd more GC friendly: Currently, activating -lowmem often makes no difference because there's usually a reference to everything somewhere. I'm not sure how you'd find easy things to change, though. I suppose the academic solution would be to find a way to export the GC's graph and stare at it, the engineering solution might be to print a report of which objects were still alive when the program exits and then stare at that instead.
May 10 2023
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Wednesday, 10 May 2023 at 03:01:34 UTC, Walter Bright wrote:
 File offset is a performance problem and also requires keeping 
 the source files in memory.
Just do it once, and you can even SWAR it. https://github.com/snazzy-d/sdc/blob/master/src/source/lexwhitespace.d#L115-L137
May 11 2023
parent deadalnix <deadalnix gmail.com> writes:
On Thursday, 11 May 2023 at 23:17:32 UTC, deadalnix wrote:
 On Wednesday, 10 May 2023 at 03:01:34 UTC, Walter Bright wrote:
 File offset is a performance problem and also requires keeping 
 the source files in memory.
Just do it once, and you can even SWAR it. https://github.com/snazzy-d/sdc/blob/master/src/source/lexwhitespace.d#L115-L137
The SWAR: https://github.com/snazzy-d/sdc/blob/master/src/source/swar/newline.d
May 11 2023
prev sibling next sibling parent max haughton <maxhaton gmail.com> writes:
On Tuesday, 9 May 2023 at 00:24:44 UTC, Walter Bright wrote:
 This PR https://github.com/dlang/dmd/pull/15199 reduces its 
 size by 8 bytes, resulting in about 20Mb of memory savings 
 compiling druntime, according to  dkorpel.

 [...]
https://issues.dlang.org/show_bug.cgi?id=23902 while you're there, reserve some bits for fixing this.
May 09 2023
prev sibling parent cc <cc nevernet.com> writes:
On Tuesday, 9 May 2023 at 00:24:44 UTC, Walter Bright wrote:
 This PR https://github.com/dlang/dmd/pull/15199 reduces its 
 size by 8 bytes, resulting in about 20Mb of memory savings 
 compiling druntime, according to  dkorpel.

  6 bits for column - 1..64
I'm just sitting here blinking my eyes over and over at the thought of considering (essentially) losing the ability to track column numbers for errors in order to potentially save 20MB of memory usage in a process that is currently using... \**presses compile and glances at second monitor*\* 2GB. In other news, the City Council recently decided that doing away with crosswalks would be the best solution to dealing with the robot-cars-barreling-over-people epidemic. After careful consideration, it was determined that the benefit of meeting human needs was trivial compared to the 1% improvement in machine efficiency.
Jun 05 2023