digitalmars.D.learn - Splitting up large dirty file

Dennis (18/18) May 15 2018 I have a file with two problems:

Steven Schveighoffer (30/48) May 15 2018 Using iopipe, you can split on N lines (iopipe doesn't autodecode when
Jonathan M Davis (10/28) May 15 2018 If you're on a *nix systime, and you're simply looking for a solution to
Jon Degenhardt (29/41) May 15 2018 Can you show the program you are using that throws when using

Dennis (33/35) May 16 2018 Here's a version that only outputs the first chunk:

drug (3/40) May 16 2018 What is the purpose of `.drop(4)`? I'm pretty sure this is the reason of...

Dennis (19/21) May 16 2018 The file in question is a .json database dump with an array

Jonathan M Davis (12/16) May 16 2018 D is designed with the idea that a string is valid UTF-8, a wstring is v...

Dennis (10/16) May 17 2018 It's unfortunate that Phobos tells you 'there's problems with the

Jonathan M Davis (33/49) May 17 2018 UTFException has a sequence member and a len member (which appear to be
Dennis (15/26) May 21 2018 I have to take that back since I found out about std.encoding

Jonathan M Davis (68/92) May 21 2018 drop is range-based, so if you give it a string, it's going to decode

Dennis (35/48) May 21 2018 In this case I used drop to drop lines, not characters. The

Jon Degenhardt (41/49) May 21 2018 My general experience is that range programming works quite well.

Dennis (21/23) May 16 2018 The file in question is a .json database dump with an array

Jon Degenhardt (5/22) May 16 2018 If you write it in the style of my earlier example and use

Dennis (14/19) May 17 2018 When printing to stdout it seems to skip any validation, but

Jon Degenhardt (15/34) May 17 2018 Instead of:
Kagamin (18/25) May 18 2018 Do it old school?

Neia Neutuladh (15/24) May 17 2018 Memory mapping should work. That's in core.sys.posix.sys.mman for

ag0aep6g (3/5) May 17 2018 Nitpick: It only goes up to 0b1111_0100. The highest code point is

Dennis <dkorpel gmail.com> writes:

I have a file with two problems:
- It's too big to fit in memory (apparently, I thought 1.5 Gb 
would fit but I get an out of memory error when using 
std.file.read)
- It is dirty (contains invalid Unicode characters, null bytes in 
the middle of lines)

I want to write a program that splits it up into multiple files, 
with the splits happening every n lines. I keep encountering 
roadblocks though:

- You can't give Yes.useReplacementChar to `byLine` and `byLine` 
(or `readln`) throws an Exception upon encountering an invalid 
character.
- decodeFront doesn't work on inputRanges like 
`byChunk(4096).joiner`
- std.algorithm.splitter doesn't work on inputRanges either
- When you convert chunks to arrays, you have the risk of a split 
being in the middle of a character with multiple code units

Is there a simple way to do this?

May 15 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 5/15/18 4:36 PM, Dennis wrote:
 I have a file with two problems:
 - It's too big to fit in memory (apparently, I thought 1.5 Gb would fit 
 but I get an out of memory error when using std.file.read)
 - It is dirty (contains invalid Unicode characters, null bytes in the 
 middle of lines)
 
 I want to write a program that splits it up into multiple files, with 
 the splits happening every n lines. I keep encountering roadblocks though:
 
 - You can't give Yes.useReplacementChar to `byLine` and `byLine` (or 
 `readln`) throws an Exception upon encountering an invalid character.
 - decodeFront doesn't work on inputRanges like `byChunk(4096).joiner`
 - std.algorithm.splitter doesn't work on inputRanges either
 - When you convert chunks to arrays, you have the risk of a split being 
 in the middle of a character with multiple code units
 
 Is there a simple way to do this?
 

Using iopipe, you can split on N lines (iopipe doesn't autodecode when 
searching for newlines), or split on a pre-determined chunk size (and 
ensure you don't split a code point).

Splitting on N lines:

import iopipe.bufpipe;
import iopipe.textpipe;

auto infile = openDev("filename").bufd.assumeText.byLine;

foreach(i; 0 .. N) infile.extend(0); // ensure N lines in the buffer

Splitting on pre-determined chunk size

auto infile = openDev("filename")
     .bufd!(ubyte, chunkSize) // use chunkSize as minimum read size
     .assumeText // it's text, not ubyte
     .ensureDecodeable; // do not end in the middle of a codepoint

The output isn't as straightforward. Ideally you would want to simply 
create an output pipe that split into multiple files, and process the 
whole thing at once. I haven't created such a thing yet though (will add 
an enhancement request to do so).

Easiest thing to do is to write the entire window of the input pipe into 
an output pipe, or cast it back to ubyte[] and write directly to an 
output device.

e.g.:

auto infile = ... // one of the above ideas
    .encodeText; // convert to ubyte

auto outfile = openDev("outputFilename1", "w");
outfile.write(infile.window);
outfile.close;
infile.release(infile.window.length); // flush the input buffer
... // refill the buffer using the chosen technique above.

-Steve

May 15 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Tuesday, May 15, 2018 20:36:21 Dennis via Digitalmars-d-learn wrote:
 I have a file with two problems:
 - It's too big to fit in memory (apparently, I thought 1.5 Gb
 would fit but I get an out of memory error when using
 std.file.read)
 - It is dirty (contains invalid Unicode characters, null bytes in
 the middle of lines)

 I want to write a program that splits it up into multiple files,
 with the splits happening every n lines. I keep encountering
 roadblocks though:

 - You can't give Yes.useReplacementChar to `byLine` and `byLine`
 (or `readln`) throws an Exception upon encountering an invalid
 character.
 - decodeFront doesn't work on inputRanges like
 `byChunk(4096).joiner`
 - std.algorithm.splitter doesn't work on inputRanges either
 - When you convert chunks to arrays, you have the risk of a split
 being in the middle of a character with multiple code units

 Is there a simple way to do this?

If you're on a *nix systime, and you're simply looking for a solution to
split files and don't necessarily care about writing one, I'd suggest trying
the split utility:

https://linux.die.net/man/1/split

If I had to write it in D, I'd probably just use std.mmap and operate on the
files as a dynamic array of ubytes, since if what you care about is '\n',
that can easily be searched for without needing any decoding, and using mmap
avoids having to chunk anything.

- Jonathan M Davis

May 15 2018

Jon Degenhardt <jond noreply.com> writes:

On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:
 I have a file with two problems:
 - It's too big to fit in memory (apparently, I thought 1.5 Gb 
 would fit but I get an out of memory error when using 
 std.file.read)
 - It is dirty (contains invalid Unicode characters, null bytes 
 in the middle of lines)

 I want to write a program that splits it up into multiple 
 files, with the splits happening every n lines. I keep 
 encountering roadblocks though:

 - You can't give Yes.useReplacementChar to `byLine` and 
 `byLine` (or `readln`) throws an Exception upon encountering an 
 invalid character.

Can you show the program you are using that throws when using 
byLine? I tried a very simple program that reads and outputs 
line-by-line, then fed it a file that contained invalid utf-8. I 
did not see an exception. The invalid utf-8 was created by taking 
part of this file: 
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (a 
commonly used file with utf-8 edge cases), plus adding a number 
of random hex characters, including null. I don't see exceptions 
thrown.

The program I used:

int main(string[] args)
{
     import std.stdio;
     import std.conv : to;
     try
     {
         auto inputStream = (args.length < 2 || args[1] == "-") ? 
stdin : args[1].File;
         foreach (line; inputStream.byLine(KeepTerminator.yes)) 
write(line);
     }
     catch (Exception e)
     {
         stderr.writefln("Error [%s]: %s", args[0], e.msg);
         return 1;
     }
     return 0;
}

May 15 2018

Dennis <dkorpel gmail.com> writes:

On Wednesday, 16 May 2018 at 02:47:50 UTC, Jon Degenhardt wrote:
 Can you show the program you are using that throws when using 
 byLine?

Here's a version that only outputs the first chunk:
```
import std.stdio;
import std.range;
import std.algorithm;
import std.file;
import std.exception;

void main(string[] args) {
	enforce(args.length == 2, "Pass one filename as argument");
	auto lineChunks = File(args[1], 
"r").byLine.drop(4).chunks(10_000_000/10);
	new File("output.txt", "w").write(lineChunks.front.joiner);
}
```

dmd splitFile -g
./splitFile.exe UTF-8-test.txt

std.utf.UTFException C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1380):
Invalid UTF-8 sequence (at index 4)
----------------
0x004038D2 in pure dchar std.utf.decodeImpl!(true, 0, 
char[]).decodeImpl(ref char[], ref uint) at 
C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1529)
0x00403677 in pure  trusted dchar std.utf.decode!(0, 
char[]).decode(ref char[], ref uint) at 
C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1076)
0x00403575 in pure  property  safe dchar 
std.range.primitives.front!(char).front(char[]) at 
C:\D\dmd2\windows\bin\..\..\src\phobos\std\range\primitives.d(2333)
0x0040566D in pure  property dchar 
std.algorithm.iteration.joiner!(std.range.Chunks!(std.stdio.Fi
e.ByLineImpl!(char, char).ByLineImpl).Chunks.Chunk).joiner(std.range
.Chunks!(std.stdio.File.ByLineImpl!(char, 
char).ByLineImpl).Chunks.Chunk).Result.front() at 
C:\D\dmd2\windows\bin\..\..\src\phobos\std\algorithm\iteration.d(2491)

May 16 2018

drug <drug2004 bk.ru> writes:

16.05.2018 10:06, Dennis пишет:
 
 Here's a version that only outputs the first chunk:
 ```
 import std.stdio;
 import std.range;
 import std.algorithm;
 import std.file;
 import std.exception;
 
 void main(string[] args) {
      enforce(args.length == 2, "Pass one filename as argument");
      auto lineChunks = File(args[1], 
 "r").byLine.drop(4).chunks(10_000_000/10);
      new File("output.txt", "w").write(lineChunks.front.joiner);
 }
 ```
 
 dmd splitFile -g
 ./splitFile.exe UTF-8-test.txt
 
 std.utf.UTFException C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1380): 
 Invalid UTF-8 sequence (at index 4)
 ----------------
 0x004038D2 in pure dchar std.utf.decodeImpl!(true, 0, 
 char[]).decodeImpl(ref char[], ref uint) at 
 C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1529)
 0x00403677 in pure  trusted dchar std.utf.decode!(0, char[]).decode(ref 
 char[], ref uint) at C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1076)
 0x00403575 in pure  property  safe dchar 
 std.range.primitives.front!(char).front(char[]) at 
 C:\D\dmd2\windows\bin\..\..\src\phobos\std\range\primitives.d(2333)
 0x0040566D in pure  property dchar 
 std.algorithm.iteration.joiner!(std.range.Chunks!(std.stdio.Fi
e.ByLineImpl!(char, 
 char).ByLineImpl).Chunks.Chunk).joiner(std.range
 .Chunks!(std.stdio.File.ByLineImpl!(char, 
 char).ByLineImpl).Chunks.Chunk).Result.front() at 
 C:\D\dmd2\windows\bin\..\..\src\phobos\std\algorithm\iteration.d(2491)

What is the purpose of `.drop(4)`? I'm pretty sure this is the reason of 
the exception.

May 16 2018

Dennis <dkorpel gmail.com> writes:

On Wednesday, 16 May 2018 at 08:20:06 UTC, drug wrote:
 What is the purpose of `.drop(4)`? I'm pretty sure this is the 
 reason of the exception.

The file in question is a .json database dump with an array 
"rows" of 10 million 8-line objects. The newlines in the string 
fields are escaped, but they still contain other invalid 
characters which makes std.json reject it.

The first 4 lines of the file are basically "header" and the last 
2 lines are a closing ] and }, so I want to split every 4 + 
8*(10_000_000/amountOfFiles) n lines and also remove trailing the 
comma, add brackets, drop the last 2 lines etc.

I thought it wouldn't be hard to crudely split this file using 
D's range functions and basic string manipulation, but the 
combination of being to large for a string and having invalid 
encoding seems to defeat most simple solutions. For now I decided 
to use Git Bash and do:
tail -n80000002 inputfile.json | split -l 8000000 - outputfile

And now I have files that do fit in memory. I'm still interested 
in complete D solutions though, thanks for the iopipe and memory 
mapped file suggestions Steven and Jonathan. I will check those 
out.

May 16 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Wednesday, May 16, 2018 08:57:10 Dennis via Digitalmars-d-learn wrote:
 I thought it wouldn't be hard to crudely split this file using
 D's range functions and basic string manipulation, but the
 combination of being to large for a string and having invalid
 encoding seems to defeat most simple solutions.

D is designed with the idea that a string is valid UTF-8, a wstring is valid
UTF-16, and dstring is valid UTF-32. For various reasons, that doesn't
always hold true like it should, but pretty much all of Phobos is written
with that assumption and will generally throw an exception if it isn't. If
you're ever dealing with a different encoding (or with invalid Unicode), you
really need to use integral types like ubyte (e.g. by using
std.string.representation or by reading the data in as ubytes rather than as
a string) and not try to use character types like char or string. If you try
to use char or string with invalid UTF-8 without having it throw any
exceptions, you're pretty much guaranteed to fail.

- Jonathan M Davis

May 16 2018

Dennis <dkorpel gmail.com> writes:

On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
 For various reasons, that doesn't always hold true like it 
 should, but pretty much all of Phobos is written with that 
 assumption and will generally throw an exception if it isn't.

It's unfortunate that Phobos tells you 'there's problems with the 
encoding' without providing any means to fix it or even diagnose 
it. The UTFException doesn't contain what the character in 
question was. You just have to abort whatever you were trying to 
do.

On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
 If you're ever dealing with a different encoding (or with 
 invalid Unicode), you really need to use integral types like 
 ubyte

I tried something like byChunk(4096).joiner.splitter(cast(ubyte) 
'\n') but it turns out splitter wants at least a forward range, 
even when the separator is a single element.

May 17 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Thursday, May 17, 2018 21:10:35 Dennis via Digitalmars-d-learn wrote:
 On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
 For various reasons, that doesn't always hold true like it
 should, but pretty much all of Phobos is written with that
 assumption and will generally throw an exception if it isn't.

 It's unfortunate that Phobos tells you 'there's problems with the
 encoding' without providing any means to fix it or even diagnose
 it. The UTFException doesn't contain what the character in
 question was. You just have to abort whatever you were trying to
 do.

UTFException has a sequence member and a len member (which appear to be
public but undocumented) which should contain the invalid sequence of code
units. In general though, exceptions aren't a great way to deal with this
problem. I think that you either want to be calling decode manually (in
which case, you have direct access to where the invalid Unicode is and have
the freedom to deal with it however is appropriate), or using the Unicode
replacement character would be better (which std.utf.decode supports, but
it's not what's used by default). Really, what's byting you here is the
auto-decoding. With Phobos, you have to fight to have it not happen by doing
stuff like special-casing your code for strings or using
std.string.representation or std.utf.byCodeUnit.

In principle, the way that Unicode would ideally be handled would be to
validate all character data when it enters the program (soing whatever is
appropriate with invalid Unicode at that point), and then the rest of the
program then either is always dealing with valid Unicode, or it's dealing
with integral values that it doesn't treat as Unicode (e.g. ubyte[]). But
the way that Phobos is written, it ends up decoding and validating all over
the place.

 On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
 If you're ever dealing with a different encoding (or with
 invalid Unicode), you really need to use integral types like
 ubyte

 I tried something like byChunk(4096).joiner.splitter(cast(ubyte)
 '\n') but it turns out splitter wants at least a forward range,
 even when the separator is a single element.

Actually, I'm pretty sure that splitter curently requires a random-access
range (even though it should theoretically work with a forward range). I
don't think that it can be made to work with an input range though given how
the range API works - or at least, it were made to work with it, you'd have
to deal with the fact that popping front on the spitter range would
invalidate anything that had been returned from front. And it would be
difficult to implement it  safely if what gets returned by front is not
completely independent of the splitter range (which means that it needs
save). Basic input ranges in general tend to be extremely limited in what
they can do, which can get really annoying when you deal with stuff like
files or sockets where making it a forward range likely means either reading
it all into memory or having buffers that potentially have to be dup-ed by
each call to save.

- Jonathan M Davis

May 17 2018

Dennis <dkorpel gmail.com> writes:

On Thursday, 17 May 2018 at 21:10:35 UTC, Dennis wrote:
 It's unfortunate that Phobos tells you 'there's problems with 
 the encoding' without providing any means to fix it or even 
 diagnose it.

I have to take that back since I found out about std.encoding 
which has functions like `sanitize`, but also `transcode`. (My 
file turned out to actually be encoded with ANSI / Windows-1252, 
not UTF-8)
Documentation is scarce however, and it requires strings instead 
of forward ranges.

 Jon Degenhardt
 Instead of:
 
      auto outputFile = new File("output.txt");
 
 try:
 
     auto outputFile = File("output.txt", "w");

Wow I really butchered that code. So it is the `drop(4)` that 
triggers the UTFException? I find Exceptions in range code hard 
to interpret.

 Kagamin
 Do it old school?

I want to be convinved that Range programming works like a charm, 
but the procedural approaches remain more flexible (and faster 
too) it seems. Thanks for the example.

May 21 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Monday, May 21, 2018 15:00:09 Dennis via Digitalmars-d-learn wrote:
 On Thursday, 17 May 2018 at 21:10:35 UTC, Dennis wrote:
 It's unfortunate that Phobos tells you 'there's problems with
 the encoding' without providing any means to fix it or even
 diagnose it.

 I have to take that back since I found out about std.encoding
 which has functions like `sanitize`, but also `transcode`. (My
 file turned out to actually be encoded with ANSI / Windows-1252,
 not UTF-8)
 Documentation is scarce however, and it requires strings instead
 of forward ranges.

  Jon Degenhardt

 Instead of:
      auto outputFile = new File("output.txt");

 try:
     auto outputFile = File("output.txt", "w");

 Wow I really butchered that code. So it is the `drop(4)` that
 triggers the UTFException?

drop is range-based, so if you give it a string, it's going to decode
because of the whole auto-decoding mess with std.range.primitives.front and
popFront. If you can't have auto-decoding, you either have to be dealing
with functions that you know avoid it, or you need to do use something like
std.string.representation or std.utf.byCodeUnit to get around the
auto-decoding. If you're dealing with invalid Unicode, you basically have to
either convert it all up front or do something like treat it like binary
data, or Phobos is going to try to decode it as Unicode and give you a
UTFExceptions.

 I find Exceptions in range code hard to interpret.

Well, if you just look at the stack trace, it should tell you. I don't see
why ranges would be any worse than any other code except for maybe the fact
that it's typical to chain a lot of calls, and you frequently end up with
wrapper types in the stack trace that you're not necessarily familiar with.
The big problem here really is that all you're really being told is that
your string has invalid Unicode in it somewhere and the chain of function
calls that resulted in std.utf.decode being called on your invalid Unicode.
But even if you weren't dealing with ranges, if you passed invalid Unicode
to something completely string-based which did decoding, you'd run into
pretty much the same problem. The data is being used outside of its original
context where you could easily figure out what it relates to, so it's going
to be a problem by its very nature. The only real solution there is to be
controlling the decoding yourself, and even then, it's easy to be in a
position where it's hard to figure out where in the data the bad data is
unless you've done something like keep track of exactly what index your at,
which really doesn't work well once you're dealing with slicing data.

  Kagamin

 Do it old school?

 I want to be convinved that Range programming works like a charm,
 but the procedural approaches remain more flexible (and faster
 too) it seems. Thanks for the example.

The whole auto-decoding mess makes things worse than they should be, but if
you find procedural examples more flexible, then I would guess that that
would be simply a matter of getting more experience with ranges. Ranges are
far more composable in terms of how they're used, which tends to inherently
make them more flexible. However, it does result in code that's a mixture of
functional and procedural programming, which can be quite a shift for some
folks. So, there's no question that it takes some getting used to, but D
does allow for the more classic approaches, and ranges are not always the
best approach.

As for performance, that depends on the code and the compiler. It wouldn't
surprise me if dmd didn't optimize out the range stuff as much as it really
should, but it's my understanding that ldc typically manages to generate
code where the range abstraction didn't cost you anything. If there's an
issue, I think that it's frequently an algorithmic one or the fact that some
range-processing has a tendency to process the same data multiple times,
because that's the easiest, most abstract way to go about it and works in
general but isn't always the best solution.

For instance, because of how the range API works, when using splitter, if
you iterate through the entire range, you pretty much have to iterate
through it twice, because it does look-ahead to find the delimiter and then
returns you a slice up to that point, after which, you process that chunk of
the data to do whatever it is you want to do with each split piece. At a
conceptual level, what you're doing with your code with splitter is then
really clean and easy to write, and often, it should be plenty efficient,
but it does require going over the data twice, whereas if you looped over
the data yourself, looking for each delimiter, you'd only need to iterate
over it once. So, in cases like that, I'd fully expect the abstraction to
cost you, though whether it costs enough to matter depends on what you're
doing.

As is the case when dealing with most abstractions, I think that it's mostly
a matter of using it where it makes sense to write cleaner code more quickly
and then later figuring out the hot spots where you need to optimize better.
In many cases, ranges will be pretty much the same as writing loops, and in
others, the abstraction is worth the cost. Where it isn't, you don't use
them or implement something yourself rather than using the standard function
for it, because you can write something faster for your use case.

Just the other day, I refactored some code to not use splitter, because in
that particular case, it was costing too much, but there are still tons of
cases where I'd use splitter without thinking twice about it, because it's
the simplest, fastest way to get the job done, and it's going to be fast
enough in most cases.

- Jonathan M Davis

May 21 2018

Dennis <dkorpel gmail.com> writes:

On Monday, 21 May 2018 at 17:42:19 UTC, Jonathan M Davis wrote:
On Monday, May 21, 2018 15:00:09 Dennis via Digitalmars-d-learn
wrote:
drop is range-based, so if you give it a string, it's going to
decode because of the whole auto-decoding mess with
std.range.primitives.front and popFront.

In this case I used drop to drop lines, not characters. The
exception was thrown by the joiner it turns out.

On Monday, 21 May 2018 at 17:42:19 UTC, Jonathan M Davis wrote:
I find Exceptions in range code hard to interpret.

Well, if you just look at the stack trace, it should tell you.
I don't see why ranges would be any worse than any other code
except for maybe the fact that it's typical to chain a lot of
calls, and you frequently end up with wrapper types in the
stack trace that you're not necessarily familiar with.

Exactly that: stack trace full of weird mangled names of template
functions, lambdas etc. And because of lazy evaluation and chains
of range functions, the line number doesn't easily show who the
culprit is.

On Monday, 21 May 2018 at 17:42:19 UTC, Jonathan M Davis wrote:
In many cases, ranges will be pretty much the same as writing
loops, and in others, the abstraction is worth the cost.

From the benchmarking I did, I found that ranges are easily an
order of magnitude slower even with compiler optimizations:

https://run.dlang.io/gist/5f243ca5ba80d958c0bc16d5b73f2934?compiler=ldc&args=-O3%20-release

```
LDC -O3 -release
Range Procedural
Stringtest: ["267ns", "11ns"]
Numbertest: ["393ns", "153ns"]

DMD -O -inline -release
Range Procedural
Stringtest: ["329ns", "8ns"]
Numbertest: ["1237ns", "282ns"]
```

This first range test is an opcode scanner I wrote for an
assembler. The range code is very nice and it works, but it
needlessly allocates a new string. So I switched to a procedural
version, which runs (and compiles) faster. This procedural
version did have some bugs initially though.

The second test is a simple number calculation. I thought that
the range code inlines to roughly the same procedural code so it
could be optimized the same, but there remains a factor 2 gap. I
don't know where the difficulty is, but I did notice that
switching the maximum number from int to enum makes the
procedural version 0 ns (calculated at compile time) while LDC
can't deduce the outcome in the range version (which still runs
for >300 ns).

May 21 2018

Jon Degenhardt <jond noreply.com> writes:

On Monday, 21 May 2018 at 15:00:09 UTC, Dennis wrote:
 I want to be convinced that Range programming works like a 
 charm, but the procedural approaches remain more flexible (and 
 faster too) it seems. Thanks for the example.

On Monday, 21 May 2018 at 22:11:42 UTC, Dennis wrote:
 In this case I used drop to drop lines, not characters. The 
 exception was thrown by the joiner it turns out.
  ...
 From the benchmarking I did, I found that ranges are easily an 
 order of magnitude slower even with compiler optimizations:

My general experience is that range programming works quite well. 
It's especially useful when used to do lazy processing and as a 
result minimize memory allocations. I've gotten quite good 
performance with these techniques (see my DConf talk slides: 
https://dconf.org/2018/talks/degenhardt.html).

Your benchmarks are not against the file split case, but if you 
benchmarked that you may have also seen it as slow. It that case 
you may be hitting specific areas where there are opportunities 
for performance improvement in the standard library. One is that 
joiner is slow (PR: https://github.com/dlang/phobos/pull/6492). 
Another is that the write[fln] routines are much faster when 
operating on a single large object than many small objects. e.g. 
It's faster to call write[fln] with an array of 100 characters 
than: (a) calling it 100 times with one character; (b) calling it 
once, with 100 characters as individual arguments (template 
form); (c) calling it once with range of 100 characters, each 
processed one at a time.

When joiner is used as in your example, you not only hit the 
joiner performance issue, but the write[fln] issue. This is due 
to something that may not be obvious at first: When joiner is 
used to concatenate arrays or ranges, it flattens out the 
array/range into a single range of elements. So, rather than 
writing a line at a time, you example is effectively passing a 
character at a time to write[fln].

So, in the file split case, using byLine in an imperative fashion 
as in my example will have the effect of passing a full line at a 
time to write[fln], rather than individual characters. Mine will 
be faster, but not because it's imperative. The same thing could 
be achieved procedurally.

Regarding the benchmark programs you showed - This is very 
interesting. It would certainly be worth additional looks into 
this. One thing I wonder is if the performance penalty may be due 
to a lack of inlining due to crossing library boundaries. The 
imperative versions aren't crossing these boundaries. If you're 
willing, you could try adding LDC's LTO options and see what 
happens. There are some instructions in the release notes for LDC 
1.9.0 (https://github.com/ldc-developers/ldc/releases). Make sure 
you use the form that includes druntime and phobos.

--Jon

May 21 2018

Dennis <dkorpel gmail.com> writes:

On Wednesday, 16 May 2018 at 08:20:06 UTC, drug wrote:
 What is the purpose of `.drop(4)`? I'm pretty sure this is the 
 reason of the exception.

The file in question is a .json database dump with an array 
"rows" of 10 million 8-line objects. The newlines in the string 
fields are escaped, but they still contain other invalid 
characters which makes std.json reject it.

The first 4 lines of the file are basically "header" and the last 
2 lines are a closing ] and }, so I want to split every 4 + 
8*(10_000_000/amountOfFiles) lines and also remove trailing the 
comma, add brackets, drop the last 2 lines etc.

I thought it wouldn't be hard to crudely split this file using 
D's range functions and basic string manipulation, but the 
combination of being to large for a string and having invalid 
encoding seems to defeat most simple solutions. For now I decided 
to use Git Bash and do:
```
tail -n80000002 inputfile.json | split -l 8000000 - outputfile
```
And now I have files that do fit in memory. I'm still interested 
in complete D solutions though, thanks for the iopipe and memory 
mapped file suggestions Steven and Jonathan. I will check those 
out.

May 16 2018

Jon Degenhardt <jond noreply.com> writes:

On Wednesday, 16 May 2018 at 07:06:45 UTC, Dennis wrote:
 On Wednesday, 16 May 2018 at 02:47:50 UTC, Jon Degenhardt wrote:
 Can you show the program you are using that throws when using 
 byLine?

 Here's a version that only outputs the first chunk:
 ```
 import std.stdio;
 import std.range;
 import std.algorithm;
 import std.file;
 import std.exception;

 void main(string[] args) {
 	enforce(args.length == 2, "Pass one filename as argument");
 	auto lineChunks = File(args[1], 
 "r").byLine.drop(4).chunks(10_000_000/10);
 	new File("output.txt", "w").write(lineChunks.front.joiner);
 }
 ```

If you write it in the style of my earlier example and use 
counters and if-tests it will work. byLine by itself won't try to 
interpret the characters (won't auto-decode them), so it won't 
trigger an exception if there are invalid utf-8 characters.

May 16 2018

Dennis <dkorpel gmail.com> writes:

On Wednesday, 16 May 2018 at 15:47:29 UTC, Jon Degenhardt wrote:
 If you write it in the style of my earlier example and use 
 counters and if-tests it will work. byLine by itself won't try 
 to interpret the characters (won't auto-decode them), so it 
 won't trigger an exception if there are invalid utf-8 
 characters.

When printing to stdout it seems to skip any validation, but 
writing to a file does give an exception:

```
     auto inputStream = (args.length < 2 || args[1] == "-") ? 
stdin : args[1].File;
	auto outputFile = new File("output.txt");
     foreach (line; inputStream.byLine(KeepTerminator.yes)) 
outputFile.write(line);
```
std.exception.ErrnoException C:\D\dmd2\windows\bin\..\..\src\phobo
\std\stdio.d(2877):  (No error)

According to the documentation, byLine can throw an UTFException 
so relying on the fact that it doesn't in some cases doesn't seem 
like a good idea.

May 17 2018

Jon Degenhardt <jond noreply.com> writes:

On Thursday, 17 May 2018 at 20:08:09 UTC, Dennis wrote:
 On Wednesday, 16 May 2018 at 15:47:29 UTC, Jon Degenhardt wrote:
 If you write it in the style of my earlier example and use 
 counters and if-tests it will work. byLine by itself won't try 
 to interpret the characters (won't auto-decode them), so it 
 won't trigger an exception if there are invalid utf-8 
 characters.

 When printing to stdout it seems to skip any validation, but 
 writing to a file does give an exception:

 ```
     auto inputStream = (args.length < 2 || args[1] == "-") ? 
 stdin : args[1].File;
 	auto outputFile = new File("output.txt");
     foreach (line; inputStream.byLine(KeepTerminator.yes)) 
 outputFile.write(line);
 ```
 std.exception.ErrnoException C:\D\dmd2\windows\bin\..\..\src\phobo
\std\stdio.d(2877):  (No error)

 According to the documentation, byLine can throw an 
 UTFException so relying on the fact that it doesn't in some 
 cases doesn't seem like a good idea.

Instead of:

      auto outputFile = new File("output.txt");

try:

     auto outputFile = File("output.txt", "w");

That works for me. The second arg ("w") opens the file for write. 
When I omit it, I also get an exception, as the default open mode 
is for read:

  * If file does not exist:  Cannot open file `output.txt' in mode 
`rb' (No such file or directory)
  * If file does exist:   (Bad file descriptor)

The second error presumably occurs when writing.

As an aside - I agree with one of your bigger picture 
observations: It would be preferable to have more control over 
utf-8 error handling behavior at the application level.

May 17 2018

Kagamin <spam here.lot> writes:

On Thursday, 17 May 2018 at 20:08:09 UTC, Dennis wrote:
 ```
     auto inputStream = (args.length < 2 || args[1] == "-") ? 
 stdin : args[1].File;
 	auto outputFile = new File("output.txt");
     foreach (line; inputStream.byLine(KeepTerminator.yes)) 
 outputFile.write(line);
 ```

Do it old school?
---
int line;
auto outputFile = File("output.txt", "wb");
foreach (chunk; inputStream.byChunk(4<<10))
{
   auto rem=chunk;
   while(rem!=null)
   {
     auto i=rem.countUntil(10);
     auto len=i+1;
     if(i<0)len=rem.length; else line++;
     outputFile.rawWrite(rem[0..len]);
     rem=rem[len..$];
   }
}
---

May 18 2018

Neia Neutuladh <neia ikeran.org> writes:

On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:
 I have a file with two problems:
 - It's too big to fit in memory (apparently, I thought 1.5 Gb 
 would fit but I get an out of memory error when using 
 std.file.read)

Memory mapping should work. That's in core.sys.posix.sys.mman for 
Posix systems, and Windows has some equivalent probably. (But 
nobody uses Windows, right?)

 - It is dirty (contains invalid Unicode characters, null bytes 
 in the middle of lines)

std.algorithm should generally work with sequences of anything, 
not just strings. So memory map, cast to ubyte[], and deal with 
it that way?

 - When you convert chunks to arrays, you have the risk of a 
 split being in the middle of a character with multiple code 
 units

It's straightforward to scan for the start of a Unicode 
character; you just skip past characters where the highest bit is 
set and the next-highest is not. (0b1100_0000 through 0b1111_1110 
is the start of a multibyte character; 0b0000_0000 through 
0b0111_1111 is a single-byte character.)

That said, you seem to only need to split based on a newline 
character, so you might be able to ignore this entirely, even if 
you go by chunks.

May 17 2018

ag0aep6g <anonymous example.com> writes:

On 05/17/2018 11:40 PM, Neia Neutuladh wrote:
 0b1100_0000 through 0b1111_1110 is the start of a 
 multibyte character

Nitpick: It only goes up to 0b1111_0100. The highest code point is 
U+10FFFF. There are no sequences with more than four bytes.

May 17 2018

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Splitting up large dirty file