www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Splitting up large dirty file

reply Dennis <dkorpel gmail.com> writes:
I have a file with two problems:
- It's too big to fit in memory (apparently, I thought 1.5 Gb 
would fit but I get an out of memory error when using 
std.file.read)
- It is dirty (contains invalid Unicode characters, null bytes in 
the middle of lines)

I want to write a program that splits it up into multiple files, 
with the splits happening every n lines. I keep encountering 
roadblocks though:

- You can't give Yes.useReplacementChar to `byLine` and `byLine` 
(or `readln`) throws an Exception upon encountering an invalid 
character.
- decodeFront doesn't work on inputRanges like 
`byChunk(4096).joiner`
- std.algorithm.splitter doesn't work on inputRanges either
- When you convert chunks to arrays, you have the risk of a split 
being in the middle of a character with multiple code units

Is there a simple way to do this?
May 15 2018
next sibling parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/15/18 4:36 PM, Dennis wrote:
 I have a file with two problems:
 - It's too big to fit in memory (apparently, I thought 1.5 Gb would fit 
 but I get an out of memory error when using std.file.read)
 - It is dirty (contains invalid Unicode characters, null bytes in the 
 middle of lines)
 
 I want to write a program that splits it up into multiple files, with 
 the splits happening every n lines. I keep encountering roadblocks though:
 
 - You can't give Yes.useReplacementChar to `byLine` and `byLine` (or 
 `readln`) throws an Exception upon encountering an invalid character.
 - decodeFront doesn't work on inputRanges like `byChunk(4096).joiner`
 - std.algorithm.splitter doesn't work on inputRanges either
 - When you convert chunks to arrays, you have the risk of a split being 
 in the middle of a character with multiple code units
 
 Is there a simple way to do this?
 
Using iopipe, you can split on N lines (iopipe doesn't autodecode when searching for newlines), or split on a pre-determined chunk size (and ensure you don't split a code point). Splitting on N lines: import iopipe.bufpipe; import iopipe.textpipe; auto infile = openDev("filename").bufd.assumeText.byLine; foreach(i; 0 .. N) infile.extend(0); // ensure N lines in the buffer Splitting on pre-determined chunk size auto infile = openDev("filename") .bufd!(ubyte, chunkSize) // use chunkSize as minimum read size .assumeText // it's text, not ubyte .ensureDecodeable; // do not end in the middle of a codepoint The output isn't as straightforward. Ideally you would want to simply create an output pipe that split into multiple files, and process the whole thing at once. I haven't created such a thing yet though (will add an enhancement request to do so). Easiest thing to do is to write the entire window of the input pipe into an output pipe, or cast it back to ubyte[] and write directly to an output device. e.g.: auto infile = ... // one of the above ideas .encodeText; // convert to ubyte auto outfile = openDev("outputFilename1", "w"); outfile.write(infile.window); outfile.close; infile.release(infile.window.length); // flush the input buffer ... // refill the buffer using the chosen technique above. -Steve
May 15 2018
prev sibling next sibling parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Tuesday, May 15, 2018 20:36:21 Dennis via Digitalmars-d-learn wrote:
 I have a file with two problems:
 - It's too big to fit in memory (apparently, I thought 1.5 Gb
 would fit but I get an out of memory error when using
 std.file.read)
 - It is dirty (contains invalid Unicode characters, null bytes in
 the middle of lines)

 I want to write a program that splits it up into multiple files,
 with the splits happening every n lines. I keep encountering
 roadblocks though:

 - You can't give Yes.useReplacementChar to `byLine` and `byLine`
 (or `readln`) throws an Exception upon encountering an invalid
 character.
 - decodeFront doesn't work on inputRanges like
 `byChunk(4096).joiner`
 - std.algorithm.splitter doesn't work on inputRanges either
 - When you convert chunks to arrays, you have the risk of a split
 being in the middle of a character with multiple code units

 Is there a simple way to do this?
If you're on a *nix systime, and you're simply looking for a solution to split files and don't necessarily care about writing one, I'd suggest trying the split utility: https://linux.die.net/man/1/split If I had to write it in D, I'd probably just use std.mmap and operate on the files as a dynamic array of ubytes, since if what you care about is '\n', that can easily be searched for without needing any decoding, and using mmap avoids having to chunk anything. - Jonathan M Davis
May 15 2018
prev sibling next sibling parent reply Jon Degenhardt <jond noreply.com> writes:
On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:
 I have a file with two problems:
 - It's too big to fit in memory (apparently, I thought 1.5 Gb 
 would fit but I get an out of memory error when using 
 std.file.read)
 - It is dirty (contains invalid Unicode characters, null bytes 
 in the middle of lines)

 I want to write a program that splits it up into multiple 
 files, with the splits happening every n lines. I keep 
 encountering roadblocks though:

 - You can't give Yes.useReplacementChar to `byLine` and 
 `byLine` (or `readln`) throws an Exception upon encountering an 
 invalid character.
Can you show the program you are using that throws when using byLine? I tried a very simple program that reads and outputs line-by-line, then fed it a file that contained invalid utf-8. I did not see an exception. The invalid utf-8 was created by taking part of this file: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (a commonly used file with utf-8 edge cases), plus adding a number of random hex characters, including null. I don't see exceptions thrown. The program I used: int main(string[] args) { import std.stdio; import std.conv : to; try { auto inputStream = (args.length < 2 || args[1] == "-") ? stdin : args[1].File; foreach (line; inputStream.byLine(KeepTerminator.yes)) write(line); } catch (Exception e) { stderr.writefln("Error [%s]: %s", args[0], e.msg); return 1; } return 0; }
May 15 2018
parent reply Dennis <dkorpel gmail.com> writes:
On Wednesday, 16 May 2018 at 02:47:50 UTC, Jon Degenhardt wrote:
 Can you show the program you are using that throws when using 
 byLine?
Here's a version that only outputs the first chunk: ``` import std.stdio; import std.range; import std.algorithm; import std.file; import std.exception; void main(string[] args) { enforce(args.length == 2, "Pass one filename as argument"); auto lineChunks = File(args[1], "r").byLine.drop(4).chunks(10_000_000/10); new File("output.txt", "w").write(lineChunks.front.joiner); } ``` dmd splitFile -g ./splitFile.exe UTF-8-test.txt std.utf.UTFException C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1380): Invalid UTF-8 sequence (at index 4) ---------------- 0x004038D2 in pure dchar std.utf.decodeImpl!(true, 0, char[]).decodeImpl(ref char[], ref uint) at C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1529) 0x00403677 in pure trusted dchar std.utf.decode!(0, char[]).decode(ref char[], ref uint) at C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1076) 0x00403575 in pure property safe dchar std.range.primitives.front!(char).front(char[]) at C:\D\dmd2\windows\bin\..\..\src\phobos\std\range\primitives.d(2333) 0x0040566D in pure property dchar std.algorithm.iteration.joiner!(std.range.Chunks!(std.stdio.Fi e.ByLineImpl!(char, char).ByLineImpl).Chunks.Chunk).joiner(std.range .Chunks!(std.stdio.File.ByLineImpl!(char, char).ByLineImpl).Chunks.Chunk).Result.front() at C:\D\dmd2\windows\bin\..\..\src\phobos\std\algorithm\iteration.d(2491)
May 16 2018
next sibling parent reply drug <drug2004 bk.ru> writes:
16.05.2018 10:06, Dennis пишет:
 
 Here's a version that only outputs the first chunk:
 ```
 import std.stdio;
 import std.range;
 import std.algorithm;
 import std.file;
 import std.exception;
 
 void main(string[] args) {
      enforce(args.length == 2, "Pass one filename as argument");
      auto lineChunks = File(args[1], 
 "r").byLine.drop(4).chunks(10_000_000/10);
      new File("output.txt", "w").write(lineChunks.front.joiner);
 }
 ```
 
 dmd splitFile -g
 ./splitFile.exe UTF-8-test.txt
 
 std.utf.UTFException C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1380): 
 Invalid UTF-8 sequence (at index 4)
 ----------------
 0x004038D2 in pure dchar std.utf.decodeImpl!(true, 0, 
 char[]).decodeImpl(ref char[], ref uint) at 
 C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1529)
 0x00403677 in pure  trusted dchar std.utf.decode!(0, char[]).decode(ref 
 char[], ref uint) at C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1076)
 0x00403575 in pure  property  safe dchar 
 std.range.primitives.front!(char).front(char[]) at 
 C:\D\dmd2\windows\bin\..\..\src\phobos\std\range\primitives.d(2333)
 0x0040566D in pure  property dchar 
 std.algorithm.iteration.joiner!(std.range.Chunks!(std.stdio.Fi
e.ByLineImpl!(char, 
 char).ByLineImpl).Chunks.Chunk).joiner(std.range
 .Chunks!(std.stdio.File.ByLineImpl!(char, 
 char).ByLineImpl).Chunks.Chunk).Result.front() at 
 C:\D\dmd2\windows\bin\..\..\src\phobos\std\algorithm\iteration.d(2491)
What is the purpose of `.drop(4)`? I'm pretty sure this is the reason of the exception.
May 16 2018
next sibling parent reply Dennis <dkorpel gmail.com> writes:
On Wednesday, 16 May 2018 at 08:20:06 UTC, drug wrote:
 What is the purpose of `.drop(4)`? I'm pretty sure this is the 
 reason of the exception.
The file in question is a .json database dump with an array "rows" of 10 million 8-line objects. The newlines in the string fields are escaped, but they still contain other invalid characters which makes std.json reject it. The first 4 lines of the file are basically "header" and the last 2 lines are a closing ] and }, so I want to split every 4 + 8*(10_000_000/amountOfFiles) n lines and also remove trailing the comma, add brackets, drop the last 2 lines etc. I thought it wouldn't be hard to crudely split this file using D's range functions and basic string manipulation, but the combination of being to large for a string and having invalid encoding seems to defeat most simple solutions. For now I decided to use Git Bash and do: tail -n80000002 inputfile.json | split -l 8000000 - outputfile And now I have files that do fit in memory. I'm still interested in complete D solutions though, thanks for the iopipe and memory mapped file suggestions Steven and Jonathan. I will check those out.
May 16 2018
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Wednesday, May 16, 2018 08:57:10 Dennis via Digitalmars-d-learn wrote:
 I thought it wouldn't be hard to crudely split this file using
 D's range functions and basic string manipulation, but the
 combination of being to large for a string and having invalid
 encoding seems to defeat most simple solutions.
D is designed with the idea that a string is valid UTF-8, a wstring is valid UTF-16, and dstring is valid UTF-32. For various reasons, that doesn't always hold true like it should, but pretty much all of Phobos is written with that assumption and will generally throw an exception if it isn't. If you're ever dealing with a different encoding (or with invalid Unicode), you really need to use integral types like ubyte (e.g. by using std.string.representation or by reading the data in as ubytes rather than as a string) and not try to use character types like char or string. If you try to use char or string with invalid UTF-8 without having it throw any exceptions, you're pretty much guaranteed to fail. - Jonathan M Davis
May 16 2018
parent reply Dennis <dkorpel gmail.com> writes:
On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
 For various reasons, that doesn't always hold true like it 
 should, but pretty much all of Phobos is written with that 
 assumption and will generally throw an exception if it isn't.
It's unfortunate that Phobos tells you 'there's problems with the encoding' without providing any means to fix it or even diagnose it. The UTFException doesn't contain what the character in question was. You just have to abort whatever you were trying to do. On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
 If you're ever dealing with a different encoding (or with 
 invalid Unicode), you really need to use integral types like 
 ubyte
I tried something like byChunk(4096).joiner.splitter(cast(ubyte) '\n') but it turns out splitter wants at least a forward range, even when the separator is a single element.
May 17 2018
next sibling parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Thursday, May 17, 2018 21:10:35 Dennis via Digitalmars-d-learn wrote:
 On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
 For various reasons, that doesn't always hold true like it
 should, but pretty much all of Phobos is written with that
 assumption and will generally throw an exception if it isn't.
It's unfortunate that Phobos tells you 'there's problems with the encoding' without providing any means to fix it or even diagnose it. The UTFException doesn't contain what the character in question was. You just have to abort whatever you were trying to do.
UTFException has a sequence member and a len member (which appear to be public but undocumented) which should contain the invalid sequence of code units. In general though, exceptions aren't a great way to deal with this problem. I think that you either want to be calling decode manually (in which case, you have direct access to where the invalid Unicode is and have the freedom to deal with it however is appropriate), or using the Unicode replacement character would be better (which std.utf.decode supports, but it's not what's used by default). Really, what's byting you here is the auto-decoding. With Phobos, you have to fight to have it not happen by doing stuff like special-casing your code for strings or using std.string.representation or std.utf.byCodeUnit. In principle, the way that Unicode would ideally be handled would be to validate all character data when it enters the program (soing whatever is appropriate with invalid Unicode at that point), and then the rest of the program then either is always dealing with valid Unicode, or it's dealing with integral values that it doesn't treat as Unicode (e.g. ubyte[]). But the way that Phobos is written, it ends up decoding and validating all over the place.
 On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
 If you're ever dealing with a different encoding (or with
 invalid Unicode), you really need to use integral types like
 ubyte
I tried something like byChunk(4096).joiner.splitter(cast(ubyte) '\n') but it turns out splitter wants at least a forward range, even when the separator is a single element.
Actually, I'm pretty sure that splitter curently requires a random-access range (even though it should theoretically work with a forward range). I don't think that it can be made to work with an input range though given how the range API works - or at least, it were made to work with it, you'd have to deal with the fact that popping front on the spitter range would invalidate anything that had been returned from front. And it would be difficult to implement it safely if what gets returned by front is not completely independent of the splitter range (which means that it needs save). Basic input ranges in general tend to be extremely limited in what they can do, which can get really annoying when you deal with stuff like files or sockets where making it a forward range likely means either reading it all into memory or having buffers that potentially have to be dup-ed by each call to save. - Jonathan M Davis
May 17 2018
prev sibling parent reply Dennis <dkorpel gmail.com> writes:
On Thursday, 17 May 2018 at 21:10:35 UTC, Dennis wrote:
 It's unfortunate that Phobos tells you 'there's problems with 
 the encoding' without providing any means to fix it or even 
 diagnose it.
I have to take that back since I found out about std.encoding which has functions like `sanitize`, but also `transcode`. (My file turned out to actually be encoded with ANSI / Windows-1252, not UTF-8) Documentation is scarce however, and it requires strings instead of forward ranges. Jon Degenhardt
 Instead of:
 
      auto outputFile = new File("output.txt");
 
 try:
 
     auto outputFile = File("output.txt", "w");
Wow I really butchered that code. So it is the `drop(4)` that triggers the UTFException? I find Exceptions in range code hard to interpret. Kagamin
 Do it old school?
I want to be convinved that Range programming works like a charm, but the procedural approaches remain more flexible (and faster too) it seems. Thanks for the example.
May 21 2018
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, May 21, 2018 15:00:09 Dennis via Digitalmars-d-learn wrote:
 On Thursday, 17 May 2018 at 21:10:35 UTC, Dennis wrote:
 It's unfortunate that Phobos tells you 'there's problems with
 the encoding' without providing any means to fix it or even
 diagnose it.
I have to take that back since I found out about std.encoding which has functions like `sanitize`, but also `transcode`. (My file turned out to actually be encoded with ANSI / Windows-1252, not UTF-8) Documentation is scarce however, and it requires strings instead of forward ranges. Jon Degenhardt
 Instead of:
      auto outputFile = new File("output.txt");

 try:
     auto outputFile = File("output.txt", "w");
Wow I really butchered that code. So it is the `drop(4)` that triggers the UTFException?
drop is range-based, so if you give it a string, it's going to decode because of the whole auto-decoding mess with std.range.primitives.front and popFront. If you can't have auto-decoding, you either have to be dealing with functions that you know avoid it, or you need to do use something like std.string.representation or std.utf.byCodeUnit to get around the auto-decoding. If you're dealing with invalid Unicode, you basically have to either convert it all up front or do something like treat it like binary data, or Phobos is going to try to decode it as Unicode and give you a UTFExceptions.
 I find Exceptions in range code hard to interpret.
Well, if you just look at the stack trace, it should tell you. I don't see why ranges would be any worse than any other code except for maybe the fact that it's typical to chain a lot of calls, and you frequently end up with wrapper types in the stack trace that you're not necessarily familiar with. The big problem here really is that all you're really being told is that your string has invalid Unicode in it somewhere and the chain of function calls that resulted in std.utf.decode being called on your invalid Unicode. But even if you weren't dealing with ranges, if you passed invalid Unicode to something completely string-based which did decoding, you'd run into pretty much the same problem. The data is being used outside of its original context where you could easily figure out what it relates to, so it's going to be a problem by its very nature. The only real solution there is to be controlling the decoding yourself, and even then, it's easy to be in a position where it's hard to figure out where in the data the bad data is unless you've done something like keep track of exactly what index your at, which really doesn't work well once you're dealing with slicing data.
  Kagamin

 Do it old school?
I want to be convinved that Range programming works like a charm, but the procedural approaches remain more flexible (and faster too) it seems. Thanks for the example.
The whole auto-decoding mess makes things worse than they should be, but if you find procedural examples more flexible, then I would guess that that would be simply a matter of getting more experience with ranges. Ranges are far more composable in terms of how they're used, which tends to inherently make them more flexible. However, it does result in code that's a mixture of functional and procedural programming, which can be quite a shift for some folks. So, there's no question that it takes some getting used to, but D does allow for the more classic approaches, and ranges are not always the best approach. As for performance, that depends on the code and the compiler. It wouldn't surprise me if dmd didn't optimize out the range stuff as much as it really should, but it's my understanding that ldc typically manages to generate code where the range abstraction didn't cost you anything. If there's an issue, I think that it's frequently an algorithmic one or the fact that some range-processing has a tendency to process the same data multiple times, because that's the easiest, most abstract way to go about it and works in general but isn't always the best solution. For instance, because of how the range API works, when using splitter, if you iterate through the entire range, you pretty much have to iterate through it twice, because it does look-ahead to find the delimiter and then returns you a slice up to that point, after which, you process that chunk of the data to do whatever it is you want to do with each split piece. At a conceptual level, what you're doing with your code with splitter is then really clean and easy to write, and often, it should be plenty efficient, but it does require going over the data twice, whereas if you looped over the data yourself, looking for each delimiter, you'd only need to iterate over it once. So, in cases like that, I'd fully expect the abstraction to cost you, though whether it costs enough to matter depends on what you're doing. As is the case when dealing with most abstractions, I think that it's mostly a matter of using it where it makes sense to write cleaner code more quickly and then later figuring out the hot spots where you need to optimize better. In many cases, ranges will be pretty much the same as writing loops, and in others, the abstraction is worth the cost. Where it isn't, you don't use them or implement something yourself rather than using the standard function for it, because you can write something faster for your use case. Just the other day, I refactored some code to not use splitter, because in that particular case, it was costing too much, but there are still tons of cases where I'd use splitter without thinking twice about it, because it's the simplest, fastest way to get the job done, and it's going to be fast enough in most cases. - Jonathan M Davis
May 21 2018
parent reply Dennis <dkorpel gmail.com> writes:
On Monday, 21 May 2018 at 17:42:19 UTC, Jonathan M Davis wrote:
 On Monday, May 21, 2018 15:00:09 Dennis via Digitalmars-d-learn 
 wrote:
 drop is range-based, so if you give it a string, it's going to 
 decode because of the whole auto-decoding mess with 
 std.range.primitives.front and popFront.
In this case I used drop to drop lines, not characters. The exception was thrown by the joiner it turns out. On Monday, 21 May 2018 at 17:42:19 UTC, Jonathan M Davis wrote:
 I find Exceptions in range code hard to interpret.
Well, if you just look at the stack trace, it should tell you. I don't see why ranges would be any worse than any other code except for maybe the fact that it's typical to chain a lot of calls, and you frequently end up with wrapper types in the stack trace that you're not necessarily familiar with.
Exactly that: stack trace full of weird mangled names of template functions, lambdas etc. And because of lazy evaluation and chains of range functions, the line number doesn't easily show who the culprit is. On Monday, 21 May 2018 at 17:42:19 UTC, Jonathan M Davis wrote:
 In many cases, ranges will be pretty much the same as writing 
 loops, and in others, the abstraction is worth the cost.
From the benchmarking I did, I found that ranges are easily an order of magnitude slower even with compiler optimizations: https://run.dlang.io/gist/5f243ca5ba80d958c0bc16d5b73f2934?compiler=ldc&args=-O3%20-release ``` LDC -O3 -release Range Procedural Stringtest: ["267ns", "11ns"] Numbertest: ["393ns", "153ns"] DMD -O -inline -release Range Procedural Stringtest: ["329ns", "8ns"] Numbertest: ["1237ns", "282ns"] ``` This first range test is an opcode scanner I wrote for an assembler. The range code is very nice and it works, but it needlessly allocates a new string. So I switched to a procedural version, which runs (and compiles) faster. This procedural version did have some bugs initially though. The second test is a simple number calculation. I thought that the range code inlines to roughly the same procedural code so it could be optimized the same, but there remains a factor 2 gap. I don't know where the difficulty is, but I did notice that switching the maximum number from int to enum makes the procedural version 0 ns (calculated at compile time) while LDC can't deduce the outcome in the range version (which still runs for >300 ns).
May 21 2018
parent Jon Degenhardt <jond noreply.com> writes:
On Monday, 21 May 2018 at 15:00:09 UTC, Dennis wrote:
 I want to be convinced that Range programming works like a 
 charm, but the procedural approaches remain more flexible (and 
 faster too) it seems. Thanks for the example.
On Monday, 21 May 2018 at 22:11:42 UTC, Dennis wrote:
 In this case I used drop to drop lines, not characters. The 
 exception was thrown by the joiner it turns out.
  ...
 From the benchmarking I did, I found that ranges are easily an 
 order of magnitude slower even with compiler optimizations:
My general experience is that range programming works quite well. It's especially useful when used to do lazy processing and as a result minimize memory allocations. I've gotten quite good performance with these techniques (see my DConf talk slides: https://dconf.org/2018/talks/degenhardt.html). Your benchmarks are not against the file split case, but if you benchmarked that you may have also seen it as slow. It that case you may be hitting specific areas where there are opportunities for performance improvement in the standard library. One is that joiner is slow (PR: https://github.com/dlang/phobos/pull/6492). Another is that the write[fln] routines are much faster when operating on a single large object than many small objects. e.g. It's faster to call write[fln] with an array of 100 characters than: (a) calling it 100 times with one character; (b) calling it once, with 100 characters as individual arguments (template form); (c) calling it once with range of 100 characters, each processed one at a time. When joiner is used as in your example, you not only hit the joiner performance issue, but the write[fln] issue. This is due to something that may not be obvious at first: When joiner is used to concatenate arrays or ranges, it flattens out the array/range into a single range of elements. So, rather than writing a line at a time, you example is effectively passing a character at a time to write[fln]. So, in the file split case, using byLine in an imperative fashion as in my example will have the effect of passing a full line at a time to write[fln], rather than individual characters. Mine will be faster, but not because it's imperative. The same thing could be achieved procedurally. Regarding the benchmark programs you showed - This is very interesting. It would certainly be worth additional looks into this. One thing I wonder is if the performance penalty may be due to a lack of inlining due to crossing library boundaries. The imperative versions aren't crossing these boundaries. If you're willing, you could try adding LDC's LTO options and see what happens. There are some instructions in the release notes for LDC 1.9.0 (https://github.com/ldc-developers/ldc/releases). Make sure you use the form that includes druntime and phobos. --Jon
May 21 2018
prev sibling parent Dennis <dkorpel gmail.com> writes:
On Wednesday, 16 May 2018 at 08:20:06 UTC, drug wrote:
 What is the purpose of `.drop(4)`? I'm pretty sure this is the 
 reason of the exception.
The file in question is a .json database dump with an array "rows" of 10 million 8-line objects. The newlines in the string fields are escaped, but they still contain other invalid characters which makes std.json reject it. The first 4 lines of the file are basically "header" and the last 2 lines are a closing ] and }, so I want to split every 4 + 8*(10_000_000/amountOfFiles) lines and also remove trailing the comma, add brackets, drop the last 2 lines etc. I thought it wouldn't be hard to crudely split this file using D's range functions and basic string manipulation, but the combination of being to large for a string and having invalid encoding seems to defeat most simple solutions. For now I decided to use Git Bash and do: ``` tail -n80000002 inputfile.json | split -l 8000000 - outputfile ``` And now I have files that do fit in memory. I'm still interested in complete D solutions though, thanks for the iopipe and memory mapped file suggestions Steven and Jonathan. I will check those out.
May 16 2018
prev sibling parent reply Jon Degenhardt <jond noreply.com> writes:
On Wednesday, 16 May 2018 at 07:06:45 UTC, Dennis wrote:
 On Wednesday, 16 May 2018 at 02:47:50 UTC, Jon Degenhardt wrote:
 Can you show the program you are using that throws when using 
 byLine?
Here's a version that only outputs the first chunk: ``` import std.stdio; import std.range; import std.algorithm; import std.file; import std.exception; void main(string[] args) { enforce(args.length == 2, "Pass one filename as argument"); auto lineChunks = File(args[1], "r").byLine.drop(4).chunks(10_000_000/10); new File("output.txt", "w").write(lineChunks.front.joiner); } ```
If you write it in the style of my earlier example and use counters and if-tests it will work. byLine by itself won't try to interpret the characters (won't auto-decode them), so it won't trigger an exception if there are invalid utf-8 characters.
May 16 2018
parent reply Dennis <dkorpel gmail.com> writes:
On Wednesday, 16 May 2018 at 15:47:29 UTC, Jon Degenhardt wrote:
 If you write it in the style of my earlier example and use 
 counters and if-tests it will work. byLine by itself won't try 
 to interpret the characters (won't auto-decode them), so it 
 won't trigger an exception if there are invalid utf-8 
 characters.
When printing to stdout it seems to skip any validation, but writing to a file does give an exception: ``` auto inputStream = (args.length < 2 || args[1] == "-") ? stdin : args[1].File; auto outputFile = new File("output.txt"); foreach (line; inputStream.byLine(KeepTerminator.yes)) outputFile.write(line); ``` std.exception.ErrnoException C:\D\dmd2\windows\bin\..\..\src\phobo \std\stdio.d(2877): (No error) According to the documentation, byLine can throw an UTFException so relying on the fact that it doesn't in some cases doesn't seem like a good idea.
May 17 2018
next sibling parent Jon Degenhardt <jond noreply.com> writes:
On Thursday, 17 May 2018 at 20:08:09 UTC, Dennis wrote:
 On Wednesday, 16 May 2018 at 15:47:29 UTC, Jon Degenhardt wrote:
 If you write it in the style of my earlier example and use 
 counters and if-tests it will work. byLine by itself won't try 
 to interpret the characters (won't auto-decode them), so it 
 won't trigger an exception if there are invalid utf-8 
 characters.
When printing to stdout it seems to skip any validation, but writing to a file does give an exception: ``` auto inputStream = (args.length < 2 || args[1] == "-") ? stdin : args[1].File; auto outputFile = new File("output.txt"); foreach (line; inputStream.byLine(KeepTerminator.yes)) outputFile.write(line); ``` std.exception.ErrnoException C:\D\dmd2\windows\bin\..\..\src\phobo \std\stdio.d(2877): (No error) According to the documentation, byLine can throw an UTFException so relying on the fact that it doesn't in some cases doesn't seem like a good idea.
Instead of: auto outputFile = new File("output.txt"); try: auto outputFile = File("output.txt", "w"); That works for me. The second arg ("w") opens the file for write. When I omit it, I also get an exception, as the default open mode is for read: * If file does not exist: Cannot open file `output.txt' in mode `rb' (No such file or directory) * If file does exist: (Bad file descriptor) The second error presumably occurs when writing. As an aside - I agree with one of your bigger picture observations: It would be preferable to have more control over utf-8 error handling behavior at the application level.
May 17 2018
prev sibling parent Kagamin <spam here.lot> writes:
On Thursday, 17 May 2018 at 20:08:09 UTC, Dennis wrote:
 ```
     auto inputStream = (args.length < 2 || args[1] == "-") ? 
 stdin : args[1].File;
 	auto outputFile = new File("output.txt");
     foreach (line; inputStream.byLine(KeepTerminator.yes)) 
 outputFile.write(line);
 ```
Do it old school? --- int line; auto outputFile = File("output.txt", "wb"); foreach (chunk; inputStream.byChunk(4<<10)) { auto rem=chunk; while(rem!=null) { auto i=rem.countUntil(10); auto len=i+1; if(i<0)len=rem.length; else line++; outputFile.rawWrite(rem[0..len]); rem=rem[len..$]; } } ---
May 18 2018
prev sibling parent reply Neia Neutuladh <neia ikeran.org> writes:
On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:
 I have a file with two problems:
 - It's too big to fit in memory (apparently, I thought 1.5 Gb 
 would fit but I get an out of memory error when using 
 std.file.read)
Memory mapping should work. That's in core.sys.posix.sys.mman for Posix systems, and Windows has some equivalent probably. (But nobody uses Windows, right?)
 - It is dirty (contains invalid Unicode characters, null bytes 
 in the middle of lines)
std.algorithm should generally work with sequences of anything, not just strings. So memory map, cast to ubyte[], and deal with it that way?
 - When you convert chunks to arrays, you have the risk of a 
 split being in the middle of a character with multiple code 
 units
It's straightforward to scan for the start of a Unicode character; you just skip past characters where the highest bit is set and the next-highest is not. (0b1100_0000 through 0b1111_1110 is the start of a multibyte character; 0b0000_0000 through 0b0111_1111 is a single-byte character.) That said, you seem to only need to split based on a newline character, so you might be able to ignore this entirely, even if you go by chunks.
May 17 2018
parent ag0aep6g <anonymous example.com> writes:
On 05/17/2018 11:40 PM, Neia Neutuladh wrote:
 0b1100_0000 through 0b1111_1110 is the start of a 
 multibyte character
Nitpick: It only goes up to 0b1111_0100. The highest code point is U+10FFFF. There are no sequences with more than four bytes.
May 17 2018