digitalmars.D.learn - Prevent opening binary/other garbage files

helxi (6/6) Sep 29 2018 I'm writing a utility that checks for specific keyword(s) found

Adam D. Ruppe (14/20) Sep 29 2018 Simplest might be to read the first few bytes (like couple

helxi (5/18) Sep 29 2018 Thanks. Would you say

Adam D. Ruppe (2/5) Sep 29 2018 Eh, not really, most text files will not have one.

helxi (78/83) Oct 01 2018 Hi,

Adam D. Ruppe (11/15) Oct 01 2018 Yes. Any random collection of bytes <= 127 is valid utf-8. Lines

bauss (11/17) Sep 29 2018 What I would do is read the frist 512 bytes and the last 512

helxi <brucewayneshit gmail.com> writes:

I'm writing a utility that checks for specific keyword(s) found 
in the files in a given directory recursively. What's the best 
strategy to avoid opening a bin file or some sort of garbage 
dump? Check encoding of the given file?

If so, what are the most popular encodings (in POSIX if that 
matters) and how do I detect them?

Sep 29 2018

Adam D. Ruppe <destructionator gmail.com> writes:

On Saturday, 29 September 2018 at 15:52:30 UTC, helxi wrote:
 I'm writing a utility that checks for specific keyword(s) found 
 in the files in a given directory recursively. What's the best 
 strategy to avoid opening a bin file or some sort of garbage 
 dump? Check encoding of the given file?

Simplest might be to read the first few bytes (like couple 
hundred probably) and if any of them are < 32 && != '\t' && != 
'\r' && != '\n' && != 0, there's a good chance it is a binary 
file.

Text files are frequently going to have tabs and newlines, but 
not so frequently other low bytes.

If you do find a bunch of 0's, but not the other values, you 
might have a utf-16 file.

 If so, what are the most popular encodings (in POSIX if that 
 matters) and how do I detect them?

for text on posix computers they are likely going to be utf8, and 
you can try using Phobos' readText function. It will throw if it 
encounters non-utf8, so you catch that and go on to the next one.

But the simpler check described above will also probably work and 
can read less of the file.

Sep 29 2018

helxi <brucewayneshit gmail.com> writes:

On Saturday, 29 September 2018 at 16:01:18 UTC, Adam D. Ruppe 
wrote:
 On Saturday, 29 September 2018 at 15:52:30 UTC, helxi wrote:
 I'm writing a utility that checks for specific keyword(s) 
 found in the files in a given directory recursively. What's 
 the best strategy to avoid opening a bin file or some sort of 
 garbage dump? Check encoding of the given file?

 Simplest might be to read the first few bytes (like couple 
 hundred probably) and if any of them are < 32 && != '\t' && != 
 '\r' && != '\n' && != 0, there's a good chance it is a binary 
 file.

 Text files are frequently going to have tabs and newlines, but 
 not so frequently other low bytes.

 If you do find a bunch of 0's, but not the other values, you 
 might have a utf-16 file.

Thanks. Would you say 
https://dlang.org/library/std/encoding/get_bom.html is useful in 
this context?

Sep 29 2018

Adam D. Ruppe <destructionator gmail.com> writes:

On Saturday, 29 September 2018 at 23:46:26 UTC, helxi wrote:
 Thanks. Would you say 
 https://dlang.org/library/std/encoding/get_bom.html is useful 
 in this context?

Eh, not really, most text files will not have one.

Sep 29 2018

helxi <brucewayneshit gmail.com> writes:

On Sunday, 30 September 2018 at 03:19:11 UTC, Adam D. Ruppe wrote:
 On Saturday, 29 September 2018 at 23:46:26 UTC, helxi wrote:
 Thanks. Would you say 
 https://dlang.org/library/std/encoding/get_bom.html is useful 
 in this context?

 Eh, not really, most text files will not have one.

Hi,

I tried out https://dlang.org/library/std/utf/validate.html 
before manually checking for encoding myself so I ended up with 
the code below. I was fairly surprised that "*.o" (object) files 
are UTF encoded! Is it normal?

import std.stdio : File, lines, stdout;

void panic(in string message, int exitCode = 1) {
	import core.stdc.stdlib : exit;
	import std.stdio : stderr, writeln;

	stderr.writeln(message);
	exit(exitCode);
}

void writeFunc(ulong occerenceNumber, ulong lineNumber, in ref 
string fileName,
		in ref string line, File ofile = stdout) {
	import std.stdio : writef;

	ofile.writef("%s: L:%s: F:\"%s\":\n%s\n", occerenceNumber, 
lineNumber, fileName, line);
}

void treverseDirectories(in string path, in string term)
in {
	import std.file : isDir;

	if (!isDir(path))
		panic("Cannot access directory: " ~ path);
}
do {
	import std.file : dirEntries, SpanMode;

	ulong occerenceNumber, filesChecked, filesIgnored; // = 0;
	File currentFile;
	foreach (string fileName; dirEntries(path, SpanMode.breadth)) {
		try {
			currentFile = File(fileName, "r");
			++filesChecked;
			foreach (ulong lineNumber, string currentLine; 
lines(currentFile)) {
				if (lineNumber == 0) {
					// check if the file is encoded with proper UTF
					// if Line 0 is not UTF encoded, move on to the next file

					// I hope the compiler unrolls this if condition
					import std.utf : validate;

					validate(currentLine);
                                         // throws exception if 
the file is not UTF encoded
				}
				import std.algorithm : canFind;

				if (canFind(currentLine, term)) {
					writeFunc(++occerenceNumber, lineNumber, fileName, 
currentLine);
				}
			}
		}
		catch (Exception e) {
			filesIgnored++;
		}
	}
	//summarize
	import std.stdio : writefln;

	writefln("Total match found:\t%s\nTotal files 
checked:\t%s\nTotal files ignored:\t%s\n",
			occerenceNumber, filesChecked, filesIgnored);
}

void main(string[] args) {
	import std.getopt : getopt;

	string term, directory;
	getopt(args, "term|t", &term, "directory|d", &directory);

	if (!directory) {
		// if directory not specified, start working with the current 
directory
		import std.file : getcwd;

		directory = getcwd();
	}

	if (!term)
		panic("Term not specified.");

	treverseDirectories(directory, term);
}


/*

Output:  https://pastebin.com/PZ8nCaYf

Oct 01 2018

Adam D. Ruppe <destructionator gmail.com> writes:

On Monday, 1 October 2018 at 15:21:24 UTC, helxi wrote:
 I tried out https://dlang.org/library/std/utf/validate.html 
 before manually checking for encoding myself so I ended up with 
 the code below. I was fairly surprised that "*.o" (object) 
 files are UTF encoded! Is it normal?

Yes. Any random collection of bytes <= 127 is valid utf-8. Lines 
will read until it sees a byte 10, and cut off from there.

Quite a few file formats have a 10 early on to detect text/binary 
transmission corruption, but even if they don't, it is a fairly 
common byte to see before too long and that cuts off your scan 
for later bytes.


You really are better off looking for those <32 bytes like I 
described earlier - a .o file will likely have some 1's and 3's 
early on which that will quickly detect, but those will also pass 
the validate test.

Oct 01 2018

bauss <jj_1337 live.dk> writes:

On Saturday, 29 September 2018 at 15:52:30 UTC, helxi wrote:
 I'm writing a utility that checks for specific keyword(s) found 
 in the files in a given directory recursively. What's the best 
 strategy to avoid opening a bin file or some sort of garbage 
 dump? Check encoding of the given file?

 If so, what are the most popular encodings (in POSIX if that 
 matters) and how do I detect them?

What I would do is read the frist 512 bytes and the last 512 
bytes and if over 50% of those bytes are below 32 and not 8, 9, 
10, 11, 12 or 13 then chances are you have a binary file, but 
there is nothing that stops someone from writing "invalid" bytes 
into a text file. There are no limitations on what a file can 
hold and generally the system treats all files the same.

The reason I recommend to read the first 512 and last 512 bytes 
is because some binary files may contain legit text strings etc. 
so by picking two places chances are you won't have two segments 
with text.

Sep 29 2018

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Prevent opening binary/other garbage files