digitalmars.D.learn - Checking, whether string contains only ascii.
- berni (7/12) Feb 22 2017 In my program, I read a postscript file. Normal postscript files
- H. S. Teoh via Digitalmars-d-learn (21/27) Feb 22 2017 [...]
- H. S. Teoh via Digitalmars-d-learn (12/26) Feb 22 2017 [...]
- jklm (7/20) Feb 22 2017 void foo(string postscript)
- jklm (2/24) Feb 22 2017 \s postscript args[0]
- Adam D. Ruppe (4/6) Feb 22 2017 Easiest:
- aberba (3/9) Feb 22 2017 :)
- ag0aep6g (21/34) Feb 22 2017 Making full use of the standard library:
- =?UTF-8?Q?Ali_=c3=87ehreli?= (12/47) Feb 22 2017 One more:
- kinke (6/17) Feb 22 2017 One more again as I couldn't believe noone went for 'any' yet:
- H. S. Teoh via Digitalmars-d-learn (9/15) Feb 22 2017 You win 1 intarwebs for the shortest solution posted so far. ;-)
In my program, I read a postscript file. Normal postscript files should only be composed of ascii characters, but one never knows what users give us. Therefore I'd like to make sure that the string the program read is only made up of ascii characters. This simplifies the code thereafter, because I then can assume, that codeunit==codepoint. Is there a simple way to do so? Here a sketch of my function:void foo(string postscript) { // throw Exception, if postscript is not all ascii // other stuff, assuming codeunit=codepoint }
Feb 22 2017
On Wed, Feb 22, 2017 at 07:26:15PM +0000, berni via Digitalmars-d-learn wrote:In my program, I read a postscript file. Normal postscript files should only be composed of ascii characters, but one never knows what users give us. Therefore I'd like to make sure that the string the program read is only made up of ascii characters. This simplifies the code thereafter, because I then can assume, that codeunit==codepoint. Is there a simple way to do so?[...] Hmm... What about: import std.range.primitives; bool isAsciiOnly(R)(R input) if (isInputRange!R && is(ElementType!R : dchar)) { import std.algorithm.iteration : fold; return input.fold!((a, b) => a && b < 0x80)(true); } unittest { assert(isAsciiOnly("abcdefg")); assert(!isAsciiOnly("abcбвг")); } Basically, it iterates over the string / range of characters and checks that every character is less than 0x80, since anything that's 0x80 or greater cannot be ASCII. T -- INTEL = Only half of "intelligence".
Feb 22 2017
On Wed, Feb 22, 2017 at 11:43:00AM -0800, H. S. Teoh via Digitalmars-d-learn wrote: [...]import std.range.primitives; bool isAsciiOnly(R)(R input) if (isInputRange!R && is(ElementType!R : dchar)) { import std.algorithm.iteration : fold; return input.fold!((a, b) => a && b < 0x80)(true); } unittest { assert(isAsciiOnly("abcdefg")); assert(!isAsciiOnly("abcбвг")); }[...] Ah, missing the Exception part: void foo(string input) { if (!input.isAsciiOnly) throw new Exception("..."); } T -- Why are you blatanly misspelling "blatant"? -- Branden Robinson
Feb 22 2017
On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote:In my program, I read a postscript file. Normal postscript files should only be composed of ascii characters, but one never knows what users give us. Therefore I'd like to make sure that the string the program read is only made up of ascii characters. This simplifies the code thereafter, because I then can assume, that codeunit==codepoint. Is there a simple way to do so? Here a sketch of my function:void foo(string postscript) { import std.ascii, astd.algorithm.ieration; if (!args[0].filter!(a => !isASCII(a)).empty) throw new Exception("bla"); }void foo(string postscript) { // throw Exception, if postscript is not all ascii // other stuff, assuming codeunit=codepoint }
Feb 22 2017
On Wednesday, 22 February 2017 at 19:57:22 UTC, jklm wrote:On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote:\s postscript args[0]In my program, I read a postscript file. Normal postscript files should only be composed of ascii characters, but one never knows what users give us. Therefore I'd like to make sure that the string the program read is only made up of ascii characters. This simplifies the code thereafter, because I then can assume, that codeunit==codepoint. Is there a simple way to do so? Here a sketch of my function:void foo(string postscript) { import std.ascii, astd.algorithm.ieration; if (!postscript.filter!(a => !isASCII(a)).empty) throw new Exception("bla"); }void foo(string postscript) { // throw Exception, if postscript is not all ascii // other stuff, assuming codeunit=codepoint }
Feb 22 2017
On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote:herefore I'd like to make sure that the string the program read is only made up of ascii characters.Easiest: foreach(char ch; postscript) if(ch > 127) throw new Exception("non-ascii detected");
Feb 22 2017
On Wednesday, 22 February 2017 at 20:01:57 UTC, Adam D. Ruppe wrote:On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote::)herefore I'd like to make sure that the string the program read is only made up of ascii characters.Easiest: foreach(char ch; postscript) if(ch > 127) throw new Exception("non-ascii detected");
Feb 22 2017
On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote:In my program, I read a postscript file. Normal postscript files should only be composed of ascii characters, but one never knows what users give us. Therefore I'd like to make sure that the string the program read is only made up of ascii characters. This simplifies the code thereafter, because I then can assume, that codeunit==codepoint. Is there a simple way to do so? Here a sketch of my function:Making full use of the standard library: ---- import std.algorithm: all; import std.ascii: isASCII; import std.exception: enforce; enforce(postscript.all!isASCII); ---- That checks on the code point level (because strings are ranges of dchars). If you want to be clever, you can avoid decoding and check on the code unit level: ---- /* other imports as above */ import std.utf: byCodeUnit; enforce(postscript.byCodeUnit.all!isASCII); ---- Or you can do it manually, avoiding all those imports: ---- foreach (char c; postscript) if (c > 0x7F) throw new Exception("not ASCII"); ----void foo(string postscript) { // throw Exception, if postscript is not all ascii // other stuff, assuming codeunit=codepoint }
Feb 22 2017
On 02/22/2017 12:02 PM, ag0aep6g wrote:On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote:One more: bool isAscii(string s) { import std.string : representation; import std.algorithm : canFind; return !s.representation.canFind!(c => c >= 0x80); } unittest { assert(isAscii("hello world")); assert(!isAscii("hellö wörld")); } AliIn my program, I read a postscript file. Normal postscript files should only be composed of ascii characters, but one never knows what users give us. Therefore I'd like to make sure that the string the program read is only made up of ascii characters. This simplifies the code thereafter, because I then can assume, that codeunit==codepoint. Is there a simple way to do so? Here a sketch of my function:Making full use of the standard library: ---- import std.algorithm: all; import std.ascii: isASCII; import std.exception: enforce; enforce(postscript.all!isASCII); ---- That checks on the code point level (because strings are ranges of dchars). If you want to be clever, you can avoid decoding and check on the code unit level: ---- /* other imports as above */ import std.utf: byCodeUnit; enforce(postscript.byCodeUnit.all!isASCII); ---- Or you can do it manually, avoiding all those imports: ---- foreach (char c; postscript) if (c > 0x7F) throw new Exception("not ASCII"); ----void foo(string postscript) { // throw Exception, if postscript is not all ascii // other stuff, assuming codeunit=codepoint }
Feb 22 2017
On Wednesday, 22 February 2017 at 20:07:34 UTC, Ali Çehreli wrote:One more: bool isAscii(string s) { import std.string : representation; import std.algorithm : canFind; return !s.representation.canFind!(c => c >= 0x80); } unittest { assert(isAscii("hello world")); assert(!isAscii("hellö wörld")); } AliOne more again as I couldn't believe noone went for 'any' yet: --- import std.algorithm; return !s.any!"a > 127"; // code-point level ---
Feb 22 2017
On Wed, Feb 22, 2017 at 09:16:24PM +0000, kinke via Digitalmars-d-learn wrote: [...]One more again as I couldn't believe noone went for 'any' yet: --- import std.algorithm; return !s.any!"a > 127"; // code-point level ---You win 1 intarwebs for the shortest solution posted so far. ;-) Though, according to the OP, an exception is wanted, so it should be more along the lines of: enforce(!s.any!"a > 127"); T -- A bend in the road is not the end of the road unless you fail to make the turn. -- Brian White
Feb 22 2017
On Wednesday, 22 February 2017 at 21:23:45 UTC, H. S. Teoh wrote:enforce(!s.any!"a > 127");Puh, it's lot's of possibilities to choose of, now... I thought of something like the foreach-loop but wasn't sure if that is correct for all utf encodings. All in all, I think I take the any-approach, because it feels a little bit more like looking at the string at a whole and I like to use enforce. Thanks for all your answers!
Feb 23 2017
On Thursday, 23 February 2017 at 08:34:53 UTC, berni wrote:On Wednesday, 22 February 2017 at 21:23:45 UTC, H. S. Teoh wrote:All the examples given here are very nice. But alas this will not work with postscript files as found in the wild.enforce(!s.any!"a > 127");Puh, it's lot's of possibilities to choose of, now... I thought of something like the foreach-loop but wasn't sure if that is correct for all utf encodings. All in all, I think I take the any-approach, because it feels a little bit more like looking at the string at a whole and I like to use enforce. Thanks for all your answers!In my program, I read a postscript file. Normal postscript files should only be composed of ascii characters, but one never knows what users give us. Therefore I'd like to make sure that the string the program read is only made up of ascii characters.Generally postscript files may contain binary data. Think of included images or font data. So in postscript files there should normally be no utf-8 encoded text, but binary data are quite usual. Think of postscript files as a sequence of ubytes.
Feb 23 2017
On Thursday, 23 February 2017 at 17:44:05 UTC, HeiHon wrote:Generally postscript files may contain binary data. Think of included images or font data. So in postscript files there should normally be no utf-8 encoded text, but binary data are quite usual. Think of postscript files as a sequence of ubytes.As far as I know, images and font data have to be in clean7bit too (they are not human readable though). But postscript files can contain preview images, which can be binary. I know about this. I just tried to keep my question simple -- and actually I'm only testing part of the postscript file, where I know, that binary data must not occur.
Feb 23 2017