digitalmars.D - Questions about builtin RegExp
- Andrew Fedoniouk (10/10) Feb 16 2006 1) Will builtin RegExp increase minimal size of D executable?
- Oskar Linde (12/21) Feb 17 2006 No. This was as far as I understood one of the considerations.
- Andrew Fedoniouk (10/29) Feb 17 2006 And what is this opNext() doing exactly?
- Regan Heath (5/30) Feb 17 2006 I think you're thinking inside the box. :)
- Andrew Fedoniouk (11/44) Feb 18 2006 I beleive there is a sort of misunderstanding about what scripting is an...
- Regan Heath (13/61) Feb 18 2006 I think there is some overlap, i.e. some scripting tasks do not require ...
- Andrew Fedoniouk (23/42) Feb 18 2006 1) Scrtipting langauges are being used usualy as built into some other
- Walter Bright (5/13) Feb 18 2006 I agree. But I don't believe that there's anything special about scripti...
- kris (3/5) Feb 18 2006 Really? Do you have some kind of data to back that assertion?
- Walter Bright (12/16) Feb 18 2006 Peer reviewed statistical research studies? Nope. But it's a pretty good...
- Lucas Goss (6/7) Feb 18 2006 I've never used scripting languages for that purpose. The only reason
- Walter Bright (7/16) Feb 17 2006 No.
- Andrew Fedoniouk (42/51) Feb 17 2006 Next questions then:
- Ivan Senji (46/126) Feb 17 2006 Instead of an answer a quick example of what I tried and what works:
- Andrew Fedoniouk (25/151) Feb 17 2006 Thanks, Ivan, see below:
- Ivan Senji (6/25) Feb 17 2006 Naturally, but this was just a see-if-it-can-be-done example. :)
- Andrew Fedoniouk (13/27) Feb 17 2006 :)
- Ivan Senji (23/61) Feb 17 2006 Well it wouldn't be the first time that the documentation is
- Andrew Fedoniouk (8/29) Feb 17 2006 And what is this opNext for then?
- Walter Bright (3/7) Feb 17 2006 m/regex/g => RegExp("regex", "g")
- Walter Bright (8/9) Feb 17 2006 For startsWith(), sure. But if that was all regex was used for, nobody w...
- Walter Bright (16/48) Feb 17 2006 None. Operator overloading requires one object be a class or a struct. B...
- Andrew Fedoniouk (24/77) Feb 17 2006 And this RegExp("string") ~~ "string" is more honest, isn't it?
- Walter Bright (8/27) Feb 18 2006 That doesn't give the match results, though.
- Andrew Fedoniouk (10/27) Feb 18 2006 Who cares in most of cases?
- Walter Bright (9/17) Feb 18 2006 In a very large fraction of cases, it matters. After all, if you are
- Andrew Fedoniouk (25/42) Feb 18 2006 Probably in some Perl-ish use cases this is really so needed.
- Walter Bright (3/9) Feb 19 2006 I'd like to see strtok() parse an email address out of a body of text.
- Andrew Fedoniouk (16/22) Feb 19 2006 I don't really understand "parse an email address out of a body of text....
- Chris Sauls (24/59) Feb 19 2006 I think he meant something more like (using MatchExpr, sorry):
- Unknown W. Brackets (21/52) Feb 19 2006 Andrew Fedoniouk,
- Regan Heath (44/59) Feb 19 2006 Here's how I'd do it:
-
Walter Bright
(3/4)
Feb 19 2006
Your's is a lot of code to do what a regex does. Now recognize a url
... - Regan Heath (10/15) Feb 19 2006 This is true, though my code is likely faster.
- Georg Wrede (23/83) Feb 20 2006 DISCLAIMER INSERTED WHEN PROOFREADING:
- Georg Wrede (13/39) Feb 20 2006 Had I to do stuff on the M$ "platform", I'd definitely look long and
1) Will builtin RegExp increase minimal size of D executable? I mean if this executable is not using regexp at all. 2) Is it possible to override operator ~~ ? 3) What is the main purpose of incorporating interprettable regexps in natively compileable language? 4) When happens check of regexp for syntax correctness - at compile time or at runtime? "..." ~~ "..." If ~~ is a part of language syntax then one can assume that expression is getting compiled somehow. Andrew.
Feb 16 2006
Andrew Fedoniouk wrote:1) Will builtin RegExp increase minimal size of D executable? I mean if this executable is not using regexp at all.No. This was as far as I understood one of the considerations.2) Is it possible to override operator ~~ ?Yes. opMatch() and opNext().3) What is the main purpose of incorporating interprettable regexps in natively compileable language?To make regexps more accessible I guess. Makes D seem like a alternative to scripting languages.4) When happens check of regexp for syntax correctness - at compile time or at runtime? "..." ~~ "..." If ~~ is a part of language syntax then one can assume that expression is getting compiled somehow.At runtime. For now atleast. In the future it could possibly be compiled at compile time, but there will still always be a need to support run-time regexps anyway. /Oskar
Feb 17 2006
"Oskar Linde" <olREM OVEnada.kth.se> wrote in message news:dt40sg$29nc$1 digitaldaemon.com...Andrew Fedoniouk wrote:And what is this opNext() doing exactly? next sub-expression, next match from last position matched (/g) ?1) Will builtin RegExp increase minimal size of D executable? I mean if this executable is not using regexp at all.No. This was as far as I understood one of the considerations.2) Is it possible to override operator ~~ ?Yes. opMatch() and opNext().??? alternative to some scripting language can be another scripting language. alternative to some natively compileable language can be another natively compileable language.3) What is the main purpose of incorporating interprettable regexps in natively compileable language?To make regexps more accessible I guess. Makes D seem like a alternative to scripting languages.Having "builtin" regexps without strings in the language seems unnatural. Andrew.4) When happens check of regexp for syntax correctness - at compile time or at runtime? "..." ~~ "..." If ~~ is a part of language syntax then one can assume that expression is getting compiled somehow.At runtime. For now atleast. In the future it could possibly be compiled at compile time, but there will still always be a need to support run-time regexps anyway.
Feb 17 2006
On Fri, 17 Feb 2006 20:46:01 -0800, Andrew Fedoniouk <news terrainformatica.com> wrote:"Oskar Linde" <olREM OVEnada.kth.se> wrote in message news:dt40sg$29nc$1 digitaldaemon.com...I think you're thinking inside the box. :) With the recent additions is it not possible to write scripts in D? ReganAndrew Fedoniouk wrote:And what is this opNext() doing exactly? next sub-expression, next match from last position matched (/g) ?1) Will builtin RegExp increase minimal size of D executable? I mean if this executable is not using regexp at all.No. This was as far as I understood one of the considerations.2) Is it possible to override operator ~~ ?Yes. opMatch() and opNext().??? alternative to some scripting language can be another scripting language. alternative to some natively compileable language can be another natively compileable language.3) What is the main purpose of incorporating interprettable regexps in natively compileable language?To make regexps more accessible I guess. Makes D seem like a alternative to scripting languages.
Feb 17 2006
"Regan Heath" <regan netwin.co.nz> wrote in message news:ops45qq5rn23k2f5 nrage.netwin.co.nz...On Fri, 17 Feb 2006 20:46:01 -0800, Andrew Fedoniouk <news terrainformatica.com> wrote:I beleive there is a sort of misunderstanding about what scripting is and why there are scripting (typeless) languages, compiled bytecoded and compiled native. These three groups has their own niches. D as a compiled language will never reach flexibility of e.g. prototype based JavaScript or Ruby. There are just different definitions of flexibility for these groups - different and sometimes even orthogonal tasks . Andrew."Oskar Linde" <olREM OVEnada.kth.se> wrote in message news:dt40sg$29nc$1 digitaldaemon.com...I think you're thinking inside the box. :) With the recent additions is it not possible to write scripts in D?Andrew Fedoniouk wrote:And what is this opNext() doing exactly? next sub-expression, next match from last position matched (/g) ?1) Will builtin RegExp increase minimal size of D executable? I mean if this executable is not using regexp at all.No. This was as far as I understood one of the considerations.2) Is it possible to override operator ~~ ?Yes. opMatch() and opNext().??? alternative to some scripting language can be another scripting language. alternative to some natively compileable language can be another natively compileable language.3) What is the main purpose of incorporating interprettable regexps in natively compileable language?To make regexps more accessible I guess. Makes D seem like a alternative to scripting languages.
Feb 18 2006
On Sat, 18 Feb 2006 00:36:23 -0800, Andrew Fedoniouk <news terrainformatica.com> wrote:"Regan Heath" <regan netwin.co.nz> wrote in message news:ops45qq5rn23k2f5 nrage.netwin.co.nz...I think there is some overlap, i.e. some scripting tasks do not require the flexibilty you mention, instead the important factor may be one or more of: - how fast can I code the solution - how easily can I code the solution - how easily can I maintain the solution - how likely is my solution to contain bugs - how easy will it be to find those bugs Assuming you're a D programmer and assuming the D std lib contains the tools to achieve your task, why not use D? ReganOn Fri, 17 Feb 2006 20:46:01 -0800, Andrew Fedoniouk <news terrainformatica.com> wrote:I beleive there is a sort of misunderstanding about what scripting is and why there are scripting (typeless) languages, compiled bytecoded and compiled native. These three groups has their own niches. D as a compiled language will never reach flexibility of e.g. prototype based JavaScript or Ruby. There are just different definitions of flexibility for these groups - different and sometimes even orthogonal tasks ."Oskar Linde" <olREM OVEnada.kth.se> wrote in message news:dt40sg$29nc$1 digitaldaemon.com...I think you're thinking inside the box. :) With the recent additions is it not possible to write scripts in D?Andrew Fedoniouk wrote:And what is this opNext() doing exactly? next sub-expression, next match from last position matched (/g) ?1) Will builtin RegExp increase minimal size of D executable? I mean if this executable is not using regexp at all.No. This was as far as I understood one of the considerations.2) Is it possible to override operator ~~ ?Yes. opMatch() and opNext().??? alternative to some scripting language can be another scripting language. alternative to some natively compileable language can be another natively compileable language.3) What is the main purpose of incorporating interprettable regexps in natively compileable language?To make regexps more accessible I guess. Makes D seem like a alternative to scripting languages.
Feb 18 2006
1) Scrtipting langauges are being used usualy as built into some other environments. This use case is quite different from D execution model. Different life cycle. 2) Scripting langauges are safe. Tremendous effort needed to make GPF in scripting environment. In D to make GPF is a piece of cake. I mean not because of bugs in language or libs but because you can dereference null object pointer for example. 3) Scripting languages provide very high level and convenient set of ready to use task oriented set of classes/objects. Example: for building D projects you would rather use make or build scripts than D itself, right? Even if you would have something like std.build I bet you will use some scripting tool for your builds. What I want to say: To write fast scripting engine in D is possible and this is what D is best for (among other things). But to write something D-ish in scripting.... Completely different areas of use to be short.I beleive there is a sort of misunderstanding about what scripting is and why there are scripting (typeless) languages, compiled bytecoded and compiled native. These three groups has their own niches. D as a compiled language will never reach flexibility of e.g. prototype based JavaScript or Ruby. There are just different definitions of flexibility for these groups - different and sometimes even orthogonal tasks .I think there is some overlap, i.e. some scripting tasks do not require the flexibilty you mention, instead the important factor may be one or more of: - how fast can I code the solution - how easily can I code the solution - how easily can I maintain the solution - how likely is my solution to contain bugs - how easy will it be to find those bugs Assuming you're a D programmer and assuming the D std lib contains the tools to achieve your task, why not use D?
Feb 18 2006
"Andrew Fedoniouk" <news terrainformatica.com> wrote in message news:dt6ma6$1jt0$1 digitaldaemon.com...I beleive there is a sort of misunderstanding about what scripting is and why there are scripting (typeless) languages, compiled bytecoded and compiled native. These three groups has their own niches. D as a compiled language will never reach flexibility of e.g. prototype based JavaScript or Ruby. There are just different definitions of flexibility for these groups - different and sometimes even orthogonal tasks .I agree. But I don't believe that there's anything special about scripting that makes it especially suited for regex, but regex is a large reason people use scripting languages.
Feb 18 2006
Walter Bright wrote: [snip]regex is a large reason people use scripting languages.Really? Do you have some kind of data to back that assertion?
Feb 18 2006
"kris" <fu bar.org> wrote in message news:dt7m3l$2hc5$1 digitaldaemon.com...Walter Bright wrote: [snip]Peer reviewed statistical research studies? Nope. But it's a pretty good impression one gets by reading the examples in manuals for scripting languages, listening to what people say about those languages, and looking at a sampling of actual scripts. Here's a quote from "Programming Perl"'s preface by Larry Wall: "Perl is no longer just for text processing." That means, to me, that Perl was DESIGNED to be a text processing language. Why would the backbone of that, regex, not be why a large number of people use Perl? Perl stands for "Practical Extraction and Report Language", i.e. text manipulation. Larry goes out of his way to say that Perl is a superset of sed and awk, which are regex string manipulation scripting languages.regex is a large reason people use scripting languages.Really? Do you have some kind of data to back that assertion?
Feb 18 2006
Walter Bright wrote:... but regex is a large reason people use scripting languages.I've never used scripting languages for that purpose. The only reason I've used scripting languages is because they are often times easier, quicker, and have a huge library to write portable code. D almost matches them in being as easy and as quick, but lacks the huge standard library.
Feb 18 2006
"Andrew Fedoniouk" <news terrainformatica.com> wrote in message news:dt3v1o$27nk$1 digitaldaemon.com...1) Will builtin RegExp increase minimal size of D executable? I mean if this executable is not using regexp at all.No.2) Is it possible to override operator ~~ ?Overload, yes. With opMatch().3) What is the main purpose of incorporating interprettable regexps in natively compileable language?Make them easier to use.4) When happens check of regexp for syntax correctness - at compile time or at runtime? "..." ~~ "..."Right now, at runtime. But the compiler is allowed to diagnose it at compile time, if it's a string literal.If ~~ is a part of language syntax then one can assume that expression is getting compiled somehow.
Feb 17 2006
Thanks, Walter,Next questions then: [char string literal] ~~ [char string literal] 1) For what object I need to override opMatch to be able to get it invoked in the line above? 2) For some types of RE (alike) expressions there is no need to create instance of RegExp, e.g. test "*.ext" ~~ file_name can be implemented times faster than standard RE creation/invocation. 3) Some objects has no string representation of match operation. For example CSS selector as an object has match operation with DOM element as an argument. But you have a requirement: "Both operands must be implicitly convertible to char[]." What to do in this case?2) Is it possible to override operator ~~ ?Overload, yes. With opMatch().Easier? What is wrong with standard way: regexp re = new regexp("....."); re.test(...); And easier is not mean more effective. while( true ) { if( "mask" ~~ file_name ) .... } As far as I understand you will generate: while( true ) { regexp re = new regexp("mask"); re.test(file_name); .... }3) What is the main purpose of incorporating interprettable regexps in natively compileable language?Make them easier to use.If it does not compile this regexp at compile time than this is just a fake and not a a solution at all for the language of D level. Even Perl compiles its regular expresions in compile time. So the real meaning of arg1 ~~ arg2 notation is just a shortcut of arg1.test(arg2) In general shortcuts are good but in this particular case it has hidden side effects in creation of new RegExp object on each test invocation. Andrew.4) When happens check of regexp for syntax correctness - at compile time or at runtime? "..." ~~ "..."Right now, at runtime. But the compiler is allowed to diagnose it at compile time, if it's a string literal.
Feb 17 2006
Andrew Fedoniouk wrote:Thanks, Walter,Instead of an answer a quick example of what I tried and what works: <CODE> import std.stdio; class ArrayBeginsWith { static ArrayBeginsWith opCall(int a) { check = a; return instance; } static ArrayBeginsWith instance; static int check; static this() { instance = new ArrayBeginsWith; } static bool opMatch(int[] nums) { if(nums.length < 1)return false; if(nums[0] == check) return true; else return false; } } static bool opMatch(int[] nums) { if(nums.length < 2)return false; if(nums[0] == 0 && nums[1] == 1) return true; else return false; } void main() { static int[] somearray1 = [0,1,2]; static int[] somearray2 = [2,1,2]; writefln(ArrayBeginsWith(0) ~~ somearray1); writefln(ArrayBeginsWith(0) ~~ somearray2); writefln(ArrayBeginsWith(2) ~~ somearray1); writefln(ArrayBeginsWith(2) ~~ somearray2); } </CODE>Next questions then: [char string literal] ~~ [char string literal] 1) For what object I need to override opMatch to be able to get it invoked in the line above? 2) For some types of RE (alike) expressions there is no need to create instance of RegExp, e.g. test "*.ext" ~~ file_name can be implemented times faster than standard RE creation/invocation. 3) Some objects has no string representation of match operation. For example CSS selector as an object has match operation with DOM element as an argument. But you have a requirement: "Both operands must be implicitly convertible to char[]." What to do in this case?2) Is it possible to override operator ~~ ?Overload, yes. With opMatch().Nothing is wrong with this, but ~~ is easier :)Easier? What is wrong with standard way: regexp re = new regexp("....."); re.test(...);3) What is the main purpose of incorporating interprettable regexps in natively compileable language?Make them easier to use.And easier is not mean more effective. while( true ) { if( "mask" ~~ file_name ) .... } As far as I understand you will generate: while( true ) { regexp re = new regexp("mask"); re.test(file_name); .... }I don't think this is to hard to optimize away. Compiler can even generate global RegExp instance for each regular expression literal and use it many times.This generation of new RegExp doesn't have to be true. But ~~ provides us with a feature of testing arbitrary types for arbitrary things.If it does not compile this regexp at compile time than this is just a fake and not a a solution at all for the language of D level. Even Perl compiles its regular expresions in compile time. So the real meaning of arg1 ~~ arg2 notation is just a shortcut of arg1.test(arg2) In general shortcuts are good but in this particular case it has hidden side effects in creation of new RegExp object on each test invocation.4) When happens check of regexp for syntax correctness - at compile time or at runtime? "..." ~~ "..."Right now, at runtime. But the compiler is allowed to diagnose it at compile time, if it's a string literal.
Feb 17 2006
Thanks, Ivan, see below: "Ivan Senji" <ivan.senji_REMOVE_ _THIS__gmail.com> wrote in message news:dt5b54$h1q$1 digitaldaemon.com...Andrew Fedoniouk wrote:function startsWith( int[] arr, int v ) { if(arr.length < 1) return false; return arr[0] == check); } and its usage: static int[] somearray2 = [2,1,2]; if( somearray2.startsWith( 0 ) ) ... will be more a) compact b) human readable c) maintainable d) natural the same apply to function match( const char[] str, RegExp re ) { ... } if( mystr.match(someRe) ) .... ------------------------------------ I would go to normal implementation of outer methods instead of this :p~~.Thanks, Walter,Instead of an answer a quick example of what I tried and what works: <CODE> import std.stdio; class ArrayBeginsWith { static ArrayBeginsWith opCall(int a) { check = a; return instance; } static ArrayBeginsWith instance; static int check; static this() { instance = new ArrayBeginsWith; } static bool opMatch(int[] nums) { if(nums.length < 1)return false; if(nums[0] == check) return true; else return false; } } static bool opMatch(int[] nums) { if(nums.length < 2)return false; if(nums[0] == 0 && nums[1] == 1) return true; else return false; } void main() { static int[] somearray1 = [0,1,2]; static int[] somearray2 = [2,1,2]; writefln(ArrayBeginsWith(0) ~~ somearray1); writefln(ArrayBeginsWith(0) ~~ somearray2); writefln(ArrayBeginsWith(2) ~~ somearray1); writefln(ArrayBeginsWith(2) ~~ somearray2); } </CODE>Next questions then: [char string literal] ~~ [char string literal] 1) For what object I need to override opMatch to be able to get it invoked in the line above? 2) For some types of RE (alike) expressions there is no need to create instance of RegExp, e.g. test "*.ext" ~~ file_name can be implemented times faster than standard RE creation/invocation. 3) Some objects has no string representation of match operation. For example CSS selector as an object has match operation with DOM element as an argument. But you have a requirement: "Both operands must be implicitly convertible to char[]." What to do in this case?2) Is it possible to override operator ~~ ?Overload, yes. With opMatch().As I said having defined function with name 'match' and clearly defined parameters is way better than to make syntax of the language look like an Xmas Tree - with all possible smiley notations (http://www.helpbytes.co.uk/smileys.php) Andrew.Nothing is wrong with this, but ~~ is easier :)Easier? What is wrong with standard way: regexp re = new regexp("....."); re.test(...);3) What is the main purpose of incorporating interprettable regexps in natively compileable language?Make them easier to use.And easier is not mean more effective. while( true ) { if( "mask" ~~ file_name ) .... } As far as I understand you will generate: while( true ) { regexp re = new regexp("mask"); re.test(file_name); .... }I don't think this is to hard to optimize away. Compiler can even generate global RegExp instance for each regular expression literal and use it many times.This generation of new RegExp doesn't have to be true. But ~~ provides us with a feature of testing arbitrary types for arbitrary things.If it does not compile this regexp at compile time than this is just a fake and not a a solution at all for the language of D level. Even Perl compiles its regular expresions in compile time. So the real meaning of arg1 ~~ arg2 notation is just a shortcut of arg1.test(arg2) In general shortcuts are good but in this particular case it has hidden side effects in creation of new RegExp object on each test invocation.4) When happens check of regexp for syntax correctness - at compile time or at runtime? "..." ~~ "..."Right now, at runtime. But the compiler is allowed to diagnose it at compile time, if it's a string literal.
Feb 17 2006
Andrew Fedoniouk wrote:Thanks, Ivan, see below:...function startsWith( int[] arr, int v ) { if(arr.length < 1) return false; return arr[0] == check); } and its usage: static int[] somearray2 = [2,1,2]; if( somearray2.startsWith( 0 ) ) ... will be more a) compact b) human readable c) maintainable d) naturalNaturally, but this was just a see-if-it-can-be-done example. :)As I said having defined function with name 'match' and clearly defined parameters is way better than to make syntax of the language look like an Xmas Tree -Well i don't see it like that, I see it as a abstracted concept of "matching", and that can be interpreted as an elementary operation. Plus we can overload ~~ to mean matching of any kind we want that makes sense.
Feb 17 2006
:D or better :~~Dstatic int[] somearray2 = [2,1,2]; if( somearray2.startsWith( 0 ) ) ... will be more a) compact b) human readable c) maintainable d) naturalNaturally, but this was just a see-if-it-can-be-done example. :):) 1) According to http://www.digitalmars.com/d/expression.html#MatchExpression "Both operands must be implicitly convertible to char[]. " so yours "matching of any kind we want " is not strictly true. 2) ~~ has sidefects. Moreover it is implemented as statefull comparison so consequent ~~'s on the same arguments will yeld to different results. 3) while(true) { bool r = "a" ~~ r"\w"; } must allocate new RegExp.As I said having defined function with name 'match' and clearly defined parameters is way better than to make syntax of the language look like an Xmas Tree -Well i don't see it like that, I see it as a abstracted concept of "matching", and that can be interpreted as an elementary operation. Plus we can overload ~~ to mean matching of any kind we want that makes sense.
Feb 17 2006
Andrew Fedoniouk wrote:That's a good smiley.:D or better :~~Dstatic int[] somearray2 = [2,1,2]; if( somearray2.startsWith( 0 ) ) ... will be more a) compact b) human readable c) maintainable d) naturalNaturally, but this was just a see-if-it-can-be-done example. :)Well it wouldn't be the first time that the documentation is wrong/incomplete. Both types *do* have to be implicitly convertible to char[] unless you use a match expression with your own type with defined opMatch operator.:) 1) According to http://www.digitalmars.com/d/expression.html#MatchExpression "Both operands must be implicitly convertible to char[]. " so yours "matching of any kind we want " is not strictly true.As I said having defined function with name 'match' and clearly defined parameters is way better than to make syntax of the language look like an Xmas Tree -Well i don't see it like that, I see it as a abstracted concept of "matching", and that can be interpreted as an elementary operation. Plus we can overload ~~ to mean matching of any kind we want that makes sense.2) ~~ has sidefects. Moreover it is implemented as statefull comparison so consequent ~~'s on the same arguments will yeld to different results.char[] ~~ char[] is implemented that way, but users Foo ~~ Bar[] doesn't have to behave that way (but it can if it makes sense there are more matches)3) while(true) { bool r = "a" ~~ r"\w"; } must allocate new RegExp.Why? Why couldn't a compiler optimize this away into something like: RegExp __regexp0001; static this() { __regexp0001 = new RegExp("a"); } and then later whenever literal "a" is used as regex: while(true) { bool r = __regexp0001 ~~ r"\w"; } So it is true that a new RegExp is allocated but it needs only to be done once.
Feb 17 2006
And what is this opNext for then? And more: traditionally there are two "test" operations in RegExps: 'match' and 'test' as far as I remember. match returns matched substring and test returns boolean. There is also /g flag which allow to scan the whole string (Perl) $i = 0while ($string =~ m/regex/g) { }So what exactly this ~~ does?Andrew.3) while(true) { bool r = "a" ~~ r"\w"; } must allocate new RegExp.Why? Why couldn't a compiler optimize this away into something like: RegExp __regexp0001; static this() { __regexp0001 = new RegExp("a"); } and then later whenever literal "a" is used as regex: while(true) { bool r = __regexp0001 ~~ r"\w"; } So it is true that a new RegExp is allocated but it needs only to be done once.
Feb 17 2006
"Andrew Fedoniouk" <news terrainformatica.com> wrote in message news:dt5ton$10qu$1 digitaldaemon.com...There is also /g flag which allow to scan the whole string (Perl) $i = 0while ($string =~ m/regex/g) { }So what exactly this ~~ does?Andrew.m/regex/g => RegExp("regex", "g")
Feb 17 2006
"Andrew Fedoniouk" <news terrainformatica.com> wrote in message news:dt5eo9$kgu$1 digitaldaemon.com...will be more a) compact b) human readable c) maintainable d) naturalFor startsWith(), sure. But if that was all regex was used for, nobody would have ever invented them. Regexes can search for arbitrarilly complex patterns, and are used that way. Writing a library of custom functions for each is out of the question. What you're also missing in the examples is using the match result, not just testing for the match.
Feb 17 2006
"Andrew Fedoniouk" <news terrainformatica.com> wrote in message news:dt591g$erk$1 digitaldaemon.com...Next questions then: [char string literal] ~~ [char string literal] 1) For what object I need to override opMatch to be able to get it invoked in the line above?None. Operator overloading requires one object be a class or a struct. But you could do: RegExp("string") ~~ "string" and overload opMatch for RegExp.2) For some types of RE (alike) expressions there is no need to create instance of RegExp, e.g. test "*.ext" ~~ file_name can be implemented times faster than standard RE creation/invocation.Sure. Create your own MyReg object, and use it like: MyReg("*.ext") ~~ filename3) Some objects has no string representation of match operation. For example CSS selector as an object has match operation with DOM element as an argument. But you have a requirement: "Both operands must be implicitly convertible to char[]." What to do in this case?Operator overloading happens before implicit conversions.For whatever reason, people find that confusing and impractical.Easier? What is wrong with standard way: regexp re = new regexp("....."); re.test(...);3) What is the main purpose of incorporating interprettable regexps in natively compileable language?Make them easier to use.And easier is not mean more effective.True. I didn't say it was more effective.If it does not compile this regexp at compile time than this is just a fake and not a a solution at all for the language of D level. Even Perl compiles its regular expresions in compile time.It isn't worth trying to do them at compile time if the feature itself doesn't catch on.So the real meaning of arg1 ~~ arg2 notation is just a shortcut of arg1.test(arg2)It's more than that, because of the implicit declaration of the match results.In general shortcuts are good but in this particular case it has hidden side effects in creation of new RegExp object on each test invocation.Yes, but why is that a bad thing?
Feb 17 2006
"Walter Bright" <newshound digitalmars.com> wrote in message news:dt6da8$1ci6$1 digitaldaemon.com..."Andrew Fedoniouk" <news terrainformatica.com> wrote in message news:dt591g$erk$1 digitaldaemon.com...And this RegExp("string") ~~ "string" is more honest, isn't it? Or as in Harmonia: string s = .... bool r = s.like("str*");Next questions then: [char string literal] ~~ [char string literal] 1) For what object I need to override opMatch to be able to get it invoked in the line above?None. Operator overloading requires one object be a class or a struct. But you could do: RegExp("string") ~~ "string" and overload opMatch for RegExp.But I want my own function for char[] ~~ char[] ! Simple pattern match does not require compilation phase or even memory allocation...2) For some types of RE (alike) expressions there is no need to create instance of RegExp, e.g. test "*.ext" ~~ file_name can be implemented times faster than standard RE creation/invocation.Sure. Create your own MyReg object, and use it like: MyReg("*.ext") ~~ filenameI don't understand why not allow this: bool opMatch(char[] a, char[] b) ?3) Some objects has no string representation of match operation. For example CSS selector as an object has match operation with DOM element as an argument. But you have a requirement: "Both operands must be implicitly convertible to char[]." What to do in this case?Operator overloading happens before implicit conversions.uh, people.... I see.For whatever reason, people find that confusing and impractical.Easier? What is wrong with standard way: regexp re = new regexp("....."); re.test(...);3) What is the main purpose of incorporating interprettable regexps in natively compileable language?Make them easier to use.You need to explain very well what is going on under the hood of this ~~ - it is statefull operator (if it is /g). <ot> I am using stream tokenizer in Harmonia instead of this /g. (class TokenizerT(CHAR) // harmonia/string.d) Simple like(pattern) method is enough in 90% of cases. Perl is completely different story - it is built around RegExp. And it is typeless. </ot> BTW: Have you seen Nemerle and its way of meta-programming? http://nemerle.org/ Andrew.And easier is not mean more effective.True. I didn't say it was more effective.If it does not compile this regexp at compile time than this is just a fake and not a a solution at all for the language of D level. Even Perl compiles its regular expresions in compile time.It isn't worth trying to do them at compile time if the feature itself doesn't catch on.So the real meaning of arg1 ~~ arg2 notation is just a shortcut of arg1.test(arg2)It's more than that, because of the implicit declaration of the match results.In general shortcuts are good but in this particular case it has hidden side effects in creation of new RegExp object on each test invocation.Yes, but why is that a bad thing?
Feb 17 2006
"Andrew Fedoniouk" <news terrainformatica.com> wrote in message news:dt6gbc$1eig$1 digitaldaemon.com..."Walter Bright" <newshound digitalmars.com> wrote in message news:dt6da8$1ci6$1 digitaldaemon.com...That doesn't give the match results, though.None. Operator overloading requires one object be a class or a struct. But you could do: RegExp("string") ~~ "string" and overload opMatch for RegExp.And this RegExp("string") ~~ "string" is more honest, isn't it? Or as in Harmonia: string s = .... bool r = s.like("str*");Consider overloading the '+' in '1+2'? To overload operators, one of the operands must be a user defined type.Sure. Create your own MyReg object, and use it like: MyReg("*.ext") ~~ filenameBut I want my own function for char[] ~~ char[] !I don't understand why not allow this: bool opMatch(char[] a, char[] b) ?For the same reason opAdd(int a, int b) is not allowed. Such a function would apply globally, all the library code will break, etc.BTW: Have you seen Nemerle and its way of meta-programming? http://nemerle.org/I don't know anything about it. I'll take a look at the link.
Feb 18 2006
"Walter Bright" <newshound digitalmars.com> wrote in message news:dt6nug$1lhe$1 digitaldaemon.com...Who cares in most of cases? user input validation tasks or simple filename matching ... When you need match results you will use regexp or something more effective like tokenizers.Or as in Harmonia: string s = .... bool r = s.like("str*");That doesn't give the match results, though.Take a look. A bit ugly on my taste but some ideas of Nemerle macros can be reused. They allow to add your own problem specific notation and syntax to the language.Consider overloading the '+' in '1+2'? To overload operators, one of the operands must be a user defined type.Sure. Create your own MyReg object, and use it like: MyReg("*.ext") ~~ filenameBut I want my own function for char[] ~~ char[] !I don't understand why not allow this: bool opMatch(char[] a, char[] b) ?For the same reason opAdd(int a, int b) is not allowed. Such a function would apply globally, all the library code will break, etc.BTW: Have you seen Nemerle and its way of meta-programming? http://nemerle.org/I don't know anything about it. I'll take a look at the link.
Feb 18 2006
"Andrew Fedoniouk" <news terrainformatica.com> wrote in message news:dt7qm4$2kn0$1 digitaldaemon.com..."Walter Bright" <newshound digitalmars.com> wrote in message news:dt6nug$1lhe$1 digitaldaemon.com...In a very large fraction of cases, it matters. After all, if you are searching a posting for an embedded email address, it doesn't do much good to only know that one is/isn't there. One is searching for it so one can do something with it.That doesn't give the match results, though. Who cares in most of cases?string s = .... bool r = s.like("str*");When you need match results you will use regexp or something more effective like tokenizers.Writing a real lexer takes a lot of effort. That's why people invented regex, it'll handle most jobs without having to write a lexer. C's strtok() is embarassingly inadequate.
Feb 18 2006
"Walter Bright" <newshound digitalmars.com> wrote in message news:dt80n7$2qiu$3 digitaldaemon.com..."Andrew Fedoniouk" <news terrainformatica.com> wrote in message news:dt7qm4$2kn0$1 digitaldaemon.com...Probably in some Perl-ish use cases this is really so needed. In my http://blocknote.net hyperlink auto-recognition start working on each complete non-ws sequence - I already know position. But this is a particular use case."Walter Bright" <newshound digitalmars.com> wrote in message news:dt6nug$1lhe$1 digitaldaemon.com...In a very large fraction of cases, it matters. After all, if you are searching a posting for an embedded email address, it doesn't do much good to only know that one is/isn't there. One is searching for it so one can do something with it.That doesn't give the match results, though. Who cares in most of cases?string s = .... bool r = s.like("str*");Why? Here is simple Tokenizer for C/C++/D/etc. alike texts module harmonia.string; class TokenizerT(CHAR) { enum token { EOT, SPACE, WORD, QUOTE, DELIMETER, COMMENT } ... } And module harmonia.html.scanner; is simple HTML/XML push parser (scanner) ---------------------- I mean that std.lib should have multiple text handling tools. RegExp is not only one possible. I would like to see something like customizeable TokenizerT above in std lib. Frequently such tokenizer is what really needed rather than regexp and scriptin style poor man tokenizing using array.split and the like. Andrew.When you need match results you will use regexp or something more effective like tokenizers.Writing a real lexer takes a lot of effort. That's why people invented regex, it'll handle most jobs without having to write a lexer. C's strtok() is embarassingly inadequate.
Feb 18 2006
"Andrew Fedoniouk" <news terrainformatica.com> wrote in message news:dt87fd$314d$1 digitaldaemon.com..."Walter Bright" <newshound digitalmars.com> wrote in message news:dt80n7$2qiu$3 digitaldaemon.com...I'd like to see strtok() parse an email address out of a body of text.Writing a real lexer takes a lot of effort. That's why people invented regex, it'll handle most jobs without having to write a lexer. C's strtok() is embarassingly inadequate.Why?
Feb 19 2006
"Walter Bright" <newshound digitalmars.com> wrote in message news:dt9ho8$20e4$3 digitaldaemon.com...I don't really understand "parse an email address out of a body of text." Do you mean something like this: char* pw = text; url u; forever { pw = strtok( pw, " \t\n\r" ); if( !pw ) return; if( !u.parse(pw) ) continue; if( u.protocol() == url::MAILTO ) //found - do something here ; }; ? Andrew.I'd like to see strtok() parse an email address out of a body of text.Writing a real lexer takes a lot of effort. That's why people invented regex, it'll handle most jobs without having to write a lexer. C's strtok() is embarassingly inadequate.Why?
Feb 19 2006
Andrew Fedoniouk wrote:"Walter Bright" <newshound digitalmars.com> wrote in message news:dt9ho8$20e4$3 digitaldaemon.com...I think he meant something more like (using MatchExpr, sorry): Granted, I just tossed that together in five seconds flat, so its probably not quite right. I'm just recently starting to lean into the RegExp camp myself. Its made parsing of Lyra scripts a dream. One thing I miss from a scripting language in doing the above, is PHP's lovely list() construct. Pretending we had this in D: -- Chris Nicholson-SaulsI don't really understand "parse an email address out of a body of text." Do you mean something like this: char* pw = text; url u; forever { pw = strtok( pw, " \t\n\r" ); if( !pw ) return; if( !u.parse(pw) ) continue; if( u.protocol() == url::MAILTO ) //found - do something here ; }; ? Andrew.I'd like to see strtok() parse an email address out of a body of text.Writing a real lexer takes a lot of effort. That's why people invented regex, it'll handle most jobs without having to write a lexer. C's strtok() is embarassingly inadequate.Why?
Feb 19 2006
Andrew Fedoniouk, What he's saying is... essentially... please take this string: char[] some_text = "The email address Walter is posting from is newshound digitalmars.com. The headers for your message have <news terrainformatica.com>, so I would assume that is your address. My address can be found in this HTML: <a href=\"mailto:unknown simplemachines.org\">my email</a>"; Now use strtok to output just the email addresses. I would expect the output to be like this: 1: newshound digitalmars.com 2: news terrainformatica.com 3: unknown simplemachines.org How many lines will it take to grab those addresses, without using a regular expression? You can use "like()" all you like, and strtok(), or even strpos()... He does not mean a whitespace separated list of addresses, why would you need to work to parse that? Most people would not use a regular expression for that, it'd be silly. I think you're looking at this from a different angle than Walter is. Just illustrating, -[Unknown]"Walter Bright" <newshound digitalmars.com> wrote in message news:dt9ho8$20e4$3 digitaldaemon.com...I don't really understand "parse an email address out of a body of text." Do you mean something like this: char* pw = text; url u; forever { pw = strtok( pw, " \t\n\r" ); if( !pw ) return; if( !u.parse(pw) ) continue; if( u.protocol() == url::MAILTO ) //found - do something here ; }; ? Andrew.I'd like to see strtok() parse an email address out of a body of text.Writing a real lexer takes a lot of effort. That's why people invented regex, it'll handle most jobs without having to write a lexer. C's strtok() is embarassingly inadequate.Why?
Feb 19 2006
On Sun, 19 Feb 2006 14:47:43 -0800, Unknown W. Brackets <unknown simplemachines.org> wrote:Andrew Fedoniouk, What he's saying is... essentially... please take this string: char[] some_text = "The email address Walter is posting from is newshound digitalmars.com. The headers for your message have <news terrainformatica.com>, so I would assume that is your address. My address can be found in this HTML: <a href=\"mailto:unknown simplemachines.org\">my email</a>"; Now use strtok to output just the email addresses. I would expect the output to be like this: 1: newshound digitalmars.com 2: news terrainformatica.com 3: unknown simplemachines.org How many lines will it take to grab those addresses, without using a regular expression? You can use "like()" all you like, and strtok(), or even strpos()...Here's how I'd do it: import std.stdio; import std.string; char[] some_text = "The email address Walter is posting from is newshound digitalmars.com. The headers for your message have <news terrainformatica.com>, so I would assume that is your address. My address can be found in this HTML: <a href=\"mailto:unknown simplemachines.org\">my email</a>"; void main() { char[][] res; res = parse_string(some_text); foreach(int i, char[] r; res) writefln("%d. %s",i+1,r); } bool valid_email_char(char c) { char* special = "<>()[]\\.,;: \""; if (c == '.') return true; if (c <= 0x1F) return false; if (c == 0x7F) return false; if (c == ' ') return false; if (strchr(special,c)) return false; return true; } char[][] parse_string(char[] text) { char[][] res; char* raw = toStringz(text); char* p; char* e; for(p = strchr(raw,' '); p; p = strchr(e,' ')) { for(e = p+1; valid_email_char(*e); e++) {} if (e > raw && *(e-1) == '.') e--; for(; p > raw && valid_email_char(*(p-1)); p--) {} res ~= p[0..(e-p)]; //add .dup if required } return res; } Regan
Feb 19 2006
"Regan Heath" <regan netwin.co.nz> wrote in message news:ops48ur1em23k2f5 nrage.netwin.co.nz...Here's how I'd do it:Your's is a lot of code to do what a regex does. Now recognize a url <g>.
Feb 19 2006
On Sun, 19 Feb 2006 18:52:19 -0800, Walter Bright <newshound digitalmars.com> wrote:"Regan Heath" <regan netwin.co.nz> wrote in message news:ops48ur1em23k2f5 nrage.netwin.co.nz...This is true, though my code is likely faster.Here's how I'd do it:Your's is a lot of code to do what a regex does.Now recognize a url <g>.Nah. You've made your point.. in fact I was secretly trying to help. <g> Regex is a good general purpose string parsing facility. I personally find composing a regex can be complicated, likely it's easier with practice. A custom piece of code is probably faster and I find it easier to tweak. In the end, unless it was performance critical or has resisted my initial efforts at composing a regex, I'd probably use a regex. Regan
Feb 19 2006
Regan Heath wrote:Walter Bright <newshound digitalmars.com> wrote:DISCLAIMER INSERTED WHEN PROOFREADING: I'm not attacking you, or anybody's opinion here, I'm just thinking aloud -- mostly to sort out my own opinion on this issue! :-)"Regan Heath" <regan netwin.co.nz> wroteThis is true, though my code is likely faster.Here's how I'd do it:Your's is a lot of code to do what a regex does.Now recognize a url <g>.Nah. You've made your point.. in fact I was secretly trying to help. <g>Regex is a good general purpose string parsing facility. I personally find composing a regex can be complicated, likely it's easier with practice. A custom piece of code is probably faster and I find it easier to tweak. In the end, unless it was performance critical or has resisted my initial efforts at composing a regex, I'd probably use a regex.Heh, interestingly, I have the same feeling about all three!! (I.e. composing nontrivial regexes is hard, custom code is faster and easier to tweak.) But I can't but wonder whether I'm wrong on all three! In other words, writing custom code to do the same as a nontrivial regexp might feel the easier choice at the outset, but the sheer number of lines required (for example for the url recognition task) makes the code error prone and unobvious. And I too _feel_ that the custom code would be faster, but, on second thought, I'd probably have to do some intensive optimizing cycles if I were against an average regexp implementation. ;-( This regexp stuff is "well understood" and polished during decades, after all. As to "easier to tweak", suppose that Boss comes to you 2 months later and wants this Url Recognizer (which you had to write in a hurry to compete with the regexp guy in the next cubicle) to only accept top-level domains in country specific urls, you'd be hard put to know where to start tweaking, while the other guy gets it right in 30 seconds flat tweaking his regexp code. (The boss' tweak accepts foo.fi but not foo.bar.fi nor foo.com)Here's how I'd do it: import std.stdio; import std.string; char[] some_text = "The email address Walter is posting from is newshound digitalmars.com. The headers for your message have <news terrainformatica.com>, so I would assume that is your address. My address can be found in this HTML: <a href=\"mailto:unknown simplemachines.org\">my email</a>"; void main() { char[][] res; res = parse_string(some_text); foreach(int i, char[] r; res) writefln("%d. %s",i+1,r); } bool valid_email_char(char c) { char* special = "<>()[]\\.,;: \""; if (c == '.') return true; if (c <= 0x1F) return false; if (c == 0x7F) return false; if (c == ' ') return false; if (strchr(special,c)) return false; return true; } char[][] parse_string(char[] text) { char[][] res; char* raw = toStringz(text); char* p; char* e; for(p = strchr(raw,' '); p; p = strchr(e,' ')) { for(e = p+1; valid_email_char(*e); e++) {} if (e > raw && *(e-1) == '.') e--; for(; p > raw && valid_email_char(*(p-1)); p--) {} res ~= p[0..(e-p)]; //add .dup if required } return res; }
Feb 20 2006
Andrew Fedoniouk wrote:"Walter Bright" <newshound digitalmars.com>"Andrew Fedoniouk" <news terrainformatica.com>Had I to do stuff on the M$ "platform", I'd definitely look long and The macro thing looks quite a bit like what I had in mind last winter when we were discussing whether the high level (that is, metaprogramming) features of D should be implemented in a syntax distinct from the "normal" language syntax or not. Seems I lost. :-) (No hard feelings, Walter and Don are really amazing me, over and over again!) Still, there's a lot of obvious stuff that seems trivial with a separate syntax, while either impossible or cumbersome with the current one. (But hey, with the rate W&D are going, all that will also be fixed by D 1.5.)You need to explain very well what is going on under the hood of this ~~ - it is statefull operator (if it is /g). <ot> I am using stream tokenizer in Harmonia instead of this /g. (class TokenizerT(CHAR) // harmonia/string.d) Simple like(pattern) method is enough in 90% of cases. Perl is completely different story - it is built around RegExp. And it is typeless. </ot> BTW: Have you seen Nemerle and its way of meta-programming? http://nemerle.org/In general shortcuts are good but in this particular case it has hidden side effects in creation of new RegExp object on each test invocation.Yes, but why is that a bad thing?
Feb 20 2006