D - regexp suggestion
- Pavel Minayev (14/14) Feb 08 2002 It would be really nice to have a method of RegExp similar to test(),
- Walter (4/18) Feb 08 2002 I believe you can already do that with regexp by looking at the match ar...
- Pavel Minayev (4/6) Feb 08 2002 array
- Walter (3/9) Feb 08 2002 You can also use the "g" attribute.
- Pavel Minayev (4/5) Feb 08 2002 Sorry, I'm not very familiar with regexp... how is
- Walter (4/9) Feb 09 2002 If you use the "g" attribute to the RegExp constructor, and repeated cal...
- Pavel Minayev (4/6) Feb 09 2002 But doesn't it try to search for the regexp further if it doens't
- Walter (4/10) Feb 09 2002 calls
- Pavel Minayev (14/17) Feb 09 2002 Then I don't understand how it can be used to tokenize the string.
- Sean L. Palmer (6/24) Feb 09 2002 I think sscanf could do this if it could return a pointer to how far it ...
- Pavel Minayev (5/8) Feb 09 2002 got
- Sean L. Palmer (7/15) Feb 09 2002 sscanf has alot more power than most people realize. I myself didn't
- Walter (7/13) Feb 09 2002 If you're changing the regular expression you're searching for, which is
- Pavel Minayev (33/38) Feb 09 2002 for
- Pavel Minayev (5/9) Feb 09 2002 Sorry =) This should of course look:
- Walter (6/6) Feb 09 2002 All you have to do is:
- Karl Bochert (12/23) Feb 09 2002 Looks really awkward. Why doesn't the RegExp class have some query fucti...
- Pavel Minayev (5/11) Feb 10 2002 If the first token will be r2, and not r1, but there are some r1s
- Walter (7/21) Feb 10 2002 Yes, but if you are using multiple RegExp's on the same string, you need...
- Pavel Minayev (14/17) Feb 10 2002 two
- Karl Bochert (16/37) Feb 10 2002 I may be missing the point here but:
- Pavel Minayev (10/13) Feb 10 2002 overall
- Karl Bochert (37/57) Feb 10 2002 I probably have some details wrong here, but
- Pavel Minayev (5/23) Feb 10 2002 Yep, right. Now I have all the tokens, how do I determine
- Walter (5/9) Feb 10 2002 There is no difference if the global attribute is set. If the global
- Karl Bochert (6/18) Feb 10 2002 I think I understand. match() without the global attribute set finds al...
- Walter (5/14) Feb 10 2002 That's not a problem with parenthesized subexpressions. You can tell whi...
- Pavel Minayev (4/7) Feb 10 2002 Walter, where is that match[][] thing? match() returns char[][], which
- Walter (6/13) Feb 10 2002 which
- Pavel Minayev (9/11) Feb 11 2002 char[][] is the list of tokens, or, to be more exact, the list of their
- Walter (14/21) Feb 11 2002 Suppose
- Karl Bochert (9/36) Feb 11 2002 Or:
It would be really nice to have a method of RegExp similar to test(), but only matching regexp at the position given, not advancing further on error, and returning number of bytes read (or 0 on failure). It could be used for easy token parsing: RegExp identifier = new RegExp('\w', ""); char[] code, token; int pos; ... int count = identifier.get(code, pos); if (count) { token = code[pos .. pos + count]; pos += count; // next token }
Feb 08 2002
I believe you can already do that with regexp by looking at the match array and using it to slice the input array. "Pavel Minayev" <evilone omen.ru> wrote in message news:a41ccn$2m50$1 digitaldaemon.com...It would be really nice to have a method of RegExp similar to test(), but only matching regexp at the position given, not advancing further on error, and returning number of bytes read (or 0 on failure). It could be used for easy token parsing: RegExp identifier = new RegExp('\w', ""); char[] code, token; int pos; ... int count = identifier.get(code, pos); if (count) { token = code[pos .. pos + count]; pos += count; // next token }
Feb 08 2002
"Walter" <walter digitalmars.com> wrote in message news:a41imc$2pnk$1 digitaldaemon.com...I believe you can already do that with regexp by looking at the matcharrayand using it to slice the input array.Yes, but it's sloooooow!
Feb 08 2002
You can also use the "g" attribute. "Pavel Minayev" <evilone omen.ru> wrote in message news:a41jep$2q3p$1 digitaldaemon.com..."Walter" <walter digitalmars.com> wrote in message news:a41imc$2pnk$1 digitaldaemon.com...I believe you can already do that with regexp by looking at the matcharrayand using it to slice the input array.Yes, but it's sloooooow!
Feb 08 2002
"Walter" <walter digitalmars.com> wrote in message news:a41oek$2se5$1 digitaldaemon.com...You can also use the "g" attribute.Sorry, I'm not very familiar with regexp... how is it supposed to do what I want?
Feb 08 2002
"Pavel Minayev" <evilone omen.ru> wrote in message news:a42jse$6h1$1 digitaldaemon.com..."Walter" <walter digitalmars.com> wrote in message news:a41oek$2se5$1 digitaldaemon.com...If you use the "g" attribute to the RegExp constructor, and repeated calls to exec() will each pick up where the previous left off.You can also use the "g" attribute.Sorry, I'm not very familiar with regexp... how is it supposed to do what I want?
Feb 09 2002
"Walter" <walter digitalmars.com> wrote in message news:a42tc9$hrc$1 digitaldaemon.com...If you use the "g" attribute to the RegExp constructor, and repeated calls to exec() will each pick up where the previous left off.But doesn't it try to search for the regexp further if it doens't match in current position?
Feb 09 2002
"Pavel Minayev" <evilone omen.ru> wrote in message news:a433vk$l3i$1 digitaldaemon.com..."Walter" <walter digitalmars.com> wrote in message news:a42tc9$hrc$1 digitaldaemon.com...callsIf you use the "g" attribute to the RegExp constructor, and repeatedYes.to exec() will each pick up where the previous left off.But doesn't it try to search for the regexp further if it doens't match in current position?
Feb 09 2002
"Walter" <walter digitalmars.com> wrote in message news:a43tq3$11uk$2 digitaldaemon.com...Then I don't understand how it can be used to tokenize the string. Suppose I have: foo123 = bar456 + 789; Now I first search for the identifier, and get "foo123" and "bar456". Then I search for numbers and get "123", "456" and "789" - and only the latter is correct... With my suggestion implemented, however, it'd look somewhat different. First I check for identifier, and get "foo123". Now I advance after the end of that token, and perform another check... when I get to "789", I check if it matches an identifier /\w.../ - it doesn't, so I check if it is a number /0-9+/ and succeed... that's how it is supposed to work.But doesn't it try to search for the regexp further if it doens't match in current position?Yes.
Feb 09 2002
I think sscanf could do this if it could return a pointer to how far it got in the input string during processing in addition to how many fields were converted. sscanf as it exists in C is not so useful. Sean "Pavel Minayev" <evilone omen.ru> wrote in message news:a443lq$147s$1 digitaldaemon.com..."Walter" <walter digitalmars.com> wrote in message news:a43tq3$11uk$2 digitaldaemon.com...Then I don't understand how it can be used to tokenize the string. Suppose I have: foo123 = bar456 + 789; Now I first search for the identifier, and get "foo123" and "bar456". Then I search for numbers and get "123", "456" and "789" - and only the latter is correct... With my suggestion implemented, however, it'd look somewhat different. First I check for identifier, and get "foo123". Now I advance after the end of that token, and perform another check... when I get to "789", I check if it matches an identifier /\w.../ - it doesn't, so I check if it is a number /0-9+/ and succeed... that's how it is supposed to work.But doesn't it try to search for the regexp further if it doens't match in current position?Yes.
Feb 09 2002
"Sean L. Palmer" <spalmer iname.com> wrote in message news:a444t2$14qa$1 digitaldaemon.com...I think sscanf could do this if it could return a pointer to how far itgotin the input string during processing in addition to how many fields were converted. sscanf as it exists in C is not so useful.Also if sscanf would understoof regexps... =) That's why I suggest RegExp.scan();
Feb 09 2002
sscanf has alot more power than most people realize. I myself didn't discover alot of it until recently. But it won't tell you where it got to in the string. Sean "Pavel Minayev" <evilone omen.ru> wrote in message news:a447tq$161o$1 digitaldaemon.com..."Sean L. Palmer" <spalmer iname.com> wrote in message news:a444t2$14qa$1 digitaldaemon.com...wereI think sscanf could do this if it could return a pointer to how far itgotin the input string during processing in addition to how many fieldsconverted. sscanf as it exists in C is not so useful.Also if sscanf would understoof regexps... =) That's why I suggest RegExp.scan();
Feb 09 2002
"Pavel Minayev" <evilone omen.ru> wrote in message news:a443lq$147s$1 digitaldaemon.com...With my suggestion implemented, however, it'd look somewhat different. First I check for identifier, and get "foo123". Now I advance after the end of that token, and perform another check... when I get to "789", I check if it matches an identifier /\w.../ - it doesn't, so I check if it is a number /0-9+/ and succeed... that's how it is supposed to work.If you're changing the regular expression you're searching for, which is what you're doing by switching from looking for an identifier to looking for a number, you'll need to create a new RegExp for each different regular expression. Then apply them as required to the remainder of the input string.
Feb 09 2002
"Walter" <walter digitalmars.com> wrote in message news:a446n4$15hm$1 digitaldaemon.com...If you're changing the regular expression you're searching for, which is what you're doing by switching from looking for an identifier to lookingfora number, you'll need to create a new RegExp for each different regular expression. Then apply them as required to the remainder of the input string.I pre-create them all in form of an array; RegExp[] tokens; static this() { tokens = new RegExp('\w+', ""), // word new RegExp('\d+', ""), // number ... } Now how do I apply them to the remainder of the input string (whatever this means)? I can of course first retrieve identifiers, and remove them from the array, then get rid of numbers, symbols... etc. But it would be damn slow. This could be also done by "regexp comparison" function, if there were one: // read a token for (int i = 0; i < token.length; i++) { // RegExp.cmp() returns the number of chars at the beginning // of given string that match the regexp, or 0 if no match int len = tokens[0].cmp(text[pos .. text.length]); if (len) { // match! token = text[pos .. pos + len]; pos += len; } } Regexp comparison is a good idea anyhow, IMO. Can be used for lots of different things.
Feb 09 2002
tokens = new RegExp('\w+', ""), // word new RegExp('\d+', ""), // number ...Sorry =) This should of course look: tokens = new RegExp('\w+', "") ~ // word new RegExp('\d+', "") ~ // number ...
Feb 09 2002
All you have to do is: r1 = new RegExp(...); m1 = r1.match(input); if (m1.length) m2 = r2.match(input[&m1[0][0] - &input[0] .. input.length]; and so on...
Feb 09 2002
On Sat, 9 Feb 2002 15:56:56 -0800, "Walter" <walter digitalmars.com> wrote:All you have to do is: r1 = new RegExp(...); m1 = r1.match(input); if (m1.length) m2 = r2.match(input[&m1[0][0] - &input[0] .. input.length]; and so on...Looks really awkward. Why doesn't the RegExp class have some query fuctions to hide the gore? r1 = new RegExp (...); r1.exec(input); x = r1.matches (); //returns number of parenthesized matches tail = r1.tail (); //returns portion of input after match m1 = getMatch (n) //returns the nth matching substring Regular expressions are very powerful but can also be very complicated. Shouldn't the class help by providing well-named queries? In addition it would be more like PCRE, which is already well understood. Karl Bochert
Feb 09 2002
"Walter" <walter digitalmars.com> wrote in message news:a44fdn$18t6$1 digitaldaemon.com...All you have to do is: r1 = new RegExp(...); m1 = r1.match(input); if (m1.length) m2 = r2.match(input[&m1[0][0] - &input[0] .. input.length]; and so on...If the first token will be r2, and not r1, but there are some r1s further in the string, the first match() will skip the r2 and get the r1.
Feb 10 2002
"Pavel Minayev" <evilone omen.ru> wrote in message news:a45a2l$1kk4$1 digitaldaemon.com..."Walter" <walter digitalmars.com> wrote in message news:a44fdn$18t6$1 digitaldaemon.com...Yes, but if you are using multiple RegExp's on the same string, you need to decide which slices get searched for which patterns. If you are using one RegExp, just set the "g" attribute. If you use one RegExp to search for two different patterns, use parenthesized subexpressions, and the math[][] return will tell you which one was matched.All you have to do is: r1 = new RegExp(...); m1 = r1.match(input); if (m1.length) m2 = r2.match(input[&m1[0][0] - &input[0] .. input.length]; and so on...If the first token will be r2, and not r1, but there are some r1s further in the string, the first match() will skip the r2 and get the r1.
Feb 10 2002
"Walter" <walter digitalmars.com> wrote in message news:a45e05$1m8o$1 digitaldaemon.com...RegExp, just set the "g" attribute. If you use one RegExp to search fortwodifferent patterns, use parenthesized subexpressions, and the math[][] return will tell you which one was matched.This will tokenize the string, but once I have all the tokens, there's - once again - the problem how to determine the type of each token, having its regexp. Once again suppose the token was "foo666". Once again I need to check all possible versions, and if I check for the number first, I'll have a match - "666"... of course a check can be done for starting position == 0 - which involves too many checks, IMO, or the regexp can have "^" inserted at the front... but even then, each token gets checked twice - first in the RegExp.match(), then by my type detection routine. Wouldn't it be slow? I'm not asking for much... just the version of test() with for-loop removed.
Feb 10 2002
On Sun, 10 Feb 2002 17:54:52 +0300, "Pavel Minayev" <evilone omen.ru> wrote:"Walter" <walter digitalmars.com> wrote in message news:a45e05$1m8o$1 digitaldaemon.com...I may be missing the point here but: The power of regular expressions is their ability to search for multiple patterns at once. If the next thing in the input is either a number or a word which could have embedded digits then "\w[\w\d]*" matches a word "\d+" matches a number "(\w[\w\d]*)|(\d+)" matches a word or a number and "[\t ]*(\w[\w\d]*)|(\d+)" matches any spaces followed by a word or a number. In the last 2 cases, the result of the search is up to 3 substrings : the overall match, and the substrings within the parentheses. Perform the search and then the lengths of the substrings will tell you what you found. Documentation on standard regex's can be found at: http://compy.ww.tu-berlin.de/doc/packages/pcre/pcre.html among many other places.RegExp, just set the "g" attribute. If you use one RegExp to search fortwodifferent patterns, use parenthesized subexpressions, and the math[][] return will tell you which one was matched.This will tokenize the string, but once I have all the tokens, there's - once again - the problem how to determine the type of each token, having its regexp. Once again suppose the token was "foo666". Once again I need to check all possible versions, and if I check for the number first, I'll have a match - "666"... of course a check can be done for starting position == 0 - which involves too many checks, IMO, or the regexp can have "^" inserted at the front... but even then, each token gets checked twice - first in the RegExp.match(), then by my type detection routine. Wouldn't it be slow? I'm not asking for much... just the version of test() with for-loop removed.
Feb 10 2002
"Karl Bochert" <kbochert ix.netcom.com> wrote in message news:1103_1013361883 bose...In the last 2 cases, the result of the search is up to 3 substrings : theoverallmatch, and the substrings within the parentheses. Perform the search and then the lengths of the substrings will tell you what you found.How can these lengths tell? Token type is determined by the forming characters (described by regexp in my case), not by the length - or am I missing something? Suppose the input was: foo bar123 456 baz Now I get the following tokens: "foo", "bar123", "baz", "123", "456" How do I know that "123" is not supposed to be here?
Feb 10 2002
On Sun, 10 Feb 2002 20:32:11 +0300, "Pavel Minayev" <evilone omen.ru> wrote:"Karl Bochert" <kbochert ix.netcom.com> wrote in message news:1103_1013361883 bose...I probably have some details wrong here, but Declare a regular expression: p = Regexp( "(\w[\w\d]*)|(\d+)" ) then: p.match ("123test") produces 3 substrings: "123" -- the overall match "" -- the match for the first set of parens "123" -- the match for the second set of parens In PCRE (the common C implementation) the substrings are returned as an array of pointers into the string (6 in this case). I suspect D returns an equivalent array of offsets (slices?) into the string? The non-zero length of the third substring shows that a number ("\d+") was found. In your example: p.exec (foo bar123 baz 123); produces: "foo" "foo" "" and: p.exec ("bar123 baz 123") produces: "bar123" "bar123" "" and: p.exec ("123 456"); produces: "123" "" "123" I have used exec() here because it is probably the same as PCRE's exec function. I have read the RegExp documentation but do not understand the difference between the exec() and match() methods. Maybe match() is just exec() anchored to the start of the text? KarlIn the last 2 cases, the result of the search is up to 3 substrings : theoverallmatch, and the substrings within the parentheses. Perform the search and then the lengths of the substrings will tell you what you found.How can these lengths tell? Token type is determined by the forming characters (described by regexp in my case), not by the length - or am I missing something? Suppose the input was: foo bar123 456 baz Now I get the following tokens: "foo", "bar123", "baz", "123", "456" How do I know that "123" is not supposed to be here?
Feb 10 2002
"Karl Bochert" <kbochert ix.netcom.com> wrote in message news:1103_1013375566 bose...In your example: p.exec (foo bar123 baz 123); produces: "foo" "foo" "" and: p.exec ("bar123 baz 123") produces: "bar123" "bar123" "" and: p.exec ("123 456"); produces: "123" "" "123"Yep, right. Now I have all the tokens, how do I determine the _type_ of each (identifier, number, string...), with regexp describing those types?
Feb 10 2002
"Karl Bochert" <kbochert ix.netcom.com> wrote in message news:1103_1013375566 bose...I have used exec() here because it is probably the same as PCRE's exec function. I have read the RegExp documentation but do not understand the difference between the exec() and match() methods. Maybe match() is just exec() anchored to the start of the text?There is no difference if the global attribute is set. If the global attribute is not set, then match returns an array of all the matches in the input.
Feb 10 2002
On Sun, 10 Feb 2002 15:47:34 -0800, "Walter" <walter digitalmars.com> wrote:"Karl Bochert" <kbochert ix.netcom.com> wrote in message news:1103_1013375566 bose...I think I understand. match() without the global attribute set finds all matches in the subject string, but loses the 'which substring' information. That might explain Pavel's problem -- to parse the next token and get it's type info he should use exec() or global match(). Karl BochertI have used exec() here because it is probably the same as PCRE's exec function. I have read the RegExp documentation but do not understand the difference between the exec() and match() methods. Maybe match() is just exec() anchored to the start of the text?There is no difference if the global attribute is set. If the global attribute is not set, then match returns an array of all the matches in the input.
Feb 10 2002
"Pavel Minayev" <evilone omen.ru> wrote in message news:a461ka$1tv0$1 digitaldaemon.com..."Walter" <walter digitalmars.com> wrote in message news:a45e05$1m8o$1 digitaldaemon.com...That's not a problem with parenthesized subexpressions. You can tell which one got the match by the index in match[][]. The second index 0 is the overall match, subsequent indices are the matches for each subexpression.RegExp, just set the "g" attribute. If you use one RegExp to search fortwodifferent patterns, use parenthesized subexpressions, and the math[][] return will tell you which one was matched.This will tokenize the string, but once I have all the tokens, there's - once again - the problem how to determine the type of each token, having its regexp.
Feb 10 2002
"Walter" <walter digitalmars.com> wrote in message news:a470kt$2art$1 digitaldaemon.com...That's not a problem with parenthesized subexpressions. You can tell which one got the match by the index in match[][]. The second index 0 is the overall match, subsequent indices are the matches for each subexpression.Walter, where is that match[][] thing? match() returns char[][], which ain't what I need...
Feb 10 2002
"Pavel Minayev" <evilone omen.ru> wrote in message news:a47ir2$2i4j$1 digitaldaemon.com..."Walter" <walter digitalmars.com> wrote in message news:a470kt$2art$1 digitaldaemon.com...whichThat's not a problem with parenthesized subexpressions. You can tellsubexpression.one got the match by the index in match[][]. The second index 0 is the overall match, subsequent indices are the matches for eachWalter, where is that match[][] thing? match() returns char[][], which ain't what I need...It sounds like just what you need. I guess I just don't understand what's wrong.
Feb 10 2002
"Walter" <walter digitalmars.com> wrote in message news:a47r1i$2lhb$1 digitaldaemon.com...It sounds like just what you need. I guess I just don't understand what's wrong.char[][] is the list of tokens, or, to be more exact, the list of their _values_. But how do I know their _types_ (string or number or ..)? Suppose the regexp was: ([A-Za-z_]+|0-9+) And I get 10 tokens. How do I tell if the first matched [A-Za-z_]+ part or the 0-9+ part, without checking it separately (which results in two checks per token)?
Feb 11 2002
"Pavel Minayev" <evilone omen.ru> wrote in message news:a485eh$2rbg$1 digitaldaemon.com...char[][] is the list of tokens, or, to be more exact, the list of their _values_. But how do I know their _types_ (string or number or ..)?Supposethe regexp was: ([A-Za-z_]+|0-9+) And I get 10 tokens. How do I tell if the first matched [A-Za-z_]+ part or the 0-9+ part, without checking it separately (which results in two checks per token)?You can tell which parenthesized subexpression matched by checking to see which index it was in: char[][] m; r = new RegExp("(a)|(b)", "g"); // search for "a" or "b" while ((m = r.exec("a b and a b")) != null) { if (m[1]) ; // matched an "a" else if (m[2]) ; // matched a "b" }
Feb 11 2002
On Mon, 11 Feb 2002 14:57:58 -0800, "Walter" <walter digitalmars.com> wrote:"Pavel Minayev" <evilone omen.ru> wrote in message news:a485eh$2rbg$1 digitaldaemon.com...Or: m = r.exec (...); switch (m.length) { case 0: // no match case 2: // matched 'a' case 3: //matched 'b' ... ???char[][] is the list of tokens, or, to be more exact, the list of their _values_. But how do I know their _types_ (string or number or ..)?Supposethe regexp was: ([A-Za-z_]+|0-9+) And I get 10 tokens. How do I tell if the first matched [A-Za-z_]+ part or the 0-9+ part, without checking it separately (which results in two checks per token)?You can tell which parenthesized subexpression matched by checking to see which index it was in: char[][] m; r = new RegExp("(a)|(b)", "g"); // search for "a" or "b" while ((m = r.exec("a b and a b")) != null) { if (m[1]) ; // matched an "a" else if (m[2]) ; // matched a "b" }
Feb 11 2002