digitalmars.D.learn - hyperlink regular expression pattern

John C (11/11) May 09 2008 I want to split an HTML anchor tag into its constituent parts. I have a ...

novice2 (14/15) May 09 2008 this works for me (dmd 1.029)

John C (3/22) May 09 2008 Thanks - that seems to extract the href and text. What about getting oth...

novice2 (23/27) May 09 2008 what you asked - that you got :)

John C <johnch_atms hotmail.com> writes:

I want to split an HTML anchor tag into its constituent parts. I have a regular
expression pattern that works with .NET's Regex class, but not with std.regexp
- it errors out with "*+? not allowed in atom". I think this means something in
the pattern is non-standard.

Here's my code:

if (auto m = std.regexp.search(
  "<a href=\"www.google.com\">Google</a>", 
  r"<a.*?href=[""'](?<url>.*?)[""'].*?>(?<name>.*?)</a>")) {
  string url = m.match(1);
  string name = m.match(2);
}

The problematic parts are "?<url>" and "?<name>" - but not being a whiz with
regular expressions, I don't know what to use instead.

Perhaps someone's got a better pattern they could post?

John.

May 09 2008

novice2 <sorry noem.ail> writes:

 Perhaps someone's got a better pattern they could post?

this works for me (dmd 1.029)

import std.regexp;

void main()
{
 if (auto m = std.regexp.search(
              "<a href=\"www.google.com:8080/dfs?a1=1&a2=2\">This is Google
link</a>",
              "<a[^>]+href=(['\"]?)(.*?)\\1.*?>(.*)</a>"))
 {
   for(int i=0; i<10; i++)
   {
     printf("%d=\"%.*s\"\n", i, m.match(i));
   }
 }
}

May 09 2008

John C <johnch_atms hotmail.com> writes:

novice2 Wrote:

 Perhaps someone's got a better pattern they could post?

 
 this works for me (dmd 1.029)
 
 import std.regexp;
 
 void main()
 {
  if (auto m = std.regexp.search(
               "<a href=\"www.google.com:8080/dfs?a1=1&a2=2\">This is Google
link</a>",
               "<a[^>]+href=(['\"]?)(.*?)\\1.*?>(.*)</a>"))
  {
    for(int i=0; i<10; i++)
    {
      printf("%d=\"%.*s\"\n", i, m.match(i));
    }
  }
 }
 

Thanks - that seems to extract the href and text. What about getting other
attributes like name and title, as in this link:

<a href=\"www.google.com\" name=\"googleLink\" title=\"Click Me\">Google
Link</a>

May 09 2008

novice2 <sorry noem.ail> writes:

John C Wrote:

 Thanks - that seems to extract the href and text.

what you asked - that you got :)
"question is half of  answer" :)

What about getting other attributes like name and title, as in this link:
 
 <a href=\"www.google.com\" name=\"googleLink\" title=\"Click Me\">Google
Link</a>

imho, it can't be done by one regexp match. because of random sequense of
attributes. imho, you should get whole <a> tag attributes string, then iterate
attributes in it.
something like this below.
but sorry, it can't catch attributes without quotes.
may be, std.strings non-regexp will be better when parsing attributes.

//////
import std.regexp;
import std.stdio;

void main()
{
 if (auto m = std.regexp.search("<anothertag><a
href=\"www.google.com:8080/dfs?a1=1&a2=2\" name='google Link' color=red
title=\"Click Me\"\">This is Google link</a></anothertag>",
              "<a(\\s.*?)>(.*?)</a>"))
 {
   writefln("tag attributes: \"%s\"", m.match(1));
   writefln("tag content: \"%s\"", m.match(2));
   foreach(s; RegExp("(\\S+?)=(['\"]?)(.*?)\\2").search(m.match(1)))
   {
     writefln("found attribute: name=\"%s\", value=\"%s\"", s.match(1),
s.match(3));
   }
 }
}

May 09 2008

D Programming

C/C++ Programming

Other

digitalmars.D.learn - hyperlink regular expression pattern