www.digitalmars.com

D Programming Language 2.0


Last update Mon Dec 31 02:58:02 2012

Regular expressions

by Dmitry Olshansky, the author of std.regex

Introduction

String processing is a kind of daily routine that most applications do in a one way or another. It should come as no wonder that many programming languages have standard libraries equipped with a variety of specialized functions for common string manipulation needs. The D programming language standard library among others offers a nice assortment in std.string, as well as generic functions from std.algorithm that work with strings. Still no amount of fixed functionality could cover all needs, as naturally flexible text data needs flexible solutions.

Here is where come in handy, often succinctly called as regexes. Regexes are simple yet powerful language for defining patterns of strings, put together with a substitution mechanism, they form a Swiss Army knife of text processing. They are considered so useful that a number of languages provides built-in support for regular expressions. This does not nessary mean that built-in implies faster processing or more features. It's more a matter of providing convenient and friendly syntax for typical operations and usage patterns.

The D programming language provides a standard library module std.regex. Being a highly expressive systems language, D allows regexes to be efficiently implemented within the language itself, yet have the same level of readability and usability as built-in version would provide. We'll see below how close to built-in regexes look and feel we can achieve.

By the end of article you'll have a good understanding of regular expression capabilities in this library, and how to utilize its API in a most straightforward way. Examples in this article assume that the reader has fairly good understanding of regex elements, yet it's not required to understand the API.

A warm up

How do you check if something is a phone number by looking at it?

Yes, it's something with numbers, and there may be a country code in front of that... Ok, going for an international format should make it more strict. As this is the first time, let's put together a full program:

import std.stdio, std.regex;
void main(string argv[]){
    string phone = argv[1]; //assuming phone is passed as first argument on the command line
    if(match(phone, r"^\+[1-9][0-9]* [0-9 ]*$"))
        writeln("It looks like a phone number.");
    else
        writeln("Nope, it's not a phone number.");
}
And that's it! Just keep in mind the boundaries of regular expressions power - to truly establish a validness of a phone number, one has to try dialing it or contact the authority.

Now, come to think of it, this tiny sample showed a lot of useful things already:

Continuing with the phone number example, it could be useful to get the exact value of country code, as well as the whole number. For the sake of experiment let's also explicitly store compiled regex pattern via regex to see how it works.

string phone = "+31 650 903 7158"; //fictional, any coincidence is just that
auto phoneReg = regex(r"^\+([1-9][0-9]*) [0-9 ]*$");
auto m = match(phone, phoneReg);
assert(m);
assert(m.captures[0] == "+31 650 903 7158");
assert(m.captures[1] == "31");
//you shouldn't really care about the type of regex object,
//but here it is
static assert(is(typeof(phoneReg) : Regex!char));

To search and replace

There is a frequent need to find all matches in piece of text. Picking an easy task, let's see how to filter out all white space-only lines. There is no special routine for looping over input like search() or similar as found in some libraries. Instead std.regex provides a natural syntax for looping over all matches via plain foreach.

auto buffer = std.file.readText("regex.d");
foreach(m; match(buffer, regex(r"^.*[^\p{WhiteSpace}]+.*$","gm"))){
    writeln(m.hit); //hit is a shorthand for m.captures[0]
}

It just works as the result of match is an input range. An element type is Captures for the string type being used, it is a random access range of submatches. I won't go into full detail of the range concept, suffice to say, that it's a primary way of representing sequences of data in D. Random access here implies one can peek at any element with direct indexing. To have a feeling of how this affects the API:

auto m = match("Ranges are hot!", r"(\w)\w*(\w)"); //at least 3 "word" symbols
assert(m.front[0] == "Ranges");
//m.captures is a historical alias for the first element of match range (.front).
assert(m.captures[1] == m.front[1]);
auto captures = m.front;
assert(captures[2] == "s");
foreach(c; captures)
    writeln(c); //will show lines: Ranges, R, s

By playing by the rules std.regex gets some nice benefits in interaction with other modules e.g. this is how one could count non-empty lines in text buffer (just don't forget to import std.algorithm when trying that at home):

auto buffer = std.file.readText("regex.d");
int count = count(match(text, regex(r"^.*\P{WhiteSpace}+.*$","gm")));

This by the way tells me that std.regex has 7128 non-blank lines as of this writing. But let's get back to the regular expression itself. A seasoned regex user catches instantly that Unicode properties are supported with perl-style \p{xxx}, to spice that all of Scripts and Blocks are supported. Let us dully note that \P{xxx} means not having an xxx property, i.e here not a white space character. Refer to the Unicode standard UTS 18 for full list and specific details.

Another thing of importance is the option string - "gm", where g stands for global and m for multi-line mode. While global is obvious in this context, what multi-line mode does deserves some explanation.

Historically utilities that supported regex patterns (unix grep, sed, etc.) processed text line by line. At that time anchors like ^, $ meant the start of the whole input buffer that has been same as that of line. As regular expressions became more ubiquitous the need to recognize multiple lines in a chunk of text, with anchors ^ & $ matching before and after line break literally. For the curious, line break is defined as (\n | \v | \r | \f | \u0085 | \u2028 | \u2029 | \r\n). Needless to say, one need not to turn on multi-line mode if not using any of ^, $.

Now that search was covered, the topic suggest that it's about time to do some replacements. For instance to replace all dates in a text from "MM/dd/YYYY" format to a sortable version of "YYYY-MM-dd":

auto text = readText(...);
auto replaced = replace(text,
    regex(r"([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})","g"), "$3-$1-$2");

The "all dates" part comes from the fact that regex has global option set, omit it to replace only the first occurrence. As seen the replacement is controlled by format string not unlike one in C's famous printf. The $1, $2, $3 substituted with content of sub-expresions. Aside from referencing sub-matches, one can include the whole part of input preceding the match via $` and $' for the content following right after the match.

Ok, now let's aim for something bigger, this time to show that std.regex can do things that are unattainable by classic textual substitution alone. Imagine you want to translate a web shop catalog so that it display prices in your currency. Yes, one can use calculator or estimate it, once having current ratio. Being programmers we can do better, so let's wrap up a simple program that converts text to use correct prices everywhere. For the sake of example let it be UK pounds and US dollars.

import std.conv, std.regex, std.range, std.file, std.stdio;
import std.string : format;

void main(string[] argv){
    immutable ratio = 1.5824; //UK pounds to US dollar as of this writing
    auto to_dollars(Captures!string price){
        real value = to!real(price["integer"]);
        if(!price["fraction"].empty)
            value += 0.01*to!real(price["fraction"]);
        return format("$%.2f",value * ratio);
    }
    string text = std.file.readText(argv[1]);
    auto converted = replace!to_dollars(text,
            regex(r"£\s*(?P<integer>[0-9]+)(\.(?P<fraction>[0-9]{2}))?","g"));
    write(converted);
}

Getting current conversion rates and supporting more currencies is left as an exercise for the reader. What at work here is so-called replace with delegate, analogous to a callout ability found in other languages and regex libraries. The magic is simple: whenever replace finds a match it calls a user supplied callback on the captured piece, then it uses the return value as replacement. And so on for other matches, given the "g" flag of regex.

I couldn't resist to spice this example up with yet another feature - named groups. Names work just like opaque aliases for numbers of captured subexpressions. Meaning that with the same exact regular expression one could as well change these lines to:

real value = to!real(price[1]);
if(!price[3].empty)
            value += 0.01*to!real(price[3]);
Though that lacks readability and is not future proof. Also note that optional captures are still represented, it's just they can be an empty string if not matched.

More features, more options

As core functionallity was already presented, let's move on to extras. Sometimes it's useful to do almost the opposite of searching - split up input using regex as separator. Like the following sample, that outputs text by sentences:

foreach(sentence; splitter(argv[1], regex(r"(?<=[.?!]+(?![?!]))\s*")))
    writeln(sentence);

Again the type of splitter is range, thus allowing foreach iteration. Notice the usage of lookaround in regex, it's a neat trick here as stripping off final punctuation is not our intention. Breaking down this example, (?<=[.?!]) part looks behind for first ., ? or !. This get us half way to our goal because \s* also matches between elements of punctuation like "?!", so a negative lookahead is introduced inside lookbehind to make sure we are past all of the punctuation marks. Admittedly, barrage of ? and ! makes this regex rather obscure, more then it's actually is. Observe that there are no restrictions on contents of lookaround expressions, one can go for lookahead inside lookbehind and so on. However in general it's recommended to use them sparingly, keeping them as weapon of last resort.

Now for something completely different yet exactly the same. For one, there is an ability to precompile constant regex at compile-time:

static r = regex("Boo-hoo");
assert(match("What was that? Boo-hoo?", r));

Importantly it's the exact same Regex object that works through all of the API we've seen so far. It's just takes less then 1 μs of run-time to initialize. Run-time version took around 10-20 μs on my machine, keep in mind that it was a trivial pattern.

Now stepping futher in this direction there is an ability to construct specialized native machine code that matches a given regex to speed up matching. All of this at compile time, and again the module usage stays consistent!

Recalling the phone number example:

//It's only a 5 characters difference!
string phone = "+31 650 903 7158";
auto phoneReg = ctRegex!r"^\+([1-9][0-9]*) [0-9 ]*$";
auto m = match(phone, phoneReg);
assert(m);
assert(m.captures[0] == "+31 650 903 7158");
assert(m.captures[1] == "31");

In this particular case it matched roughly 50% faster for me though I haven't done comprehensive analysis of this case. Not to spoil the excitement, but as of this writing the feature is experimental and temporarily lacks some extensions. Continuing the gloomy trend, be warned that compiler might choke on it, and it's also less tested. That being said, there is no doubt ctRegex facility will only improve over time. So start with run-time version, then go compile-time if willing to experiment or when need to squeeze out some more performance.

Conclusion

The article represents a walkthrough of std.regex focused on showcasing the API. By following a series of easy yet meaningful tasks, its features are exposed in combinations, that underline the elegance and flexibility of this library solution. The good thing that not only the API is natural, but it also follows established standards and integrates well with the rest of Phobos. Putting together its major features for a short-list, std.regex is:

The format of this article is intentionally more of an overview, as such it doesn't stop to talk in-depth about certain capabilities like case-insensitive matching (via simple casefolding rules of Unicode), backreferences, lazy quantifiers. And even more features are coming, as there are interesting opportunities not exploited.





Forums | Comments |  D  | Search | Downloads | Home