digitalmars.com                        
Last update Sat Oct 7 20:19:42 2023

Component Programming in D

September 27, 2012

written by Walter Bright

The idea of writing reusable software is as old as programming, and is so well ingrained into programmers that we take it as an axiom. It’s accepted, and non-controversial. Of course we all strive to produce reusable software, after all, we’re professionals.

But as I look back on 35 years of programming, I note with chagrin that practically none of the code I’ve written has been usable in another project. I will ‘reuse’ code by copy-paste-modify, but that’s cheating, and even that doesn’t happen too often. Somehow, I seemed to have missed the boat somewhere. I ask other long time programmers, and many of them seem to have the same frustrating experience.

This starts out, then, as journey into figuring out what went wrong. Why does that compressor I wrote before not work in another project? Why is my macro expander not reusable? Why did I chuck all my carefully written UTF code? Why do I write symbol tables over (and over again)? Why is that disassembler I wrote completely unusable in another application? Why do I keep reinventing the wheel?

My first thought was that I am now a better programmer than I was, and so rewrites are justified. That’s just too easy a rationalization, though. A more careful examination turns up two troubling characteristics of my components:

  1. The abstractions are leaky. I have failed to encapsulate the problem, and the component’s dependencies have leaked out into the surrounding application, and the surrounding application’s dependencies have leaked into the component.
  2. The components are too specific. They are not about generic type T, but about specific type X. The implementation contains and takes advantage of specific information about X.

Things need some rethinking. Let’s go back to first principles, and think about what a ‘component’ is, anyway.

What is a Component?

One aspect of a component is that it is reusable software. But being reusable isn’t enough. In C, for example, there are lots of available reusable libraries for all kinds of things. But they aren’t necessarily components. A component has a predefined interface that it conforms to. Most C libraries roll their own unique interfaces, and the user wishing to use them has to build custom scaffolding for each one.

But what sort of defined interface could possibly be general enough to work for a wide range of software components?

Paring a program down to its fundamentals, they are:

  1. read input
  2. process that input
  3. write output

Even if a program itself doesn’t fit that model, it is usually composed of subsystems that conform. Let’s rewrite this in pseudo code:

source => algorithm => sink

Chaining things together makes sense:

source => algorithm1 => algorithm2 => sink

etc. Hmm. This looks awfully familiar. Where have we seen this before? Of course it’s the Unix command line, the files & filters model. Files form the sources and sinks, and the algorithms are called “filters”. The Unix command line is a brilliant innovation where the components are files and filter programs. Each filter program does one smallish job, and to get things done, the command line user strings them along, the output of each feeding into the input of the next, and the final sink is a file or the console.

For endless examples of how successful this component style is, see commandlinefu.com.

Since Unix is largely written in C, this model has found its way into common C usage, and is the ubiquitous file interface. Component programming in C relies on being able to express sources and sinks as files, and algorithms that operate on the file interfaces. This is so successful that there are often pseudo-file systems.

Of course there are limitations. It shows the usual characteristics of something that evolved while retaining backward compatibility, rather than being conceived up front, as anyone trying to figure out ioctl can attest. Worse, it wants to view data as a stream of bytes, and that’s awkward for a lot of uses as certain algorithms imply specific supporting data structures, as we’ll get into later

But we owe this design an enormous debt for showing what a component is and what component programming can do to save us time and money.

Looking Back At My Code

Now that I know what a component is, I look back at all my failures at reusable code and notice something else - it looks nothing at all like:

source => algorithm => sink

In fact, it looks like a bunch of nested loops. The source data enters at the top, and gets swirled around and around in ever smaller and tighter loops, and leaves via the sink in the center of that maelstrom.

For example, here’s a simple program in pseudo-code that reads a text file from stdin line by line, tests to see if each line matches a pattern, and writes those lines out to stdout:

void main(string[] args)
{
    string pattern = args[1];
    while (!feof(stdin))
    {
        string line = getLine(stdin);
        if (match(pattern, line))
            writeLine(stdout, line);
    }
}

No wonder it is not reusable. There’s no way to separate out the looping part from the source and that darned sink in the middle of it all can hardly be composed with other loops and sinks. Also bad is that it isn’t immediately obvious what it does.

Compare it with this pseudo-code version:

void main(string[] args)
{
    string pattern = args[1];
    stdin => byLines => match(pattern) => stdout;
}
The loops have vanished, along with the intermediate variable ‘line’ that was used to glue things together. The code is easier to reason about (and hence more likely to be correct), shorter, and is made up of composable items.

The Next Design

In the 1980s along came C++. C++ brought OOP (Object Oriented Programming) to the C world, but somehow C++ OOP did not produce better component programming. (There are many successful C++ OOP libraries that have stood the test of time, and can be viewed as component libraries, but as far as components as discussed here, I don’t think there was much improvement.) C++ did update the C file design and generalized it a bit with iostreams. Things started looking like our desired:

source => algorithm => sink

with iostream’s >> operator. But that just didn’t seem to catch on, either, and didn’t see much use outside of reading and writing files. I think the iostream style could have been generalized and expanded beyond merely serialization, but that never happened.

In the 1990s Alexander Stepanov arrived on the C++ scene and revolutionized it with his introduction of the C++ STL (Standard Template Library). At last, components could be more than files, and algorithms could be generic and automatically adapt themselves to the data types being processed. Components followed a common interface, could be composed, and could be compiled to highly efficient code.

But it doesn’t quite get us where we want to be:

  1. The syntax doesn’t resemble
    source => algorithm => sink
    

    it keeps winding up looking like loops

    for (i = L.begin(); i != L.end(); ++i)
       ... do something with *i ...
    

    just like my old C code.

    There is some improvement in getting rid of loops with std::for_each(), std::transform(), std::copy(), std::find(), etc., but there are composition problems with them. For example, std::find() returns an iterator, but where is the other end of the range? The components don’t fit together easily.

    Also, STL’s design was marred by the absence of lambda functions in C++ at the time. This rendered many STL-based idioms awkward, to the extent that STL containers tend to enjoy much more usage than STL’s higher-order algorithms.

  2. Iterators are an abstraction of pointers. But pointers don’t know where the beginning or end of their data is, so when using iterators one finds himself passing around pairs of iterators. This is exactly analogous to C’s passing arrays around as two separate items - a pointer to the start of the array, and the dimension of the array. (I have characterized this as C’s Greatest Mistake, a mistake that has unfortunately propagated forward into C++ and the STL design.) Needing to pass around two disjoint pieces of data to embody an abstraction (as opposed to raising the abstraction to first class status) hurts composition and forces designs into pedestrian manipulations of fragments. A better component design would combine the two.

Back To Basics

Going back to what we want from a component,

source => algorithm => sink
let’s list some specifics. What kinds of sources are there?
  1. streams - Streams provide data sequentially, generally with no history of what came before, and no knowledge of what comes after. Streams live in the now. Files typically present themselves as streams, as is the input from your keyboard, data coming from a network socket, data flowing from a sensor, etc.
  2. containers - Containers are fixed collections of data that have at last one method of iterating through all elements. Typical containers would include arrays, hash tables, trees and linked lists.
  3. generators - Generators are algorithms that produce a stream of data. Examples of generators would be random number generators, iotas, prime number generators, producing successive digits of Π, etc. Generators have a known starting point (a seed), and sometimes can produce an infinite succession of values.
  4. list comprehensions - A source passed through an algorithm (such as a filter for list comprehensions), is also a source.

What kinds of algorithms are there?

  1. filter - A filter takes its input, and produces some subset of that for its output. Examples include selecting only odd values or prime values, searching for regex matches, validating the data, or passing only employees with a security clearance.
  2. map - Maps take input data, transform it, and output the transformed data. Examples include capitalizing words, squaring numbers, and sorting the input.
  3. reduce - Reduce consumes its input and produces as output a single value. Examples include summing all the data, finding the maximum, etc.

These would be of course just a few examples of many. There are many algorithms, each with its own personality. Sorting, searching, pattern matching, encryption, compression, and more come to mind, not to mention a plethora of numeric algorithms. Notably, algorithms have specific demands on the layout of the data they operate on. Filter, map, and reduce all operate with simple streams, but many algorithms need more sophisticated data access patterns.

And sinks:

  1. streams - writing to files, the console, a network socket, etc.
  2. containers - the container gets built from the data.
Forming a table with three columns:
Source Algorithm Sink
filesortfile
tree filter tree
array map array
socket reduce socket
list max list
iota search
random numbersodd
word count

A perfect component design should be able to take any item from the Source column, apply any Algorithm, and output to any Sink, even if those Sources, Algorithms, and Sinks were developed independently by groups that have never heard of each other.

Furthermore, they need to be able to be snapped together (i.e. be composable) like:

file => sort => odd => list

for any combination that makes sense. Some data format transformation may be in order. For example, quicksort needs random access, something a streaming file doesn’t offer reasonably well. So we’d need to insert an adapter component that transforms an input stream into an array, leading to:

file => make_array => sort => odd => list

And, of course, we won’t accept performance compromises. C++ STL showed that component design can be just as efficient as hand-built non-component code, and often turns out to be even faster. Any component design based on indirection (i.e. virtual dispatch), boxing, or dynamic typing is going to have a harder time delivering the performance.

After all, giving users motivation to replace the abstraction with hand-coded equivalents when they ‘really mean it’ is a sign that something went wrong in the abstracting process. Components must work for industrial code.

Summing up what is needed for a modern, excellent component design:

  1. Ability to ‘snap together’ independently developed sources, algorithms, and sinks, i.e. composability
  2. Strong support for encapsulation
  3. Ability to generate code as efficient as hand-coded versions
  4. A natural syntax that looks like:
  5. source => algorithm => sink
    
  6. Ability to work with types that are not known in advance

The D Programming Language

Being a more recent programming language, D enjoys an invaluable opportunity to stand on the shoulders of giants and take advantage of new ideas, and put the flesh on them. Fundamentally improving component programming has enormous potential to change and significantly improve the way we write software.

Next I will focus on the design of components in D, and what underlying features enable it to work.

Component Programming in D part 2

In the first installment, we discussed the motivation for component programming, and what we’re looking for in a component programming design. Here is how the D standard library does it.

Sources

What are the minimum requirements for a source?

  1. determine if there is input data available
  2. read the current input datum
  3. advance to the next input datum

To satisfy them, an InputRange is defined as having the following:

  1. a property:
    bool empty;
    
    which returns true if there is no more input, and false if there is.
  2. a property:
    E front;
    
    which returns the current input datum of type E. Input ranges are parameterized by their element type (denoted here as E).
  3. a method:
    void popFront();
    
    which advances to the next input datum.

Note that this implies a one-element buffer because client code may call front multiple times without moving to the next element. There exists an even simpler notion, the unbuffered range; it only enables a modicum of algorithms and currently it is not formally abstracted in D’s standard library.

A very interesting thing about an InputRange is it isn’t itself a type. A type is an InputRange if it has the primitives empty, front, and popFront. This makes an InputRange a concept, rather than a type.

Since we all know C file I/O so well, here’s an InputRange that reads characters from stdin:

private import core.stdc.stdio;

struct StdinByChar {

    @property bool empty() {
        if (hasChar)
            return false;
        auto c = fgetc(stdin);
        if (c == EOF)
            return true;
        ch = cast(char)c;
        hasChar = true;
        return false;
    }

    @property char front() {
        return ch;
    }

    void popFront() {
        hasChar = false;
    }

  private:
    char ch;
    bool hasChar;
}

It has the expected one element buffer and a flag that keeps track of it.

The use of it is straightforward, here’s code to print it to stdout:

for (auto r = StdinByChar(); !r.empty; r.popFront()) {
    auto c = r.front;
    fputc(c, stdout);
}

It’s pretty clear that anything with those 3 primitives can be expressed in a for loop the same way, such that the D foreach statement can automatically do the equivalent with:

foreach (c; StdinByChar())
    fputc(c, stdout);

Look, ma, no types! Already, we’re on to something powerfully useful.

The InputRange just described is streaming only. One pass streams are hardly sufficient for complex processing. Let’s extend it.

ForwardRange

The first extension is called a ForwardRange. It adds a property:

@property R save;

(Where R is the ForwardRange type). save returns a copy of the ForwardRange in its current state. It does not make a copy of the data being iterated over, only the position. Hence, the original and the copy can now traverse the ForwardRange independently of each other. A singly-linked list is a typical forward range.

A ForwardRange is useful when lookahead is needed. Algorithms that require a ForwardRange are brute force subsequence searching, and generally any pattern matching that needs to do multiple passes over its inputs, are good examples. Merge sort and bring to front also come to mind.

Note that a file stream usually cannot efficiently be a ForwardRange, the way the buffering is done precludes it.

BidirectionalRange

The next extension is for a bidirectional range, and two additional primitives are required:

@property E back;
void popBack();

These work analogously to front and popFront(), except they work their way backwards from the end of the range rather than the beginning.

An example of a bidirectional range would be a doubly linked list. But doubly-linked lists are not used all that often. What makes bidirectional ranges worth studying is that UTF8- and UTF16-encoded strings are, in fact, bidirectional ranges. Bidirectionality enables a few more efficient algorithms, such as finding the last occurrence of a sequence in another.

RandomAccessRange

The RandomAccessRange overloads the [ ] operator so it can be indexed to refer to the nth element:

E opIndex(size_t n);

In addition to indexing, there are two options for the RandomAccessRange’s functionality:

  1. a BidirectionalRange that offers the length property or is infinite:
    @property size_t length;
    
  2. a ForwardRange that is infinite. (An infinite range always returns false for the empty property.)

More details about InputRanges can be found at https://dlang.org/phobos/std_range.html.

Sinks

Sinks in D are called OutputRanges. An OutputRange is nothing more than a type that supports a method called put:

void put(E e);

The idea is that element e of type E gets “put” into the range. (Some other variations on an OutputRange are detailed at https://dlang.org/phobos/std_range.html#put.)

For example, an OutputRange that writes characters to stdout would be:

struct StdoutByChar {
    void put(char c) {
        if (fputc(c, stdout) == EOF)
            throw new Exception("stdout error");
    }
}

Recall our earlier:

foreach (c; StdinByChar())
    fputc(c, stdout);

Using an OutputRange, this becomes:

StdoutByChar r;
foreach (c; StdinByChar())
    r.put(c);

This is starting to look pretty generic. And something subtle has happened along the way: our earlier version ignored fputc errors, but the range version doesn’t. With exception handling, the errors can be detected in the range code, and it won’t uglify the algorithm code.

Algorithms

Consider what that last snippet of code does - it copies the data, element by element, from an InputRange and puts it into the OutputRange. Generalizing a bit, and calling it copy (as obvious as hammer on a cockroach):

void copy(ref StdinByChar source, ref StdoutByChar sink) {
    foreach (c; source)
        sink.put(c);
}

This is lovely as long as you’ve got types StdinByChar and StdoutByChar, and it’s back to copy/pasta for any other types. Fixing that by making copy a template:

void copy(Source, Sink)(ref Source source, ref Sink sink) {
    foreach (c; source)
        sink.put(c);
}

and copy now will generically take any type! Including, of course, types that are not valid InputRanges or OutputRanges. The hapless programmer will see the horrible obscure failure messages coming from the body of copy. This is not so nice. The generic version of copy has the opposite problem from the specific one.

Recall that I said that InputRanges and OutputRanges were concepts, not types. The algorithm can be fixed by adding constraints (which are D’s idea of concepts):

Sink copy(Source, Sink)(ref Source source, ref Sink sink)
    if (isInputRange!Source &&
        isOutputRange!(Sink, ElementType!Source))
{
    foreach (c; source)
        sink.put(c);
    return sink;
}

isInputRange is a library template that can determine if its argument is a valid InputRange, and isOutputRange is a library template that determines if its argument is a valid OutputRange that is compatible with the element type of the InputRange.

(The return sink; was added to help make copy composable.)

How is our top level code looking now?

StdinByChar source;
StdoutByChar sink;
copy(source, sink);

Almost there. The syntax does not read left to right. But D has a feature UFCS (Uniform Function Call Syntax) where:

func(a,b,c)

can be written as:

a.func(b,c)

so now the component copy looks like:

StdinByChar source;
StdoutByChar sink;
source.copy(sink);

Filters

The copy algorithm is a trivial example of a filter. A filter copies some subset of its input to its output. What would a slightly more sophisticated filter look like? How about all the integers less than 3?

int[] arr = [1,2,3,4,5];
auto r = arr.filter!(a => a < 3);
writeln(r);

which will print:

[1, 2]

Note the simple lambda (a => a < 3) as the selection predicate for the filter.

As should be apparent, arrays are InputRanges, or more precisely, RandomAccessRanges. One nice thing about input, forward, bidirectional, and random-access ranges is that they form a hierarchy; the more powerful ranges are refined version of the simpler ones, so they can be substituted for them.

Maps

Map algorithms read their input, transform it, and return an InputRange that is the transformed result. For example, to square the input:

int[] arr = [1,2,3,4,5];
auto r = arr.map!(a => a * a);
writeln(r);

which will print:

[1, 4, 9, 16, 25]

Reducers

A reducer algorithm starts with an initial value, and combines that value with the values of all the elements in the range, and produces a single value as a result. For example, to get the sum of the input:

int[] arr = [1,2,3,4,5];
auto r = arr.reduce!((a,b) => a + b);
writeln(r);

which will print:

15

Note the two argument lambda (a,b) => a + b.

Just to show how crazy flexible algorithms can be, reduce can also compute multiple values with one pass through the data (which is pretty useful for streaming data which would be expensive to save for a second pass through it). Multiple lambdas produce a tuple result, here the sum and sum of squares is computed:

int[] arr = [1,2,3,4,5];
auto r = arr.reduce!((a,b) => a + b, (a,b) => a + b * b);
writefln("sum = %s, sum of squares = %s", r[0], r[1]);

which will print:

sum = 15, sum of squares = 55

Sorting

Sorting is a broader topic, so let’s focusing on the quicksort variety. Unlike the other algorithms discussed so far, quicksort needs its data in memory, and for good performance, it needs random access.

In order to do the impedance adaptation of a streaming input to quicksort, we’d need to accumulate all input in an array. Quicksort being an in-place algorithm, it would modify the array and return a view of that same array that offers extra amenities available to sorted arrays, such as binary search and merge operations.

Putting It Together

Let’s nail these bad boys together and solve a straightforward problem, and see how it looks. The problem is to read lines from stdin, sort them, and write the sorted result to stdout. The complete program is:

import std.stdio;
import std.array;
import std.algorithm;

void main() {
    stdin.byLine(KeepTerminator.yes)                     // 1
    map!(a => a.idup).                                   // 2
    array.                                               // 3
    sort.                                                // 4
    copy(                                                // 5
       stdout.lockingTextWriter());                      // 6
}
  1. This creates an InputRange that reads from stdin line by line, keeping the line terminator at the end of the line.
  2. In order to read lines fast, byLine reuses the same character buffer. In order to create an array of lines, we’ll need to copy the lines returned by byLine. The map function (from std.algorithm) handles that nicely, calling a lambda that maps each input line to a copy of that line.
  3. The lines need to be coalesced into an array of lines, which is what std.array.array does. An array is a RandomAccessRange, which is needed by the sort algorithm.
  4. What could be simpler than this call to the sort algorithm?
  5. copy() is the algorithm we wrote about earlier. It copies its first argument, an InputRange, to its second argument, an OutputRange.
  6. The lockingTextWriter locks the output stream, which is stdout, writes its input (which is an InputRange of lines of text) to that stream.
This does an excellent job of meeting out criteria for components. The flow of control follows a logical progression from top to bottom, with no loops or conditional statements. It’s easy to reason about (and hence much more likely to be correct). The individual components are independent of each other, and each is reusable, and replaceable with others.

Features Needed For Components

Quite a few features of D come together to make this work:

  1. Exception handling for errors rather than error codes - This removes all the error handling code from the use of the components, while fully supporting error detection. Maybe there is a clean way to do this with error codes, but I haven’t thought of one.
  2. Generic functions - Since ranges are concepts, not types, the straightforward way to write algorithms is as function templates that can adapt themselves to the final range types.
  3. Template constraints - In order to stave off the user-hostile problem of inscrutable error messages, template constraints prevent algorithms from instantiating unless the range arguments match the corresponding range concepts.
  4. UFCS (Uniform Function Call Syntax) - This enables a natural left-to-right order of composition and evaluation of ranges and algorithms. At the same time, it allows algorithms to be implemented outside any particular type or family of types, fostering better modularity.
  5. Language recognition of InputRanges for foreach loops - While not strictly required, this makes for convenient and robust iteration over InputRanges.
  6. Ranges are concepts, not types - Ranges can be implemented as value or reference types, whichever is more appropriate for the specific case. It is not necessary to use virtual functions or other specific interface types.
  7. Inlining, customization, and optimization - The compiler is fully empowered by the language to inline, customize, and optimize a generic implementation into one customized by the particular types involved. This is necessary in order to match the performance of hand-built, custom code.
  8. Specialization - Although not shown in the examples, the implementor of the components can add specialized versions based on inquiring about the argument types at compile time.
  9. Type Deduction - D is a statically typed language, but the file sorting program doesn’t explicitly mention any types. The compiler can deduce the required types from the arguments.
  10. Tuples - The example that computes both the sums and squares in a single pass relies on tuples to produce its return value(s). Being able to combine multiple operations into a single pass over the data is an important optimization.

Conclusion

I’ve suffered the heartbreak of non-reusable code for a long time. One solution is component programming. A successful component design is a combination of convention with some careful language support.

Many advanced features of D come together to enable development of generic components that can snap together in a simple, robust manner. While writing the components is slightly more work than the traditional method of custom loops, the ease of use of those components, and their reusability, more than balance the ledger towards the profit side.

D’s component design builds on the earlier successes of the Unix “files and filters” model, followed by C’s file stream interface, followed by C++’s iterators and algorithms. While this is relatively new in D, early indications are that it is a robust and extremely promising design. An awful lot of code can be refactored into ranges and algorithms, and hence can be easily made into reusable software components.

The Inevitable Comparisons

Clearly, this invites comparison to how this is or may be done in other languages. It’s beyond the scope of this article to do a thorough survey of this, and even if I tried, it’s easy to miss best practices in a language one is not thoroughly used to programming in. The best I can do is discuss the rationale for certain design choices for D, and to speculate how D would have fared for component-based programming for other choices.

For example, the classic functional languages’ approach to component programming is to use a list as the lingua franca connection between components. The list is an example of a D forward range. A forward range is a very bad fit for a variety of important algorithms, such as quicksort, binary search, median, and nth element. These need random access. High performance code also can mandate very specific data layouts to maximize memory throughput. The D range concept allows appropriate data structures to be used, rather than trying to make lists fit everything.

Other languages implement parameterized containers and algorithms (such as higher-order functions) by using indirection (usually virtual functions). This constrains the options for data layout (again, data layout can be performance critical), and presumes the existence of a “sufficiently smart optimizer” that can unwind the indirections to generate code specialized to the instantiation types. Without that, one winds up with inefficient components that invite programmers to not use them. The D approach is to use so-called “heterogeneous” generics, where generating specialized code upon instantiation is relatively straightforward. Furthermore, this specialized code instantiation is composable - components can be layered on top of each other without concern about falling off of a performance cliff.

And lastly, there’s the issue of manageability. Does a component design rely on incidental scaffolding, tricks, exploits, and cautions? Is there a lot of boilerplate code surrounding the meat? At least the use of the components should be straightforward enough that a relative newcomer to the language can get it to work, and get it to generate industrial quality results. I’ve had enough of code that looks simple and elegant in a tutorial, only to find out it doesn’t perform well, and that “experts don’t write code that way”, and I’m sure you have, too.

Other Work

There’s an enormous amount of history behind component programming and body of work written about software component programming, as well as a smorgasbord of ideas about what components should be. For just a taste, see Component-based software engineering.

Further Reading

  1. The D Programming Language web site
  2. The D Programming Language by Andrei Alexandrescu
  3. Ranges
  4. Algorithms

References

Acknowledgements

Home | Runtime Library | IDDE Reference | STL | Search | Download | Forums