digitalmars.D - D graph library
- Joseph Rushton Wakeling (73/73) May 06 2013 Hello all,
- w0rp (11/11) May 06 2013 I have an interest in having a nice, reliable graph library in
- Joseph Rushton Wakeling (8/18) May 07 2013 Very much the same. The actual range of specific issues I need to solve...
- qznc (15/15) May 08 2013 I am thinking about something like this as well, but finishing things is...
- Joseph Rushton Wakeling (14/31) May 08 2013 That would be a great perspective to have. I should add that the use of...
- H. S. Teoh (146/330) May 10 2013 I suppose we can make it an optional thing, such that algorithm A may
Hello all, I've recently been having some very tentative thoughts about whether to try and develop an effective graph/network library in D. The goal would be to try and have a library that allows for highly efficient representation of graphs and multigraphs, both as static and dynamic entities, with very fast retrieval of graph properties (e.g. the neighbours of a given vertex, the weights of given links) for use in simulations _on_ the graph, and very fast calculation of graph metrics of various kinds. The general field of application would be in various branches of complexity science with particular attention to those stemming from interdisciplinary physics. There are a number of existing solutions in other languages. igraph, written in C, is probably the most feature-complete solution, with Python and R interfaces also available; however, it has (to my eyes) a very complex and unfriendly API, albeit one that improves when the Python or R versions are used. I haven't personally used the Boost Graph Library, beyond going through some of the documentation to explore it as an option, but more experienced colleagues have downplayed it as very complex to use. Finally, Networkx is a Python library that seems quite friendly to use, but when exploring it I found some issues that made me reluctant to trust it as a scientifically reliable solution. (To be fair to the maintainers, when I raised those issues with them, they did something about it quite quickly.) I know that Philippe Sigaud started some work in this direction for D: http://svn.dsource.org/projects/dranges/trunk/dranges/docs/graph.html ... but I don't know if there is any ongoing status or intent to develop it further. I haven't yet gone through the source code with sufficient scrutiny to decide whether this makes for a good basis for the kind of project I have in mind. My main reasons for being tentative about moving forward with this are -- well, first, it's a major undertaking; as my day job is producing research, not code, there's something of a pressure to just use the tools already available rather than trying to optimize for my personal language choice, and I don't want to start something only to drop off because my research activities are taking me in a different direction. Second, I don't know how useful it would be to others -- Python and C/C++ are still definitely the standard (and growing) tools. So it would be good to know if anyone else would like to have such a library explicitly in D. Third, in terms of producing productive code in a short space of time, I don't know if it's really worth implementing things in D rather than just providing bindings/wrappers to an existing library, with igraph being the obvious choice as it is written in C. The reason for not doing that so far as been a certain amount of nervousness about trying to provide .di interfaces for a very large and complex codebase with lots of custom data structures (the default approach recommended for using igraph is to import the headers for THE ENTIRE LIBRARY, so it doesn't lend itself to piecewise construction of bindings). Finally, given that many of my complaints about the existing tools relate to complex APIs and not-very-user-friendly code, I'm rather worried about my own ability (as a researcher and not a well-trained developer:-) to do something better. My experience of code written by researchers is in general not good, though the D community has shown there are exceptions! Now, that said, there is one very big positive motivation -- I think that D provides a language in which one could create a graph library with a genuinely friendly and well-designed API and with easily readable code, yet which has all the efficiency and scalability of igraph. And I'd much rather work with that than any of the other solutions out there. Some goals for a project like this would be something along the following lines: * Hyper-efficient performance for storage and manipulation of graphs. I'd want to be able to perform analyses of and/or run simulations on extremely large graph sizes. This would include features such as effective use of parallelism (when appropriate), dynamic recalculation of various network quantities as nodes/edges are added/removed, and so on. * Written in idiomatic D style as exemplified by the best of Phobos, making best use of D's generics, templates, etc. * Easy to hook into other languages -- I think my colleagues would very much appreciate a Python interface. * Ideally, would rely _only_ on Phobos and not on any 3rd-party libraries. It may turn out that some desirable functionality is not yet in Phobos (or does not yet perform adequately), in which case those features should be implemented or improved in Phobos rather than this library. Anyway, for a very tentative proposal (and I have to say, I'm really not sure whether I will commit to this), I think that's a long enough email for now -- I'd really welcome people's thoughts on these ideas and to get a feel of whether there's interest in seeing something like this implemented in D. Thanks & best wishes, -- Joe
May 06 2013
I have an interest in having a nice, reliable graph library in any language I use, whether that's Python, Java, JavaScript, or D. This desire comes from the fact that having a nice graph implementation makes it easy to solve a few particular problems. I would be interested in assisting with development of a nice graph library in some fashion, if you decide to start such a project. I am merely the holder of a BSc, and I don't have far reaching experience with implementing graph data structures and algorithms. Nonetheless, implementing graph data structures and algorithms is something I just enjoy doing, and it's a good learning experience for me.
May 06 2013
On 05/06/2013 07:02 PM, w0rp wrote:I have an interest in having a nice, reliable graph library in any language I use, whether that's Python, Java, JavaScript, or D. This desire comes from the fact that having a nice graph implementation makes it easy to solve a few particular problems.Very much the same. The actual range of specific issues I need to solve is fairly narrow, but having a good general solution would make it much easier.I would be interested in assisting with development of a nice graph library in some fashion, if you decide to start such a project. I am merely the holder of a BSc, and I don't have far reaching experience with implementing graph data structures and algorithms. Nonetheless, implementing graph data structures and algorithms is something I just enjoy doing, and it's a good learning experience for me.Qualifications are not important, what matters is curiosity, fun and engagement with the problem :-) Besides, if you already have some existing experience implementing these kinds of data structures and algorithms, you might actually have more practical understanding of the problem than I do right now. Thanks for the interest! It makes the whole idea seem much more worthwhile.
May 07 2013
I am thinking about something like this as well, but finishing things is not my strength. ;) What I desired were probably quite different requirements to other graph libraries. My background is compiler optimizations, where the programs are represented as a graph: libFirm [0]. 1) A transaction mechanism. In debug mode, libfirm runs some checks on every graph change. However, sometimes a sequence of changes must be completed until the graph is in a valid state again. A transaction mechanism could be used to run the verifier after a complete transaction. 2) Node maps. Compiler optimizations sometimes rely on multiple analyses, which usually compute information for each node in the graph. While a hash map can be used to annotate nodes, a graph specific map should be more efficient. Additionally, a graph might know about its corresponding maps and update them implicitly, if the graph changes. [0] http://pp.info.uni-karlsruhe.de/firm/Main_Page
May 08 2013
On 05/08/2013 09:14 AM, qznc wrote:I am thinking about something like this as well, but finishing things is not my strength. ;)Oh, tell me about it. :-PWhat I desired were probably quite different requirements to other graph libraries. My background is compiler optimizations, where the programs are represented as a graph: libFirm [0]. 1) A transaction mechanism. In debug mode, libfirm runs some checks on every graph change. However, sometimes a sequence of changes must be completed until the graph is in a valid state again. A transaction mechanism could be used to run the verifier after a complete transaction. 2) Node maps. Compiler optimizations sometimes rely on multiple analyses, which usually compute information for each node in the graph. While a hash map can be used to annotate nodes, a graph specific map should be more efficient. Additionally, a graph might know about its corresponding maps and update them implicitly, if the graph changes.That would be a great perspective to have. I should add that the use of things like hash maps etc. for efficient data storage/retrieval is something I'm really not that familiar with, so it would be good to be able to get concrete advice on those aspects. So, your thoughts and insights may still be very important even if you can't commit to a lot of programming time. I suspect that where effective hashmaps and other container types are concerned, it may be necessary to either make (or push for) some quite extensive additions to Phobos and/or druntime. We could of course use Steve Schveighoffer's excellent work, but I have a strong preference for avoiding 3rd-party dependencies in cases where the functionality clearly _should_ be available in standard libraries.[0] http://pp.info.uni-karlsruhe.de/firm/Main_PageI love that you provided a brainfuck frontend ... :-)
May 08 2013
On Wed, May 08, 2013 at 07:25:26PM +0200, Joseph Rushton Wakeling wrote:On 05/08/2013 06:12 PM, H. S. Teoh wrote:I suppose we can make it an optional thing, such that algorithm A may ask for only an InputRange (and a graph with bidirectional range of nodes will still work) but algorithm B may ask for more, say a ForwardRange or RandomAccessRange. Each algorithm will require the minimum necessary to perform efficiently (though of course it can be special-cased to take advantage of additional features of a higher range, if that is available in a given instantiation). In fact, now that I think of it... most of the features you listed can arguably be optional: only a few algorithms (none that I know of, actually) require enumeration of edges, some algorithms may only require an input range of nodes, each of which gives an input range of edges, etc.. Would it make sense to define a set of graph properties (e.g., via templates ala std.range.is{Input,Forward,Bidirectional,...}Range, or hasLength, hasSwappableElements, etc.), and have each algorithm require whatever minimal subset of them is necessary to do the work? Because it makes little sense to require, say, a graph that returns a total count of nodes, if a given algorithm never needs to use that information anyway. This way, a structure for which the total number of nodes is expensive to compute can still be used with that algorithm. IOW, we can have things like hasRandomAccessNodes, hasNodeCount, hasEdgeCount, canEnumerateEdges, etc., and each algorithm will use a subset of these in their signature constraints.Hmm. Is it necessary to provide a random access ranges? For some graph algorithms that's necessary, but for others, it may not be.Agree that it's debatable. It's probably desirable but not necessary.Makes sense.If nodes() and edges() return a RandomAccessRanges then they will automatically have the .length property. So you could add a courtesy function like nodecount() or edgecount() that in this case would just wrap nodes.length but for other cases might calculate the node or edge count in a different way.Note also that both the above automatically define the count of nodes/edges, but we could of course add courtesy functions to give them directly.Something along the lines of std.range's .hasLength, I guess? Makes sense.Hmm. I'm undecided here until I see some actual algorithms in which this choice makes a difference. I don't want to fall into premature generalization here. ;-)Well, I was considering whether making it a range of IDs might be more flexible with respect to underlying data structures. I may be falling into "premature optimization" here. :-)Finally, a reasonable question is whether they should return ranges of actual nodes or edges, or ranges of node/edge IDs.I think the distinction is mostly superfluous. One could always wrap IDs in a structNot if the graph is computed on-the-fly. E.g. chess analysis applications in which the complete graph is infeasible to compute.Another thought I have is whether we should require a method that enumerates all edges. The number of edges in a graph can grow very quickly w.r.t. the number of nodes, and some graphs may be so complex that the edges can't all be stored in memory at once, but only computed per-node. In such cases, requiring edge enumeration may be counterproductive. I'm inclined towards leaving out edge enumeration unless some specific algorithms require it (even then, we could just make it a requirement for those particular algorithms). I admit my knowledge of graph algorithms is rather limited, but AFAIK none of them require edge enumeration on the entire graph.Well, if we accept that we're going to be offering the user a range that lists all edges, the underlying code might handle that in many different ways -- just reading from an array, cacheing input from a database, or whatever else is appropriate. That shouldn't make it difficult to get the overall number of edges -- that's always going to be a stored number of one kind or another, whether it's an array length or a record of the number of fields in a database, or whatever.In that case, I wonder if it's even necessary to unify the two graph types. Why not just have two separate types and have the type system sort out compatibility with algorithms for us?Does it make sense to have a common interface between, say, directed and non-directed graphs? That is, are there algorithms that can operate on both, or do most of the interesting algorithms only work on one or the other? I'm just wondering if it's worth the trouble of doing compile-time (or runtime) checks if we could just have two unrelated graph types.Reasonable question. As far as I can tell most graph libraries offer a broadly common interface even though the underlying data structure may vary quite a bit. The practical implementation of those checks might well be simply reading an immutable boolean variable in the graph type.Are there any algorithms that need to do something different depending on whether the input graph is bipartite or not? Just wondering if it's worth the trouble of introducing an additional field (it may well be necessary -- I admit my knowledge of graph algorithms is rather spotty).Also, I'm not sure about the usefulness of isBipartiteGraph, unless it entails some specific graph method (say a method that returns nodes grouped by which partition it lies in) that takes advantage of this fact. OTOH, maybe it does serve the purpose of letting the user promise that the input graph will be bipartite, so if it isn't and the algorithm blows up, it's not our fault. :-PYes, that would be the purpose. The bipartite graph might also be just a different data type.Thought so. :)Personally, I'd shorten it to ngbrs, but most people here would disagree. :-PI don't like those kinds of abbreviations in general, I like English. :-PRight. Which brings me back to my idea that perhaps these graph properties should be optional, or perhaps we should have some kind of hierarchy of graph types (ala input range, forward range, bidirectional range, etc.), so that individual algorithms can choose which features are mandatory and which are optional. (I really dislike algorithms that are artificially restricted just because the author decided that feature X is required, no matter if the algorithm doesn't even use it.) On Fri, May 10, 2013 at 02:04:28PM +0200, Joseph Rushton Wakeling wrote:I'm not so sure about providing both in and out neighbours / edges. In some cases this may be very inefficient. I think it may be best simply to adopt the convention that a directed graph will always return out neighbours, and undirected graphs will always return all neighbours. We can always provide an optional method for inverting a graph where this is useful / necessary.Maybe, but in some cases you want to be able to quickly get the in-degree or whatever. Inverting the graph is going to be an expensive method that will require a lot of calculation (and maybe also a lot more memory, if you wind up storing both versions). I'd much rather have the option to get in-links and say, "Hey, in some cases this may be really expensive" than to have no option at all by design.On 05/08/2013 06:12 PM, H. S. Teoh wrote:True. Though another way of designing the API would be: foreach (n; g.nodes) { foreach (m; g.neighbours(n)) { ... } } which would work with ID's. And arguably, this may be more efficient in some cases.On Wed, May 08, 2013 at 02:34:32PM +0200, Joseph Rushton Wakeling wrote:I'm being daft. Of _course_ the ranges should be of nodes or edges, because you want to be able to do things like foreach(n; g.nodes) { // g is the graph ... :-) foreach(m; n.neighbours) { ... } }Finally, a reasonable question is whether they should return ranges of actual nodes or edges, or ranges of node/edge IDs.I think the distinction is mostly superfluous. One could always wrap IDs in a struct:I'm kinda torn about that one. Should the graph structure itself keep track of these annotations (which is basically what these properties are)? Or should it be the algorithm that tracks them? I can imagine cases for which it's useful to do something like this: auto myAlgo(G)(G graph) if (isGraph!G) { struct Annotations { float weight; float length; Color color; } Annotations[G.NodeIdType] annotations; ... foreach (n; graph.nodes) { ... annotations[n].weight = 1.0; ... } while (...) { auto n = graph.nodes[someId]; if (annotations[n].weight == 1.0) { ... } } } The advantage of not storing these annotations in G is that it allows myAlgo to apply to a far wider variety of graph-like things. Say very large graphs that are computed on-the-fly (e.g. chess positions), for which it would be infeasible to store arbitrary data with every node or edge.I don't think what I was proposing was at odds with what you describe here. For sure it's helpful to be able to map attributes as you describe here. What I mean is simply it's helpful to be able to define arbitrary properties (and property names) that will be associated with edges, not that edges should be _obliged_ to have a certain set of associated parameters.It should also be possible to define arbitrary collections of node properties, such as colour, type, textual labels, numerical properties etc., and to query if a node has them (preferably at compile-time).Hmm. I'm not sure about requiring such attributes. I'd go for a more generic approach, such as if an algorithm requires some attribute (say weight), then it could take a compile-time parameter that maps .weight to whatever actual attribute is in the graph node / edge. For example: auto findShortestPath(alias weight,G)(G graph, G.node start, G.node end) { ... } struct Graph { struct Node { ... } alias node = Node; struct Edge { double length; property double euclideanDistance(); ... } } Graph g; Graph.Node x, y; auto pathByLength = findShortestPath!"length"(g, x, y); auto pathByDist = findShortestPath!"euclideanDistance"(g, x, y); This allows the same graph to be evaluated multiple ways, as this example shows, without needing to create wrappers or proxy objects.I was just thinking it would be nice if we could write: auto pathLength(G)(G graph, G.NodeType start, G.NodeType end) { ... } instead of: auto pathLength(G,N)(G graph, N start, N end) if (is(N == typeof(G.nodes.front))) { ... } Or worse: auto pathLength(G,N)(G graph, typeof(G.nodes.front) start, typeof(G.nodes.front) end) { ... } The first example is the most self-documenting and pleasant to read, IMO. But now that I think about it, perhaps the best way is just: template NodeType(G) if (isGraph!G) { alias NodeType = typeof(G.init.nodes.front); } auto pathLength(G)(G graph, NodeType!G start, NodeType!G end) { ... } OK, so I retract what I said earlier. :) [...]I'm tentatively considering defining the node type within the graph itself (G.node above), as a way of reducing the number of compile-time parameters, but this may not be necessary, and may be limiting in some cases. Have to think about it a bit more.Can you expand on this?Ah, I see.Also, I'm not sure about isWeighted... I would expect most graph algorithms to either not care about weights (in which case the entire graph can be weightless) or require weights on all edges (in which case isWeighted is redundant). Or am I ignorant of algorithms that process weights that only exist in some edges but not others?Consider path length -- e.g. a path v1 -> v2 -> v3 has length 2 for an unweighted graph, length w_{1, 2} + w_{2, 3} for a weighted graph. That in turn has implications for shortest-path-length calculations, and the optimal algorithm to calculate shortest paths thus depends on whether you're dealing with the weighted or unweighted case.All of this in turn impacts on any metrics or algorithms that depend on distances between nodes. For example, today I'm going to be implementing [*] an algorithm to calculate betweenness centrality: http://www.cs.ucc.ie/~rb4/resources/Brandes.pdf This does need to be modified, albeit trivially, for the weighted case. [* Inside some custom-written code for a very particular application, I'm afraid. This is why I want to create a general-purpose graph library, to stop me having to do things like this :-P ]Right. For maximum generality, I'm thinking something along these lines: auto findPathLength(G)(G graph, NodeType!G start, NodeType!G end) { ... // weightless version } auto findPathLength(string weightField, G)(G graph, NodeType!G start, NodeType!G end) { // weighted version alias weight = __traits(getMember, G, weightField); ... auto w = node.weight; ... } So basically, the user specifies which property should be treated as weight, and that automatically selects the weighted version of the algorithm. If omitted, the unweighted version is selected instead. This lets us do things like: void main() { FlowGraph g; ... auto numHops = findPathLength(g, n1, n2); auto totalFlow = findPathLength!"flowRate"(g, n1, n2); auto totalDistance = findPathLength!"distance"(g, n1, n2); ... }Thanks! I'll take a look when I have time.I think it would help discussion if we have a list (even if tentative) of algorithms we'd like to implement, along with what each algorithm expects from the graph (weights, colors, directedness, etc.). Then we can evaluate which properties/functions/etc. are really necessary, and which are just icing on the cake that can be omitted with no real drawbacks. I think the more minimal the API the better -- it makes the resulting module easier to understand and use, and makes it more likely we'll actually finish implementing it. :-PAgree about minimal API, but I think what I've described previously is pretty minimal ... :-) As for algorithms, igraph has a very substantial list: http://igraph.sourceforge.net/doc/html/igraph-Structural.htmlWe could probably "scratch our own itch" to a degree (ha!) in terms of choosing what to implement, but stuff like shortest path calculation, centrality measures and so on seem like good starting points.[...] Yeah, shortest path is probably the most basic class of algorithms to implement. Though there *are* quite a variety of them -- Dijkstra, A*, etc.. I've been looking into A* recently, since for some applications the graph is simply too large for Dijkstra to be feasible. T -- It won't be covered in the book. The source code has to be useful for something, after all. -- Larry Wall
May 10 2013