digitalmars.D - D graph library -- update
- Joseph Rushton Wakeling (41/41) Jul 10 2013 Hi all,
- bearophile (5/9) Jul 10 2013 If you want a meaningful memory comparison then perhaps you need
- Joseph Rushton Wakeling (4/6) Jul 10 2013 I know, and it's coming. :-) The main memory-related issues will probab...
- Joseph Rushton Wakeling (36/39) Jul 11 2013 For comparison, I performed the same tests but with a 10_000 node graph....
- John Colvin (3/60) Jul 11 2013 This is very promising. Great work!
- Andrei Alexandrescu (7/13) Jul 11 2013 Unstable sort should be faster than stable sort. If I remember correctly...
- Joseph Rushton Wakeling (3/7) Jul 11 2013 That's _exactly_ the case here. :-\
- Dmitry Olshansky (9/16) Jul 11 2013 In such a case if inserting exactly one element into a sorted range you
- Joseph Rushton Wakeling (12/19) Jul 11 2013 ... which would basically correspond to an insertion sort, right?
- Dmitry Olshansky (13/32) Jul 11 2013 Yup, but done in one step. I'd put it the other way around: simply put
- Joseph Rushton Wakeling (5/16) Jul 12 2013 Thanks very much for that, I'll try it out and report back on performanc...
- Joseph Rushton Wakeling (24/35) Jul 12 2013 So, just to report back -- this is the code I came up with based on your...
- Joseph Rushton Wakeling (23/27) Jul 12 2013 An attempt at an alternative:
- Joseph Rushton Wakeling (4/7) Jul 11 2013 Do you have a link? I couldn't find it on the bugzilla, though I
- w0rp (15/15) Jul 11 2013 I don't like it. Here is why.
- H. S. Teoh (18/34) Jul 11 2013 Yeah, constraining the number of vertices/edges at ctor time is not
- Joseph Rushton Wakeling (14/30) Jul 11 2013 This is a core graph type designed explicitly for speed and
- H. S. Teoh (14/36) Jul 11 2013 That makes sense. The core graph type is what's used internally by the
- Joseph Rushton Wakeling (3/13) Jul 11 2013 Yes, I agree. The goal right now is to get the internal stuff
- Joseph Rushton Wakeling (3/5) Jul 11 2013 I'm not sure I understand why this is a problem with the class presented...
- Joseph Rushton Wakeling (16/24) Jul 11 2013 You can add as many edges as you like -- right now, the implementation d...
- Joseph Rushton Wakeling (42/46) Jul 12 2013 I want to reply at greater length on this point, because I don't want yo...
Hi all, Following earlier discussion about a D graph library <http://forum.dlang.org/thread/5187D514.5070109 webdrake.net>, this evening I sat down and had a go at coming up with some basic code to support such a venture. You can find it here: https://github.com/WebDrake/Dgraph This takes the basic data structure from the widely-used igraph library <http://igraph.sourceforge.net> but builds around it using idiomatic D structures and algorithms. The code currently consists of the basic data structure, implemented as a final class with methods to extract the number of vertices, the list of edges, and the degree and neighbours of a vertex. There is also a simple test file that can be used for benchmarking against comparable igraph code. With the current method of adding edges one by one, this code already benchmarks as faster than its igraph equivalent, running in 2.4s on my machine when compiled with gdmd -O -release -inline and 1.4s when compiled with ldmd2 and the same flags -- compared to 3s for the igraph code written in C. However, when igraph's option to add the edges all in one go in a vector is enabled, igraph is significantly faster. This surely reflects a mix of memory management (how many allocs/appends?) and also the amount of sorting and other updates that occur when edges are added. So, I think that igraph can probably still be beaten here. The memory usage for the example D code is slightly higher than for its comparable igraph C code, clocking in at about 2MB as opposed to 1.7. I've chosen igraph as a point of comparison because it's known for being both the fastest and most scalable graph library out there. Many of igraph's design choices seem focused on minimizing memory usage, sometimes at the expense of all-out speed: for example, if an algorithm needs an adjacency list representation of the graph, igraph will generate one at that moment, and destroy it afterwards. However, on the basis of the simple work done here, I'm fairly confident that D can do better. The code here is _much_ simpler than the equivalent functions in igraph, and has not yet been optimized in any way either for speed or for memory management. Yet it seems to be performing on par with or better than igraph within the limits of its current design constraints. I'll be trying to implement a few additional little pieces of functionality in the next days, perhaps some graph metrics which can give another point of performance comparison. Anyway, comments welcome, both positive and negative. Thanks & best wishes, -- Joe
Jul 10 2013
Joseph Rushton Wakeling:The memory usage for the example D code is slightly higher than for its comparable igraph C code, clocking in at about 2MB as opposed to 1.7.If you want a meaningful memory comparison then perhaps you need a 10 or 100 (or more) times larger graph. Bye, bearophile
Jul 10 2013
On 07/11/2013 02:18 AM, bearophile wrote:If you want a meaningful memory comparison then perhaps you need a 10 or 100 (or more) times larger graph.I know, and it's coming. :-) The main memory-related issues will probably not show up in a situation like this where all we're doing is storing the graph data, but in the case where algorithms are being performed on the data.
Jul 10 2013
On 07/11/2013 02:24 AM, Joseph Rushton Wakeling wrote:I know, and it's coming. :-) The main memory-related issues will probably not show up in a situation like this where all we're doing is storing the graph data, but in the case where algorithms are being performed on the data.For comparison, I performed the same tests but with a 10_000 node graph. Here we see similar memory use, but igraph outperforms dgraph by a factor of nearly 10 even with the insert of nodes one at a time. Profiling shows that the time difference is accounted for by the sorting algorithm used in the indexEdges() method: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 93.12 110.01 110.01 20000 0.01 0.01 _D6dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv132__T12schwartzSortS606dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv9__lambda4VAyaa5_61203c2062VE3std9algorithm12SwapStrategy0TAmZ12schwartzSortMFAmZS6dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv222__T11SortedRangeTAmS1986dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv132__T12schwartzSortS606dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv9__lambda4VAyaa5_61203c2062VE3std9algorithm12SwapStrategy0TAmZ11__lambda165Z11SortedRange561__T13quickSortImplS5066dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv132__T12schwartzSortS606dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv9__lambda4VAyaa5_61203c2062VE3std9algorithm12SwapStrategy0TAmZ12schwartzSortMFAmZS6dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv222__T11SortedRangeTAmS1986dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv132__T12schwartzSortS606dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv9__lambda4VAyaa5_61203c2062VE3std9algorithm12SwapStrategy0TAmZ11__lambda165 Z11SortedRange11__lambda168TS3std5range14__T3ZipTAmTAmZ3ZipZ13quickSortImplMFS3std5range14__T3ZipTAmTAmZ3ZipZv 3.88 114.59 4.58 20000 0.00 0.00 _D6dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv132__T12schwartzSortS606dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv9__lambda2VAyaa5_61203c2062VE3std9algorithm12SwapStrategy0TAmZ12schwartzSortMFAmZS6dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv222__T11SortedRangeTAmS1986dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv132__T12schwartzSortS606dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv9__lambda2VAyaa5_61203c2062VE3std9algorithm12SwapStrategy0TAmZ11__lambda165Z11SortedRange561__T13quickSortImplS5066dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv132__T12schwartzSortS606dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv9__lambda2VAyaa5_61203c2062VE3std9algorithm12SwapStrategy0TAmZ12schwartzSortMFAmZS6dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv222__T11SortedRangeTAmS1986dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv132__T12schwartzSortS606dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv9__lambda2VAyaa5_61203c2062VE3std9algorithm12SwapStrategy0TAmZ11__lambda165 Z11SortedRange11__lambda168TS3std5range14__T3ZipTAmTAmZ3ZipZ13quickSortImplMFS3std5range14__T3ZipTAmTAmZ3ZipZv 1.81 116.73 2.14 1124043790 0.00 0.00 _D3std5range14__T3ZipTAmTAmZ3Zip7opIndexMFNaNbNfmZS3std8typecons14__T5TupleTmTmZ5Tuple 0.59 117.42 0.70 203164131 0.00 0.00 _D3std9algorithm43__T6swapAtTS3std5range14__T3ZipTAmTAmZ3ZipZ6swapAtFNaNbNfS3std5range14__T3ZipTAmTAmZ3ZipmmZv 0.42 117.92 0.50 20000 0.00 0.01 _D6dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv By default, schwartzSort uses SwapStrategy.unstable, which means quicksort is used as the sorting mechanism. If we replace this with SwapStrategy.stable or SwapStrategy.semistable, TimSort is used instead, and this dramatically cuts the running time -- from almost 2 minutes to under 3 seconds (compared to 13 seconds for igraph with one-by-one edge addition), and under 2 if ldmd2 is used for compiling. This comes at a cost of increasing memory usage from about 3.7 MB (almost identical to igraph) to 5.6. Probably the best way to secure optimal performance without the memory hit is to use an insertion sort (or maybe a smoothsort?). I guess that Timsort would be best to use in the case of adding multiple edges in one go, unless no edges at all have been added before, in which case quicksort would probably be optimal; though quicksort would probably remain best if memory management is the priority. So, the new D code is still competitive with igraph, but needs some smoothing around the edges (quite literally!:-).
Jul 11 2013
On Thursday, 11 July 2013 at 10:25:40 UTC, Joseph Rushton Wakeling wrote:On 07/11/2013 02:24 AM, Joseph Rushton Wakeling wrote:This is very promising. Great work!I know, and it's coming. :-) The main memory-related issues will probably not show up in a situation like this where all we're doing is storing the graph data, but in the case where algorithms are being performed on the data.For comparison, I performed the same tests but with a 10_000 node graph. Here we see similar memory use, but igraph outperforms dgraph by a factor of nearly 10 even with the insert of nodes one at a time. Profiling shows that the time difference is accounted for by the sorting algorithm used in the indexEdges() method: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 93.12 110.01 110.01 20000 0.01 0.01 _D6dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv132__T12schwartzSortS606dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv9__lambda4VAyaa5_61203c2062VE3std9algorithm12SwapStrategy0TAmZ12schwartzSortMFAmZS6dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv222__T11SortedRangeTAmS1986dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv132__T12schwartzSortS606dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv9__lambda4VAyaa5_61203c2062VE3std9algorithm12SwapStrategy0TAmZ11__lambda165Z11SortedRange561__T13quickSortImplS5066dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv132__T12schwartzSortS606dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv9__lambda4VAyaa5_61203c2062VE3std9algorithm12SwapStrategy0TAmZ12schwartzSortMFAmZS6dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv222__T11SortedRangeTAmS1986dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv132__T12schwartzSortS606dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv9__lambda4VAyaa5_61203c2062VE3std9algorithm12SwapStrategy0TAmZ11__lambda165 Z11SortedRange11__lambda168TS3std5range14__T3ZipTAmTAmZ3ZipZ13quickSortImplMFS3std5range14__T3ZipTAmTAmZ3ZipZv 3.88 114.59 4.58 20000 0.00 0.00 _D6dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv132__T12schwartzSortS606dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv9__lambda2VAyaa5_61203c2062VE3std9algorithm12SwapStrategy0TAmZ12schwartzSortMFAmZS6dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv222__T11SortedRangeTAmS1986dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv132__T12schwartzSortS606dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv9__lambda2VAyaa5_61203c2062VE3std9algorithm12SwapStrategy0TAmZ11__lambda165Z11SortedRange561__T13quickSortImplS5066dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv132__T12schwartzSortS606dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv9__lambda2VAyaa5_61203c2062VE3std9algorithm12SwapStrategy0TAmZ12schwartzSortMFAmZS6dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv222__T11SortedRangeTAmS1986dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv132__T12schwartzSortS606dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv9__lambda2VAyaa5_61203c2062VE3std9algorithm12SwapStrategy0TAmZ11__lambda165 Z11SortedRange11__lambda168TS3std5range14__T3ZipTAmTAmZ3ZipZ13quickSortImplMFS3std5range14__T3ZipTAmTAmZ3ZipZv 1.81 116.73 2.14 1124043790 0.00 0.00 _D3std5range14__T3ZipTAmTAmZ3Zip7opIndexMFNaNbNfmZS3std8typecons14__T5TupleTmTmZ5Tuple 0.59 117.42 0.70 203164131 0.00 0.00 _D3std9algorithm43__T6swapAtTS3std5range14__T3ZipTAmTAmZ3ZipZ6swapAtFNaNbNfS3std5range14__T3ZipTAmTAmZ3ZipmmZv 0.42 117.92 0.50 20000 0.00 0.01 _D6dgraph5graph13__T5GraphVb0Z5Graph10indexEdgesMFZv By default, schwartzSort uses SwapStrategy.unstable, which means quicksort is used as the sorting mechanism. If we replace this with SwapStrategy.stable or SwapStrategy.semistable, TimSort is used instead, and this dramatically cuts the running time -- from almost 2 minutes to under 3 seconds (compared to 13 seconds for igraph with one-by-one edge addition), and under 2 if ldmd2 is used for compiling. This comes at a cost of increasing memory usage from about 3.7 MB (almost identical to igraph) to 5.6. Probably the best way to secure optimal performance without the memory hit is to use an insertion sort (or maybe a smoothsort?). I guess that Timsort would be best to use in the case of adding multiple edges in one go, unless no edges at all have been added before, in which case quicksort would probably be optimal; though quicksort would probably remain best if memory management is the priority. So, the new D code is still competitive with igraph, but needs some smoothing around the edges (quite literally!:-).
Jul 11 2013
On 7/11/13 3:25 AM, Joseph Rushton Wakeling wrote:By default, schwartzSort uses SwapStrategy.unstable, which means quicksort is used as the sorting mechanism. If we replace this with SwapStrategy.stable or SwapStrategy.semistable, TimSort is used instead, and this dramatically cuts the running time -- from almost 2 minutes to under 3 seconds (compared to 13 seconds for igraph with one-by-one edge addition), and under 2 if ldmd2 is used for compiling.Unstable sort should be faster than stable sort. If I remember correctly (1) our TimSort is indeed slower than quicksort on random numbers, (2) there is a performance bug that makes our quicksort perform quadratically on data that's essentially sorted but has one unsorted element at the end. Is that the case? Andrei
Jul 11 2013
On 07/11/2013 05:17 PM, Andrei Alexandrescu wrote:Unstable sort should be faster than stable sort. If I remember correctly (1) our TimSort is indeed slower than quicksort on random numbers, (2) there is a performance bug that makes our quicksort perform quadratically on data that's essentially sorted but has one unsorted element at the end. Is that the case?That's _exactly_ the case here. :-\ Thanks for the clarification!
Jul 11 2013
11-Jul-2013 20:02, Joseph Rushton Wakeling пишет:On 07/11/2013 05:17 PM, Andrei Alexandrescu wrote:In such a case if inserting exactly one element into a sorted range you may want to special case it to lowerbound + insert. E.g. something like this: auto idx = sortedRange.lowerbound(val).length; //insert should be called on array backing that sortedRange insert(sortedRange, idx, val);Unstable sort should be faster than stable sort. If I remember correctly (1) our TimSort is indeed slower than quicksort on random numbers, (2) there is a performance bug that makes our quicksort perform quadratically on data that's essentially sorted but has one unsorted element at the end. Is that the case?That's _exactly_ the case here. :-\Thanks for the clarification!-- Dmitry Olshansky
Jul 11 2013
On 07/11/2013 10:43 PM, Dmitry Olshansky wrote:In such a case if inserting exactly one element into a sorted range you may want to special case it to lowerbound + insert. E.g. something like this: auto idx = sortedRange.lowerbound(val).length; //insert should be called on array backing that sortedRange insert(sortedRange, idx, val);... which would basically correspond to an insertion sort, right? I'm quite tired on reading this so will have to think it over fresh tomorrow. What I'm not certain of is how to tweak it given that actually here we have two arrays: size_t[] values; // doesn't change, but we've added a new value at the end size_t[] sortedIndices; // indices 0 .. values.length, sorted according // to values[i] ... so, it's sortedIndices that we need to sort, but according to the corresponding entries in the array of values. Anyway, thanks for the thought -- lowerbound() is a function I've never used before so it wouldn't have occurred to me.
Jul 11 2013
12-Jul-2013 02:23, Joseph Rushton Wakeling пишет:On 07/11/2013 10:43 PM, Dmitry Olshansky wrote:Yup, but done in one step. I'd put it the other way around: simply put this is inserting an item into a sorted range => binary search + insert. An insertion sort is built around this operation performed repeatedly.In such a case if inserting exactly one element into a sorted range you may want to special case it to lowerbound + insert. E.g. something like this: auto idx = sortedRange.lowerbound(val).length; //insert should be called on array backing that sortedRange insert(sortedRange, idx, val);... which would basically correspond to an insertion sort, right?I'm quite tired on reading this so will have to think it over fresh tomorrow. What I'm not certain of is how to tweak it given that actually here we have two arrays: size_t[] values; // doesn't change, but we've added a new value at the end size_t[] sortedIndices; // indices 0 .. values.length, sorted according // to values[i]Append new value to values. Then use 'values.length-1' (new length - 1 i.e. the old length) as an item to insert into sortedIndices. The last step is to figure out what range to call lowerBound on - I'd say something like: assumeSorted(sortedIndices.map!(x => values[x])) then use that to find a suitable place to insert in sortedIndices.... so, it's sortedIndices that we need to sort, but according to the corresponding entries in the array of values. Anyway, thanks for the thought -- lowerbound() is a function I've never used before so it wouldn't have occurred to me.-- Dmitry Olshansky
Jul 11 2013
On 07/12/2013 12:37 AM, Dmitry Olshansky wrote:Append new value to values. Then use 'values.length-1' (new length - 1 i.e. the old length) as an item to insert into sortedIndices. The last step is to figure out what range to call lowerBound on - I'd say something like: assumeSorted(sortedIndices.map!(x => values[x])) then use that to find a suitable place to insert in sortedIndices.Thanks very much for that, I'll try it out and report back on performance. :-) I think I may move the discussion over to D.learn as I have some other new profiling results to follow up on -- it'll be an interesting lesson in how to manage memory for performance.
Jul 12 2013
On 07/12/2013 12:37 AM, Dmitry Olshansky wrote:Append new value to values. Then use 'values.length-1' (new length - 1 i.e. the old length) as an item to insert into sortedIndices. The last step is to figure out what range to call lowerBound on - I'd say something like: assumeSorted(sortedIndices.map!(x => values[x])) then use that to find a suitable place to insert in sortedIndices.So, just to report back -- this is the code I came up with based on your suggestion: void indexEdges() { immutable size_t l = _indexHead.length; foreach(e; iota(l, _head.length)) { size_t i = _indexHead.map!(a => _head[a]).assumeSorted.lowerBound(_head[e]).length; insertInPlace(_indexHead, i, e); i = _indexTail.map!(a => _tail[a]).assumeSorted.lowerBound(_tail[e]).length; insertInPlace(_indexTail, i, e); } } This is _much_ faster than the previous code (although still not as fast as igraph when adding multiple edges in one go), and also more memory efficient. I'm a little bothered that the insertion implies potentially many re-allocs, and I wonder if it might be even better if the length of _indexHead and _indexTail can be increased in one go, and the "insertion" of the new edge index happen without risking a reallocation. Anyway, thanks very much for the useful idea! igraph is still ultimately faster in the case where multiple edges are added in a single go, but that's an issue we can confront later.
Jul 12 2013
On 07/12/2013 03:12 PM, Joseph Rushton Wakeling wrote:I'm a little bothered that the insertion implies potentially many re-allocs, and I wonder if it might be even better if the length of _indexHead and _indexTail can be increased in one go, and the "insertion" of the new edge index happen without risking a reallocation.An attempt at an alternative: immutable size_t l = _indexHead.length; _indexHead.length = _head.length; _indexTail.length = _tail.length; foreach(e; iota(l, _head.length)) { size_t i, j; i = _indexHead[0 .. e].map!(a => _head[a]).assumeSorted.lowerBound(_head[e]).length; for(j = e; j > i; --j) _indexHead[j] = _indexHead[j - 1]; _indexHead[i] = e; i = _indexTail[0 .. e].map!(a => _tail[a]).assumeSorted.lowerBound(_tail[e]).length; for(j = e; j > i; --j) _indexTail[j] = _indexTail[j - 1]; _indexTail[i] = e; } It doesn't seem to make any difference in speed or memory consumption to the current code (OK, perhaps a tiny speed increase, but very tiny). It might have some relevance in the case where multiple edges are added in one go. I'll keep it in mind for that.
Jul 12 2013
On Thursday, 11 July 2013 at 15:17:03 UTC, Andrei Alexandrescu wrote:(2) there is a performance bug that makes our quicksort perform quadratically on data that's essentially sorted but has one unsorted element at the end.Do you have a link? I couldn't find it on the bugzilla, though I do remember a discussion of this from a while back.
Jul 11 2013
I don't like it. Here is why. 1. You can only use a size_t type as the vertex type. What about strings? 2. The distinction between directed and undirected graphs is made with an boolean type parameter? Why not an enum? Better yet, why not just use interfaces? (You are already using classes and garbage collected memory.) 3. I have to control the graph's capacity manually, and I can't use arbitrary numbers for edges. 4. Adding an edge doesn't add vertices. My most common use of graphs is to start from nothing, build from adding edges arbitrarily, usually from IDs which may be integers, and may be strings, and then use graph algorithms to gain some understanding of the relation between objects mapped to these IDs. With this library, this will be difficult or impossible.
Jul 11 2013
On Thu, Jul 11, 2013 at 08:14:00PM +0200, w0rp wrote:I don't like it. Here is why. 1. You can only use a size_t type as the vertex type. What about strings? 2. The distinction between directed and undirected graphs is made with an boolean type parameter? Why not an enum? Better yet, why not just use interfaces? (You are already using classes and garbage collected memory.)I think an enum would be better.3. I have to control the graph's capacity manually, and I can't use arbitrary numbers for edges.Yeah, constraining the number of vertices/edges at ctor time is not feasible. We need to have an implementation that allows modifying the graph post-construction, such as adding edges. (Deleting may not be as important, and may be tricky to implement, but there *are* use cases for that too.)4. Adding an edge doesn't add vertices.Couldn't you just add vertices manually, since you already know what the edge is?My most common use of graphs is to start from nothing, build from adding edges arbitrarily, usually from IDs which may be integers, and may be strings, and then use graph algorithms to gain some understanding of the relation between objects mapped to these IDs. With this library, this will be difficult or impossible.Yeah we need to cater to this use case too. However, you should be able to use a helper class/graph to collect all vertices/edges, then create the library's graph type using that once all the information is known (e.g. number of vertices/edges, etc.). I wouldn't say this is impossible. Troublesome, yes, but definitely possible. T -- He who sacrifices functionality for ease of use, loses both and deserves neither. -- Slashdotter
Jul 11 2013
On Thursday, 11 July 2013 at 18:14:01 UTC, w0rp wrote:I don't like it. Here is why. 1. You can only use a size_t type as the vertex type. What about strings?This is a core graph type designed explicitly for speed and efficiency. If you want string ids for vertices, it's better to wrap this with maps from string to size_t and back again. Having strings or other arbitrary data types involved in the core data type will just be a nightmare space- and speed-wise.2. The distinction between directed and undirected graphs is made with an boolean type parameter? Why not an enum?It is an enum.why not just use interfaces? (You are already using classes and garbage collected memory.)It's in consideration. I haven't yet identified a clear need.3. I have to control the graph's capacity manually, and I can't use arbitrary numbers for edges. 4. Adding an edge doesn't add vertices.Again, those things are probably best achieved by wrappers that mediate the adding of edges.My most common use of graphs is to start from nothing, build from adding edges arbitrarily, usually from IDs which may be integers, and may be strings, and then use graph algorithms to gain some understanding of the relation between objects mapped to these IDs. With this library, this will be difficult or impossible.I understand those requirements, I just think that the core data structure on which calculations are done is not the ideal place to implement them. (Sorry for terse replies, I'm writing on my phone...)
Jul 11 2013
On Thu, Jul 11, 2013 at 08:34:17PM +0200, Joseph Rushton Wakeling wrote:On Thursday, 11 July 2013 at 18:14:01 UTC, w0rp wrote:[...]I don't like it. Here is why. 1. You can only use a size_t type as the vertex type. What about strings?This is a core graph type designed explicitly for speed and efficiency. If you want string ids for vertices, it's better to wrap this with maps from string to size_t and back again. Having strings or other arbitrary data types involved in the core data type will just be a nightmare space- and speed-wise.That makes sense. The core graph type is what's used internally by the graphing algorithms, right? So it should be optimized for maximum performance in those algorithms. However, I think we should also provide friendlier representations for user-facing types. One possibility is a wrapper type that accepts string IDs for nodes/edges, and internally maintains a mapping between them and numerical IDs, so that the user doesn't have to keep doing this manually. This way the algorithms can run at top speed, and the user can still get a nice API for string-based IDs. T -- Freedom of speech: the whole world has no right *not* to hear my spouting off!My most common use of graphs is to start from nothing, build from adding edges arbitrarily, usually from IDs which may be integers, and may be strings, and then use graph algorithms to gain some understanding of the relation between objects mapped to these IDs. With this library, this will be difficult or impossible.I understand those requirements, I just think that the core data structure on which calculations are done is not the ideal place to implement them. (Sorry for terse replies, I'm writing on my phone...)
Jul 11 2013
On Thursday, 11 July 2013 at 18:42:10 UTC, H. S. Teoh wrote:However, I think we should also provide friendlier representations for user-facing types. One possibility is a wrapper type that accepts string IDs for nodes/edges, and internally maintains a mapping between them and numerical IDs, so that the user doesn't have to keep doing this manually. This way the algorithms can run at top speed, and the user can still get a nice API for string-based IDs.Yes, I agree. The goal right now is to get the internal stuff really well done, with a friendly face to follow.
Jul 11 2013
On 07/11/2013 08:14 PM, w0rp wrote:3. I have to control the graph's capacity manually, and I can't use arbitrary numbers for edges.I'm not sure I understand why this is a problem with the class presented here. Can you clarify what your concern is?
Jul 11 2013
On 07/11/2013 08:27 PM, H. S. Teoh wrote:Yeah, constraining the number of vertices/edges at ctor time is not feasible. We need to have an implementation that allows modifying the graph post-construction, such as adding edges.You can add as many edges as you like -- right now, the implementation doesn't even constrain you from adding _the same edge_ multiple times. (I think that's a feature rather than a bug -- it allows for support of multigraphs.) You can also extend the number of vertices. Just use the addVertices() method, although to be honest I think the easiest thing might be to just allow the user to assign to the vertexCount property, with some checks to make sure that they don't reduce the total number of vertices below the largest head/tail value.(Deleting may not be as important, and may be tricky to implement, but there *are* use cases for that too.)I am also concerned about complications related to deletion, but I'm sure it's possible to address.Couldn't you just add vertices manually, since you already know what the edge is?Before calling addEdge(head, tail), calculate m = max(head, tail). Then if m >= vertexCount, set vertexCount = m + 1. Then add your edges. This is easily done with wrappers. I think there's a benefit to the core class being strict and _not_ just automatically extending the number of vertices, just as an array doesn't automatically expand if you try and read/write from an index beyond its current length.
Jul 11 2013
On 07/11/2013 08:14 PM, w0rp wrote:1. You can only use a size_t type as the vertex type. What about strings?I want to reply at greater length on this point, because I don't want you to feel I'm dismissing your concerns. I remember that back in our earlier discussions you posted a basic data type of something like, WeightType[VertexID][VertexID]; where VertexID might be an integer, a string, or potentially something else. Let's simplify and suppose we have bool[VertexID][VertexID], so that it's just true of there's a link, false otherwise. In principle you could simplify still further and just have a set of VertexID pairs, except that Phobos still doesn't have a decent Set implementation (... hint, hint ... :-) This will be quick (and fairly memory efficient, I'd have thought) to create, and the distinction between Set and MultiSet would allow an easy constraint on whether you have a graph or a multigraph. You also have a ready and efficient way to check if a given link (head, tail) is in the graph, although I think I can probably do a good job on that with the current implementation (more on that another time:-). What I'm not sure that you have so readily is an efficient way to extract e.g. degree(v), or neighbours(v), and I'm concerned that the extra data you'll have to carry to change this may actually lead to a memory blow-up. I'd be very happy to be proved wrong about this, so if you have some ideas, or better still code for comparison and benchmarking, I'll happily reconsider the design. But for now my contention is that generic vertex IDs are only relevant when it comes to input or output, and that's where they should be implemented, not in the core processing code.3. I have to control the graph's capacity manually, and I can't use arbitrary numbers for edges.I don't see the capacity issue as being different from e.g. the fact that an array's capacity must be controlled manually unless you explicitly attempt to append to the end of it. As for arbitrary numbers for edges: I think in practice this should be OK, so long as you ensure that vertexCount is greater than the maximum vertex ID number, and there are not _too_ many gaps between the vertex IDs that are used. (e.g. if your vertex IDs are 1 to 50 rather than 0 to 49, you should have no problem.) Then, most results can be adapted simply by filtering out "nodes" with degree == 0. If on the other hand your node IDs are 10, 20, 30, ... you might not get such a good performance ... :-)4. Adding an edge doesn't add vertices.Here I don't think I have anything to add apart from that I don't like the idea of a graph that might surreptitiously add an arbitrary number of vertices simply on the basis of an edge being added. I'd rather force the user to check that the link being added is valid given the vertices available, and provide a wrapper to offer those kinds of checks and the opportunity to expand the number of vertices if needed.
Jul 12 2013