## digitalmars.D - [OT] Best algorithm for extremely large hashtable?

- H. S. Teoh (25/25) Nov 15 2013 This isn't directly related to D (though the code will be in D), and I
- Dmitry Olshansky (7/12) Nov 15 2013 Store as a bit-flag in whatever structure that serves as node. It's
- H. S. Teoh (18/31) Nov 15 2013 [...]
- bearophile (5/14) Nov 15 2013 Look for Succinct Data-Structures, Graphs. You can impose further
- Andrei Alexandrescu (4/9) Nov 15 2013 The classic trick here is parallel arrays - you store a bit array in
- qznc (3/50) Nov 15 2013 A Trie might be an idea, especially the bitwise variant.
- Vladimir Panteleev (13/25) Nov 15 2013 A while ago I set out to write a solver for a group of problems
- H. S. Teoh (8/28) Nov 15 2013 Thanks!! I think delayed duplicate detection is what I'm looking for.
- John Colvin (8/10) Nov 15 2013 Could perhaps elaborate on this point?
- John Colvin (2/4) Nov 15 2013 sorry, I meant *integer multiples* of the euclidian basis vectors.
- H. S. Teoh (15/27) Nov 15 2013 Basically, adjacent nodes differ only in a single coordinate, and that
- Ivan Kazmenko (36/52) Nov 15 2013 How dense is the graph?
- John Colvin (9/38) Nov 15 2013 If you're going to chuck away the orthogonal location of the
- qznc (11/45) Nov 15 2013 So, -10 to 10 in discrete steps. This means 5 bits per dimension
- H. S. Teoh (22/53) Nov 15 2013 The graph is not explicitly stored, because it is far too large. The
- Peter Alexander (17/32) Nov 15 2013 Have multiple levels of lookup.

This isn't directly related to D (though the code will be in D), and I thought this would be a good place to ask. I'm trying to implement an algorithm that traverses a very large graph, and I need some kind of data structure to keep track of which nodes have been visited, that (1) allows reasonably fast lookups (preferably O(1)), and (2) doesn't require GB's of storage (i.e., some kind of compression would be nice). The graph nodes can be represented in various ways, but possibly the most convenient representation is as n-dimensional vectors of (relatively small) integers. Furthermore, graph edges are always between vectors that differ only by a single coordinate; so the edges of the graph may be thought of as a subset of the edges of an n-dimensional grid. The hashtable, therefore, needs to represent some connected subset of this grid in a space-efficient manner, that still allows fast lookup times. The naïve approach of using an n-dimensional bit array is not feasible because n can be quite large (up to 100 or so), and the size of the grid itself can get up to about 10 in each direction, so we're looking at a potential maximum size of 10^100, clearly impractical to store explicitly. Are there any known good algorithms for tackling this problem? Thanks! T -- Creativity is not an excuse for sloppiness.

Nov 15 2013

15-Nov-2013 22:41, H. S. Teoh Ð¿Ð¸ÑˆÐµÑ‚:This isn't directly related to D (though the code will be in D), and I thought this would be a good place to ask. I'm trying to implement an algorithm that traverses a very large graph, and I need some kind of data structure to keep track of which nodes have been visited, that (1) allows reasonably fast lookups (preferably O(1)),Store as a bit-flag in whatever structure that serves as node. It's going to be the fastest anyway and you indeed get to use only 1 bit per node (and given padding and whatnot you may already have a couple of bytes to spare per node). -- Dmitry Olshansky

Nov 15 2013

On Fri, Nov 15, 2013 at 10:52:15PM +0400, Dmitry Olshansky wrote:15-Nov-2013 22:41, H. S. Teoh Ð¿Ð¸ÑˆÐµÑ‚:[...] There are too many nodes to keep in memory, actually. The algorithm also doesn't need to store the nodes; it can easily generate them on-the-fly when it needs them. The challenge is when it needs to know whether a particular generated node has been visited before. I'd like to be able to have a fast lookup (hopefully O(1)) that tells me whether node X has been seen before. The problem is, there may have been a very large number of previously-seen nodes, so I need some way of storing this information in a space-efficient way. I'm hoping for a kind of compressed representation where multiple adjacent nodes can be represented in a very compact way, by somehow taking advantage of the fact that they lie on an n-dimensional grid and the fact that graph edges are always a subset of grid edges (as opposed to edges between two arbitrary nodes). T -- I think Debian's doing something wrong, `apt-get install pesticide', doesn't seem to remove the bugs on my system! -- Mike DresserThis isn't directly related to D (though the code will be in D), and I thought this would be a good place to ask. I'm trying to implement an algorithm that traverses a very large graph, and I need some kind of data structure to keep track of which nodes have been visited, that (1) allows reasonably fast lookups (preferably O(1)),Store as a bit-flag in whatever structure that serves as node. It's going to be the fastest anyway and you indeed get to use only 1 bit per node (and given padding and whatnot you may already have a couple of bytes to spare per node).

Nov 15 2013

H. S. Teoh:I'm hoping for a kind of compressed representation where multiple adjacent nodes can be represented in a very compact way, by somehow taking advantage of the fact that they lie on an n-dimensional grid and the fact that graph edges are always a subset of grid edges (as opposed to edges between two arbitrary nodes).Look for Succinct Data-Structures, Graphs. You can impose further memory savings in your problem. Bye, bearophile

Nov 15 2013

On 11/15/13 11:20 AM, H. S. Teoh wrote:I'm hoping for a kind of compressed representation where multiple adjacent nodes can be represented in a very compact way, by somehow taking advantage of the fact that they lie on an n-dimensional grid and the fact that graph edges are always a subset of grid edges (as opposed to edges between two arbitrary nodes).The classic trick here is parallel arrays - you store a bit array in parallel with an array of nodes. Andrei

Nov 15 2013

On Friday, 15 November 2013 at 19:21:35 UTC, H. S. Teoh wrote:On Fri, Nov 15, 2013 at 10:52:15PM +0400, Dmitry Olshansky wrote:A Trie might be an idea, especially the bitwise variant. http://en.wikipedia.org/wiki/Trie15-Nov-2013 22:41, H. S. Teoh Ð¿Ð¸ÑˆÐµÑ‚:[...] There are too many nodes to keep in memory, actually. The algorithm also doesn't need to store the nodes; it can easily generate them on-the-fly when it needs them. The challenge is when it needs to know whether a particular generated node has been visited before. I'd like to be able to have a fast lookup (hopefully O(1)) that tells me whether node X has been seen before. The problem is, there may have been a very large number of previously-seen nodes, so I need some way of storing this information in a space-efficient way. I'm hoping for a kind of compressed representation where multiple adjacent nodes can be represented in a very compact way, by somehow taking advantage of the fact that they lie on an n-dimensional grid and the fact that graph edges are always a subset of grid edges (as opposed to edges between two arbitrary nodes).This isn't directly related to D (though the code will be in D), and I thought this would be a good place to ask. I'm trying to implement an algorithm that traverses a very large graph, and I need some kind of data structure to keep track of which nodes have been visited, that (1) allows reasonably fast lookups (preferably O(1)),Store as a bit-flag in whatever structure that serves as node. It's going to be the fastest anyway and you indeed get to use only 1 bit per node (and given padding and whatnot you may already have a couple of bytes to spare per node).

Nov 15 2013

On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:This isn't directly related to D (though the code will be in D), and I thought this would be a good place to ask. I'm trying to implement an algorithm that traverses a very large graph, and I need some kind of data structure to keep track of which nodes have been visited, that (1) allows reasonably fast lookups (preferably O(1)), and (2) doesn't require GB's of storage (i.e., some kind of compression would be nice).A while ago I set out to write a solver for a group of problems which can be described as traversing in breath extremely large implicit graphs. Some code here (C++): https://github.com/CyberShadow/DDD. The project uses delayed duplicate detection, to allow the number of nodes to exceed available RAM. What we found was that in certain cases, delayed duplicate detection beat hash tables even while filtering duplicates in memory, I think mainly because DDD is more parallelizable than hash tables. You mentioned compression - perhaps a bloom filter will fit your needs, as an optimization?

Nov 15 2013

On Fri, Nov 15, 2013 at 08:29:22PM +0100, Vladimir Panteleev wrote:On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:[...]Thanks!! I think delayed duplicate detection is what I'm looking for. The bloom filter idea is an interesting one; I'll definitely consider it as a later optimization once I get DDD working. T -- It won't be covered in the book. The source code has to be useful for something, after all. -- Larry WallI'm trying to implement an algorithm that traverses a very large graph, and I need some kind of data structure to keep track of which nodes have been visited, that (1) allows reasonably fast lookups (preferably O(1)), and (2) doesn't require GB's of storage (i.e., some kind of compression would be nice).A while ago I set out to write a solver for a group of problems which can be described as traversing in breath extremely large implicit graphs. Some code here (C++): https://github.com/CyberShadow/DDD. The project uses delayed duplicate detection, to allow the number of nodes to exceed available RAM. What we found was that in certain cases, delayed duplicate detection beat hash tables even while filtering duplicates in memory, I think mainly because DDD is more parallelizable than hash tables. You mentioned compression - perhaps a bloom filter will fit your needs, as an optimization?

Nov 15 2013

On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:so the edges of the graph may be thought of as a subset of the edges of an n-dimensional grid.Could perhaps elaborate on this point? As far as I can see, the possible values of the nodes are all the points in N^n (or Z^n, dunno whether you have negatives) and the edges are all in the set of i*e_j for i in Z and j in N, where e_j is the unit vector along one axis, i.e. the edges are the euclidean basis vectors and their negatives. How is this the same as the edges of an n-dimensional grid?

Nov 15 2013

On Friday, 15 November 2013 at 19:48:21 UTC, John Colvin wrote:i.e. the edges are the euclidean basis vectors and their negatives.sorry, I meant *integer multiples* of the euclidian basis vectors.

Nov 15 2013

On Fri, Nov 15, 2013 at 08:48:19PM +0100, John Colvin wrote:On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:Actually, i can only be 1 or -1.so the edges of the graph may be thought of as a subset of the edges of an n-dimensional grid.Could perhaps elaborate on this point? As far as I can see, the possible values of the nodes are all the points in N^n (or Z^n, dunno whether you have negatives) and the edges are all in the set of i*e_j for i in Z and j in N, where e_j is the unit vector along one axis, i.e. the edges are the euclidean basis vectors and their negatives.How is this the same as the edges of an n-dimensional grid?Basically, adjacent nodes differ only in a single coordinate, and that difference can only be 1 or -1. So for the 2D case, if you graph the nodes on the plane, the edges would be horizontal/vertical line segments of unit length. If you consider the 2D grid's edges to be *all* possible such line segments, then in that sense you could think of the graph's edges as being a subset of the 2D grid's. I was hoping that this fact can be taken advantage of, to make a compact representation of visited nodes. T -- By understanding a machine-oriented language, the programmer will tend to use a much more efficient method; it is much closer to reality. -- D. Knuth

Nov 15 2013

On Friday, 15 November 2013 at 20:07:24 UTC, H. S. Teoh wrote:On Fri, Nov 15, 2013 at 08:48:19PM +0100, John Colvin wrote:How dense is the graph? For example, if it contains every possible edge described (+-1 to every single coordinate), for a breadth-first search, we can manage to just keep track of a single integer: the farthest distance traveled. For very dense graphs, you can perhaps apply a similar idea: represent large visited areas as "diamonds" with center and radius, then try to "factor" the visited areas into diamonds on-the-fly. Possibly they will be "diamonds with a [much shorter] list of holes". For example, we say that all nodes not further than 3 from (0, 0, ..., 0) are visited, and store a single center (origin) and a radius (3) instead of all (dimensions[100] to the power of distance[3]) of them. The lookup will be at most O (number-of-diamonds) plus a hash table lookup for the individual nodes. Perhaps for certain graphs, we can find a good compromise between the number of diamonds and the number of individual stored nodes. The above is just an idea, perhaps it won't be feasible by itself when you get to the details, but can inspire some related optimization. Also, the size of the border of n-dimensional area is a (n-1)-dimensional object, and for dense enough graphs, you can hope that the number of elements in it is less by an order of magnitude than the total number of visited nodes. However, for too much dimensions (as in your case, 10^100 -> 10^99), it does not seem to help much. Another question is, when will the graph traversal end? For example, if you are certain that you won't need to visit more than say one million nodes, a simple hash table storing the node representations at the hash indices will suffice. On the other hand, if you plan to visit 10^12 nodes, and the graph is not very sparse or very dense (and not regular in any obvious way besides what is described), perhaps you won't get the required compression level (1/1000) anyway. Ivan Kazmenko.How is this the same as the edges of an n-dimensional grid?Basically, adjacent nodes differ only in a single coordinate, and that difference can only be 1 or -1. So for the 2D case, if you graph the nodes on the plane, the edges would be horizontal/vertical line segments of unit length. If you consider the 2D grid's edges to be *all* possible such line segments, then in that sense you could think of the graph's edges as being a subset of the 2D grid's. I was hoping that this fact can be taken advantage of, to make a compact representation of visited nodes.

Nov 15 2013

On Friday, 15 November 2013 at 20:07:24 UTC, H. S. Teoh wrote:On Fri, Nov 15, 2013 at 08:48:19PM +0100, John Colvin wrote:Ah, ok.On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:Actually, i can only be 1 or -1.so the edges of the graph may be thought of as a subset of the edges of an n-dimensional grid.Could perhaps elaborate on this point? As far as I can see, the possible values of the nodes are all the points in N^n (or Z^n, dunno whether you have negatives) and the edges are all in the set of i*e_j for i in Z and j in N, where e_j is the unit vector along one axis, i.e. the edges are the euclidean basis vectors and their negatives.How is this the same as the edges of an n-dimensional grid?Basically, adjacent nodes differ only in a single coordinate, and that difference can only be 1 or -1.So for the 2D case, if you graph the nodes on the plane, the edges would be horizontal/vertical line segments of unit length. If you consider the 2D grid's edges to be *all* possible such line segments, then in that sense you could think of the graph's edges as being a subset of the 2D grid's.If you're going to chuck away the orthogonal location of the edges and only consider parallel position then you might as well go all the way: In the 2-D example, the edges are all +/-(0,1) and +/-(1,0). There's no difference between the edge connecting (x, p) to (x, p+1) and the edge connecting (x, q) to (x, q+1). In the general case there's only 2N possible unique edges.

Nov 15 2013

On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:This isn't directly related to D (though the code will be in D), and I thought this would be a good place to ask. I'm trying to implement an algorithm that traverses a very large graph, and I need some kind of data structure to keep track of which nodes have been visited, that (1) allows reasonably fast lookups (preferably O(1)), and (2) doesn't require GB's of storage (i.e., some kind of compression would be nice). The graph nodes can be represented in various ways, but possibly the most convenient representation is as n-dimensional vectors of (relatively small) integers. Furthermore, graph edges are always between vectors that differ only by a single coordinate; so the edges of the graph may be thought of as a subset of the edges of an n-dimensional grid. The hashtable, therefore, needs to represent some connected subset of this grid in a space-efficient manner, that still allows fast lookup times. The naÃ¯ve approach of using an n-dimensional bit array is not feasible because n can be quite large (up to 100 or so), and the size of the grid itself can get up to about 10 in each direction, so we're looking at a potential maximum size of 10^100, clearly impractical to store explicitly.So, -10 to 10 in discrete steps. This means 5 bits per dimension and 500 bits for a single coordinate. Is the graph distributed of a compute cluster or does it fit into single computer? With a few GB of RAM, this means your graph is quite sparse, yet nodes are connected ("differ only by a single coordinate")? Can you preprocess? I mean, walk all the nodes O(n) to compute a good (perfect?) hash function. In general, I think you should either store the flag right in the graph node or mirror the graph structure. I do not know any concrete algorithms.

Nov 15 2013

On Fri, Nov 15, 2013 at 09:03:17PM +0100, qznc wrote:On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:[...]The graph is not explicitly stored, because it is far too large. The nodes are generated on-the-fly as needed; the challenge is to know whether a particular generated node has been generated before. Basically, the equivalent of a hashtable of previously-visited nodes, but I'm hoping for (1) some kind of compressed representation so that as many nodes as possible can be kept in RAM for speed, and (2) something more efficient for disk I/O, because I'm expecting the number of generated nodes will not fit in RAM and will have to be put on disk (and hashtables don't do very well with disk I/O due to the latency).I'm trying to implement an algorithm that traverses a very large graph, and I need some kind of data structure to keep track of which nodes have been visited, that (1) allows reasonably fast lookups (preferably O(1)), and (2) doesn't require GB's of storage (i.e., some kind of compression would be nice). The graph nodes can be represented in various ways, but possibly the most convenient representation is as n-dimensional vectors of (relatively small) integers. Furthermore, graph edges are always between vectors that differ only by a single coordinate; so the edges of the graph may be thought of as a subset of the edges of an n-dimensional grid. The hashtable, therefore, needs to represent some connected subset of this grid in a space-efficient manner, that still allows fast lookup times. The naïve approach of using an n-dimensional bit array is not feasible because n can be quite large (up to 100 or so), and the size of the grid itself can get up to about 10 in each direction, so we're looking at a potential maximum size of 10^100, clearly impractical to store explicitly.So, -10 to 10 in discrete steps. This means 5 bits per dimension and 500 bits for a single coordinate. Is the graph distributed of a compute cluster or does it fit into single computer? With a few GB of RAM, this means your graph is quite sparse, yet nodes are connected ("differ only by a single coordinate")?Can you preprocess? I mean, walk all the nodes O(n) to compute a good (perfect?) hash function.Walking the nodes defeats the purpose, because the whole idea is to *not* have to generate every node, just those selected by the algorithm (those are already very unwieldy in number, nevermind generating *all* nodes).In general, I think you should either store the flag right in the graph node or mirror the graph structure.[...] That doesn't solve the problem when graph nodes are generated on-the-fly, since I wouldn't know where to look up the node structure without first knowing that it has been generated before. T -- Ph.D. = Permanent head Damage

Nov 15 2013

On Friday, 15 November 2013 at 20:22:29 UTC, H. S. Teoh wrote:The graph is not explicitly stored, because it is far too large. The nodes are generated on-the-fly as needed; the challenge is to know whether a particular generated node has been generated before. Basically, the equivalent of a hashtable of previously-visited nodes, but I'm hoping for (1) some kind of compressed representation so that as many nodes as possible can be kept in RAM for speed, and (2) something more efficient for disk I/O, because I'm expecting the number of generated nodes will not fit in RAM and will have to be put on disk (and hashtables don't do very well with disk I/O due to the latency).Have multiple levels of lookup. For example, let's say your entire grid G is 1,000,000 x 1,000,000. Have a top level grid H that is 1000 x 1000. Each H[i, j] corresponds to a 1000 x 1000 region in G. If any cell has been visited in that region then H[i, j] points to a direct representation of that 1000 x 1000 subregion of G, if not, then it points to null. Checking if G[i, j] has been visited is as easy as: visited(i, j) = H[i/1000, j/1000] && (*H[i/1000, j/1000])[i%1000, j%1000]; You can easily page-in and page-out those 1000 x 1000 subregions to disk if needed, just keeping the active ones in memory. H will stay in memory (only 8MB). Obviously, you'd be better off using a power of 2 instead of 1000 for fast div and mod. And of course, you can have as many levels to this as you like.

Nov 15 2013