www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Splitting std.algorithm

reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
OK, so the past weekend, I was foolhardy enough to attempt (again) to
split std.algorithm into more manageable pieces... and unlucky enough to
actually succeed this time:

	https://github.com/D-Programming-Language/phobos/pull/2879

However, right now this PR is in a rather precarious situation, because
git doesn't understand the concept of moving content between files; as a
result, *any* further changes to std/algorithm.d in master must be
manually merged into the branch (as in, the diffs must be applied by
hand). As we all (should) know, such by-hand merges are extremely
error-prone, and subtle bugs can get introduced inadvertently. A
careless mistake on my part could accidentally revert a bugfix PR, for
example.

Besides, such by-hand merges are also very time-consuming, and I really
rather do better things than to sit around manually applying diffs all
day, and all that without knowing whether or not this PR is going to get
merged at all.

So I have a request: can we please decide ASAP whether or not this PR is
worth it, and, if it is, merge it ASAP? Since we're currently in the
process of improving docs, there's an extremely high chance that
std/algorithm.d will be touched again in the near future. I've already
spent hours manually applying the diffs (and coaxing git to behave,
which is challenging in this situation) for a *single* affected PR that
was merged today, and I really do not want to keep doing this if at all
possible. If this is a bad idea, I'd like to know right now and not
waste any more time on it.

So far, the arguments for splitting std.algorithm are:

- The file is too big. It increases compilation time and compiler memory
  usage, and makes locating a particular piece of code more difficult
  than it needs to be.

- I can't even run the Phobos unittests on my machine because dmd runs
  out of memory and dies. This means either (1) I submit untested PRs so
  that I can leverage the autotester to run the tests for me, which
  means lots of wasted autotester resources and waste of time for me as
  I have to do the code-compile-test cycle on the autotester; or (2) I
  have to manually delete large swaths of code from std.algorithm while
  working on the PR, just so I can unittest my changes properly. I'm
  sure I'm not the only one here who has trouble running Phobos
  unittests; this means the barrier to contribution is needlessly high.

- It's a conglomeration of only tenuously-related functions, and as a
  result, importing std.algorithm will pull in roughly half of Phobos in
  dependencies that you may not actually need. In the process of
  splitting, I have found that I could eliminate many module-level
  imports, and/or otherwise reduce dependencies to other Phobos modules.

- Although this PR doesn't do this yet, having smaller submodules means
  that other Phobos modules that need something from std.algorithm won't
  have to import the entire thing (which in turn would cause a snowball
  effect of also importing every dependency of std.algorithm, and their
  respective recursive dependencies, most of which are unnecessary
  anyway, since it may be just 1 or 2 functions that are actually
  needed). With the new submodules, if you need map(), for example, you
  could just import std.algorithm.iteration : map, and you won't incur
  the cost of also importing stuff that only, say, cartesianProduct
  needs.

- An overly large module makes it difficult for new users to understand
  what the module does, or whether it happens to contain something they
  need. This is to some extent alleviated by proper documentation, but
  even then, functions categorized into 6 submodules is a lot more
  browseable than a single gigantic module that contains everything
  including the kitchen sink.

The only argument against splitting std.algorithm (that I know of) is:

- Andrei doesn't approve because apparently some people think "big files
  are not a problem".


So, what do you think? Should we merge this, or should we not?


T

-- 
Many open minds should be closed for repairs. -- K5 user
Jan 20 2015
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/20/15 3:40 PM, H. S. Teoh via Digitalmars-d wrote:
 - Andrei doesn't approve because apparently some people think "big files
    are not a problem".
cc Jonathan M. Davis, Steve Schveighoffer if I remember correctly :o). -- Andrei
Jan 20 2015
next sibling parent Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, January 20, 2015 16:10:26 Andrei Alexandrescu via Digitalmars-d
wrote:
 On 1/20/15 3:40 PM, H. S. Teoh via Digitalmars-d wrote:
 - Andrei doesn't approve because apparently some people think "big files
    are not a problem".
cc Jonathan M. Davis, Steve Schveighoffer if I remember correctly :o).
I honestly think that many developers are overly interested in having small modules. Splitting stuff up too much causes maintenance problems (e.g. it becomes that much harder to find everything when there are a lot of files to look through, and it's that much less obvious which module something might live in), and in my experience, large modules like std.algorithm or std.datetime are actually quite maintainable. However, that doesn't mean that we wouldn't be better off splitting up the particularly large ones. I just started on splitting std.datetime again the other day, and hopefully I can find time enough to finish it before I end up having to deal with merging other changes in. As for std.algorithm, I think that the fact that the unit tests take up too much memory on some of the Phobos developers' machines is enough to merit at least looking at splitting it up. It's an actual, objective problem rather than a subjective one. And std.algorithm contains enough disparate functions that it certainly wouldn't hurt us to split it up from an organizational point of view either. So, if H. S. Teoh has managed to split it in a sane way, it makes good sense for us to look it over and merge it if it looks good. Certainly, letting it sit is just going to cause all kinds of grief - especially when it's probably the most-edited file in all of Phobos. I had thought that the consensus was already that we should split std.algorithm at some point. The trick was spending the time to do it and get it right. - Jonathan M Davis
Jan 20 2015
prev sibling next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, Jan 20, 2015 at 05:26:41PM -0800, Jonathan M Davis via Digitalmars-d
wrote:
 On Tuesday, January 20, 2015 16:10:26 Andrei Alexandrescu via Digitalmars-d
wrote:
 On 1/20/15 3:40 PM, H. S. Teoh via Digitalmars-d wrote:
 - Andrei doesn't approve because apparently some people think "big
 files are not a problem".
cc Jonathan M. Davis, Steve Schveighoffer if I remember correctly :o).
I honestly think that many developers are overly interested in having small modules. Splitting stuff up too much causes maintenance problems (e.g. it becomes that much harder to find everything when there are a lot of files to look through, and it's that much less obvious which module something might live in),
Although I agree that we shouldn't be splitting up stuff just for the sake of splitting it up, I don't agree that it's a problem to find stuff. In this day and age, most software projects are too large to scan through stuff manually; you'd use search tools like grep or IDE search or whatever. Besides, a "real" editor like vim :-P is most useful when you navigate via the search function instead of the directional/paging keys anyway, and with D support in ctags, I don't see why splitting things into smaller files would be a problem. That on its own doesn't justify splitting, of course, but neither does it count against splitting IMO.
 and in my experience, large modules like std.algorithm or std.datetime
 are actually quite maintainable.  However, that doesn't mean that we
 wouldn't be better off splitting up the particularly large ones. I
 just started on splitting std.datetime again the other day, and
 hopefully I can find time enough to finish it before I end up having
 to deal with merging other changes in.
std.datetime is one of those things that has grown large enough that it's causing a noticeable pause when I open it in my editor or search for a symbol... I think that's nearing the point where splitting just on basis of size may become justifiable. :-P (Having said that, though, std.datetime unittests actually compile and run on my machine, in spite of their far larger number, yet std.algorithm doesn't. I think it's because of too many deeply-nested templates in std.algorithm, which probably includes a problem of my own making, namely one of the overloads of cartesianProduct, that causes an exponential number of recursive template instantiations. I've been meaning to fix that, and have in fact managed to fix it for the finite range case, but the infinite range case thus far eludes me.)
 As for std.algorithm, I think that the fact that the unit tests take
 up too much memory on some of the Phobos developers' machines is
 enough to merit at least looking at splitting it up.
Yeah, I haven't been able to run Phobos unittests locally for months now (perhaps even a year?). I think that's pretty near the point of being ridiculous.
 It's an actual, objective problem rather than a subjective one. And
 std.algorithm contains enough disparate functions that it certainly
 wouldn't hurt us to split it up from an organizational point of view
 either. So, if H. S. Teoh has managed to split it in a sane way, it
 makes good sense for us to look it over and merge it if it looks good.
I think the disparate functions part is probably the biggest reason to split it. Lumping disparate functions into one file means all the disparate dependencies of said functions also get lumped into one file, so if you import X, you're also forced to import Z just because Y, which you don't need, happens to sit in the same file as X and Y imports Z. I'm pretty sure this tangled web of interdependencies between Phobos modules is responsible for a significant proportion of complaints about Phobos template bloat / excessive executable sizes. As well as the somewhat amusing finding of mine some time ago (dunno if it's still true) that importing std.algorithm (and not actually referencing anything in it) will introduce a dependency on std.complex to your program, even though you never use anything that might remotely need to reference std.complex. I wasn't able to track down the source of this issue before, because std.algorithm was just far too big to manage; but perhaps after the split it will become more tractable. It also introduces some inadvertent circular dependencies that can cause hard-to-understand bugs, especially when conditional compilation is involved. If you have two modules A and B, and A.x depends on B.x and B.y depends on A.y, then you have a circular dependency between A and B even though in actuality they *aren't* circularly dependent. But since D's import granularity is the module, the circular dependency is there, at which point it becomes a tricky thing to make sure things are instantiated in the right order to resolve the apparent dependency loop, otherwise a static if somewhere might fail where it shouldn't. (I've seen this problem before but didn't have the patience to actually unravel it down to the actual cause. Reducing the amount of gratuitous dependencies would help a lot in making this easier to track down.) [...]
 I had thought that the consensus was already that we should split
 std.algorithm at some point. The trick was spending the time to do it
 and get it right.
[...] That's what I thought too, which is why I was a bit taken aback when Andrei seemed to disapprove of the PR. T -- I think the conspiracy theorists are out to get us...
Jan 20 2015
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/20/15 6:02 PM, H. S. Teoh via Digitalmars-d wrote:
 That's what I thought too, which is why I was a bit taken aback when
 Andrei seemed to disapprove of the PR.
I'm fine with reviewing by a few of us and getting that in. It deserves a fair shake of the stick, and I'll review it soon, too. -- Andrei
Jan 20 2015
prev sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/20/15 7:10 PM, Andrei Alexandrescu wrote:
 On 1/20/15 3:40 PM, H. S. Teoh via Digitalmars-d wrote:
 - Andrei doesn't approve because apparently some people think "big files
    are not a problem".
cc Jonathan M. Davis, Steve Schveighoffer if I remember correctly :o).
I honestly don't care whether std.algorithm is split or not. The documentation needs to be split. But that can be done with the future doc generator. However, anything that furthers the reduction in cross-importing all of phobos is good in my book. I also think this is pasting over the real problem -- dmd consumes too much memory. -Steve
Jan 20 2015
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, Jan 20, 2015 at 10:16:18PM -0500, Steven Schveighoffer via
Digitalmars-d wrote:
[...]
 I honestly don't care whether std.algorithm is split or not.
 
 The documentation needs to be split. But that can be done with the
 future doc generator.
 
 However, anything that furthers the reduction in cross-importing all
 of phobos is good in my book.
 
 I also think this is pasting over the real problem -- dmd consumes too
 much memory.
[...] Well, splitting std.algorithm is easier than fixing dmd's memory hogging ways, I think. :-P But yeah... something needs to be done about that. It's downright painful to run dmd on my older work PC, which is an old dual core pentium. It's not *ancient* ancient, but old enough to have a small enough memory that things slow to an unusable crawl as soon as I compile something non-trivial with dmd. Because of that, I've given up using D for script-like code at work, whereas if dmd was a bit more conservative in memory usage, I'd replace all my shell and perl scripts in an instant. (Well to be fair, most of that memory is hogged by the browser, so dmd doesn't really have that much wriggling room to begin with, but that belongs in another rant. :-P) T -- Real men don't take backups. They put their source on a public FTP-server and let the world mirror it. -- Linus Torvalds
Jan 20 2015
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/21/15 1:50 AM, H. S. Teoh via Digitalmars-d wrote:
 On Tue, Jan 20, 2015 at 10:16:18PM -0500, Steven Schveighoffer via
Digitalmars-d wrote:
 [...]
 I honestly don't care whether std.algorithm is split or not.

 The documentation needs to be split. But that can be done with the
 future doc generator.

 However, anything that furthers the reduction in cross-importing all
 of phobos is good in my book.

 I also think this is pasting over the real problem -- dmd consumes too
 much memory.
[...] Well, splitting std.algorithm is easier than fixing dmd's memory hogging ways, I think. :-P
I think you misunderstand my note. I am ambivalent to splitting std.algorithm. If it's split, fine, if not, fine too, I'm not objecting. But I think if splitting will reduce the cross-importing, I DO want to do that. I read that as one of the bonuses. But splitting just for the sake of splitting, meh... :)
 But yeah... something needs to be done about that.
 It's downright painful to run dmd on my older work PC, which is an old
 dual core pentium.
It's not just old PCs. I had to up my vmware linux image's memory from 1GB to 2GB just to compile the default vibe.d program. This is unacceptable. I rent a VPS with minimum memory, and I have to compile vibe.d locally because any compiling of it on that system will crash the server. Fast compilation is great to a point, but if your system can't even compile, time taken goes to infinity. I think there has been too much emphasis on dmd compilation speed. -Steve
Jan 22 2015
parent reply "Zach the Mystic" <reachzach gggmail.com> writes:
On Thursday, 22 January 2015 at 12:04:43 UTC, Steven 
Schveighoffer wrote:
 It's not just old PCs. I had to up my vmware linux image's 
 memory from 1GB to 2GB just to compile the default vibe.d 
 program. This is unacceptable. I rent a VPS with minimum 
 memory, and I have to compile vibe.d locally because any 
 compiling of it on that system will crash the server.

 Fast compilation is great to a point, but if your system can't 
 even compile, time taken goes to infinity. I think there has 
 been too much emphasis on dmd compilation speed.

 -Steve
Sometimes it seems to me that condensing some of dmd's data structures could be an overall win. I guess the downside would be that it makes reading dmd's code harder. For example, a Loc has a pointer to a file string, a line number and a character position number. On a 64-bit program, that's the same 64-bit pointer over and over for every token. It could instead be a 32-bit unique integer bumped up at every new Loc, pointing to a binary tree which calculates the file string as the greatest position <= the current Loc number. Also, packing some of the booleans and small integers more tightly, of course. I dunno, might be too hard to do when so many data structures are in flux and negatively affects code readability.
Jan 22 2015
parent reply "Dicebot" <public dicebot.lv> writes:
DMD FE memory issues are almost fully because of template 
instance explosions and CTFE. Micro-optimising structure layouts 
is a bad way to fix it.
Jan 22 2015
next sibling parent Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thursday, January 22, 2015 19:12:55 Dicebot via Digitalmars-d wrote:
 DMD FE memory issues are almost fully because of template
 instance explosions and CTFE. Micro-optimising structure layouts
 is a bad way to fix it.
I forget what the numbers were, but IIRC, when Don looked at that a while back, he found that templates like isForwardRange were being instantiated millions of times as part of the unit tests. And maybe the numbers weren't quite that high, but they were _high_. std.datetime used to have similar problems when I used assertPred for its unit tests, which was a very handy templated function for making unit tests more informative, but it was a template, and it was used a lot in the tests, so it wasn't cheap, and once it was removed, the memory consumption and time to compile definitely went down. It would be a huge win if we could figure out how to avoid having all of those templates instantiated that many times. - Jonathan M Davis
Jan 22 2015
prev sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Jan 22, 2015 at 11:29:16AM -0800, Jonathan M Davis via Digitalmars-d
wrote:
 On Thursday, January 22, 2015 19:12:55 Dicebot via Digitalmars-d wrote:
 DMD FE memory issues are almost fully because of template instance
 explosions and CTFE. Micro-optimising structure layouts is a bad way
 to fix it.
I forget what the numbers were, but IIRC, when Don looked at that a while back, he found that templates like isForwardRange were being instantiated millions of times as part of the unit tests. And maybe the numbers weren't quite that high, but they were _high_. std.datetime used to have similar problems when I used assertPred for its unit tests, which was a very handy templated function for making unit tests more informative, but it was a template, and it was used a lot in the tests, so it wasn't cheap, and once it was removed, the memory consumption and time to compile definitely went down. It would be a huge win if we could figure out how to avoid having all of those templates instantiated that many times.
[...] One idea would be to detect if a template doesn't actually generate any code (doesn't contain any functions or variable declarations in the outermost scope), and switch to a CTFE-style evaluation of the template without actually storing any of its instantiations. But I've no idea whether such an idea is feasible in the current compiler implementation, or whether it will actually help anything. DMD not reclaiming *any* memory whatsoever in the name of speed is the bugbear that hinders improvement in this area. T -- "The whole problem with the world is that fools and fanatics are always so certain of themselves, but wiser people so full of doubts." -- Bertrand Russell. "How come he didn't put 'I think' at the end of it?" -- Anonymous
Jan 22 2015
prev sibling parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, January 20, 2015 18:02:47 H. S. Teoh via Digitalmars-d wrote:
 On Tue, Jan 20, 2015 at 05:26:41PM -0800, Jonathan M Davis via Digitalmars-d
wrote:
 and in my experience, large modules like std.algorithm or std.datetime
 are actually quite maintainable.  However, that doesn't mean that we
 wouldn't be better off splitting up the particularly large ones. I
 just started on splitting std.datetime again the other day, and
 hopefully I can find time enough to finish it before I end up having
 to deal with merging other changes in.
std.datetime is one of those things that has grown large enough that it's causing a noticeable pause when I open it in my editor or search for a symbol... I think that's nearing the point where splitting just on basis of size may become justifiable. :-P (Having said that, though, std.datetime unittests actually compile and run on my machine, in spite of their far larger number, yet std.algorithm doesn't. I think it's because of too many deeply-nested templates in std.algorithm, which probably includes a problem of my own making, namely one of the overloads of cartesianProduct, that causes an exponential number of recursive template instantiations. I've been meaning to fix that, and have in fact managed to fix it for the finite range case, but the infinite range case thus far eludes me.)
Oh. I think that std.datetime is large enough that it would be better to split it up. It was actually split when I originally proposed it. It was just split so badly that I was told to put it back together again (and it improved quite a bit over the course of the review process). It's large enough that github won't display the whole thing, and it's far larger than any other module in Phobos. So, there are definitely good reasons to split it up. But as far as maintaining it goes, I haven't found it to be a problem at all that it's in a single file. - Jonathan M Davis
Jan 20 2015
parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/21/15 12:03 AM, Jonathan M Davis via Digitalmars-d wrote:
 It's large
 enough that github won't display the whole thing, and it's far larger than
 any other module in Phobos.
This is the only reason I think we need to split it :) Doing reviews on std.datetime PRs are painful. -Steve
Jan 22 2015