digitalmars.D - Splitting std.algorithm

H. S. Teoh via Digitalmars-d (66/66) Jan 20 2015 OK, so the past weekend, I was foolhardy enough to attempt (again) to

Andrei Alexandrescu (3/5) Jan 20 2015 cc Jonathan M. Davis, Steve Schveighoffer if I remember correctly :o).

Jonathan M Davis via Digitalmars-d (24/28) Jan 20 2015 I honestly think that many developers are overly interested in having sm...
H. S. Teoh via Digitalmars-d (60/89) Jan 20 2015 Although I agree that we shouldn't be splitting up stuff just for the

Andrei Alexandrescu (3/5) Jan 20 2015 I'm fine with reviewing by a few of us and getting that in. It deserves

Steven Schveighoffer (9/13) Jan 20 2015 I honestly don't care whether std.algorithm is split or not.

H. S. Teoh via Digitalmars-d (18/28) Jan 20 2015 [...]

Steven Schveighoffer (15/33) Jan 22 2015 I think you misunderstand my note. I am ambivalent to splitting

Zach the Mystic (15/24) Jan 22 2015 Sometimes it seems to me that condensing some of dmd's data

Dicebot (3/3) Jan 22 2015 DMD FE memory issues are almost fully because of template

Jonathan M Davis via Digitalmars-d (13/16) Jan 22 2015 I forget what the numbers were, but IIRC, when Don looked at that a whil...
H. S. Teoh via Digitalmars-d (13/30) Jan 22 2015 [...]

Jonathan M Davis via Digitalmars-d (10/29) Jan 20 2015 Oh. I think that std.datetime is large enough that it would be better to

Steven Schveighoffer (4/7) Jan 22 2015 This is the only reason I think we need to split it :) Doing reviews on

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

OK, so the past weekend, I was foolhardy enough to attempt (again) to
split std.algorithm into more manageable pieces... and unlucky enough to
actually succeed this time:

	https://github.com/D-Programming-Language/phobos/pull/2879

However, right now this PR is in a rather precarious situation, because
git doesn't understand the concept of moving content between files; as a
result, *any* further changes to std/algorithm.d in master must be
manually merged into the branch (as in, the diffs must be applied by
hand). As we all (should) know, such by-hand merges are extremely
error-prone, and subtle bugs can get introduced inadvertently. A
careless mistake on my part could accidentally revert a bugfix PR, for
example.

Besides, such by-hand merges are also very time-consuming, and I really
rather do better things than to sit around manually applying diffs all
day, and all that without knowing whether or not this PR is going to get
merged at all.

So I have a request: can we please decide ASAP whether or not this PR is
worth it, and, if it is, merge it ASAP? Since we're currently in the
process of improving docs, there's an extremely high chance that
std/algorithm.d will be touched again in the near future. I've already
spent hours manually applying the diffs (and coaxing git to behave,
which is challenging in this situation) for a *single* affected PR that
was merged today, and I really do not want to keep doing this if at all
possible. If this is a bad idea, I'd like to know right now and not
waste any more time on it.

So far, the arguments for splitting std.algorithm are:

- The file is too big. It increases compilation time and compiler memory
  usage, and makes locating a particular piece of code more difficult
  than it needs to be.

- I can't even run the Phobos unittests on my machine because dmd runs
  out of memory and dies. This means either (1) I submit untested PRs so
  that I can leverage the autotester to run the tests for me, which
  means lots of wasted autotester resources and waste of time for me as
  I have to do the code-compile-test cycle on the autotester; or (2) I
  have to manually delete large swaths of code from std.algorithm while
  working on the PR, just so I can unittest my changes properly. I'm
  sure I'm not the only one here who has trouble running Phobos
  unittests; this means the barrier to contribution is needlessly high.

- It's a conglomeration of only tenuously-related functions, and as a
  result, importing std.algorithm will pull in roughly half of Phobos in
  dependencies that you may not actually need. In the process of
  splitting, I have found that I could eliminate many module-level
  imports, and/or otherwise reduce dependencies to other Phobos modules.

- Although this PR doesn't do this yet, having smaller submodules means
  that other Phobos modules that need something from std.algorithm won't
  have to import the entire thing (which in turn would cause a snowball
  effect of also importing every dependency of std.algorithm, and their
  respective recursive dependencies, most of which are unnecessary
  anyway, since it may be just 1 or 2 functions that are actually
  needed). With the new submodules, if you need map(), for example, you
  could just import std.algorithm.iteration : map, and you won't incur
  the cost of also importing stuff that only, say, cartesianProduct
  needs.

- An overly large module makes it difficult for new users to understand
  what the module does, or whether it happens to contain something they
  need. This is to some extent alleviated by proper documentation, but
  even then, functions categorized into 6 submodules is a lot more
  browseable than a single gigantic module that contains everything
  including the kitchen sink.

The only argument against splitting std.algorithm (that I know of) is:

- Andrei doesn't approve because apparently some people think "big files
  are not a problem".


So, what do you think? Should we merge this, or should we not?


T

-- 
Many open minds should be closed for repairs. -- K5 user

Jan 20 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/20/15 3:40 PM, H. S. Teoh via Digitalmars-d wrote:
 - Andrei doesn't approve because apparently some people think "big files
    are not a problem".

cc Jonathan M. Davis, Steve Schveighoffer if I remember correctly :o). 
-- Andrei

Jan 20 2015

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Tuesday, January 20, 2015 16:10:26 Andrei Alexandrescu via Digitalmars-d
wrote:
 On 1/20/15 3:40 PM, H. S. Teoh via Digitalmars-d wrote:
 - Andrei doesn't approve because apparently some people think "big files
    are not a problem".

 cc Jonathan M. Davis, Steve Schveighoffer if I remember correctly :o).

I honestly think that many developers are overly interested in having small
modules. Splitting stuff up too much causes maintenance problems (e.g. it
becomes that much harder to find everything when there are a lot of files to
look through, and it's that much less obvious which module something might
live in), and in my experience, large modules like std.algorithm or
std.datetime are actually quite maintainable. However, that doesn't mean
that we wouldn't be better off splitting up the particularly large ones. I
just started on splitting std.datetime again the other day, and hopefully I
can find time enough to finish it before I end up having to deal with
merging other changes in.

As for std.algorithm, I think that the fact that the unit tests take up too
much memory on some of the Phobos developers' machines is enough to merit at
least looking at splitting it up. It's an actual, objective problem rather
than a subjective one. And std.algorithm contains enough disparate functions
that it certainly wouldn't hurt us to split it up from an organizational
point of view either. So, if H. S. Teoh has managed to split it in a sane
way, it makes good sense for us to look it over and merge it if it looks
good. Certainly, letting it sit is just going to cause all kinds of grief -
especially when it's probably the most-edited file in all of Phobos.

I had thought that the consensus was already that we should split
std.algorithm at some point. The trick was spending the time to do it and
get it right.

- Jonathan M Davis

Jan 20 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Tue, Jan 20, 2015 at 05:26:41PM -0800, Jonathan M Davis via Digitalmars-d
wrote:
 On Tuesday, January 20, 2015 16:10:26 Andrei Alexandrescu via Digitalmars-d
wrote:
 On 1/20/15 3:40 PM, H. S. Teoh via Digitalmars-d wrote:
 - Andrei doesn't approve because apparently some people think "big
 files are not a problem".

 cc Jonathan M. Davis, Steve Schveighoffer if I remember correctly :o).

 
 I honestly think that many developers are overly interested in having
 small modules. Splitting stuff up too much causes maintenance problems
 (e.g. it becomes that much harder to find everything when there are a
 lot of files to look through, and it's that much less obvious which
 module something might live in),

Although I agree that we shouldn't be splitting up stuff just for the
sake of splitting it up, I don't agree that it's a problem to find
stuff. In this day and age, most software projects are too large to scan
through stuff manually; you'd use search tools like grep or IDE search
or whatever. Besides, a "real" editor like vim :-P is most useful when
you navigate via the search function instead of the directional/paging
keys anyway, and with D support in ctags, I don't see why splitting
things into smaller files would be a problem. That on its own doesn't
justify splitting, of course, but neither does it count against
splitting IMO.


 and in my experience, large modules like std.algorithm or std.datetime
 are actually quite maintainable.  However, that doesn't mean that we
 wouldn't be better off splitting up the particularly large ones. I
 just started on splitting std.datetime again the other day, and
 hopefully I can find time enough to finish it before I end up having
 to deal with merging other changes in.

std.datetime is one of those things that has grown large enough that
it's causing a noticeable pause when I open it in my editor or search
for a symbol... I think that's nearing the point where splitting just on
basis of size may become justifiable. :-P

(Having said that, though, std.datetime unittests actually compile and
run on my machine, in spite of their far larger number, yet
std.algorithm doesn't. I think it's because of too many deeply-nested
templates in std.algorithm, which probably includes a problem of my own
making, namely one of the overloads of cartesianProduct, that causes an
exponential number of recursive template instantiations. I've been
meaning to fix that, and have in fact managed to fix it for the finite
range case, but the infinite range case thus far eludes me.)


 As for std.algorithm, I think that the fact that the unit tests take
 up too much memory on some of the Phobos developers' machines is
 enough to merit at least looking at splitting it up.

Yeah, I haven't been able to run Phobos unittests locally for months
now (perhaps even a year?). I think that's pretty near the point of
being ridiculous.


 It's an actual, objective problem rather than a subjective one. And
 std.algorithm contains enough disparate functions that it certainly
 wouldn't hurt us to split it up from an organizational point of view
 either. So, if H. S. Teoh has managed to split it in a sane way, it
 makes good sense for us to look it over and merge it if it looks good.

I think the disparate functions part is probably the biggest reason to
split it. Lumping disparate functions into one file means all the
disparate dependencies of said functions also get lumped into one file,
so if you import X, you're also forced to import Z just because Y, which
you don't need, happens to sit in the same file as X and Y imports Z.
I'm pretty sure this tangled web of interdependencies between Phobos
modules is responsible for a significant proportion of complaints about
Phobos template bloat / excessive executable sizes. As well as the
somewhat amusing finding of mine some time ago (dunno if it's still
true) that importing std.algorithm (and not actually referencing
anything in it) will introduce a dependency on std.complex to your
program, even though you never use anything that might remotely need to
reference std.complex. I wasn't able to track down the source of this
issue before, because std.algorithm was just far too big to manage; but
perhaps after the split it will become more tractable.

It also introduces some inadvertent circular dependencies that can cause
hard-to-understand bugs, especially when conditional compilation is
involved. If you have two modules A and B, and A.x depends on B.x and
B.y depends on A.y, then you have a circular dependency between A and B
even though in actuality they *aren't* circularly dependent. But since
D's import granularity is the module, the circular dependency is there,
at which point it becomes a tricky thing to make sure things are
instantiated in the right order to resolve the apparent dependency loop,
otherwise a static if somewhere might fail where it shouldn't. (I've
seen this problem before but didn't have the patience to actually
unravel it down to the actual cause. Reducing the amount of gratuitous
dependencies would help a lot in making this easier to track down.)


[...]
 I had thought that the consensus was already that we should split
 std.algorithm at some point. The trick was spending the time to do it
 and get it right.

[...]

That's what I thought too, which is why I was a bit taken aback when
Andrei seemed to disapprove of the PR.


T

-- 
I think the conspiracy theorists are out to get us...

Jan 20 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/20/15 6:02 PM, H. S. Teoh via Digitalmars-d wrote:
 That's what I thought too, which is why I was a bit taken aback when
 Andrei seemed to disapprove of the PR.

I'm fine with reviewing by a few of us and getting that in. It deserves 
a fair shake of the stick, and I'll review it soon, too. -- Andrei

Jan 20 2015

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/20/15 7:10 PM, Andrei Alexandrescu wrote:
 On 1/20/15 3:40 PM, H. S. Teoh via Digitalmars-d wrote:
 - Andrei doesn't approve because apparently some people think "big files
    are not a problem".

 cc Jonathan M. Davis, Steve Schveighoffer if I remember correctly :o).

I honestly don't care whether std.algorithm is split or not.

The documentation needs to be split. But that can be done with the 
future doc generator.

However, anything that furthers the reduction in cross-importing all of 
phobos is good in my book.

I also think this is pasting over the real problem -- dmd consumes too 
much memory.

-Steve

Jan 20 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Tue, Jan 20, 2015 at 10:16:18PM -0500, Steven Schveighoffer via
Digitalmars-d wrote:
[...]
 I honestly don't care whether std.algorithm is split or not.
 
 The documentation needs to be split. But that can be done with the
 future doc generator.
 
 However, anything that furthers the reduction in cross-importing all
 of phobos is good in my book.
 
 I also think this is pasting over the real problem -- dmd consumes too
 much memory.

[...]

Well, splitting std.algorithm is easier than fixing dmd's memory hogging
ways, I think. :-P  But yeah... something needs to be done about that.
It's downright painful to run dmd on my older work PC, which is an old
dual core pentium. It's not *ancient* ancient, but old enough to have a
small enough memory that things slow to an unusable crawl as soon as I
compile something non-trivial with dmd. Because of that, I've given up
using D for script-like code at work, whereas if dmd was a bit more
conservative in memory usage, I'd replace all my shell and perl scripts
in an instant.

(Well to be fair, most of that memory is hogged by the browser, so dmd
doesn't really have that much wriggling room to begin with, but that
belongs in another rant. :-P)


T

-- 
Real men don't take backups. They put their source on a public FTP-server and
let the world mirror it. -- Linus Torvalds

Jan 20 2015

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/21/15 1:50 AM, H. S. Teoh via Digitalmars-d wrote:
 On Tue, Jan 20, 2015 at 10:16:18PM -0500, Steven Schveighoffer via
Digitalmars-d wrote:
 [...]
 I honestly don't care whether std.algorithm is split or not.

 The documentation needs to be split. But that can be done with the
 future doc generator.

 However, anything that furthers the reduction in cross-importing all
 of phobos is good in my book.

 I also think this is pasting over the real problem -- dmd consumes too
 much memory.

 [...]

 Well, splitting std.algorithm is easier than fixing dmd's memory hogging
 ways, I think. :-P

I think you misunderstand my note. I am ambivalent to splitting 
std.algorithm. If it's split, fine, if not, fine too, I'm not objecting. 
But I think if splitting will reduce the cross-importing, I DO want to 
do that. I read that as one of the bonuses.

But splitting just for the sake of splitting, meh... :)

 But yeah... something needs to be done about that.
 It's downright painful to run dmd on my older work PC, which is an old
 dual core pentium.

It's not just old PCs. I had to up my vmware linux image's memory from 
1GB to 2GB just to compile the default vibe.d program. This is 
unacceptable. I rent a VPS with minimum memory, and I have to compile 
vibe.d locally because any compiling of it on that system will crash the 
server.

Fast compilation is great to a point, but if your system can't even 
compile, time taken goes to infinity. I think there has been too much 
emphasis on dmd compilation speed.

-Steve

Jan 22 2015

"Zach the Mystic" <reachzach gggmail.com> writes:

On Thursday, 22 January 2015 at 12:04:43 UTC, Steven 
Schveighoffer wrote:
 It's not just old PCs. I had to up my vmware linux image's 
 memory from 1GB to 2GB just to compile the default vibe.d 
 program. This is unacceptable. I rent a VPS with minimum 
 memory, and I have to compile vibe.d locally because any 
 compiling of it on that system will crash the server.

 Fast compilation is great to a point, but if your system can't 
 even compile, time taken goes to infinity. I think there has 
 been too much emphasis on dmd compilation speed.

 -Steve

Sometimes it seems to me that condensing some of dmd's data 
structures could be an overall win. I guess the downside would be 
that it makes reading dmd's code harder. For example, a Loc has a 
pointer to a file string, a line number and a character position 
number. On a 64-bit program, that's the same 64-bit pointer over 
and over for every token. It could instead be a 32-bit unique 
integer bumped up at every new Loc, pointing to a binary tree 
which calculates the file string as the greatest position <= the 
current Loc number.

Also, packing some of the booleans and small integers more 
tightly, of course. I dunno, might be too hard to do when so many 
data structures are in flux and negatively affects code 
readability.

Jan 22 2015

"Dicebot" <public dicebot.lv> writes:

DMD FE memory issues are almost fully because of template 
instance explosions and CTFE. Micro-optimising structure layouts 
is a bad way to fix it.

Jan 22 2015

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Thursday, January 22, 2015 19:12:55 Dicebot via Digitalmars-d wrote:
 DMD FE memory issues are almost fully because of template
 instance explosions and CTFE. Micro-optimising structure layouts
 is a bad way to fix it.

I forget what the numbers were, but IIRC, when Don looked at that a while
back, he found that templates like isForwardRange were being instantiated
millions of times as part of the unit tests. And maybe the numbers weren't
quite that high, but they were _high_. std.datetime used to have similar
problems when I used assertPred for its unit tests, which was a very handy
templated function for making unit tests more informative, but it was a
template, and it was used a lot in the tests, so it wasn't cheap, and once
it was removed, the memory consumption and time to compile definitely went
down.

It would be a huge win if we could figure out how to avoid having all of
those templates instantiated that many times.

- Jonathan M Davis

Jan 22 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Thu, Jan 22, 2015 at 11:29:16AM -0800, Jonathan M Davis via Digitalmars-d
wrote:
 On Thursday, January 22, 2015 19:12:55 Dicebot via Digitalmars-d wrote:
 DMD FE memory issues are almost fully because of template instance
 explosions and CTFE. Micro-optimising structure layouts is a bad way
 to fix it.

 
 I forget what the numbers were, but IIRC, when Don looked at that a
 while back, he found that templates like isForwardRange were being
 instantiated millions of times as part of the unit tests. And maybe
 the numbers weren't quite that high, but they were _high_.
 std.datetime used to have similar problems when I used assertPred for
 its unit tests, which was a very handy templated function for making
 unit tests more informative, but it was a template, and it was used a
 lot in the tests, so it wasn't cheap, and once it was removed, the
 memory consumption and time to compile definitely went down.
 
 It would be a huge win if we could figure out how to avoid having all
 of those templates instantiated that many times.

[...]

One idea would be to detect if a template doesn't actually generate any
code (doesn't contain any functions or variable declarations in the
outermost scope), and switch to a CTFE-style evaluation of the template
without actually storing any of its instantiations.

But I've no idea whether such an idea is feasible in the current
compiler implementation, or whether it will actually help anything. DMD
not reclaiming *any* memory whatsoever in the name of speed is the
bugbear that hinders improvement in this area.


T

-- 
"The whole problem with the world is that fools and fanatics are always so
certain of themselves, but wiser people so full of doubts." -- Bertrand
Russell. "How come he didn't put 'I think' at the end of it?" -- Anonymous

Jan 22 2015

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Tuesday, January 20, 2015 18:02:47 H. S. Teoh via Digitalmars-d wrote:
 On Tue, Jan 20, 2015 at 05:26:41PM -0800, Jonathan M Davis via Digitalmars-d
wrote:
 and in my experience, large modules like std.algorithm or std.datetime
 are actually quite maintainable.  However, that doesn't mean that we
 wouldn't be better off splitting up the particularly large ones. I
 just started on splitting std.datetime again the other day, and
 hopefully I can find time enough to finish it before I end up having
 to deal with merging other changes in.

 std.datetime is one of those things that has grown large enough that
 it's causing a noticeable pause when I open it in my editor or search
 for a symbol... I think that's nearing the point where splitting just on
 basis of size may become justifiable. :-P

 (Having said that, though, std.datetime unittests actually compile and
 run on my machine, in spite of their far larger number, yet
 std.algorithm doesn't. I think it's because of too many deeply-nested
 templates in std.algorithm, which probably includes a problem of my own
 making, namely one of the overloads of cartesianProduct, that causes an
 exponential number of recursive template instantiations. I've been
 meaning to fix that, and have in fact managed to fix it for the finite
 range case, but the infinite range case thus far eludes me.)

Oh. I think that std.datetime is large enough that it would be better to
split it up. It was actually split when I originally proposed it. It was
just split so badly that I was told to put it back together again (and it
improved quite a bit over the course of the review process). It's large
enough that github won't display the whole thing, and it's far larger than
any other module in Phobos. So, there are definitely good reasons to split
it up. But as far as maintaining it goes, I haven't found it to be a problem
at all that it's in a single file.

- Jonathan M Davis

Jan 20 2015

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/21/15 12:03 AM, Jonathan M Davis via Digitalmars-d wrote:
 It's large
 enough that github won't display the whole thing, and it's far larger than
 any other module in Phobos.

This is the only reason I think we need to split it :) Doing reviews on 
std.datetime PRs are painful.

-Steve

Jan 22 2015

D Programming

C/C++ Programming

Other

digitalmars.D - Splitting std.algorithm