www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 3827] New: automatic joining of adjacent strings is bad

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827

           Summary: automatic joining of adjacent strings is bad
           Product: D
           Version: 2.040
          Platform: All
        OS/Version: Windows
            Status: NEW
          Severity: normal
          Priority: P2
         Component: DMD
        AssignedTo: nobody puremagic.com
        ReportedBy: bearophile_hugs eml.cc



import std.stdio;
void main() {
    string[] a = ["foo", "bar" "baz", "spam"];
    writeln(a);
}

This code prints:
foo barbaz spam

But probably the programmer meant to create an array with 4 strings.
D has the ~ concat operator, so to prevent possible programming bugs it's
better to remove the implicit concat of strings separated by whitespace.

Everywhere the programmer wants to concat strings the explicit concat operator
can be used:

string s = "this is a very long string that doesn't fit in" ~
           " a line";

The "Python Zen" has a rule that says:

Explicit is better than implicit.

The compiler can optimize the concat away at compile time.

C code ported to D that doesn't put a ~ just raises a compile time error that's
easy to understand and fix.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 18 2010
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827




---
Created an attachment (id=571)
patch for parse.c

Vote++ and patch

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 18 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827





 Created an attachment (id=571) [details]
 patch for parse.c
 
 Vote++ and patch
Thank you. But is DMD doing the joining with ~ at compile time? If not, then you can add that optimization to your patch (if you are able to). -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 18 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827




 Thank you. But is DMD doing the joining with ~ at compile time? If not, then
 you can add that optimization to your patch (if you are able to).
And if you think it's needed, you can add the clear error message I was talking about :-) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 18 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827


Alexey Ivanov <aifgi90 gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |aifgi90 gmail.com



---
 Thank you. But is DMD doing the joining with ~ at compile time? If not, then
 you can add that optimization to your patch (if you are able to).
I think DMD is doing joining at compile time (constfold.c, from line 1387) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 28 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827




The error message for the missing ~ can be something like this (adapted from
the "'l' suffix is deprecated, use 'L' instead" error message generated by the
usage of a 10l long literal):

adjacent string literals concatenation is deprecated, add ~ between them
instead.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jun 20 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827


Ellery Newcomer <ellery-newcomer utulsa.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ellery-newcomer utulsa.edu



16:29:07 PDT ---

 The "Python Zen" has a rule that says:
 
 Explicit is better than implicit.
 
the python compiler has a rule that says do the exact same thing as what d is doing. Your serve. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jun 20 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827




I know Python, but I hope D will become better than Python on this syntax
detail.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jun 20 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827




A particularly nice example of why untidy syntax easily leads to bugs (this
comes from two different sources of sloppiness of the D2 language):


enum string[5] data = ["green", "magenta", "blue" "red", "yellow"];
static assert(data[4] == "yellow"); // asserts
void main() {}

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 21 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827




Another bug caused in my code by that anti-feature:


unittest {
    auto tests = [["", "0000"], ["12346", "0000"], ["he", "H000"],
                  ["soundex", "S532"], ["example", "E251"],
                  ["ciondecks", "C532"], ["ekzampul", "E251"],
                  ["resume", "R250"], ["Robert", "R163"],
                  ["Rupert", "R163"], ["Rubin" "R150"],
                  ["Ashcraft", "A226"], ["Ashcroft", "A226"]];
    foreach (pair; tests)
        assert(processit(pair[0]) == pair[1]);
}


That code compiles with no errors with DMD 2.050, and then causes a Range
violation at runtime because one of those arrays isn't a pair.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Nov 10 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827





that it doesn't perform automatic joining of adjacent strings:


public class Test {
    public static void Main() {
        string s = "hello " "world";
    }
}


prog.cs(3,35): error CS1525: Unexpected symbol `world'
Compilation failed: 1 error(s), 0 warnings

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Nov 10 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827




Walter agrees:

http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=121830

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Nov 12 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827




A comment from Andrei Alexandrescu:
Walter, please don't forget to tweak the associativity rules: var ~ " literal "
~ " literal " concatenates literals first.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Nov 12 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827






A comment from Stewart Gordon:

 You mean make ~ right-associative?  I think this'll break more code than
 it fixes.

 But implementing a compiler optimisation so that var ~ ctc ~ ctc is
 processed as var ~ (ctc ~ ctc), _in those cases where they're
 equivalent_, would be sensible.

 ctc = compile-time constant
-- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Nov 13 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827




19:26:18 PST ---
you don't need to mess with associativity rules, you just need to be able to
handle two or three ast cases:

1. (~ str str)        ie  str ~ str
2. (~ (~ x str) str)  ie  x ~ str ~ str
3. (~ str (~ str x))  ie  str ~ (str ~ x)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Nov 13 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827


Don <clugdbug yahoo.com.au> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |clugdbug yahoo.com.au




 you don't need to mess with associativity rules, you just need to be able to
 handle two or three ast cases:
 
 1. (~ str str)        ie  str ~ str
 2. (~ (~ x str) str)  ie  x ~ str ~ str
 3. (~ str (~ str x))  ie  str ~ (str ~ x)
Like this (optimize.c, line 1023): Expression *CatExp::optimize(int result) { Expression *e; //printf("CatExp::optimize(%d) %s\n", result, toChars()); e1 = e1->optimize(result); e2 = e2->optimize(result); + if (e1->op == TOKcat && (e2->op == TOKstring || e2->op == TOKnull) + && (((CatExp *)e1)->e2->op == TOKstring || ((CatExp *)e1)->e2->op == TOKnull)) + { + // Convert (e ~ str) ~ str into e ~ (str ~ str) + CatExp *ce = ((CatExp *)e1); + e1 = ce->e1; + ce->e1 = ce->e2; + ce->e2 = e2; + e2 = ce; + } e = Cat(type, e1, e2); if (e == EXP_CANT_INTERPRET) e = this; return e; } -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Nov 13 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827




Sorry, missed out a line:

    if (e1->op == TOKcat && (e2->op == TOKstring || e2->op == TOKnull)
            && (((CatExp *)e1)->e2->op == TOKstring || ((CatExp *)e1)->e2->op
== TOKnull))
    {
        // Convert  (e ~ str) ~ str into  e ~ (str ~ str)
        CatExp *ce = ((CatExp *)e1);
        e1 = ce->e1;
        ce->e1 = ce->e2;
        ce->e2 = e2;
        e2 = ce;
+        e2 = e2->optimize(result);
    }

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Nov 13 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827


Stewart Gordon <smjg iname.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |smjg iname.com




 The error message for the missing ~ can be something like this (adapted from
 the "'l' suffix is deprecated, use 'L' instead" error message generated by the
 usage of a 10l long literal):
 
 adjacent string literals concatenation is deprecated, add ~ between them
 instead.
Better watch out for cases where just adding ~ changes the behaviour. For example, if a is a string[], then a ~ "this" "that" and a ~ "this" ~ "that" evaluate to different strings. Not that there's any real use case for "this" "that" anyway. And those rare use cases it does have in D can be fixed by inserting the ~, though there may be easier-to-miss cases of the above of which to be wary. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Nov 16 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827





 For example, if a is a string[], then a ~ "this" "that" and a ~ "this" ~ "that"
 evaluate to different strings.
Different string arrays even. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Nov 16 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827


nfxjfg gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |nfxjfg gmail.com




 Not that there's any real use case for "this" "that" anyway.  And those rare
 use cases
I use automatic joining all the time for long string literals. I want them to span multiple source lines without containing line breaks. No, not a rarely used feature. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Nov 16 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827






 Not that there's any real use case for "this" "that" anyway.  And those rare
 use cases
I use automatic joining all the time for long string literals. I want them to span multiple source lines without containing line breaks. No, not a rarely used feature.
Stewart Gordon was just talking about code like: a ~ "this" "that" where a is a string[]. To join multiple lines you may add a ~ at their end: string text = "I use automatic joining all the time for long string literals. I want them to " ~ "span multiple source lines without containing line breaks. " ~ "No, not a rarely used feature."; -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Nov 16 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827


Steven Schveighoffer <schveiguy yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |schveiguy yahoo.com



21:33:05 PST ---


 The error message for the missing ~ can be something like this (adapted from
 the "'l' suffix is deprecated, use 'L' instead" error message generated by the
 usage of a 10l long literal):
 
 adjacent string literals concatenation is deprecated, add ~ between them
 instead.
Better watch out for cases where just adding ~ changes the behaviour. For example, if a is a string[], then a ~ "this" "that" and a ~ "this" ~ "that" evaluate to different strings.
doesn't this solve that problem? a ~ ("this" ~ "that") BTW, I don't expect very many cases like this (in fact, I bet there are none). -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Nov 16 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827





 doesn't this solve that problem? a ~ ("this" ~ "that")
It does. My point was that somebody might accidentally not add the brackets. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Nov 17 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827




12:04:03 PST ---
If constfold can access a's type, it can make the right decision.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Nov 17 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827




A recent note by Walter:

 Andrei's right. This is not about making it right-associative. It is about
 defining in the language that:
 
     ((a ~ b) ~ c)
 
 is guaranteed to produce the same result as:
 
     (a ~ (b ~ c))
 
 Unfortunately, the language cannot make such a guarantee in the face of
operator
 overloading. But it can do it for cases where operator overloading is not in
play.
-- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Nov 22 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827




See also:

http://stackoverflow.com/questions/2504536/why-allow-concatenation-of-string-literals

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Mar 20 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827




An example of the problems this
avoids:http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.announce&article_id=22649

Andrej Mitrovic:

 I see you are not the only one who started writing string array
 literals like this:
 
 enum PEGCode = grammarCode!(
      "Grammar <- S Definition+ EOI"
     ,"Definition <- RuleName Arrow Expression"
     ,"RuleName   <- Identifier>(ParamList?)"
     ,"Expression <- Sequence (OR Sequence)*"
 );
 
 IOW comma on the left side. I know it's not a style preference but
 actually a (unfortunate but needed) technique for avoiding bugs. :)
-- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 10 2012
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3827


Andrej Mitrovic <andrej.mitrovich gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |andrej.mitrovich gmail.com



17:56:16 PST ---

 enum PEGCode = grammarCode!(
      "Grammar <- S Definition+ EOI"
     ,"Definition <- RuleName Arrow Expression"
     ,"RuleName   <- Identifier>(ParamList?)"
     ,"Expression <- Sequence (OR Sequence)*"
 );
Note that this is Philippe Sigaud's code. So you can him, and me to the list of people affected by this. I'm doing string processing in D on a day-to-day basis, and whenever I have a list of strings I eventually end up shooting myself in the foot because of a missing comma. It's very easy (at least for clumsy me) to make the mistake. E.g. writing some headers to ignore: string[] ignoredHeaders = [ "foo.bar" // todo: have to fix this later "foo.do", // todo: later ]; When I have comments next to the strings it makes it easy to miss the missing comma, especially if the strings are of a different length. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 10 2012