digitalmars.D - d2sqlite3 db.run, where lies the bug?

Ralph Amissah (169/170) Apr 10 2018 /+

ag0aep6g (42/49) Apr 10 2018 [... code ...]

Ralph Amissah (38/77) Apr 10 2018 ag0aep6g, thank you most of all for fixing the bug, the removal of my im...

ag0aep6g (2/4) Apr 11 2018 There you go: https://github.com/biozic/d2sqlite3/pull/43

biozic (3/8) Apr 15 2018 Thanks for fixing this !

Ralph Amissah <ralph.amissah gmail.com> writes:

/+
Not sure where to report this, nor of where the bug lies. I hope
SQLite (and d2sqlite3) is used widely enough for this to be of
interest here.

I have sets of document files that are broken up and placed
(inserted) into an sqlite3 db, some of which fail with what is to me
an inexplicable utf-8 error as they contain no special characters
and it is "corrected" without the removal of any character in
particular, and I so far, cannot predict them. Below I have
concocted a sample string that fails and variations of it that pass
(with one character removed or added).

Note,
(i) There is no problem inserting the string in question into
  sqlite3 directly (i.e. using sqlite3 directly).
(ii) Likewise there is no problem using d2sqlite3 with a prepared
  statement on the string (.execute), but,... But for the use case,
  each document has several thousand content strings for each of
  which sqlite3 then requires a data locks and releases for the
  insertion of the next, thousands per document and this makes the
  operation conservatively several tens of time slower, than:
  generating a prepared sql statement inserting all document rows
  (objects/paragraphs) as a single sql statement with db.run begin
  and commit. Basically, it makes a significant difference that this
  works.
(iii) There do not appear to be any offending utf-8 characters

Assumption, most likely candidate for blame is the d2sqlite3 wrapper
(or my code, (something that needs to be escaped in certain
circumstances? please tell me)); sqlite3 does not have a problem
with the string in question, and the C api for which d2sqlite3
provides a wrapper is much used and seems unlikely to be to blame.

The exact location of problem may be provided in the error statement
"core.exception.UnicodeException src/rt/util/utf.d(292): invalid
UTF-8 sequence".

From a laymans perspecive it would appear that db.run the passing of

an sql statement as a string to sqlite3 should be the simplest type
of transaction; pass a statement/string unchanged to sqlite3

Sample offending text used:
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating
without the cook’s permission is, of course, naughty. Now, a quick check of
my menu shows that sure enough, a number of files have Peter’s recipe in
them. John had better tell us if he ate Peter’s lunch. You didn’t eat
Peter’s berries; funny.",

The prepared sql statement that fails with d2sqlite3 (but succeeds
using sqlite3 directly):

        BEGIN;
        DROP TABLE IF EXISTS test;
        CREATE TABLE test (
          lid     BIGINT PRIMARY KEY,
          txt     TEXT NULL
        );
        INSERT INTO test (txt) VALUES ('Contrary to Peter’s cake; all
versions of Peter’s, custard. Eating without the cook’s permission is, of
course, naughty. Now, a quick check of my menu shows that sure enough, a number
of files have Peter’s recipe in them. John had better tell us if he ate
Peter’s lunch. You didn’t eat Peter’s berries; funny.');
        COMMIT;

core.exception.UnicodeException rt/util/utf.d(292): invalid UTF-8 sequence

in the sample problem text/string/statement for d2sqlite3 (above):
  - utf-8 error goes away on removal of either:
    - one of the semi-colons (of which there are 2)
    - any of the closing single quote mark symbols ’ occurring between the
two semi-colons (of which there are 6)
  - comments
    - placing a random semi-colon within the two offending semi-colons
sometimes works
    - double semi-colon does not help

Mock problem string with test code follows (d2sqlite3 required):

+/

module d2sqlite3_utf8.issue;
import std.conv : to;
import std.format;
import std.stdio;
import d2sqlite3;
void main() {
  string[] info_tag = ["pass", "fault"];
  foreach (t, tests; [ok_check, faults]) {
    foreach (i, insert_test; tests) {
      // auto db = Database("test.sqlite");
      auto db = Database(":memory:");
      string _sql_statement = format(q"¶
        BEGIN;
        DROP TABLE IF EXISTS test;
        CREATE TABLE test (
          lid     BIGINT PRIMARY KEY,
          txt     TEXT NULL
        );
        INSERT INTO test (txt) VALUES ('%s');
        COMMIT;
      ¶",
        insert_test[0],
      );
      writeln(
        (i + 1), ". ", info_tag[t], ": ", insert_test[1],
        "\n", _sql_statement);
      db.run(_sql_statement);
      foreach (parse; db.execute("SELECT txt FROM test;")) {
        foreach (s; parse) {
          writeln(
            "  (test string == sqlite database content): ",
            (insert_test[0] == s.to!string), "\n");
          assert(insert_test[0] == s.to!string);
          // writeln(s);
        }
      }
      assert(db.totalChanges == 1);
      db.close;
    }
  }
}
string[][] faults = [
  [
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating
without the cook’s permission is, of course, naughty. Now, a quick check of
my menu shows that sure enough, a number of files have Peter’s recipe in
them. John had better tell us if he ate Peter’s lunch. You didn’t eat
Peter’s berries; funny.",
    "sample problem text, utf-8
error?\n\"core.exception.UnicodeException src/rt/util/utf.d(292): invalid UTF-8
sequence\"?",
  ]
];
/+
  in the sample problem text (above):
    - utf-8 error goes away on removal of either:
      - one of the semi-colons (of which there are 2)
      - any of the closing single quote mark symbols ’ occurring between the
two semi-colons (of which there are 6)
    - comments
      - placing a random semi-colon within the two offending semi-colons
sometimes works
      - double semi-colon does not help
  eat text that passes tests below
+/

string[][] ok_check = [
  [
    "Contrary to Peter’s cake all versions of Peter’s, custard. Eating
without the cook’s permission is, of course, naughty. Now, a quick check of
my menu shows that sure enough, a number of files have Peter’s recipe in
them. John had better tell us if he ate Peter’s lunch. You didn’t eat
Peter’s berries; funny.",
     "remove one of the semi-colons (of which there are 2) from sample problem
text:"
  ],
  [
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating
without the cook’s permission is, of course, naughty. Now, a quick check of
my menu shows that sure enough, a number of files have Peter’s recipe in
them. John had better tell us if he ate Peter’s lunch. You didn’t eat
Peter’s berries funny.",
     "remove one of the semi-colons (of which there are 2) from sample problem
text:"
  ],
  [
    "Contrary to Peter’s cake; all versions of Peters, custard. Eating
without the cook’s permission is, of course, naughty. Now, a quick check of
my menu shows that sure enough, a number of files have Peter’s recipe in
them. John had better tell us if he ate Peter’s lunch. You didn’t eat
Peter’s berries; funny.",
    "remove any one of the closing single quote mark symbols ’ occurring
between the two semi-colons (of which there are 6) from sample problem
text\n[here changing the word \"berries\" before the last semi-colon to a less
than 7 character word like \"sugar\" or \"garlic\" results in failure]:",
  ],
  [
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating
without the cooks permission is, of course, naughty. Now, a quick check of my
menu shows that sure enough, a number of files have Peter’s recipe in them.
John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s
berries; funny.",
    "remove any one of the closing single quote mark symbols ’ occurring
between the two semi-colons (of which there are 6) from sample problem text:",
  ],
  [
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating
without the cook’s permission is, of course, naughty. Now, a quick check of
my menu shows that sure enough, a number of files have Peters recipe in them.
John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s
berries; funny.",
    "remove any one of the closing single quote mark symbols ’ occurring
between the two semi-colons (of which there are 6) from sample problem text:",
  ],
  [
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating
without the cook’s permission is, of course, naughty. Now, a quick check of
my menu shows that sure enough, a number of files have Peter’s recipe in
them. John had better tell us if he ate Peters lunch. You didn’t eat
Peter’s berries; funny.",
    "remove any one of the closing single quote mark symbols ’ occurring
between the two semi-colons (of which there are 6) from sample problem text:",
  ],
  [
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating
without the cook’s permission is, of course, naughty. Now, a quick check of
my menu shows that sure enough, a number of files have Peter’s recipe in
them. John had better tell us if he ate Peter’s lunch. You didnt eat
Peter’s berries; funny.",
    "remove any one of the closing single quote mark symbols ’ occurring
between the two semi-colons (of which there are 6) from sample problem text:",
  ],
  [
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating
without the cook’s permission is, of course, naughty. Now, a quick check of
my menu shows that sure enough, a number of files have Peter’s recipe in
them. John had better tell us if he ate Peter’s lunch. You didn’t eat
Peters berries; funny.",
    "remove any one of the closing single quote mark symbols ’ occurring
between the two semi-colons (of which there are 6) from sample problem text:",
  ],
  [
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating
without the cook’s permission is, of course, naughty. Now, a quick check of
my menu; shows that sure enough, a number of files have Peter’s recipe in
them. John had better tell us if he ate Peter’s lunch. You didn’t eat
Peter’s berries; funny.",
    "place a random semi-colon within the two offending semi-colons and sample
problem text sometimes works, an example:",
  ],
];

/+
  Sorry this became so verbose, (also if unclear, or if I have
  overlooked anything obvious), I probably should report directly to
  Nicolas Sicard first but the bug seems so off the wall. Nicolas
  thanks for the d2sqlite3 wrapper btw, and you are probably my best
  bet to have this fixed, unless I have stumbled upon something even
  more interesting.

  Thanks,
  Ralph Amissah

  P.S. if relevant, currently using:
  - d2sqlite3 0.16.0;
  - DMD64 D Compiler v2.074.1
    | LDC - the LLVM D compiler (1.8.0) (based on DMD v2.078.3 and
      LLVM 5.0.1)
    | gdc (Debian 8-20180402-1) 8.0.1 20180402 (experimental)
  - Debian (latest, so called unstable);

+/

Apr 10 2018

ag0aep6g <anonymous example.com> writes:

On 04/10/2018 08:04 PM, Ralph Amissah wrote:
 The exact location of problem may be provided in the error statement
 "core.exception.UnicodeException src/rt/util/utf.d(292): invalid
 UTF-8 sequence".
 

[...]
 Mock problem string with test code follows (d2sqlite3 required):
 

[... code ...]

A more minimal test case, reduced from your code:

----
module d2sqlite3_utf8.issue;
import d2sqlite3;
void main() {
   string[] info_tag = ["pass", "fault"];
     auto db = Database(":memory:");
     string _sql_statement = `SELECT '’’';`;
     db.run(_sql_statement);
     db.close;
}
----

 From the exception's stack trace we see that 
`d2sqlite3.internal.util.byStatement(immutable(char)[]).ByStatement.findEnd` 
is the deepest non-Phobos function involved. So that's a good first spot 
to look for a bug. Let's check it out.

https://github.com/biozic/d2sqlite3/blob/2e8211946ae0e09646d561aeae1361a695adcc17/source/d2sqlite3/internal/util.d#L64-L83

And indeed, there's a bug in these lines:

----
auto tail = sql[pos .. $];
immutable offset = tail.countUntil(';') + 1;
pos += offset;
----

`pos` is used to slice the string `sql`. That means, `pos` is 
interpreted as a number of UTF-8 code *units*. But then the result of 
`countUntil` is added. `countUntil` counts code *points*. So a number of 
code points is mistaken as a number of code units. That means the next 
slicing can be incorrect and split up a multibyte sequence. And then 
`countUntil` will complain about broken UTF-8.

This can be fixed by letting `countUntil` operate on count code units 
instead:

----
import std.utf: byCodeUnit;
immutable offset = tail.byCodeUnit.countUntil(';') + 1;
----

If you want, you can make a bug report or a pull request with the fix. 
Otherwise, if you're not up to that, I can make one.

[...]
    - DMD64 D Compiler v2.074.1

That's rather old. I'd recommend updating if possible.

Apr 10 2018

Ralph Amissah <ralph.amissah gmail.com> writes:

ag0aep6g, thank you most of all for fixing the bug, the removal of my immediate
frustration! Also for the thorough step-wise instruction on sleuthing and fixing
it.

 If you want, you can make a bug report or a pull request with the fix.
 Otherwise, if you're not up to that, I can make one.

Generous, I merely stumbled on it and shouted out, you fixed it. I would be
grateful if you please file the bug and your fix. (I thought I might have more
to report, but your fix has resolved my multitude of issues with several texts).
I did copy my original post to the author of d2sqlite3 (and again this reply
which includes your fix), so he hopefully has the information required to make
the upstream fix. The diff/patch I used:

diff -Naur d2sqlite3-0.16.0/d2sqlite3/source/d2sqlite3/internal/util.d
d2sqlite3-0.16.0-1/d2sqlite3/source/d2sqlite3/internal/util.d
--- d2sqlite3-0.16.0/d2sqlite3/source/d2sqlite3/internal/util.d	2018-04-10
20:34:11.584498926 -0400
+++ d2sqlite3-0.16.0-1/d2sqlite3/source/d2sqlite3/internal/util.d	2018-04-10
20:46:09.869812899 -0400
   -65,13 +65,14   
         {
             import std.algorithm : countUntil;
             import std.string : toStringz;
+            import std.utf: byCodeUnit;

             size_t pos;
             bool complete;
             do
             {
                 auto tail = sql[pos .. $];
-                immutable offset = tail.countUntil(';') + 1;
+                immutable offset = tail.byCodeUnit.countUntil(';') + 1;
                 pos += offset;
                 if (offset == 0)
                     pos = sql.length;

(if patch is contained in patchfile called d2sqlite3_bugfix.diff and at the same
directory level as subdirectory containing d2sqlite3-0.16.0)
cp -av d2sqlite3-0.16.0 d2sqlite3-0.16.0-1 \
&& cd d2sqlite3-0.16.0-1 \
&& patch -Np1 < ../d2sqlite3_bugfix.diff

On Tue, Apr 10 2018, ag0aep6g via Digitalmars-d <digitalmars-d puremagic.com>
wrote:

 On 04/10/2018 08:04 PM, Ralph Amissah wrote:
 The exact location of problem may be provided in the error statement
 "core.exception.UnicodeException src/rt/util/utf.d(292): invalid
 UTF-8 sequence".

 [...]
 Mock problem string with test code follows (d2sqlite3 required):

 [... code ...]

[... snip ...]

  From the exception's stack trace we see that
 `d2sqlite3.internal.util.byStatement(immutable(char)[]).ByStatement.findEnd`
 is the deepest non-Phobos function involved. So that's a good first spot
 to look for a bug. Let's check it out.

 https://github.com/biozic/d2sqlite3/blob/2e8211946ae0e09646d561aeae1361a695adcc17/source/d2sqlite3/internal/util.d#L64-L83

 And indeed, there's a bug in these lines:

 ----
 auto tail = sql[pos .. $];
 immutable offset = tail.countUntil(';') + 1;
 pos += offset;
 ----

 `pos` is used to slice the string `sql`. That means, `pos` is
 interpreted as a number of UTF-8 code *units*. But then the result of
 `countUntil` is added. `countUntil` counts code *points*. So a number of
 code points is mistaken as a number of code units. That means the next
 slicing can be incorrect and split up a multibyte sequence. And then
 `countUntil` will complain about broken UTF-8.

 This can be fixed by letting `countUntil` operate on count code units
 instead:

 ----
 import std.utf: byCodeUnit;
 immutable offset = tail.byCodeUnit.countUntil(';') + 1;
 ----

 If you want, you can make a bug report or a pull request with the fix.
 Otherwise, if you're not up to that, I can make one.

 [...]
    - DMD64 D Compiler v2.074.1

 That's rather old. I'd recommend updating if possible.

for the time being it reflects the status of the rolling development branch of
Debian. LDC is based on a newer version.

Thanks again.
Ralph Amissah

Apr 10 2018

ag0aep6g <anonymous example.com> writes:

On 04/11/2018 04:08 AM, Ralph Amissah wrote:
 Generous, I merely stumbled on it and shouted out, you fixed it. I would be
 grateful if you please file the bug and your fix.

There you go: https://github.com/biozic/d2sqlite3/pull/43

Apr 11 2018

biozic <dransic gmail.com> writes:

On Wednesday, 11 April 2018 at 11:00:30 UTC, ag0aep6g wrote:
 On 04/11/2018 04:08 AM, Ralph Amissah wrote:
 Generous, I merely stumbled on it and shouted out, you fixed 
 it. I would be
 grateful if you please file the bug and your fix.

 There you go: https://github.com/biozic/d2sqlite3/pull/43

Thanks for fixing this !

-- Nicolas

Apr 15 2018

D Programming

C/C++ Programming

Other

digitalmars.D - d2sqlite3 db.run, where lies the bug?