www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - d2sqlite3 db.run, where lies the bug?

reply Ralph Amissah <ralph.amissah gmail.com> writes:
/+
Not sure where to report this, nor of where the bug lies. I hope
SQLite (and d2sqlite3) is used widely enough for this to be of
interest here.

I have sets of document files that are broken up and placed
(inserted) into an sqlite3 db, some of which fail with what is to me
an inexplicable utf-8 error as they contain no special characters
and it is "corrected" without the removal of any character in
particular, and I so far, cannot predict them. Below I have
concocted a sample string that fails and variations of it that pass
(with one character removed or added).

Note,
(i) There is no problem inserting the string in question into
  sqlite3 directly (i.e. using sqlite3 directly).
(ii) Likewise there is no problem using d2sqlite3 with a prepared
  statement on the string (.execute), but,... But for the use case,
  each document has several thousand content strings for each of
  which sqlite3 then requires a data locks and releases for the
  insertion of the next, thousands per document and this makes the
  operation conservatively several tens of time slower, than:
  generating a prepared sql statement inserting all document rows
  (objects/paragraphs) as a single sql statement with db.run begin
  and commit. Basically, it makes a significant difference that this
  works.
(iii) There do not appear to be any offending utf-8 characters

Assumption, most likely candidate for blame is the d2sqlite3 wrapper
(or my code, (something that needs to be escaped in certain
circumstances? please tell me)); sqlite3 does not have a problem
with the string in question, and the C api for which d2sqlite3
provides a wrapper is much used and seems unlikely to be to blame.

The exact location of problem may be provided in the error statement
"core.exception.UnicodeException src/rt/util/utf.d(292): invalid
UTF-8 sequence".

From a laymans perspecive it would appear that db.run the passing of
an sql statement as a string to sqlite3 should be the simplest type of transaction; pass a statement/string unchanged to sqlite3 Sample offending text used: "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.", The prepared sql statement that fails with d2sqlite3 (but succeeds using sqlite3 directly): BEGIN; DROP TABLE IF EXISTS test; CREATE TABLE test ( lid BIGINT PRIMARY KEY, txt TEXT NULL ); INSERT INTO test (txt) VALUES ('Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.'); COMMIT; core.exception.UnicodeException rt/util/utf.d(292): invalid UTF-8 sequence in the sample problem text/string/statement for d2sqlite3 (above): - utf-8 error goes away on removal of either: - one of the semi-colons (of which there are 2) - any of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) - comments - placing a random semi-colon within the two offending semi-colons sometimes works - double semi-colon does not help Mock problem string with test code follows (d2sqlite3 required): +/ module d2sqlite3_utf8.issue; import std.conv : to; import std.format; import std.stdio; import d2sqlite3; void main() { string[] info_tag = ["pass", "fault"]; foreach (t, tests; [ok_check, faults]) { foreach (i, insert_test; tests) { // auto db = Database("test.sqlite"); auto db = Database(":memory:"); string _sql_statement = format(q"¶ BEGIN; DROP TABLE IF EXISTS test; CREATE TABLE test ( lid BIGINT PRIMARY KEY, txt TEXT NULL ); INSERT INTO test (txt) VALUES ('%s'); COMMIT; ¶", insert_test[0], ); writeln( (i + 1), ". ", info_tag[t], ": ", insert_test[1], "\n", _sql_statement); db.run(_sql_statement); foreach (parse; db.execute("SELECT txt FROM test;")) { foreach (s; parse) { writeln( " (test string == sqlite database content): ", (insert_test[0] == s.to!string), "\n"); assert(insert_test[0] == s.to!string); // writeln(s); } } assert(db.totalChanges == 1); db.close; } } } string[][] faults = [ [ "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.", "sample problem text, utf-8 error?\n\"core.exception.UnicodeException src/rt/util/utf.d(292): invalid UTF-8 sequence\"?", ] ]; /+ in the sample problem text (above): - utf-8 error goes away on removal of either: - one of the semi-colons (of which there are 2) - any of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) - comments - placing a random semi-colon within the two offending semi-colons sometimes works - double semi-colon does not help eat text that passes tests below +/ string[][] ok_check = [ [ "Contrary to Peter’s cake all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.", "remove one of the semi-colons (of which there are 2) from sample problem text:" ], [ "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries funny.", "remove one of the semi-colons (of which there are 2) from sample problem text:" ], [ "Contrary to Peter’s cake; all versions of Peters, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.", "remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text\n[here changing the word \"berries\" before the last semi-colon to a less than 7 character word like \"sugar\" or \"garlic\" results in failure]:", ], [ "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cooks permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.", "remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text:", ], [ "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peters recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.", "remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text:", ], [ "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peters lunch. You didn’t eat Peter’s berries; funny.", "remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text:", ], [ "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didnt eat Peter’s berries; funny.", "remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text:", ], [ "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peters berries; funny.", "remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text:", ], [ "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu; shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.", "place a random semi-colon within the two offending semi-colons and sample problem text sometimes works, an example:", ], ]; /+ Sorry this became so verbose, (also if unclear, or if I have overlooked anything obvious), I probably should report directly to Nicolas Sicard first but the bug seems so off the wall. Nicolas thanks for the d2sqlite3 wrapper btw, and you are probably my best bet to have this fixed, unless I have stumbled upon something even more interesting. Thanks, Ralph Amissah P.S. if relevant, currently using: - d2sqlite3 0.16.0; - DMD64 D Compiler v2.074.1 | LDC - the LLVM D compiler (1.8.0) (based on DMD v2.078.3 and LLVM 5.0.1) | gdc (Debian 8-20180402-1) 8.0.1 20180402 (experimental) - Debian (latest, so called unstable); +/
Apr 10 2018
parent reply ag0aep6g <anonymous example.com> writes:
On 04/10/2018 08:04 PM, Ralph Amissah wrote:
 The exact location of problem may be provided in the error statement
 "core.exception.UnicodeException src/rt/util/utf.d(292): invalid
 UTF-8 sequence".
 
[...]
 Mock problem string with test code follows (d2sqlite3 required):
 
[... code ...] A more minimal test case, reduced from your code: ---- module d2sqlite3_utf8.issue; import d2sqlite3; void main() { string[] info_tag = ["pass", "fault"]; auto db = Database(":memory:"); string _sql_statement = `SELECT '’’';`; db.run(_sql_statement); db.close; } ---- From the exception's stack trace we see that `d2sqlite3.internal.util.byStatement(immutable(char)[]).ByStatement.findEnd` is the deepest non-Phobos function involved. So that's a good first spot to look for a bug. Let's check it out. https://github.com/biozic/d2sqlite3/blob/2e8211946ae0e09646d561aeae1361a695adcc17/source/d2sqlite3/internal/util.d#L64-L83 And indeed, there's a bug in these lines: ---- auto tail = sql[pos .. $]; immutable offset = tail.countUntil(';') + 1; pos += offset; ---- `pos` is used to slice the string `sql`. That means, `pos` is interpreted as a number of UTF-8 code *units*. But then the result of `countUntil` is added. `countUntil` counts code *points*. So a number of code points is mistaken as a number of code units. That means the next slicing can be incorrect and split up a multibyte sequence. And then `countUntil` will complain about broken UTF-8. This can be fixed by letting `countUntil` operate on count code units instead: ---- import std.utf: byCodeUnit; immutable offset = tail.byCodeUnit.countUntil(';') + 1; ---- If you want, you can make a bug report or a pull request with the fix. Otherwise, if you're not up to that, I can make one. [...]
    - DMD64 D Compiler v2.074.1
That's rather old. I'd recommend updating if possible.
Apr 10 2018
parent reply Ralph Amissah <ralph.amissah gmail.com> writes:
ag0aep6g, thank you most of all for fixing the bug, the removal of my immediate
frustration! Also for the thorough step-wise instruction on sleuthing and fixing
it.

 If you want, you can make a bug report or a pull request with the fix.
 Otherwise, if you're not up to that, I can make one.
Generous, I merely stumbled on it and shouted out, you fixed it. I would be grateful if you please file the bug and your fix. (I thought I might have more to report, but your fix has resolved my multitude of issues with several texts). I did copy my original post to the author of d2sqlite3 (and again this reply which includes your fix), so he hopefully has the information required to make the upstream fix. The diff/patch I used: diff -Naur d2sqlite3-0.16.0/d2sqlite3/source/d2sqlite3/internal/util.d d2sqlite3-0.16.0-1/d2sqlite3/source/d2sqlite3/internal/util.d --- d2sqlite3-0.16.0/d2sqlite3/source/d2sqlite3/internal/util.d 2018-04-10 20:34:11.584498926 -0400 +++ d2sqlite3-0.16.0-1/d2sqlite3/source/d2sqlite3/internal/util.d 2018-04-10 20:46:09.869812899 -0400 -65,13 +65,14 { import std.algorithm : countUntil; import std.string : toStringz; + import std.utf: byCodeUnit; size_t pos; bool complete; do { auto tail = sql[pos .. $]; - immutable offset = tail.countUntil(';') + 1; + immutable offset = tail.byCodeUnit.countUntil(';') + 1; pos += offset; if (offset == 0) pos = sql.length; (if patch is contained in patchfile called d2sqlite3_bugfix.diff and at the same directory level as subdirectory containing d2sqlite3-0.16.0) cp -av d2sqlite3-0.16.0 d2sqlite3-0.16.0-1 \ && cd d2sqlite3-0.16.0-1 \ && patch -Np1 < ../d2sqlite3_bugfix.diff On Tue, Apr 10 2018, ag0aep6g via Digitalmars-d <digitalmars-d puremagic.com> wrote:
 On 04/10/2018 08:04 PM, Ralph Amissah wrote:
 The exact location of problem may be provided in the error statement
 "core.exception.UnicodeException src/rt/util/utf.d(292): invalid
 UTF-8 sequence".
[...]
 Mock problem string with test code follows (d2sqlite3 required):
[... code ...]
[... snip ...]
  From the exception's stack trace we see that
 `d2sqlite3.internal.util.byStatement(immutable(char)[]).ByStatement.findEnd`
 is the deepest non-Phobos function involved. So that's a good first spot
 to look for a bug. Let's check it out.

 https://github.com/biozic/d2sqlite3/blob/2e8211946ae0e09646d561aeae1361a695adcc17/source/d2sqlite3/internal/util.d#L64-L83

 And indeed, there's a bug in these lines:

 ----
 auto tail = sql[pos .. $];
 immutable offset = tail.countUntil(';') + 1;
 pos += offset;
 ----

 `pos` is used to slice the string `sql`. That means, `pos` is
 interpreted as a number of UTF-8 code *units*. But then the result of
 `countUntil` is added. `countUntil` counts code *points*. So a number of
 code points is mistaken as a number of code units. That means the next
 slicing can be incorrect and split up a multibyte sequence. And then
 `countUntil` will complain about broken UTF-8.

 This can be fixed by letting `countUntil` operate on count code units
 instead:

 ----
 import std.utf: byCodeUnit;
 immutable offset = tail.byCodeUnit.countUntil(';') + 1;
 ----

 If you want, you can make a bug report or a pull request with the fix.
 Otherwise, if you're not up to that, I can make one.

 [...]
    - DMD64 D Compiler v2.074.1
That's rather old. I'd recommend updating if possible.
for the time being it reflects the status of the rolling development branch of Debian. LDC is based on a newer version. Thanks again. Ralph Amissah
Apr 10 2018
parent reply ag0aep6g <anonymous example.com> writes:
On 04/11/2018 04:08 AM, Ralph Amissah wrote:
 Generous, I merely stumbled on it and shouted out, you fixed it. I would be
 grateful if you please file the bug and your fix.
There you go: https://github.com/biozic/d2sqlite3/pull/43
Apr 11 2018
parent biozic <dransic gmail.com> writes:
On Wednesday, 11 April 2018 at 11:00:30 UTC, ag0aep6g wrote:
 On 04/11/2018 04:08 AM, Ralph Amissah wrote:
 Generous, I merely stumbled on it and shouted out, you fixed 
 it. I would be
 grateful if you please file the bug and your fix.
There you go: https://github.com/biozic/d2sqlite3/pull/43
Thanks for fixing this ! -- Nicolas
Apr 15 2018