digitalmars.D - Python-like slicing and handling UTF-8 strings as a bonus
- FG (118/118) Dec 29 2012 Slices are great but not really what I had expected, coming from Python.
- Vladimir Panteleev (5/19) Dec 29 2012 This is a common fallacy when dealing with Unicode. Please see
- FG (8/24) Dec 29 2012 Probably because I like concise code. I always prefer:
- FG (4/9) Dec 29 2012 Actually, when I look at this, it can be a one-liner after all. :)
- bearophile (6/10) Dec 29 2012 In std.algorithm there is min(), that helps a little:
- Peter Alexander (12/14) Dec 29 2012 std.range have drop and take, which work on code points, not code
- bearophile (6/8) Dec 29 2012 Right, 90% of the code doesn't need to slice strings (and
- FG (6/15) Dec 29 2012 At least dropping off the back is also possible s[2..$-5]:
- monarch_dodra (18/39) Dec 30 2012 But as a general rule, making a range out of the first (or last)
Slices are great but not really what I had expected, coming from Python. I've seen code like s[a..$-b] used without checking the values, just to end up with a Range violation. But there are 3 constraints to check here: a >= 0 && a + b <= s.length && b >= 0 That's way too much coding for a simple program/script that shortens a string, before it prints it on a screen. If I can't write s[0..80] without fear, then let there at least be a function that does it like Python would. Additionally, as strings are UTF-8-encoded, I'd like such a function to give me proper substrings, without multibyte characters cut in the middle, where s[0..80] would mean 80 characters on the screen and not 80 bytes. I would envision it being part of std.string eventually. Forgive me if such a function already exists -- I couldn't find it. I also still don't speak D too well, so don't laugh. :) import std.array, std.range, std.stdio; auto getSlice(T)(T[] s, ptrdiff_t start, ptrdiff_t end = ptrdiff_t.max) pure safe { bool start_from_back, end_from_back; size_t full_len = s.length; ptrdiff_t len; if (full_len > ptrdiff_t.max) len = ptrdiff_t.max; else len = cast(ptrdiff_t) full_len; if (end < 0) { end_from_back = true; end += len; } if (end > len) end = len; if (start < 0) { if (0 - start >= len) start = 0; else { start += len; start_from_back = true; } } if (start < 0) start = 0; if (start > end || start >= len || end <= 0) return s[0..0]; static if(is(T == char) || is(T == immutable(char)) || is(T : wchar) || is(T : immutable(wchar))) { ptrdiff_t real_start = -1, real_end = -1, loop, last_pos; if (!start_from_back || !end_from_back) { foreach (ptrdiff_t i, dchar c; s) { if (!start_from_back && loop >= start && real_start < 0) real_start = i; if (!end_from_back && loop >= end && real_end < 0) real_end = i; if ((start_from_back || real_start > -1) && (end_from_back || real_end > -1 || end == len)) break; loop++; } } start -= len; end -= len; loop = -1; if (start_from_back || end_from_back) { foreach_reverse (ptrdiff_t i, dchar c; s) { if (start_from_back && loop <= start && real_start < 0) real_start = i; if (end_from_back && loop <= end && real_end < 0) real_end = i; if ((!start_from_back || real_start > -1) && (!end_from_back || real_end > -1)) break; loop--; } } if (real_end < 0) real_end = (end_from_back ? 0 : len); if (real_start < 0) real_start = (start_from_back ? 0 : len); if (real_start > real_end) real_start = real_end = 0; return s[real_start..real_end]; } else return s[start..end]; } unittest { string s = "okrągły stół"; dstring d = "okrągły stół"d; auto t = [0, 1, 2, 3, 4]; assert(t.getSlice(0, -1) == [0, 1, 2, 3]); assert(t.getSlice(1, -2) == [1, 2]); assert(t.getSlice(-4, -2) == [1, 2]); assert(t.getSlice(-5, 7) == [0, 1, 2, 3, 4]); assert(s.getSlice(0, 0) == ""); assert(s.getSlice(0, 1) == "o"); assert(s.getSlice(0) == s); assert(s.getSlice(8) == "stół"); assert(s.getSlice(8, -1) == "stó"); assert(s.getSlice(8, -2) == "st"); assert(s.getSlice(8, -4) == ""); assert(s.getSlice(10, 11) == "ó"); assert(s.getSlice(10, -1) == "ó"); assert(s.getSlice(10, 12) == "ół"); assert(s.getSlice(11, 12) == "ł"); assert(s.getSlice(11, 15) == "ł"); assert(d.getSlice(0, 0) == ""d); assert(d.getSlice(0, 1) == "o"d); assert(d.getSlice(0) == d); assert(d.getSlice(8) == "stół"d); assert(d.getSlice(8, -1) == "stó"d); assert(d.getSlice(8, -2) == "st"d); assert(d.getSlice(8, -4) == ""d); assert(d.getSlice(10, 11) == "ó"d); assert(d.getSlice(10, -1) == "ó"d); assert(d.getSlice(10, 12) == "ół"d); assert(d.getSlice(11, 12) == "ł"d); assert(d.getSlice(11, 15) == "ł"d); assert(d.getSlice(11, 15) == "ł"d); }
Dec 29 2012
On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote:Slices are great but not really what I had expected, coming from Python. I've seen code like s[a..$-b] used without checking the values, just to end up with a Range violation. But there are 3 constraints to check here: a >= 0 && a + b <= s.length && b >= 0 That's way too much coding for a simple program/script that shortens a string, before it prints it on a screen. If I can't write s[0..80] without fear, then let there at least be a function that does it like Python would.Why?Additionally, as strings are UTF-8-encoded, I'd like such a function to give me proper substrings, without multibyte characters cut in the middle, where s[0..80] would mean 80 characters on the screen and not 80 bytes.This is a common fallacy when dealing with Unicode. Please see the linked and the following points: http://utf8everywhere.org/#myth.utf32.o1
Dec 29 2012
On 2012-12-29 23:35, Vladimir Panteleev wrote:On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote:Probably because I like concise code. I always prefer: if (A) print(getMessage().getSlice(0..100)); to writing something like this: auto message = getMessage(); if (A) print(message.length > 100 ? message[0..100] : message);Slices are great but not really what I had expected, coming from Python. I've seen code like s[a..$-b] used without checking the values, just to end up with a Range violation. But there are 3 constraints to check here: a >= 0 && a + b <= s.length && b >= 0 That's way too much coding for a simple program/script that shortens a string, before it prints it on a screen. If I can't write s[0..80] without fear, then let there at least be a function that does it like Python would.Why?True. I didn't think about all the languages out there. Just some common European ones.Additionally, as strings are UTF-8-encoded, I'd like such a function to give me proper substrings, without multibyte characters cut in the middle, where s[0..80] would mean 80 characters on the screen and not 80 bytes.This is a common fallacy when dealing with Unicode. Please see the linked and the following points: http://utf8everywhere.org/#myth.utf32.o1
Dec 29 2012
On 2012-12-29 23:55, FG wrote:Probably because I like concise code. I always prefer: if (A) print(getMessage().getSlice(0..100)); to writing something like this: auto message = getMessage(); if (A) print(message.length > 100 ? message[0..100] : message);Actually, when I look at this, it can be a one-liner after all. :) if (A) print(getMessage()[0..($>100?100:$)]); Didn't expect this to work.
Dec 29 2012
FG:to writing something like this: auto message = getMessage(); if (A) print(message.length > 100 ? message[0..100] : message);In std.algorithm there is min(), that helps a little: if (A) print(message[0 .. min($, 100)]); Bye, bearophile
Dec 29 2012
On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote:Forgive me if such a function already exists -- I couldn't find it.std.range have drop and take, which work on code points, not code units. They also handle over-dropping or over-taking gracefully. For example: string s = "okrągły stół"; writeln(s.drop(8).take(3)); // "stó" writeln(s.drop(8).take(100)); // "stół" writeln(s.drop(100).take(100)); // "" http://dpaste.dzfl.pl/2f8ebf49 It doesn't support negative indexing. Generally speaking though, the vast majority of user code should never need to index into a Unicode string.
Dec 29 2012
Peter Alexander:Generally speaking though, the vast majority of user code should never need to index into a Unicode string.Right, 90% of the code doesn't need to slice strings (and generally strings are Unicode). But the other 90% of the code needs to slice things... Bye, bearophile
Dec 29 2012
On 2012-12-30 00:03, Peter Alexander wrote:On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote:Ah, so this is the way of doing it. Thanks.Forgive me if such a function already exists -- I couldn't find it.std.range have drop and take, which work on code points, not code units. They also handle over-dropping or over-taking gracefully. For example: string s = "okrągły stół"; writeln(s.drop(8).take(3)); // "stó" writeln(s.drop(8).take(100)); // "stół" writeln(s.drop(100).take(100)); // ""It doesn't support negative indexing.At least dropping off the back is also possible s[2..$-5]: writeln(s.retro.drop(5).retro.drop(2)); // "rągły" (or with dropBack, without retro, if available) I have no idea how to do s[$-4..$-2] though.
Dec 29 2012
On Sunday, 30 December 2012 at 00:02:17 UTC, FG wrote:On 2012-12-30 00:03, Peter Alexander wrote:dropBack is available IFF retro is available. (AFAIK)On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote:Ah, so this is the way of doing it. Thanks.Forgive me if such a function already exists -- I couldn't find it.std.range have drop and take, which work on code points, not code units. They also handle over-dropping or over-taking gracefully. For example: string s = "okrągły stół"; writeln(s.drop(8).take(3)); // "stó" writeln(s.drop(8).take(100)); // "stół" writeln(s.drop(100).take(100)); // ""It doesn't support negative indexing.At least dropping off the back is also possible s[2..$-5]: writeln(s.retro.drop(5).retro.drop(2)); // "rągły" (or with dropBack, without retro, if available)I have no idea how to do s[$-4..$-2] though.But as a general rule, making a range out of the first (or last) elements of a non RA range is a limitation of how ranges can "only shrink". strings are a special case of non-RA, non-sliceable range you can index and slice... Anyways, you can always get creative with length: //---- s = "hello world"; s[s.dropBack(4).length .. s.dropBack(2).length]; //---- In this particular example, it is a bit suboptimal, but quite frankly, I'd assume readability trumps performance for this kind of code (and is what I'd use in my end code). One last thing: keep in mind "drop/take" are linear operations. If you are handling unicode, then everything is linear anyways, so I'm not saying these functions are slow or anything, just don't forget they aren't the o(1) functions you'd get with ASCII.
Dec 30 2012