          Issue ID: 14458
           Summary: very slow ubyte[] assignment (dmd doesn't use memset)
           Product: D
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: P1
         Component: DMD
          Assignee: nobody puremagic.com
          Reporter: code dawg.eu

Tracked down a severe performance issue in my new AA implementation, where it
zeroed a freshly allocated entry.

DMD generates the following code for the assignment.
void zero(ubyte[] ary) { ary[] = 0; }
        mov     rcx, rdi                                ; 0008 _ 48: 89. F9
        xor     rax, rax                                ; 000B _ 48: 31. C0
        mov     rdi, rsi                                ; 000E _ 48: 8B. FE
        rep stosb                                       ; 0011 _ F3: AA

This is a bytewise store 0 and is about 4x slower than memset, if sz >= 4. It's
slightly faster for sz < 4.
Not sure why `rep stosb` suddenly becomes 4x slower when sz increases from 3 to
4 bytes, but in any case the compiler should optimize the small case to direct
assignments and the big case to memset, or always use memset.

Apr 17 2015