digitalmars.D.learn - Optimization tips for alpha blending / rasterization loop
- Mikko Ronkainen (46/46) Nov 21 2013 I'm trying to learn some software rasterization stuff. Here's
- Craig Dillabaugh (9/56) Nov 21 2013 Do you want to use a ubyte instead of a byte here?
- Andrea Fontana (11/77) Nov 22 2013 If I'm right all of these lines:
- Andrea Fontana (2/84) Nov 22 2013 Of course I mean immutable, not enum :)
- bearophile (5/6) Nov 22 2013 See:
- Craig Dillabaugh (6/12) Nov 22 2013 Yes it is pretty easy to mix that up. A lot of my work is with
- bearophile (4/8) Nov 22 2013 Vote for the bug if you like it :-)
- Mikko Ronkainen (37/43) Nov 22 2013 Yes, that was a silly mistake. It seems that fixing that removed
I'm trying to learn some software rasterization stuff. Here's what I'm doing: 32-bit DMD on 64-bit Windows Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR (this seems fastest to my CPU + GPU) Framebuffer is thrown as is to OpenGL, rendered as textured quad. Here's a simple rectangle drawing algorithm that also does alpha blending. I tried quite a many variations (for example without the byte casting, using ints and shifting instead), but none was as fast as this: class Framebuffer { int[] data; int width; int height; } void drawRectangle(Framebuffer framebuffer, int x, int y, int width, int height, int color) { foreach (i; y .. y + height) { int start = x + i * framebuffer.width; foreach(j; 0 .. width) { byte* bg = cast(byte*)&framebuffer.data[start + j]; byte* fg = cast(byte*)&color; int alpha = (fg[3] & 0xff) + 1; int inverseAlpha = 257 - alpha; bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * (bg[0] & 0xff)) >> 8); bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * (bg[1] & 0xff)) >> 8); bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * (bg[2] & 0xff)) >> 8); bg[3] = cast(byte)0xff; } } } I would like to make this as fast as possible as it is done for almost every pixel every frame. Am I doing something stupid that is slowing things down? Cache trashing, or even branch prediction errors? :) Is this kind of algorith + data even a candidate for SIMD usage? Even if fg is of type byte, fg[0] would return greater value than 0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder why?
Nov 21 2013
On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen wrote:I'm trying to learn some software rasterization stuff. Here's what I'm doing: 32-bit DMD on 64-bit Windows Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR (this seems fastest to my CPU + GPU) Framebuffer is thrown as is to OpenGL, rendered as textured quad. Here's a simple rectangle drawing algorithm that also does alpha blending. I tried quite a many variations (for example without the byte casting, using ints and shifting instead), but none was as fast as this: class Framebuffer { int[] data; int width; int height; } void drawRectangle(Framebuffer framebuffer, int x, int y, int width, int height, int color) { foreach (i; y .. y + height) { int start = x + i * framebuffer.width; foreach(j; 0 .. width) { byte* bg = cast(byte*)&framebuffer.data[start + j]; byte* fg = cast(byte*)&color; int alpha = (fg[3] & 0xff) + 1; int inverseAlpha = 257 - alpha; bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * (bg[0] & 0xff)) >> 8); bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * (bg[1] & 0xff)) >> 8); bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * (bg[2] & 0xff)) >> 8); bg[3] = cast(byte)0xff; } } } I would like to make this as fast as possible as it is done for almost every pixel every frame. Am I doing something stupid that is slowing things down? Cache trashing, or even branch prediction errors? :) Is this kind of algorith + data even a candidate for SIMD usage? Even if fg is of type byte, fg[0] would return greater value than 0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder why?Do you want to use a ubyte instead of a byte here? Also, for your alpha channel: int alpha = (fg[3] & 0xff) + 1; int inverseAlpha = 257 - alpha; If fg[3] = 0 then inverseAlpha = 256, which is out of the range that can be stored in a ubyte. Craig
Nov 21 2013
On Friday, 22 November 2013 at 03:36:38 UTC, Craig Dillabaugh wrote:On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen wrote:If I'm right all of these lines: byte* fg = cast(byte*)&color; int alpha = (fg[3] & 0xff) + 1; int inverseAlpha = 257 - alpha; are constant, and you put it outside the both foreach using an enum; you can also pre-calculate this: (alpha * (fg[0] & 0xff) before foreach.I'm trying to learn some software rasterization stuff. Here's what I'm doing: 32-bit DMD on 64-bit Windows Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR (this seems fastest to my CPU + GPU) Framebuffer is thrown as is to OpenGL, rendered as textured quad. Here's a simple rectangle drawing algorithm that also does alpha blending. I tried quite a many variations (for example without the byte casting, using ints and shifting instead), but none was as fast as this: class Framebuffer { int[] data; int width; int height; } void drawRectangle(Framebuffer framebuffer, int x, int y, int width, int height, int color) { foreach (i; y .. y + height) { int start = x + i * framebuffer.width; foreach(j; 0 .. width) { byte* bg = cast(byte*)&framebuffer.data[start + j]; byte* fg = cast(byte*)&color; int alpha = (fg[3] & 0xff) + 1; int inverseAlpha = 257 - alpha; bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * (bg[0] & 0xff)) >> 8); bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * (bg[1] & 0xff)) >> 8); bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * (bg[2] & 0xff)) >> 8); bg[3] = cast(byte)0xff; } } } I would like to make this as fast as possible as it is done for almost every pixel every frame. Am I doing something stupid that is slowing things down? Cache trashing, or even branch prediction errors? :) Is this kind of algorith + data even a candidate for SIMD usage? Even if fg is of type byte, fg[0] would return greater value than 0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder why?Do you want to use a ubyte instead of a byte here? Also, for your alpha channel: int alpha = (fg[3] & 0xff) + 1; int inverseAlpha = 257 - alpha; If fg[3] = 0 then inverseAlpha = 256, which is out of the range that can be stored in a ubyte. Craig
Nov 22 2013
On Friday, 22 November 2013 at 08:44:06 UTC, Andrea Fontana wrote:On Friday, 22 November 2013 at 03:36:38 UTC, Craig Dillabaugh wrote:Of course I mean immutable, not enum :)On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen wrote:If I'm right all of these lines: byte* fg = cast(byte*)&color; int alpha = (fg[3] & 0xff) + 1; int inverseAlpha = 257 - alpha; are constant, and you put it outside the both foreach using an enum; you can also pre-calculate this: (alpha * (fg[0] & 0xff) before foreach.I'm trying to learn some software rasterization stuff. Here's what I'm doing: 32-bit DMD on 64-bit Windows Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR (this seems fastest to my CPU + GPU) Framebuffer is thrown as is to OpenGL, rendered as textured quad. Here's a simple rectangle drawing algorithm that also does alpha blending. I tried quite a many variations (for example without the byte casting, using ints and shifting instead), but none was as fast as this: class Framebuffer { int[] data; int width; int height; } void drawRectangle(Framebuffer framebuffer, int x, int y, int width, int height, int color) { foreach (i; y .. y + height) { int start = x + i * framebuffer.width; foreach(j; 0 .. width) { byte* bg = cast(byte*)&framebuffer.data[start + j]; byte* fg = cast(byte*)&color; int alpha = (fg[3] & 0xff) + 1; int inverseAlpha = 257 - alpha; bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * (bg[0] & 0xff)) >> 8); bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * (bg[1] & 0xff)) >> 8); bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * (bg[2] & 0xff)) >> 8); bg[3] = cast(byte)0xff; } } } I would like to make this as fast as possible as it is done for almost every pixel every frame. Am I doing something stupid that is slowing things down? Cache trashing, or even branch prediction errors? :) Is this kind of algorith + data even a candidate for SIMD usage? Even if fg is of type byte, fg[0] would return greater value than 0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder why?Do you want to use a ubyte instead of a byte here? Also, for your alpha channel: int alpha = (fg[3] & 0xff) + 1; int inverseAlpha = 257 - alpha; If fg[3] = 0 then inverseAlpha = 256, which is out of the range that can be stored in a ubyte. Craig
Nov 22 2013
Craig Dillabaugh:Do you want to use a ubyte instead of a byte here?See: http://d.puremagic.com/issues/show_bug.cgi?id=3850 Bye, bearophile
Nov 22 2013
On Friday, 22 November 2013 at 10:27:12 UTC, bearophile wrote:Craig Dillabaugh:Yes it is pretty easy to mix that up. A lot of my work is with images with single byte pixels, so I am pretty used to using ubyte now. I can't remember if I ever used byte, and if I did I was likely a bug. CraigDo you want to use a ubyte instead of a byte here?See: http://d.puremagic.com/issues/show_bug.cgi?id=3850 Bye, bearophile
Nov 22 2013
Craig Dillabaugh:Yes it is pretty easy to mix that up. A lot of my work is with images with single byte pixels, so I am pretty used to using ubyte now. I can't remember if I ever used byte, and if I did I was likely a bug.Vote for the bug if you like it :-) Bye, bearophile
Nov 22 2013
Do you want to use a ubyte instead of a byte here?Yes, that was a silly mistake. It seems that fixing that removed the need for all the masking operations, which had the biggest speedup.Also, for your alpha channel: int alpha = (fg[3] & 0xff) + 1; int inverseAlpha = 257 - alpha; If fg[3] = 0 then inverseAlpha = 256, which is out of the range that can be stored in a ubyte.I think my logic should be correct. The calculations are done with ints, and the result is then just casted/clamped to the byte. The reason for the +1 is the >> 8, which divides by 256. class Framebuffer { uint[] framebufferData; uint framebufferWidth; uint framebufferHeight; } void drawRectangle(Framebuffer framebuffer, uint x, uint y, uint width, uint height, uint color) { immutable ubyte* fg = cast(immutable ubyte*)&color; immutable uint alpha = fg[3] + 1; immutable uint invAlpha = 257 - alpha; immutable uint afg0 = alpha * fg[0]; immutable uint afg1 = alpha * fg[1]; immutable uint afg2 = alpha * fg[2]; foreach (i; y .. y + height) { uint start = x + i * framebuffer.width; foreach(j; 0 .. width) { ubyte* bg = cast(ubyte*)(&framebuffer.data[start + j]); bg[0] = cast(ubyte)((afg0 + invAlpha * bg[0]) >> 8); bg[1] = cast(ubyte)((afg1 + invAlpha * bg[1]) >> 8); bg[2] = cast(ubyte)((afg2 + invAlpha * bg[2]) >> 8); bg[3] = 0xff; } } } Can this be made faster with SIMD? (I don't know much about it, maybe the data and algorithm doesn't fit it?) Can this be parallelized with any real gains?
Nov 22 2013