Rounded corner bitmaps more efficient

lpcware · ‎06-15-2016

Content originally posted in LPCWare by amlwwalker on Tue Oct 29 11:33:50 MST 2013
hi everyone.
LPC1788 here.
I've written some code which Im quite proud of that draws a bitmap to the lcd but rounds the corners by a certain radius. The benefit over bitmaps with the corners already rounded is that you get the background colour in the corners rather than "white" or whatever colour is on the bitmap. (see picture attached).

Anyway, here is my code, I just wonderd whether anyone can see a more efficient way of doing it - or anything im doing more than once unnecessarily as I can just see it rendering the screen (it does the same thing for all 10 images).
What its doing is going through each pixel of the bitmap, and checking which corner its in.
Then it passes the coordinates of the pixel to a function that checks whether that pixel is within the bounds of the corner (a quarter of the bresseheim circle algorithm). If it is then it plots that pixel.

Any improvements would be much appreciated.
Alex

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Pacman on Tue Nov 05 11:32:15 MST 2013
Sorry for not explaining it properly.

Any checking should be done outside the loops.
The above example only draws a filled rectangle in one single color.
It does this by drawing a line from startx to endx.
So startx and endx must be calculated in advance, and you need to make sure they're both in range, before you start the loop.

This eliminates 100 checks when drawing a small 10x10 pixel rectangle.

10000 checks on 100x100 pixels; so you can easily imagine that it'll be faster...

Now, to morph that into a rounded rectangle, you would have to make a modifcation after you know the line number; eg inside the outer loop.
That would involve adding an offset to 'd' and reducing w by two for each pixel.

for(... ; i < h; ...){
d = &a[offset];
w = width - (offset << 1);
if(w && ...)
...

When you need to copy a picture from memory, you only need to add the offset to the source address too. Apart from that, you'd need to make sure you do not read a word or halfword from an odd address.

Reading/writing words from/to addresses divisible by 4 would be faster than reading/writing words from/to addresses not divisible by 4 (eg. where the remainder is 1, 2 or 3.
Best case is remainder = 0. Worst case is remainder = 1 or 3. 2 is somewhere in between.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by amlwwalker on Thu Oct 31 13:36:24 MST 2013
Thats great pacman - I've just implemented the first example.

For the second example - the even more efficient one, where do I put my checks?
For instance you are setting a color in three different places:

*d++ = color;
*wd++ = wcolor;
*(uint16_t *)d = color; /* write the last 16-bit word */

I am drawing a bitmap so it needs to be the next byte from my bitmap array. However I am also doing checks to see whether or not to plot that pixel the colour from the bitmap, or the background colour.
You have two conditions and then the for loop all setting the pixel colour?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Pacman on Wed Oct 30 21:26:24 MST 2013
It's good you get this from the beginning and you don't have to wait for months or years to find the best possible solution.

I think you'll get the entire picture now.

Instead of ...

for (int i = 0; i < 10; i++){
   for (int j = 0; j < 10; j++){
        drawPixel(i, j, bitmap++):
   }
}


__inline void drawPixel(uint16_t Xpos, uint16_t Ypos, uint16_t color) {
volatile uint16_t *pLCDbuf = (uint16_t *) LCD_VRAM_BASE_ADDR; /* LCD buffer start address */

pLCDbuf[Ypos * GLCD_X_SIZE + Xpos] = color;
}

...which is quite good, you can do better, way better: move *ALL* calculations outside the loop.

uint16_t *screen;
uint16_t linesize; /* width in pixels of a line */

uint16_t x, y; /* x and y position of box */
uint16_t w, h; /* box width and height */
uint16_t color; /* color of box */
uint16_t *a; /* precalculated address of starting top/left corner */
uint16_t *d; /* destination */

screen = (uint16_t *)LCD_VRAM_BASE_ADDR;  /* this is what I usually call the screen base address */
linesize = GLCD_X_SIZE;

x = 17;
y = 33;

w = 50;  /* width of our box */
h = 10; /* height of our box */

a = &screen[y * linesize + x];  /* get address of the top/right corner on of our 'box' on screen */

for (int i = 0; i < h; i++){
    d = a;  /* point d to left side of box */
    for(int j = 0; j < w; j++)
    {
        *d++ = color;
    }
    a += linesize;  /* advance to next line */
}

This should be quite fast. It can be optimized further, but it should show you approximately what the idea is.

So why is the loop faster ?
...Technically speaking, because it will be compiled into assembler code that uses only a single instruction for writing the bytes. Something like this I guess:

loop:
 subsr2,r2,#1; decrement width counter, update condition codes (takes 1 clock cycle)
 strhr0,[r1],#2; store color (takes one clock cycle)
 bneloop; go round loop (2 clock cycles)

In the following code, we optimize a little further. We'll write almost twice as fast...

uint32_t wcolor;
uint32_t *wd;
wcolor = (color << 16) | color;
uint16_t ww;

ww = w >> 1;  /* width / 2 */
w &= 1; /* keep only odd-bit of w */

for (int i = 0; i < h; i++){
    d = a;  /* point d to left side of box */
    if(w && (2 & (uint32_t) a)) /* if w is 1 and a is not on a 32-bit boundary */
    {
        *d++ = color;
        w = 0;
    }
    wd = (uint32_t *)d;
    for(int j = 0; j < ww; j++)
    {
        *wd++ = wcolor;
    }
    if(w) /* if we forgot to write the 16-bit word... */
    {
        *(uint16_t *)d = color; /* write the last 16-bit word */
    }
    a += linesize;  /* advance to next line */
}

The new loop basically look like this in assembler:

loop:
 subsr2,r2,#1; decrement width counter, update condition codes (1 clk)
 strr0,[r1],#4; store two colors (1 clk)
 bneloop; go round loop (2 clk)

...It can be optimized even further. -You could increase your 4-byte blocks to for instance 16-byte blocks, so the assembler-code for the loop would look like the following...

loop:
 subsr2,r2,#1; decrement width counter and set condition code (1 clk)
 strr0,[r1,#4]; store two colors at address d + 4 (1 clk)
 strr0,[r1,#8]; store two colors at address d + 8 (1 clk)
 strr0,[r1,#12]; store two colors at address d + 12 (1 clk)
 strr0,[r1],#16; store two colors at address d, then d = d + 16 (1 clk)
 bneloop; go round loop (2 clk)

You see... we'll get much more interesting stuff done per clock cycle now.

This is all 'standard stuff', which you'll have to know.
When you're confident and have written a routine, which is portable, you can consider using the DMA for copying your rectangle. It'll be even faster than using the CPU (use memory-to-memory copy, no sync, low priority; eg. DMA channel 7).

lpcware · ‎06-15-2016

Content originally posted in LPCWare by amlwwalker on Wed Oct 30 09:56:22 MST 2013
Your codes quite hard to tell where it fits into mine, but I think I understand.

At the moment for each pixel I am calling the routine to draw the pixel.

So if it was a bitmap, 10 pixels, by 10 pixels, I am doing something like:

for (int i = 0; i < 10; i++){
   for (int j = 0; j < 10; j++){
        drawPixel(i, j, bitmap++):
   }
}


__inline void drawPixel(uint16_t Xpos, uint16_t Ypos, uint16_t color) {
volatile uint16_t *pLCDbuf = (uint16_t *) LCD_VRAM_BASE_ADDR; /* LCD buffer start address */

pLCDbuf[Ypos * GLCD_X_SIZE + Xpos] = color;
}

but what you are saying (and my version is much simpler than yours, I haven;t quite got my head around yours yet), I could do:
uint32_t line[10];

for (int i = 0; i < 10; i++){
   for (int j = 0; j < 10; j++){
        line[j] = bitmap++;
   }
//and then somehow write the whole line to pLCDbuf[]???
}

actually no, Im confused

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Pacman on Wed Oct 30 02:37:57 MST 2013
Here's some pseudo-code example of a way that's probably slightly faster (note: untested code, probably a lot of bugs in it).

The code is not intended to work, but it's intended for you to see how you should think of the rectangle as a bunch of lines with a starting point and an ending point, rather than a box of pixels.

You can use the very same principles for a circle, an oval, a triangle and a few more polygon types (but not for those polygons that have 'holes' in them, and not if there is more than one starting/ending point set.

void rbox(void *picture, int32_t dstx, int32_t dsty, int32_t width, int32_t height, int32_t radius)
{
  int32_t dx, sx, ex;
  int32_t save_x;
  int32_t srcx, srcy;
  ...

  for(srcy = 0; srcy < height; srcy++)
  {
    dx = calcx(radius, srcy, height);
    sx = dx;
    ex = width - dx;

    save_x = dstx;
    dstx += sx;
    for(srcx = sx; srcx < ex; srcx++)
    {
      copypixel(picture,srcx,srcy,dstx++,dsty);
    }
    dstx = save_x;
    dsty++;
  }
}

the calcx function should calculate the starting x position for the line specified. It can do that, since we know the radius, the y position and the height of the rectangle.

The inner loop should be as fast as possible, doing as little as possible; eg. all calculations, if's, etc should be outside the loop.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Pacman on Wed Oct 30 02:37:31 MST 2013
Woops. Soory, I was not reading your code properly.

I now copied it to my text-editor to get a better formatting.
Indeed, it seems that 'bitmap' is the screen base address.

...But I think you get the idea. Instead of checking each pixel, calculate the starting pixel and the ending pixel for every line.
Then it's just to calculate the starting address you need to write to and copy <length> pixels from source to destination.

...Remember that you can copy 32-bit words as soon as you have more than 1 16-bit word:

const uint32_t *ws;
uint32_t *wd;
const uint16_t *hs;
uint16_t *hd;

if(s < e && length & 1)
{
  *hd++ = *hs++;
}
length = length >> 1;
wd = (uint32_t *)hd;
ws = (uint32_t *)hs;
while(s < e)
{
  *wd++ = *ws++;
}

lpcware · ‎06-15-2016

Content originally posted in LPCWare by amlwwalker on Wed Oct 30 01:51:34 MST 2013
Hi Pacman,
Yeah we talked about this didnt we when we were discussing fonts...
I didnt quite get it then, so I haven't reimplemented my "draw_pixel" routine.
I understand that that will be one of the places I can make most difference though.
I can currently draw a square bitmap to the screen. Im reading the bitmap data out of an array as a pointer.
Im not sure what you mean by "screen base address" but I probably already have that as Im writing to the screen?
Width/Height, yeah,
For my bitmaps I am only supporting 16bpp RGB 565. For the 32 bit copy, you mean to read two pixels worth of data at a time and write them to the screen at the same time?
So I need to change my pixel writing routine to one that uses pointers to the data not values
And once I have a reference I should read 32 bits out at a time, not 16.
Im not quite sure how Im going to do that, especially if I have to do checks on both pixels like in my algorithm above?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Pacman on Wed Oct 30 01:42:43 MST 2013
If you want to see real speed, you need to get rid of all the setpixel stuff and start using pointers. :)

You can probably imagine that copying 10 pixels like this...

while(s < e)
{
*d++ = *s++;
}

is way faster than using a function call to read a pixel and then using another function call to write the pixel, repeated 10 times.
Imagine how many times these functions have to calculate addresses, check bounds, go to the grocery store, read the newspapers and have a cup of hot chocolate. ;)

Copying a rectangular bitmap is probably where you want to start, when you explore faster ways.

Find the base screen address, write some values there and see that it changes.
From that point on, it's just your imagination that stops you from writing real fast graphics-routines. :)

You need...
1: Screen base address
2: Width of screen in pixels
3: Height of screen in pixels
4: Size of a line in bytes.
5: Size of a pixel in bytes.
6: Bits per pixel

You don't necessarily need the last two, in case you only want to support a certain bit depth.
Note: If the image you are copying is 16 bits per pixel, you can use a 32-bit copy; this will speed things up to be almost twice as fast.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by amlwwalker on Wed Oct 30 00:18:50 MST 2013
Wow! Woops I forgot the code!
Sorry, here it is:

Bool GLCD_checkBresenhamCorner(uint16_t h, uint16_t k, uint16_t r,
uint16_t which, uint16_t xC, uint16_t yC) {

int x = 0;
int y = r;
int p = (3 - (2 * r));
do {
switch (which) {
case 1: {//Testing if its outside the top left corner
if (xC <= h - x && yC <= k - y) {
return 0;
} else if (xC <= h - y && yC <= k - x) {
return 0;
}
break;
}
case 2: {//Testing if its outside the top right corner
if (xC >= h + y && yC <= k - x) {
return 0;
} else if (xC >= h + x && yC <= k - y) {
return 0;
}
break;
}
case 3: {//Testing if its outside the bottom right corner
if (xC >= h + x && yC >= k + y) {
return 0;
} else if (xC >= h + y && yC >= k + x) {
return 0;
}
break;
}
case 4: {//Testing if its outside the bottom left corner
if (xC <= h - y && yC >= k + x) {
return 0;
} else if (xC <= h - x && yC >= k + y) {
return 0;
}
break;
}
}

x++;

if (p < 0)
p += ((4 * x) + 6);

else {
y--;
p += ((4 * (x - y)) + 10);
}
} while (x <= y);
return 1;
}

void GLCD_displayBitmapInCirle(uint16_t x, uint16_t y, uint16_t w, uint16_t h,
uint16_t r, uint16_t * bitmap) {
uint16_t atx = 0, aty = 0;
uint16_t j = 0;
uint16_t k = 0;
uint16_t index = 0;

for (k = 0; k < h; k++) {
for (j = 0; j < w; j++) {
atx = x + j;
aty = y + k;
if (atx <= x + r && aty <= y + r) { //is it in the top left corner
if (GLCD_checkBresenhamCorner(x + r, y + r, r, 1, atx, aty)
== 1) {
GLCD_SetPixel_16bpp(atx, aty, bitmap[index]);
}
} else if (atx >= x + w - r && aty <= y + r) { //is it in the top right corner
if (GLCD_checkBresenhamCorner(x + w - r, y + r, r, 2, atx, aty)
== 1) {
GLCD_SetPixel_16bpp(atx, aty, bitmap[index]);
}
} else if (atx >= x + w - r && aty >= y + h - r) { //is it in the bottom right corner
if (GLCD_checkBresenhamCorner(x + w - r, y + h - r, r, 3, atx,
aty) == 1) {
GLCD_SetPixel_16bpp(atx, aty, bitmap[index]);
}
} else if (atx <= x + r && aty >= y + h - r) { //is it in the bottom left corner
if (GLCD_checkBresenhamCorner(x + r, y + h - r, r, 4, atx, aty)
== 1) {
GLCD_SetPixel_16bpp(atx, aty, bitmap[index]);
}
} else { //its not in a corner so draw it
GLCD_SetPixel_16bpp(atx, aty, bitmap[index]);
}
index++;
}
}

x++;
}

lpcware · ‎06-15-2016

Content originally posted in LPCWare by MarcVonWindscooting on Tue Oct 29 13:56:15 MST 2013

Quote: amlwwalker
hi everyone.
LPC1788 here.
I've written some code which Im quite proud of that draws a bitmap to the lcd but rounds the corners by a certain radius.
...
Any improvements would be much appreciated.
Alex

Beeing proud is a good thing, as it keeps programmers from hiding their achievements!
If you really need improvement, I'd divide the area into a rectangular region free of circles and copy that without checking.
Also, the remaining parts involve somehow the same calculations 4 times. Maybe split the remaining parts into 4 rectangles and 4 quarter-circles. Adding/subtracting is less expensive than multiplication on many CPUs. Checking inside/outside must be done for one of the 4 quarter-circles only...

EDIT:
Sorry, the 'remaining' rectangles are only 2, not 4.
Also, where is your code?

Rounded corner bitmaps more efficient

Rounded corner bitmaps more efficient

LPC17xx