Checksum Method for MC9S12ZVCA(192)

charles_wangw · ‎04-15-2020

Dear

During my developing I need to calculate checksum of flash for an fixed area. I use the method below to get every 4bytes from flash directly and sum them.

getFlash.twobytes = * ( (bl_u32_t *) ( (bl_u32_t)pu32FloatingPointer + (u32LocalIndex<<2)) )

But I find two issues:

1. It cost a lot of time, more than 20ms. We want it can decrease

2. some times it is not correct, seems one byte can not read out from flash, so the result is not correct(during debug, Still not find the root cause)

Could you please help if we can use some other method to get the target. Calculate checksum or CRC and compare with the result(we saved in a place before).

Thank you very much.

kef2 · ‎04-16-2020

2. some times it is not correct, seems one byte can not read out from flash, so the result is not correct(during debug, Still not find the root cause)

Do you calculate your sum from application image? Perhaps you didn't take into account security byte, which may be changed by your flash programmer? If you indeed take this into account, perhaps you should be aware about S12(X) vs S12Z difference. On older S12(X) after mass erase and unsecure flash state was all ones, even for security byte. Later, software like P&E Unsecure could program security byte to 0xFE (security off) to ease your debugging. On S12Z after unsecure command flash security byte is already 0xFE. It is provided by S12Z itself, not by software running on PC. Anyway I think you should remove security byte from your checksum check area.

1. It cost a lot of time, more than 20ms. We want it can decrease

How many ms is acceptable? Do you want fast startup to functional state? What about fast startup as usual and checksum check in background? This would allow you to use slower but proper CRC, not just semi-working sum check.

192/4 = 48kDwords in 20ms at 32MHz is ~13 bus cycles for one dword. Optimized loop code should look like this

L1: add D6, (X+)

cmp X, #top_of_flash

blo L1

This should take about 5 bus cycles for single Dword and thus still about 8ms for whole flash array. Reducing loop overhead and adding several Dwords per iteration will reduce effective cycles per dword to 3+, still about 5ms for whole array, if not more.

L1:

add D6, (X+)

cmp X, #top_of_flash

blo L1

Edward

charles_wangw · ‎04-16-2020

Hello Edward

Thank you for your share. Actually I have implement the method CRC32 with EDMA on S32K144 with NXP support.

Pflash completeness check

But there is no CRC32 module and EDMA in MC9S12ZVCA. So I am testing checksum now.

I calculate two areas :

Area1:

#define PP4G_APPLICATION_ADDRESS_START (0x00FD0408u)
#define PP4G_APPLICATION_ADDRESS_END (0x00FF6DDFu)

Area2:

#define PP4G_CALIBRATION_ADDRESS_START (0x00FF7000u)
#define PP4G_CALIBRATION_ADDRESS_END (0x00FF77DFu)

It seems I need 75ms for this part calculate. I will update the clock setting today as your suggestion.

By the way, should my method is the same as your Assembly language? C is convenient for my code in this status.

Thanks

Charles

kef2 · ‎04-17-2020

Hello Charles,

Ghm, didn't older CW for MCUs have option to erase/program only used flash/eeprom areas? I remember I had to use something like it in CW debugger settings to recover from S12Z machine exception on EEPROM ECC fault. I can't find such options in CW for MCUs 11.1. But if you had it enabled, it could lead as well to problems verifying checksum. You calculate checksum assuming all unused bytes are 0xFF, than if program gets shorter so that not all previously used sectors are erased.. checksum check will fail.

My asm pseudocode uses 32 bits checksum, easy to adapt to 16 bits, bit longer though.

Yes, C is always convenient and often as effective as assembler, provided your algorithm is effective and CPU friendly, as well C code should give optimizer close or even better nothing do do.

First line of your loop body 1) shifts index left, 2) adds result to pointer, then 3) dereferences pointer and 4) stores flash dword to some file or global scope variable. Then there are 4,5) two adds. Global scope variable reduces optimizers freedom to optimize that store out as not needed and reuse values in CPU registers instead. Local variable on stack should be better, I think. Then for loop 6) increments index, 7) compares to the limit and 8) branches to the start of loop body. Optimizer may group and eliminate some of these steps, but not all. My code 1) dereferences pointer, reads and adds data to the checksum and advances pointer to the next location in one step, for 16 bits checksum it would be identical step 2). 3) compare pointer to the limit, 3) branch to start. Quite faster, isn't it?

Here's routine for you to calculate 16 bits sum:

__asm unsigned short cksum(unsigned long start, unsigned long top)

{

// load start to X pointer register

// first 32bits argument is passed in D6

TFR D6, X

// load Y with end address

// top is assumed to be word aligned,

// start-top 0x100..0x110 will return identical result for 0x100..0x10F

// +1 because Y is 24 bits pointer, not 32 like top

LD Y, top+1

CLR D2 // initialize 16 bits checksum = 0

L1: ADD D2, (X+) // add and advance pointer

// ... ADD D2, (X+)

CMP X,Y // compare with top

BLO L1 // keep looping while X < Y

// return value is in D2 already

}

You may try as well uncomment second ADD provided start and top differ by multiple of 4. You may also replace two ADD's with 32 bits fetch, shift and adds to check if 32 bits access is faster. Try replacing code down from L1 with this

L1: LD D6, (X+) // 32 bits fetch

ADD D2, D6 // add lo word

LSR D6, D6, #16 // get high word

ADD D2, D6 // add hi word

CMP X,Y // compare with top

BLO L1 // keep looping while X < Y

// return value is in D2 already

Edward

kef2 · ‎04-17-2020

I want to correct myself about execution times. First of all I missed that when branch is taken, which happens every time our loop continues, instruction queue refill takes place and eats additional 1-2 bus cycles. 1 is in the case L1 labels address is dword aligned, 2 in all other 3 cases. Another missed fact is default flash wait states setting. It is enabled by default and told to be 1 additional cycle. But wait states off vs wait states seems adding 1.5 bus cycles to single loop iteration with 32 bits fetch.

Real benchmarking. With L1 dword aligned and wait states switched off 32 bit fetch + add + shift + add takes 10 bus cycles per loop iteration (sindle dword). Wait state enabled +1 bus cycle, L1 misaligned +1 bus cycle.

Two 16 bit adds variant, L1 aligned, no wait states - 9.5 bus cycles. Wait states enabled - 11 bus cycles. I wonder why +1.5 bus cycles instead of +2. L1 misalignment should add one more bus cycle. So the best you can get is 11*192/4*1024 / 32M ~ 17ms. Wait states should be enabled above 25MHz, so practically no difference 32MHz with wait states or 25MHz without them.

Edward

charles_wangw · ‎04-20-2020

Hello Edward

Thank you for your detail information. I have no time test today and will try tomorrow. Sorry I still need ask two more details below to help my quick test.

1. How to config my MCU to 25Mhz with internal oscillator

2. How to config to disable Watch Dog?

Thank you again.

Charles

kef2 · ‎04-20-2020

Hello Charles,

1. Well, if it's too hard for you to configure PLL, then perhaps try using CW Processor Expert to configure PLL.

2. COP watchdog is disabled by default. See CPMUCOP register.

Edward