mcf 5282 access error operand read code warrior 6.4

davidteo · ‎08-11-2021

I am using a MCF5282, with code warrior 6.4.

We have an exception logging mechanism which captured "access error, operand read" on 2 different software. This only occurred on some units so far and not easily reproducible on same machine.

based on the MAP file, they are pointed to the same offset location in the code warrior's library.

Unit #1:

the logged exception PC address &f006e6da points to:
F006E6B8 0000009C .text Block_link (C_TRK_4i_CF_MSL.a alloc.o )
F006E754 00000060 .text Block_unlink (C_TRK_4i_CF_MSL.a alloc.o )

Unit #2:

the logged exception PC address &f006f96a points to:
F006F948 0000009C .text Block_link (C_TRK_4i_CF_MSL.a alloc.o )
F006F9E4 00000060 .text Block_unlink (C_TRK_4i_CF_MSL.a alloc.o )

- In both cases, I do not have the status register values available.

- While, I understand that the logged PC address may not necessary be correct during the exception, but these two happened to be coincidentally similar.

- Both units point to offset 022H in Block_Link().

I managed to disassemble "C_TRK_4i_CF_MSL.a": (asm snippet below)

>>>

Address ObjectCode Label Opcode Operands Comment

0x00000000 _Block_link:
; Block_link:
0x00000000 0x4FEFFFEC lea -20(a7),a7
0x00000004 0x48EF50800008 movem.l d7/a4-a4/a6,8(a7)
0x0000000A 0x206F001C movea.l 28(a7),a0
0x0000000E 0x7EF8 moveq #-8,d7
0x00000010 0x2410 move.l (a0),d2
0x00000012 0x70FD moveq #-3,d0
0x00000014 0x2207 move.l d7,d1
0x00000016 0xC082 and.l d2,d0
0x00000018 0xC282 and.l d2,d1
0x0000001A 0x2080 move.l d0,(a0)
0x0000001C 0x2241 movea.l d1,a1
0x0000001E 0xD3C8 adda.l a0,a1
0x00000020 0x70FB moveq #-5,d0
0x00000022 0xC191 and.l d0,(a1)
0x00000024 0x286F0018 movea.l 24(a7),a4
0x00000028 0x2341FFFC move.l d1,-4(a1)
0x0000002C 0xCEAC000C and.l 12(a4),d7
0x00000030 0x4DF478FC lea (-4,a4,d7.l),a6
...

<<<

1) Can someone explain how/which instruction(s) can result in "access error operand read"?

2) If this function can be a possible root cause, may I known which/what user code(s) statements can reach this library function?
Ie. My assumption is "new array[size];", or "delete ptr;", etc?

3) At time of writing, I have also read posts, which recommend to check the hardware errata for "Internal Flash Speculation Address Qualification Incomplete", which I am unable to verify the hw cpu version yet.

Thanks.

TomE · ‎08-13-2021

> C_TRK_4i_CF_MSL.a alloc.o

So that looks like it a memory allocator. "Malloc" or "new" or a relative of one of those functions.

I'd say the first thing to check is that it might have run out of heap memory. Either run out properly, or the heap got fragmented. Very few programmers check to see if an allocation failed and very few check how their code handles that failure. The code usually blunders along with null pointers until it goes to free a null pointer and then the library blows up and crashes.

It is a really good idea to set up your system memory map so there's NO memory at zero. That means not putting the 64k SRAM at zero and also not putting the 512k CFM (Flash) at zero either. Likewise the DRAM memory block (if used). That way the first time something generates a null pointer it'll get an instant trap then rather then blundering on and failing somewhere else, or corrupting something else.

> Can someone explain how/which instruction(s) can result in "access error operand read"?

Firstly that's an "Access Error" Exception. That can happen for one of 15 different reasons in the ColdFire architecture. For the MCF52, only 4 can happen: Instruction Fetch, Read, Write and Write to Write Protected. You've got a simple memory read failure.

You have to look at the CFPRM.pdf document referenced at the top of the "ColdFire Core" section in the Reference manual.

Any instruction that can read memory can cause that exception. So pretty much every instruction except for the register-only operations. In your case it looks like the following "and" instruction failed because Register A1 ended up pointing to somewhere invalid.

> 0x00000022 0xC191 and.l d0,(a1)

By "invalid" I mean there's a 4G address space and about 0.014% of that is occupied by the SRAM and Flash and module registers. A1 was pointing "somewhere else", maybe to zero.

You should enhance your "Exception Logging" system to record all of the CPU registers and the relevant parts of the exception stack frame. You need the "Exception Frame Format/Vector Word" as the "FS" bits in there tell you what went wrong. That's the first part. Except you know it was a "read error", so you must be decoding that part already.

There are only two ways to proceed with this. The first part is to try to get it to happen more often. See what is triggering this. Does the unit have to be on for a long time? Is it on a network? Could it be related to the volume or type of network traffic? Is something overloading it? If you can get it to trigger "at will" then you run it with the debugger connected and set to stop on these traps. Then you'll be able to look back and see the function calling path on the stack.

If you can't get it to fail on a debugger you have to log enough information to be able to do that manually. That means the stack frame, registers, and at least 1k of "stack dump", so you can decode the stack and see all of the previous stack frames and function pointers. It is actually quite easy to have a "trap handler" walk the stack frame to print out (or in your case log) all of the previous function pointers. I've worked on an MPC860 based system that had a compressed "map" file in the file system and recorded the stack and crash information into a very large 8-entry ring. After a crash it retrieved the crash and decoded it, annotating the stack dump with the function names from the map file. That was very useful.

Tom

davidteo · ‎08-13-2021

1. I have checked on the heap (code and map review), there is plenty of memory, however I was not able to verify the fragmentation possibility.

2. Your reply mentioned the 022H instruction as the "cause" (PC offset).
I was assuming the PC indicates the next instruction to execute, so shouldn't the cause be from instruction(s) before 022H?
- This was a point I cannot ascertain about the CPU behaviour, nor from the Access error for operand read description.
- Are you or TX able to confirm that the PC is indicating the instruction it failed on, or should be some instruction before the PC?

- Note: VYI. The MAP trace is based on my local rebuilt (same build process) version of the 2 software versions which they occurred on, however, the original sw version's map file is not available, so currently, I can only assume my binary is "similar and accurate" to the original version's unless otherwise.

3. Yes, we have been trying to reproduce the issue on the reported unit itself, and currently it has surpassed longer operating hours than the reported issue's without luck.

Thanks a lot for the informative reply. I will take note of what is mentioned and see what else I can find from my investigation.

TomE · ‎08-13-2021

> 2. Your reply mentioned the 022H instruction as the "cause" (PC offset).
> I was assuming the PC indicates the next instruction to execute, so
> shouldn't the cause be from instruction(s) before 022H?

0x0000001A 0x2080 move.l d0,(a0)    This would give a write error
0x0000001C 0x2241 movea.l d1,a1     This doesn't read
0x0000001E 0xD3C8 adda.l a0,a1      This doesn't read
0x00000020 0x70FB moveq #-5,d0      This doesn't read
0x00000022 0xC191 and.l d0,(a1)     This does read
0x00000024 0x286F0018 movea.l 24(a7),a4 This is from the stack
0x00000028 0x2341FFFC move.l d1,-4(a1) 0x22 would have failed first

Now I'll read the Reference Manual. "Table 2-5. Exception Vector Assignments" says that for the Access Error, the PC that is stacked is "Fault": "Fault refers to the PC of the instruction that caused the exception." This doesn't apply in the case of writes as it is detailed that "The V2 ColdFire processor uses an imprecise reporting mechanism for access errors on operand writes.". But that's not your problem here. Your problem is with 0x22.

I've just been analysing some HC08 code. The compiler we're using (Cosmic) supplies the full sources for all the libraries. Does CW do that? If it does provide the library sources for "alloc()" then you'll be able to get a better idea of how that pointer got broken. Leading to "what sort of error could cause that".

I'm trying to work out what it is doing from your decode:

0x0000001E 0xD3C8 adda.l a0,a1      Pointer work
0x00000020 0x70FB moveq #-5,d0      d0 = 0xfffffffb = ...1111011
0x00000022 0xC191 and.l d0,(a1)     (a1) & ...1111011 --> (a1)

It is knocking out the third bit from the memory at that location. That's not an address operation - I as expecting it to be masking off bits to get a word address. That looks like it is clearing a flag bit in a memory location used to store the state of something.

The function is called "_Block_link". So I guess it is linking blocks together. And that's certainly going to go badly with bad pointers. I'm still going with a null pointer - someone freeing a corrupted block, or writing off the end of an allocated block. Classic "buffer overrun". Like I said, is this thing on a network? Search for all "strcpy()" like functions that don't check the length of the copy.

Google can't find "_Block_link" but it can find "C_TRK_4i_CF_MSL.a"; but it doesn't help and you should have this document already:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.699.1583&rep=rep1&type=pdf

Tom

davidteo · ‎08-17-2021

Code extract of the Block_link() funcation in alloc.c located in the cw6.4. (Assuming it is representative of that in the library "C_TRK_4i_CF_MSL.a")

I have commented the original statements and replaced with the #defines equivalent.

static void Block_link(Block* ths, SubBlock* sb)
{
  SubBlock** st;

  /* SubBlock_set_free(sb); // #define SubBlock_set_free(ths) \
                                 mem_size this_size = SubBlock_size((ths)); \\ #define SubBlock_size(ths) ((ths)->size_ & size_flag) \
                                   (ths)->size_ &= ~this_alloc_flag; \
                                   *(mem_size*)((char*)(ths) + this_size) &= ~prev_alloc_flag; \
                                   *(mem_size*)((char*)(ths) + this_size - sizeof(mem_size)) = this_size
  */
  mem_size this_size = ((ths)->size_ & size_flag);
  (ths)->size_ &= ~this_alloc_flag;
  *(mem_size*)((char*)(ths) + this_size) &= ~prev_alloc_flag;
  *(mem_size*)((char*)(ths) + this_size - sizeof(mem_size)) = this_size;


  /* st = &Block_start(ths); // #define Block_start(ths) (*(SubBlock**) 
     ((char*)(ths) + Block_size((ths)) - sizeof(mem_size)))
  */
  st = &(*(SubBlock**)((char*)(ths) + Block_size((ths)) - sizeof(mem_size)));
  if (*st != 0)
  {
    sb->prev_ = (*st)->prev_;
    sb->prev_->next_ = sb;
    sb->next_ = *st;
    (*st)->prev_ = sb;
    *st = sb;
    *st = SubBlock_merge_prev(*st, st);
    SubBlock_merge_next(*st, st);
  }
  else
  {
    *st = sb;
    sb->prev_ = sb;
    sb->next_ = sb;
  }
  if (ths->max_size_ < SubBlock_size(*st))
    ths->max_size_ = SubBlock_size(*st);
}

I have also read your post below, but not able to reply accordingly. Thanks for the informative reply!

TomE · ‎08-17-2021

There's something wrong with that code. There's a missing "=" or a missing ";" or something. Anyway, there's enough to go on:

From your original disassembly, the code blew up on the third "&" operation. So if I look for the third one in your source and try to line them up, I get this. I've rearranged the operations so you can see the data flow better:

/* SubBlock_set_free(sb); // #define SubBlock_set_free(ths) \
  mem_size this_size = SubBlock_size((ths)); \\ 
/* SubBlock_set_free(sb); // #define SubBlock_set_free(ths) \
  mem_size this_size = SubBlock_size((ths)); \\ 
    #define SubBlock_size(ths) 
      ((ths)->size_ & size_flag) \
0x00000012 0x70FD moveq #-3,d0
0x00000016 0xC082 and.l d2,d0

      (ths)->size_ &= ~this_alloc_flag; \
0x0000000E 0x7EF8 moveq #-8,d7
0x00000014 0x2207 move.l d7,d1
0x00000018 0xC282 and.l d2,d1

      *(mem_size*)((char*)(ths) + this_size) &= ~prev_alloc_flag; \
0x0000001C 0x2241 movea.l d1,a1
0x0000001A 0x2080 move.l d0,(a0)
0x0000001E 0xD3C8 adda.l a0,a1
0x00000020 0x70FB moveq #-5,d0
0x00000022 0xC191 and.l d0,(a1)

      *(mem_size*)((char*)(ths) + this_size - sizeof(mem_size)) = this_size    #define SubBlock_size(ths) 
      ((ths)->size_ & size_flag) \
      (ths)->size_ &= ~this_alloc_flag; \
      *(mem_size*)((char*)(ths) + this_size) &= ~prev_alloc_flag; \
      *(mem_size*)((char*)(ths) + this_size - sizeof(mem_size)) = this_size

You could check the values of "size_flag", "this_alloc_flag" and "prev_alloc_flag" and see if they correspond to the numbers in the disassembly.

I'm pretty sure it thinks it is reading a block descriptor and trying to read the size. It then uses that to calculate where the end of the block (or the start of the next block) is. That's the "adda.l, a0, a1" That's the read that blew up. That's probably because the "size" it read was something bigger than the expected size, so it was trying to write "over there", where "there" might be a gigabyte away from where it should be.

That still doesn't tell you anything except that the memory heap is or has been corrupted by something. It is usually at this point that you need to build with a debug-version of the memory allocator that checks for errors at every stage, and can tell you when it FIRST went wrong. The worst bugs are ones where the heap was corrupted a long time ago and a subsequent operation tripped over it.

But then you still need a "crash dump" with the registers and a stack you can analyze.

Or you need to "stare at the code" until you see the bug. Check every allocation and free and see what's writing to them. Mainly, what has CHANGED in the use of the product. What has changed in incoming data that is now overwriting a buffer when it didn't used to.

Is this connected to a network? What protocols is it using? UDP, TCP, HTTP, Zeroconf, MDNS?

Is this receiving serial data streams over an RS232 port? Check the protocol decoder code to see how "trusting" it is. Can it handle ANY corruption in the data stream, like a message too long, missing delimiters, loss of sync, bad CRC, corrupted in-protocol length bytes, corrupted protocol selector bytes?

Likewise receiving and decoding CAN messages.

Is it anywhere near high current sources like welders, motors, electric furnaces, fans, contacters, electric vehicles, generators, mining operations, Railways, Tramlines, Elevators, conveyor systems, Air Conditioning units?

Is it on a radio network or near any radio sources? That includes Walkie Talkies, CB Radio, WiFi and Mobile Phones. Near it or its comm lines or power source or remote sensors. What else gets plugged into its power point? Vacuum cleaners and floor plishers can trigger problems when the cleaner comes around.

Is there any potential for "Ground Shift" between that CPU and things it is connected to? Are its comms lines opto-isolated or transformer isolated (and CAN doesn't count, that needs a common ground; lots of people get that wrong)?

Single-point earthing, on the board and to the case?

Lightning? Electrostatic discharge? I remember helping with a fingerprint scanner that wasn't earthed properly. People's fingers would touch the unit and the ESD would jump from the CPU address lines across to the Ethernet network output signals. They had a big box full of dead CPU cards.

If you want to try and trigger the problem, try any and all of the above nasty things. Run radios and phones near the unit. Hit it with ESD. Turn big electric things on and off. Wrap an electric welding cable around it (don't do that :-).

In case the above isn't clear, I suspect there's some code that doesn't error check some data coming in from outside. When this data is corrupted (by noise of some sort), the code fails and corrupts the memory heap, causing the crash. So the way to try and make this happen more often is to corrupt the data a lot.

Tom

davidteo · ‎08-19-2021

There are several use-cases you mentioned here which are being retested, however, yielded no reproduce-able result.

I will take note of the your suggestions where applicable.

Hopefully, I can update this post again in near future with some results/root cause for sharing.

Thanks a lot for the replies!

Hui_Ma · ‎08-11-2021

Hi,

1) Can someone explain how/which instruction(s) can result in "access error operand read"?

TS: <MCF5282UM> chapter 2.3.4.1 Access Error Exception provides detailed access error description. There with below description about error occurs on an operand read:

If the access error occurs on an operand read, the processor immediately aborts the current instruction’s execution and initiates exception processing. In this situation, any address register updates attributable to the auto-addressing modes, (for example, (An)+,-(An)), have already been performed, so the programming model contains the updated An value. In addition, if an access error occurs during a MOVEM instruction loading from memory, any registers already updated before the fault occurs contain the operands from memory.

The MOVE instruction includes read process.

2) If this function can be a possible root cause, may I known which/what user code(s) statements can reach this library function?
Ie. My assumption is "new array[size];", or "delete ptr;", etc?

TS: I checked the PC address located at Flash memory. If there exists Flash read/erase/write operation at same time during firmware execution?

3) At time of writing, I have also read posts, which recommend to check the hardware errata for "Internal Flash Speculation Address Qualification Incomplete", which I am unable to verify the hw cpu version yet.

TS: Could you help to provide MCF5282 chip surface silk print info? Then we could check the Data Code affected with MCF5282 Errata.

Thanks for the attention.

B.R.

Mike

davidteo · ‎08-13-2021

2.3.4.1 Access Error Exception

I have read the section, however, it is my assumption is that a read operand is practically valid to almost all instructions, where data must be read, except for comparison instructions? So, I can't ascertain that I can interpret the paragraph easily to disassembly code. Was hoping that NXP can help narrow down/explain which of those instructions specifically might result in this read operand error only, if my assumption was wrong.

TS: I checked the PC address located at Flash memory. If there exists Flash read/erase/write operation at same time during firmware execution?

As I am fairly new to both this CPU and source code which I am debugging for this issue, I am unable to verify but will take note of this scenario mentioned.

TS: Could you help to provide MCF5282 chip surface silk print info? Then we could check the Data Code affected with MCF5282 Errata.

The unit is not on same continent with me. When I have info on this, I will update this again.

TomE · ‎08-13-2021

I've read all the errata. It is very unlikely that any of them will be causing you trouble. If you had the Flash programming wrong it should be corrupting the code reads all over the place. If you were getting cache flushing wrong, that would be very unlikely to give consistent results. This is a software bug. Probably externally triggered.

> it is my assumption is that a read operand is practically valid to almost all instructions,
> where data must be read,

I have no idea what you mean there. The Instruction Fetch reads the instruction from memory. If that cycle fails you get a different error. When the instruction execution requires data to be read or written, if that goes wrong you get the Access Error Exception. Now we have to define "wrong".

Here's the big problem with this architecture and the documentation though. Lots of CPU architectures derive from really simple 1970's design (I mean 8080 and 6502 and 6802). Many of them still have all the limitations of those original designs, meaning the CPU blindly performs reads and write cycles, and if there's nothing at that address it is just waving its legs in the breeze and reading an "empty bus". If you need a cycle to take longer than the default, then it takes extra hardware to force "Wait States".

There's no concept of a "bad read". If you need one, then that has to be "bolted on" to the original design, usually badly. So you can get bad pointers and nothing throws an error that can be trapped, and so the CPU just heads off into the weeds. That was fine for 1970, but we need something better now. Anyone used to that architecture gets confused when someone else finds a better way to operate.

The Coldfire derives from the 68000 which copied the way the PDP-11 worked, and it (and the 68000) performed "Asynchronous Bus Cycles" where the addressed peripheral or memory could take as long as it liked to perform the operation, and told the CPU when the cycle was done by driving "DTACK" or "BERR" if it went wrong. The bus is not synchronous, and that is by design. The system had to come with "bus timeout" hardware to interrupt the CPU if nothing was answering. This capability lives on in the ColdFire chips. The older ones have the same external CPU bus as the 68000 but with "DTACK/BERR" called "TA/TEA". This chip has all of that internally (the SRAM, Flash and peripherals all drive an internal "TA"). The EIM has "TA" and "TEA" pins"TA" can be driven internally by the chip select decoder for a synch cycle or it can be driven by the peripheral. But if nothing drives it, the cycle never finishes and the CPU stalls forever (or until the watchdog resets it). That is obviously "suboptimal".

So the TA and TEA are a straight line back to the 68000 and PDP11. If you've read all the manuals back to about 1970, then you know how all this works. If you haven't read them then there's no surprise that you're confused as the modern manuals don't give any of the background to help you understand how this works.

So there should be some hardware that can time out these cycles when the software gets confused and generates a bad pointer that tries to access the 99.986% of your memory space that is unoccupied.

I'm sure some NXP CPUs have this, so let's search. Here's a "Bus Monitor" in the MCF5445x (and read that for what goes wrong with it):

https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/5445x-Bus-Monitor/m-p/177624

And in that article:

We are using bus monitor on MCF5282 processor, it is configured to 1024 cycles in CCR register and we get ACCESS ERROR trap on wrong access. It operates regulary [sic] on our project.

So you do have a "Bus Monitor" and that throws the Access Exception when something accesses the 99.986% of your memory space that is unoccupied. You probably don't know it exists as it is enabled out of reset.

So (given the history), all Coldfire chips have to have Bus Monitors, right?

This seems to be true for the MCF53 and MCF54 chips. but not the MCF548x. But the MCF52 are a mixed bag. Later ones have one. Early ones like the MCF5208 have one, but it doesn't work properly. The MCF5250 (and MCF548x) have the Software Watchdog terminating stuck bus cycles, but with TA and not TEA, and then throwing an interrupt instead of an exception.

I can't find anything in any MCF51 series chip that looks like a "Bus Monitor". Some manuals say that "Reserved memory in table 4.1" generate a Bus Error, but there's no "reserved" memory in that table. Other later manuals list the "Unallocated" memory spaces as generating Bus Errors, so it is likely all of them do it (but it just isn't documented).

Tom

davidteo · ‎08-30-2021

1) I was just updated that the correct cpu involved in this access error should be a MCF5280, not MCF5282.
(Is there any way to recorrect this post title to mcf5280?)

The MCF5280 erraata (last updated 2009) contains many unresolved errata with software workarounds, but mentions none regarding exception access error, nor operand read.

- This issue occurred only recently on several units. Duration of occurrence on each reported device(s) is not consistent. (30 hrs < access error > 3 days)

2) The access error operand read issue is still valid, and verified matching with the released file.

I managed to locate the the first 22H bytes of the disassembly instructions of block_link() function at the same linked address in my own rebuilt version (rebuilt.s) matching that with the released version (release.s) file address.
- There were no changes between the 2 software versions at this function's zero-offset to the 22H bytes.
- Read operand access error did indeed occur at this location.

2a) I did a debug to find out how this can be reached, and in most cases, it comes primarily from freeing dynamic memory operation. (I have yet to debug more to find out)

- vector.push_back(), list.push_back(), pop_front(): not every consecutive calls of such statements may reach block_link().

- delete: via "operator <<" for string.

TomE · ‎08-30-2021

It's a double free or buffer overrun error. Are any of those "strings" being handed to any code that doesn't respect the allocated length, or assumes "the buffer is big enough"? That's what you're looking for.

Tom