There's something wrong with that code. There's a missing "=" or a missing ";" or something. Anyway, there's enough to go on:
From your original disassembly, the code blew up on the third "&" operation. So if I look for the third one in your source and try to line them up, I get this. I've rearranged the operations so you can see the data flow better:
/* SubBlock_set_free(sb); // #define SubBlock_set_free(ths) \
mem_size this_size = SubBlock_size((ths)); \\
/* SubBlock_set_free(sb); // #define SubBlock_set_free(ths) \
mem_size this_size = SubBlock_size((ths)); \\
#define SubBlock_size(ths)
((ths)->size_ & size_flag) \
0x00000012 0x70FD moveq #-3,d0
0x00000016 0xC082 and.l d2,d0
(ths)->size_ &= ~this_alloc_flag; \
0x0000000E 0x7EF8 moveq #-8,d7
0x00000014 0x2207 move.l d7,d1
0x00000018 0xC282 and.l d2,d1
*(mem_size*)((char*)(ths) + this_size) &= ~prev_alloc_flag; \
0x0000001C 0x2241 movea.l d1,a1
0x0000001A 0x2080 move.l d0,(a0)
0x0000001E 0xD3C8 adda.l a0,a1
0x00000020 0x70FB moveq #-5,d0
0x00000022 0xC191 and.l d0,(a1)
*(mem_size*)((char*)(ths) + this_size - sizeof(mem_size)) = this_size #define SubBlock_size(ths)
((ths)->size_ & size_flag) \
(ths)->size_ &= ~this_alloc_flag; \
*(mem_size*)((char*)(ths) + this_size) &= ~prev_alloc_flag; \
*(mem_size*)((char*)(ths) + this_size - sizeof(mem_size)) = this_size
You could check the values of "size_flag", "this_alloc_flag" and "prev_alloc_flag" and see if they correspond to the numbers in the disassembly.
I'm pretty sure it thinks it is reading a block descriptor and trying to read the size. It then uses that to calculate where the end of the block (or the start of the next block) is. That's the "adda.l, a0, a1" That's the read that blew up. That's probably because the "size" it read was something bigger than the expected size, so it was trying to write "over there", where "there" might be a gigabyte away from where it should be.
That still doesn't tell you anything except that the memory heap is or has been corrupted by something. It is usually at this point that you need to build with a debug-version of the memory allocator that checks for errors at every stage, and can tell you when it FIRST went wrong. The worst bugs are ones where the heap was corrupted a long time ago and a subsequent operation tripped over it.
But then you still need a "crash dump" with the registers and a stack you can analyze.
Or you need to "stare at the code" until you see the bug. Check every allocation and free and see what's writing to them. Mainly, what has CHANGED in the use of the product. What has changed in incoming data that is now overwriting a buffer when it didn't used to.
Is this connected to a network? What protocols is it using? UDP, TCP, HTTP, Zeroconf, MDNS?
Is this receiving serial data streams over an RS232 port? Check the protocol decoder code to see how "trusting" it is. Can it handle ANY corruption in the data stream, like a message too long, missing delimiters, loss of sync, bad CRC, corrupted in-protocol length bytes, corrupted protocol selector bytes?
Likewise receiving and decoding CAN messages.
Is it anywhere near high current sources like welders, motors, electric furnaces, fans, contacters, electric vehicles, generators, mining operations, Railways, Tramlines, Elevators, conveyor systems, Air Conditioning units?
Is it on a radio network or near any radio sources? That includes Walkie Talkies, CB Radio, WiFi and Mobile Phones. Near it or its comm lines or power source or remote sensors. What else gets plugged into its power point? Vacuum cleaners and floor plishers can trigger problems when the cleaner comes around.
Is there any potential for "Ground Shift" between that CPU and things it is connected to? Are its comms lines opto-isolated or transformer isolated (and CAN doesn't count, that needs a common ground; lots of people get that wrong)?
Single-point earthing, on the board and to the case?
Lightning? Electrostatic discharge? I remember helping with a fingerprint scanner that wasn't earthed properly. People's fingers would touch the unit and the ESD would jump from the CPU address lines across to the Ethernet network output signals. They had a big box full of dead CPU cards.
If you want to try and trigger the problem, try any and all of the above nasty things. Run radios and phones near the unit. Hit it with ESD. Turn big electric things on and off. Wrap an electric welding cable around it (don't do that :-).
In case the above isn't clear, I suspect there's some code that doesn't error check some data coming in from outside. When this data is corrupted (by noise of some sort), the code fails and corrupts the memory heap, causing the crash. So the way to try and make this happen more often is to corrupt the data a lot.
Tom