Coldfire SCI Transmit Errors

R_R_Ritchey · ‎03-05-2014

I have a strange problem happening on transmit. The first or second character of my transmit sequence gets stepped on but only every 1000 or so transmits. The code is completely interrupt driven. I have a MCF51JM running at 48MHz and transmitting the serial at 100K baud. I transmit a message to another device on the bus and it sends back a message. I have a loop of the exact same message going on. The transmit message should start out as 0x01, 0x00, 0x78. It works great except for about once in 1000 transmits then one of the first two bytes gets stepped on. I see 0x01, 0x78 or 0x00, 0x78 transmitted. I can breakpoint on a timeout as the other device will not respond to this message. When I check the transmit buffer it is correct. It seems as if the TDRE interrupt happens once too often. I seem to be reloading the transmit buffer too fast and a value gets stepped on. Its as though TDRE stays set after I leave the interrupt routine and I get interrupted back into the routine before the transmit data register moves to the transmit register. Looking for a solution? Thanks,

TomE · ‎03-05-2014

I suspect a software bug. It looks like an "interrupt hazard" of some sort.

> Its as though TDRE stays set after I leave the interrupt routine and I get interrupted back

I would suggest adding some temporary code to your transmit routine that checks the state of TDRE immediately after you've written SCIxD in that code. See if it ever remains set and if if does, whether this correlates with your "stomped" messages. See if adding this test has changed the timing enough so the problem then "goes away".

There's another problem you may be triggering. In your SCI Transmit interrupt routine you MUST read SCIxS1 and then TEST for TDRE. Sometimes you may find it isn't set. Obviously you shouldn't write to SCIxD then, but you can have your code tell you if this happens and if it correlates with your problem. Also note (1) below.

That it only happens to the first or second character indicates a problem with starting the message transmission. So how are you STARTING a transmission in your code?

In this chip, the TDRE is a Status bit that says "the register is empty". It is not a flag that is set when the register BECOMES empty that can trigger just one interrupt. That is a pain as it means that when you've finished a message you have to clear TIE when there's nothing more to send, and in your mainline you have to set TIE to start a new transmission.

So to start a new message, are you just setting TIE or are you setting TIE *and* writing the first byte to SCIxD?

If you're writing the first byte in the mainline, then don't. Just set TIE and let the interrupt routine send the first byte. See if that fixes the problem.

(1) Otherwise, If you're writing the first byte in the mainline, are you writing it PROPERLY? The Manual says "To clear TDRE, read SCIxS1 with TDRE set and then write to the SCI data register (SCIxD).". That's tricly. That's not simple. That's a recipe for problems. I would suggest you have to do those two things with CPU interrupts DISABLED or you risk getting an interrupt between those two operations and messing up the timing. Like a Receive interrupt. You must write to SCIxD this way in your ISR too.

I suspect that your "once in 1000 transmits" happens because you got some other interrupt (like the TPM) happen between a few critical instructions in the mainline or in the SCI ISR. If you're not hard-disabling all interrupts in the SCI service routine, the TPM can interrupt the SCI interrupt. That may change the timing or access something else.

My experience with these chips is that when you initially write the first byte to SCIxD, TDRE drops instantly, but it takes one whole SCI baud-rate-clock (or one whole bit time) before TDRE will assert again. So you can write to SCIxD and then set TIE and the interrupt won't happen until you've left that function. Unless that function is interrupted by something else, in which case the SCI interrupt may happen before you've finished some sensitive operations. So if your transmit code looks like "SCIxD = *ptr; SCIxC2 |= TIE; ptr += 1;" then sometimes the transmit routine will get in before the "ptr += 1" and you'll send characters in the wrong order. Code like that is a hazard, as both the mainline and the ISR are changing "ptr", so all accesses to it must be locked somehow (interrupts disabled usually), and it has to be declared as "volatile" too, or even if you write your code in the "right order" (increment ptr before setting TIE) the compiler is free to re-order those operations.

I then checked the Errata for this chip. Nothing is listed for the SCI.

I typed "SCI Errata" into Freescale's search, and found the following interrupt bug in some HCS12 chips. This isn't your problem, but makes for interesting reading. It is amazing what can go wrong sometimes.

http://cache.freescale.com/files/microcontrollers/doc/eng_bulletin/EB614.pdf

Tom

R_R_Ritchey · ‎03-06-2014

Don't know if you are interested but did put a test for TDRE still set after servicing the interrupt:

(void)(SCI1S1 == 0);	/* Status register read to clear flags */
SCI1D = TxByte;	/* Write byte to SCI 1 data register	*/
if(SCI1S1 & SCI_S1_TDRE)	/* RDRF still set? */
{
asm{ halt }	/* DEBUG */
}

I do occasionally hit the halt. Interrupts are disabled around this code. I don't hit the halt as often as I see dropped bytes. I am probably just going to live with this, I have my serial protocol designed to handle this type of situation so its not a killer, just annoying as anything.

TomE · ‎03-06-2014

> Don't know if you are interested

I'm interested in responses to all of my suggestions and questions. How are you starting transmission? Are you just enabling the transmit interrupt or are you writing a byte and enabling? What does this chip REQUIRE that you do? Do any expected operations not work and require weird workarounds? I call reading SCI1S1 first to clear TDRE unnecessary, weird and likely to cause problems.

If you haven't done so I'd suggest changing that "(void)(SCI1S1 == 0);" to test for TDRE not being set. If you have run this test then please say so. I can't see what you have and haven't done from here. :-)

> I do occasionally hit the halt.

Now that I think about this I would expect that. I'm assuming this is the first write into an empty SCI. Both the shift and transfer registers are empty. So you write to SCI1D and then on the next SCI clock it gets transferred to the shift register and TDRE gets set again. Somewhere between ZERO and 10us later. So depending on the phasing of that clock and your code it will sometimes transfer between the write to SCI1D and the read of SCI1S1.

But what about the other tests?

You could change your code to work on

How do you know it isn't the receiver dropping the bytes? Do you have an independent comms monitor proving it is the transmitter's fault? If the receiver ever gets delayed by 100us then the bytes will overrun the receiver and you'll lose them there. Does the receiver code detect and report overruns?

Tom

R_R_Ritchey · ‎03-07-2014

Hi Tom, I think I answered some of your questions in the first post. To clear TDRE it is required to read SCIxS1 then write SCIxD. As I stated in the earlier post, I enable interrupts in the main line and nothing else in the main line. The interrupt handler takes it from there. TDRE will always be set at "(void)(SCI1S1 == 0);" because that is what caused the interrupt and its before the data register is written which actually clears TDRE.

Not sure about which other tests you are talking about.

I have a logic analyzer on the line and can see the dropped byte, I will attach a jpg. I am not depending on the receiver to tell me when there are errors. I am catching the hard error on the logic analyzer. You can see

the dropped byte on the very right. Also, you can see all the interrupt code in the zip attachment to the earlier post.

Thanks for your input.

TomE · ‎03-08-2014

> I do occasionally hit the halt. Interrupts are disabled around this code.

> I don't hit the halt as often as I see dropped bytes.

As I replied, I'd expect this to occasionally happen on the very FIRST byte loaded in at the start of a message, but not otherwise. Since you're stopping on the halt, you can't tell if that "TDRE is still set" is the first byte or it is then just about to drop a byte and might lead to the cause. Can you change the "halt" to count the occurrence and then inspect the counter after detected drops to see if they ever line up?

> Some of the blocks are new and some are made from Bakelite,

Is it possible that a module like this may have a crack in it? If we don't look we won't find out.

> Gone are the days when manufacturers provided simple gate-level circuits for interrupt controllers

TDRE has a "read-before-clear-by-write" requirement which means it isn't simply implemented like UARTs always used to be in simple gates. I've found a bug in the LCD Controller in the MCF5329. In there is an "MCF_LCDC_LISR" register that reports the current interrupts. It is a read-to-clear register (always a bad sign), and it is only meant to be read immediately after an interrupt has happened. I was polling it to see when the next interrupt happened, and it didn't work sometimes. If the chip is trying to SET a bit in that register on the same clock cycle that it is being READ, the "read" wins and the bit-set fails completely. So you lose interrupt and status indications. On an LCD screen. Where you can SEE the dropped interrupts.

The same thing might be happening here. If the chip is trying to set a bit in the SCIxS1 register on the same clock cycle that you're either reading it (to set up the clear) or writing to SCIxD, then it might fail to clear TDRE, you get the double-interrupt and clobber the byte.

> Does that mean the SCI is receiving its own transmissions?

The following only applies if you are doing that. If so then the chip may be trying to set RDRF in the same cycle it is trying to clear TDRE, and failing. The combination of the CPU clock speed, the 100kHz baud rate and your code might be lining these clocks up sometimes. That's easy to find, just read and save SCIxS1 before and after the write to SCIxD, and then test to see if RDRF is set (or has become set) at the same time that TDRE stays set and it isn't the first byte written. And then it goes on to drop a byte.

If you are reading back your own transmissions, you can catch this error during the readback.

This chip has the SCIxC2[RE] bit, which means you could disable the receiver during transmits to see if the problem goes away. The only tricky bit is that you don't want to enable the receiver again until TC is set or you'll receive the last byte or part thereof and have to handle this apparent error.

Tom

TomE · ‎03-08-2014

> Hi Tom, I think I answered some of your questions in the first post.

Yes, sorry. I didn't read the post to Shaun, but managed to scroll past it and only saw (and replied to) the second one.

> have about 30 years experience writing low level routines on micros in both assembler and C.

Me too. You've done everything I'd have done up to this point. So I'll tell you what I'd do next.

I'd disassemble all the code and read the assembly. You're lucky (in a way) that you have a compiler that respects the "register" keyword, (gcc knows better so it won't), but just check that it is doing what you want. Having that facility can make you write "unreliable" code though. I *REALLY* don't like the "SaveSR_DI()" and "RestoreSR()" pseudo-functions as they are really PSEUDO functions, being macros. Are you sure the compiler guarantees to expand in-line assembly and NOT be using the D0 register for something else during that function? In gcc I'd be forced to write them as "inline assembly functions" with a more complicated syntax, but one that at least guarantees register allocations. Does your compiler provide assembly functions possibly named "asm_set_ipl()" rather than using your own?

The next thing I'd do is to start timestamping and logging everything to get an idea of the execution timing when things go wrong. Program up a spare DMA Timer to free-run at 1MHz (or more, but 1mHz/1us is convenient) from 0x00000000 to 0xffffffff and then simply write "log entries" consisting of a 32-bit "entry identifier" and the 32-bit DTCNn register into a circular log. Then you can track what happened when, and if the interrupt did happen twice and too close together. You could even write the SCI Status register into that log.

Are you sure it is only the first or second bytes that go missing, or are the only ones you NOTICE go missing as they mess up the protocol? "Only the first two" means something, but I don't know what.

Your Analyser trace shows two types of messages. I assume they're the transmit and receive data. If your analyser is monitoring a single line, then it looks like you have a Serial Bus. Does that mean the SCI is receiving its own transmissions? You may have found a problem related to simultaneous receive and transmit interrupt generation/handling/clearing, or the code that enables/disables the receive interrupts.

> TDRE will always be set at "(void)(SCI1S1 == 0);" because that

> is what caused the interrupt and its before the data register is

> written which actually clears TDRE.

Yes, but. I'd call that an unwarranted assumption until proved otherwise. Gone are the days when manufacturers provided simple gate-level circuits for interrupt controllers that we could easily understand. Or that they could easily understand and document properly. Something is wrong, so I'd guess that assumption might be wrong and would add code to check it. It might lead to a workaround. It might lead to a demonstration of the problem you can send to Freescale.

> have about 30 years experience

Try this for some nostalgia:

http://www.righto.com/2013/09/the-z-80-has-4-bit-alu-heres-how-it.html?m=1

> Also, you can see all the interrupt code in the zip attachment

That looks OK. It is hard to see how that might go wrong. I'd disassemble it and check that the compiler headers have "volatile" in all the right places and that they're accessing all registers at the correct width (unlike the PIT headers).

Remember these chips are "lego sets" where the functional blocks are clipped together. Some of the blocks are new and some are made from Bakelite, and have evolved from 6800 peripherals via the 68HC11. The 8-bit peripherals (SCI, PWM) are the old one while the 32-bit DMA Timer are new ones. Some old peripherals have very slow access times, that's why they made RapidIO on some chips as the GPIO takes so many clock cycles to read and write. So you may have found an "old slow peripheral on fast CPU" problem. Try adding a bunch of NOPs in your interrupt service routines.

If all else fails, transmit on Transmit Finished rather than TDRE.

I've searched for "SCI TDRE" in these forums and got a lot of hits, but nothing matching your problem.

Tom

R_R_Ritchey · ‎03-08-2014

Hi Tom. Not sure why you don't like "SaveSR_DI()" and "RestoreSR()", its the only way to bump the Coldfire to uninterruptible and then back to previous level (0-6) but I have checked them thoroughly in both disassembled code and by stepping through them. The compiler is doing the right thing. These macros really should be pre-defined as on a micro with multiple interrupt levels you often need to block interrupts completely temporarily in an interrupt routine and then go back to the interrupt actual level. Codewarrior has nothing predefined as far as I can find.

Yes, this is a serial bus, transmissions both ways on the same wire. Messy but necessary in my case. It may be related to that. I basically disable all receive interrupts while transmitting but leave the receiver active as I utilize the idle-line detect function to see end of packets and RDRF must go high to enable the idle-line interrupt later.

Like the Z80 stuff. I actually starting programming on an F8 in college. Still have the manual for that thing.

At this point I am going to move on. Thanks for all your help. The protocol I designed will handle this, it just really bugs me.

TomE · ‎03-08-2014

> Not sure why you don't like "SaveSR_DI()" and "RestoreSR()",

Because you can use numeric registers in that assembly that the C code might be using to store important variables that should be preserved, and that the assembly might overwrite. I suspect that instead of using "D0" you should be using a temporary register variable from the function itself.

> These macros really should be pre-defined as on a micro with multiple interrupt levels

They were apparently supplied in CW7. There should be an equivalent.

Google for "asm_set_ipl()" and follow links back to this forum.

Here's some source, which matches what I use, but probably won't work in CodeWarrior without translation

http://subversion.assembla.com/svn/coldfire-thermal/trunk/thermal_coldfire/Sources/mcf51xx/mcf5xxx_l...

Here's a previous post on this subject:

Re: Problem with interrupt and interrupt counter

There's a reference in the above to the "support files" in older Code Warrior versions that had "asm_set_ipl()" in them. CW 10 should provide something similar. If it doesn't I don't know what they're thinking:

https://community.freescale.com/message/144503#144503

Here's another question where CW5 supplied these files but CW10 doesn't. The solution is to include the CW5 files in the CW10 project. The take-home message is that CW10 seems to easily support "LED-blinky" projects, but not ones using interrupts.

https://community.freescale.com/message/88858#88858

> I basically disable all receive interrupts while transmitting but leave the receiver active

So you're getting solid overruns on the receiver? That's what you're doing that is unusual. You may be triggering a bug where the receive error flags are behaving differently to the Receive Enable flags and their setting is interfering with the transmits. The problem may go away if you received your transmissions, either by enabling interrupts or by reading during the transmit interrupt.

> At this point I am going to move on.

If you ever move back to this and nail the real cause, please let us know. You may save others wasted time and failed products and unhappy customers.

Tom

R_R_Ritchey · ‎03-06-2014

Hi Shaun, this is a custom application. Both devices know to run at 100K baud. The receiving device is working just fine, I can see all this on the logic analyzer.

Hi Tom, let me go through your suggestions but I have tried most. I always read SCIxS1 immediately before stuffing SCIxD. I also disable interrupts so that I cannot be interrupted SCIxS1 read and SCIxD write. I enable TIE in the main line code but do not stuff anything there. I realize that TDRE will already be set and I will go immediately to the interrupt routine to stuff the first byte. I have verified that the buffer pointer is setup before TIE is enabled. The only code that increments the pointer is in the interrupt routine and interrupts are disabled around all the code that manipulates the buffer pointer down to the SCIxD write. I may sound inexperienced posting this but I have about 30 years experience writing low level routines on micros in both assembler and C. I really have tried everything you mentioned here except for the interrupt routine code (that was next). Thanks for all your suggestions but I did try all this before posting. I don't like to post trivial issues. I realize I did not phrase the original question well. Also, if you would like to look at the relevant interrupt code I extracted it to one file and its attached.

yibbidy · ‎03-05-2014

100K baud is an unusual rate. Are you confident that the device that is receiving the message is doing so correctly, and that it is not dropping the character?

How long are the messages? Are you losing one out of three characters when it fails or one out of 1000?

What happens if you change the baud rate to something slower?

Shaun

Coldfire SCI Transmit Errors

Coldfire SCI Transmit Errors

General