Unrecoverable state following extra byte read on MKL27Z4

chrismc · ‎08-25-2016

We were seeing instances of the I2C bus getting stuck in an unrecoverable state during I2C master reads on the MKL27Z4. The issue seemed to occur more often when the CPU was 'under pressure' - i.e I2C wasn't being serviced quickly. Looking at the I2C bus activity, it appeared the I2C controller would attempt to read one extra byte than expected after which the I2C lines would be left in their non-default (low) state.

We guessed that this could be occurring due to the I2C controller behavior not following the pattern expected by the CPU firmware - it becoming out of sync so that an extra byte read was created by the controller or a byte read was dropped by the CPU. Looking into it, that did seem to be occurring...

We were using I2C_MasterTransferBlocking() of KSDK 2.0. With this, multi-byte reads should have been triggered one byte at a time in the status polling read loop in I2C_MasterReadBlocking(). We tried disabling interrupts globally within the loop and the issue stopped occurring, which implies there's an unexpected timing dependency involved (between I2C state changes and CPU servicing). When we then added a 25usec delay within this loop, the issue would always occur. With just a 20usec delay (shorter than a 1byte transmission on the 400K I2C bus), the issue wouldn't occur. From the reference manual my impression was that byte reads or writes should only be triggered by accesses of the DATA register, but from the behavior mentioned above that didn't seem to be the case. It appeared that the controller would perform another read before the CPU was ready to trigger it (reading the value out of the data register). Possibly the controller was just continuing on the multibyte read or an extra byte read was being triggered unexpectedly early on in the sequence from a different register access maybe?

We've been able to work around this with either the global interrupt disable within the read loop or switching to I2C_MasterTransferNonBlocking() (which is interrupt instead of polling based). However it'd be good to get confirmation of this behavior and whether its a known issue or not.

Thanks.

rastislav_pavlanin · ‎09-05-2016

Hello Chris,

after internal discussion the errata mentioned above will include also master (not only slave) as also verified by our team. This most probably affets also your application. I believe that DMA can simply work as workaround for you. However, you need to ensure in your application the DMA will have the highest priority/arbitration on crossbar switch to minimize the latency (as you described within 20us) on system bus.

regards

R.

rastislav_pavlanin · ‎08-25-2016

Hello Chris,

do you utilize the double buffer feature of I2C? Is it happening when baudrate is decreased to 100k?

We have errata which were not yet updated but it is related more to I2C slave not master.

Title:

−I2C: Slave does not hold bus between byte transfers and may result in lost data

Errata Description:

−When the I2C module is used as a slave device, the bus is not held by the slave between byte transfers. If the I2C slave I2C_D register and the data buffer are full, incoming data from an I2C master will overwrite data in the data buffer.

Errata Workaround:

−When configured as a slave, the delay in processing incoming bytes should be minimized. Delay can be minimized by the use of DMA or increased interrupt priority

regards

R.

chrismc · ‎08-30-2016

Thanks for the reply and apologies for my slow followup.

On your questions:

Double buffer feature: We're using the NXP SDK 2.0 code. As far as I can tell the code is accounting for it but not attempting to make use of it.
Decreasing the baudrate: this could help alleviate the problem, but I don't feel it's really addressing the fundamental issue.
Errata: that doesn't really seem directly relevant.

This issue is continuing to pose a problem for us. I was wrong in my assumption that just using the interrupt based I2C_MasterTransferNonBlocking() interface would provide a good workaround.

The fundamental issue seems to be that while at the surface the I2C controller appears to have an interaction model that is driven by the CPU (which would allow the CPU to do things as it wants without timing restraints), it actually does place service timing conditions on the CPU. This occurs whether the transaction is polling, interrupt or DMA based.

If I put a 25usec delay in either of the polling loop, I2C interrupt handler or the DMA completion handler, the transaction will fail every time. Could you confirm this is expected/acceptable behavior? If so, it places quite tight restrictions on the CPU software. With us already running I2C at the highest IRQ priority, I can think of the following options to work around the problem, but none of them are great for us:

Spend time ensuring global interrupts are not disabled for longer than 20usec (with a 400K I2C speed) anywhere in software.
Use the slower 100K I2C speed to lower the chance of the issue occurring (this has a major impact for us due to data length / time).
Use polling based IRQ transfers with global interrupts disabled (not viable for us due to data length / time)

Reading the reference manual and looking at the SDK I2C polling and interrupt code, I can't see where this service timing dependency comes from. With the I2C DMA code however, a timing dependency would make sense if the DMA transfer was kicking off the last byte on the I2C bus... that would mean the completion handler would need to set the NACK for the last byte read before it completed. However, that would mean that the TXAK would be applying to the current byte in transfer, which is not consistent with the manual.

Would you be able to provide any insight into this?

Thanks.

mjbcswitzerland · ‎08-31-2016

Hi Chris

The double-buffered I2C controller in the newer parts has become a source of difficulty reported in may forum posts. Initially NXP stated that there was "no software impact" (see original migration guide) but changed this later (see newest migration guide) to essentially state that the impact on the software is "huge". It seems the NXP software is somewhere between the two and tends to use varous seemingy random time delay waits to somehow work around things (also in interrupts) that don't seem to be understood or documented. It may or may not work, depending on the wind direction it seems;-)

See also
https://community.nxp.com/message/575517
and the attached PDF showing how to handle interrupts in master and slave modes.

This has been used in the uTasker project in intensive industrial use (400kb/s with multiple I2C slaves used intensively in parallel with various other peripherals, such as USB audio) and may therefore serve as base to clean up the example code (no delays are required and it is not known of any need for low latency in master mode).
I have attached the uTasker I2C driver as reference, which has been used in many Kinetis based products since 2011 - including the newer double-buffer operation marked accordingly by defines.

Regards

Mark

chrismc · ‎09-01-2016

Hi Mark

Thanks greatly for that reply.

I've read through the linked post and your post on the KL43. Unfortunately we've hit the situation mentioned by Bob - "Problem with putting it in the migration guide is if it is a new design it is not likely to be looked at, as there is nothing being migrated."

I've just had a quick look at the code you generously attached. For the "double buffering" I could see how you were using the start interrupt where the KSDK uses a short delay for the repeat start issue. The delay used in KSDK isn't long enough to be a major issue for us I think (even in an IRQ handler).

It seems to me that I'm hitting something very similar to what you saw on the KL43 (but on the KL27). From a high level it seems as if that with the introduction of the double buffering the synchronous interaction model between the I2C controller and software control was lost to a degree - the I2C controller seems continue bus activity without initiation from software control. This results in the CPU being required to keep up with the controller - the service timing dependancy. In fact the KL43 post essentially states that - "Our recommendation is to do this using interrupts since a polling firmware migh not be fast enough to detect when the 2nd to las byte has been received and set the TXACK flag on time." So the double buffering parts attempt to read until they are told to stop, and if they aren't told to stop in time (within approx. 20usec with a 400K clock), then it will fail.

I think we'll need to look more at how recoverable the situation is. Do you have any ideas on that Mark? If we were able to detect and recover through a reinitialization sequence that would be great for our usage scenario (as a short single data drop does not have a critical impact). I noticed your driver had code/comments around bus lockup state recovery - was that for the Kinetis side or troublesome I2C parts?

Cheers, Chris

mjbcswitzerland · ‎09-02-2016

Hello Chris

Possibly you already saw it but I originally posted problems that were encountered when migrating from original I2C controller based parts to double-buffered ones (all newer parts with double-buffered design are relevant since I tested on quite a lot of newer chips too with identical results, including the KL27):KL43 I2C problem

It was quite obvious that the master could be seen sending out continuous clocks, rather than sending out 9 and allowing a slave to hold the bus, but after much experimentation to find the best way to handle the master and slave modes using interrupts I don't know that this is an issue in practice any more.

During intensive tests there was an additional possibility of a spurious loss of bus ownership that was worked around quite simply (see modification {1} in the code).

The I2C bus recovery code is not related to buffered mode but is a standard requirement when slave devices are not reset together with a master: https://community.nxp.com/thread/336594 All I2C drivers should have it to avoid the risk of a product resetting and never being able to work with slaves - NXP framework code doesn't use it because it is more "example code" where there has been no considreration about real-world situations in real products and can thus be made to catastrophically fail quite simply.

Master mode I2C bus recovery itself is not difficult since it is essentially a re-initialisation of the I2C interface but the method of detection probably requires the situation to be detected at a hgher level (i.e. in the application [some form of polling watchdog perhaps], and also application dependent). The impacts of the disturbance at the application level are also an important factor in the design of this since some appliations may or may not be more sensive to such failure. Presently the products that I am involved with which are based on double-buffered parts seem to be adequately reliable (at 400kHz and based on several months of intensive operation) but I am not 100% convinced that there couldn't maybe something that could occur in situations of higher latency - or/eg. when more features are added which start to cause some new effects to become apparent. For mission-critical designs I would presently avoid parts with double-buffered controller's - when the I2C is a factor - in favour of more established I2C controller designs which have been proven over a number of years.

Regards

Mark

chrismc · ‎09-02-2016

Thanks for those extra comments Mark. Understood on all of that.

I think the situation is now probably about as clear as it's going to get for us unless NXP comes back with something.

rastislav_pavlanin · ‎08-31-2016

Hi Chris,

I have forwarded this to our I2C expert. As soon as I will get the response will let you know.

regards

R.

chrismc · ‎09-01-2016

Hi Rastislav

Thanks for the reply.

Could you please give an indication of when a response will be available?

It would be good if the response could also confirm (nor not) the new comments by Mark and myself.

The I2C double buffering is currently presented as a feature in the reference manual. As the situation sits now, it would seem that not only that this new functionality cannot be taken advantage of, but it adds restrictions on software usage. More than that, while the issues were reported quite some time ago they haven't yet been covered in the errata documents, receiving just a cursory covering in the migration guide (which would not naturally be referenced for new project development). Can this situation be addressed?

Cheers, Chris

Unrecoverable state following extra byte read on MKL27Z4

Unrecoverable state following extra byte read on MKL27Z4

Kinetis L Series MCUs