RT685 i3c port with Zephyr - driver with i2c devices dies after minutes of use

galenc · ‎02-04-2025

I have a system based around the MIMXRT685-EVK running Zephyr. I have the i3c port configured and interfacing with a few i2c devices. They all work for some time, but depending of frequency of communication the bus/driver will just die after a few minutes.

I experienced this on the evk and on our custom board. And similarly with all the peripherals on the bus or just a single one.

here you can see the scope (yellow SDA, green SCL, violet device interrupt) it just stops trying any comms on the bus when it should be servicing the interrupt: logic output logic output

It happens for this device which is imu - lsm6dso (a zephyr built in driver), sampling at higher frequency from interrupts. Even communicating with another device at slower frequency it will stay up longer but eventually crash (maybe 5-10 minutes with a polling message every second).

Here is the warnings we saw from the i3c_mcux driver just before the bus goes dead. the errors are failed attempts to read from device registers (the bus in the scope has no activity from subsequent attempts, shown in red as errors):
warning from i3c_mcux before bus crash warning from i3c_mcux before bus crash

Has anyone seen any issues similar with the i3c bus and zephyr implementation specifically?

our whole bus is essentially i2c devices we don't need the features of i3c (we are using it because we are using all the other ports) if there is a way it might be more stable if we "force" it into i2c mode all the time?

It don't know if there is a bug in the driver or some way to better configure it? we are using it pretty much as it came. see device tree sections:

&i3c0 {
	status = "okay";

	i2c-scl-hz = <400000>;
	i3c-scl-hz = <400000>;
	i3c-od-scl-hz = <400000>;

	clk-divider = <12>;
	clk-divider-slow = <1>;
	clk-divider-tc = <1>;

	pinctrl-0 = <&pinmux_i3c>;
	pinctrl-names = "default";

	lsm6dso0: lsm6dso@6a0000000000000050 {
		status = "okay";
		compatible = "st,lsm6dso";
		reg = <0x6a 0x00 0x50>;
		irq-gpios = <&gpio1 10 GPIO_ACTIVE_HIGH>;
	};
};

	pinmux_i3c: pinmux_i3c {
		group0 {
			pinmux = <I3C0_SCL_PIO2_29>,
					 <I3C0_SDA_PIO2_30>;
			input-enable;
			slew-rate = "slow";
			drive-strength = "high";
		};

		group1 {
			pinmux = <I3C0_PUR_PIO2_31>;
			slew-rate = "normal";
			drive-strength = "normal";
		};
	};

Any help is much appreciated.

Thanks, Galen

Dezheng_Tang

Glad your I2C bitbang recover bus works. On the master side, several things you should check:

(1) PIN MUX needs to set back to I3C SCL and SDA lines.

(2) Make sure both MSTATUS and MERRWARN registers are reset to default state.

(3) In our SDK code, we have a module called I3C_MasterTransferAbort() to abort the master transfer. Please take a close look on this. I don't know why you would stay in NORMACK state even after you have issued STOP request. After you emit stop, you should wait for CTRL DONE. Once CTRL DONE is seen, it should be back to IDLE state.

(4) If (2) and (3) don't work, I don't know what else you can do, maybe reset I3C by writing 1 to I3C0 bit in RSTCTL1_PRSTCTL2. You may want to check all the I3C registers, especially, MCONFIG and make sure they are the same before and after resetting I3C bit.

View solution in original post

galenc · ‎02-24-2025

So I have been doing a lot of debugging and solved a couple of the failure modes. The last one I have narrowed down to this line of code in the i3c driver in zephyr:

https://github.com/zephyrproject-rtos/zephyr/blob/70f55bccf8af2866331f6fd84e05c600f4457704/drivers/i...

Basically the `mcux_i3c_do_request_emit_stop()` function in `drivers/i3c/i3c_mcux.c` will get stuck in this loop with a k_busy_wait(10) forever unable to emit a stop and it will essentially block zephyr from context switching as well, because its a busy wait, so my entire app is dead.

Like i mentioned we aren't using any i3c features all our peripherals are actually i2c. I'm not exactly sure what emitting a stop does but it seems to be associated with the IBI? How is it possible that the subsystem has lost control of the bus completely? I'm not sure if you have some good ideas on how to solve this but two ideas I had were, but I'm not sure how to implement them:

1. Disable all i3c features or at least any that would make it unable to emit a stop (or need to at all), or force it into a mode where we can ignore the stop condition?

2. Is there a way to recover control of the bus here back to an initial state?

Other then that can you provide some context as to what is going on in this function, i.e. what is it exactly waiting for? how can we make sure that condition happens eventually?

Thanks,
Galen

Dezheng_Tang · ‎02-25-2025

Hi, Galen,

IBI is an I3C only feature and should be disabled if you are doing I2C communication only. e.g. make sure you are NOT doing anything to MIBIRULES register.

But, regardless I2C or I3C, Emit Start/Stop requests are still used for generating Start/Stop condition of I2C. See Section "Reading and writing I2C messages using normal method" for more details.

I don't know what other I3C features you have enabled, regardless, don't register IBI.

Regarding recovery control, it depends on what kind of error you got first. A few things I can think of: you should makes sure the master is in IDLE state, master configuration is back to the POR default state, no pending interrupt OR all master status bits should be cleared.

galenc · ‎02-26-2025

Hi, Thank you for the quick response.

So i get intermittent errors with some transactions that do not result in failures (i can share those if you'd like too), but here are the errors i get just before the bus fails. this is after communicating with the peripherals a few times a second for over an hour:

[01:09:10.791,048] <dbg> i3c_mcux: mcux_i3c_has_error: ERROR: MCTRL 0x0002d350 MSTATUS 0x00009003 MERRWARN 0x00100000
[01:09:10.791,059] <dbg> i3c_mcux: mcux_i3c_has_error: ERROR: MCTRL 0x0002d350 MSTATUS 0x00009003 MERRWARN 0x00100000
[01:09:10.791,063] <dbg> i3c_mcux: mcux_i3c_has_error: ERROR: MCTRL 0x0002d350 MSTATUS 0x00009003 MERRWARN 0x00100000
[01:09:10.791,067] <err> i3c_mcux: Timeout error
[01:09:10.916,786] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001003 MERRWARN 0x00000000 MIBIRULES 0x00000000
[01:09:12.042,217] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001003 MERRWARN 0x00000000 MIBIRULES 0x00000000
[01:09:13.169,180] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001003 MERRWARN 0x00000000 MIBIRULES 0x00000000

I modified the `mcux_i3c_do_request_emit_stop()` to look like this and give me some better logging (and attempt some recovery by clearing everything...it doesn't work the bus was still dead after these file logs are emitted):

static inline int mcux_i3c_do_request_emit_stop(I3C_Type *base, bool wait_stop)
{
	reg32_update(&base->MCTRL,
		     I3C_MCTRL_REQUEST_MASK | I3C_MCTRL_DIR_MASK | I3C_MCTRL_RDTERM_MASK,
		     I3C_MCTRL_REQUEST_EMIT_STOP);

	/*
	 * EMIT_STOP request doesn't result in MCTRLDONE being cleared
	 * so don't wait for it.
	 */

	if (wait_stop) {
		/*
		 * Note that we don't exactly wait for I3C_MSTATUS_STATE_IDLE.
		 * If there is an incoming IBI, it will get stuck forever
		 * as state would be I3C_MSTATUS_STATE_SLVREQ.
		 */
		int stop_tries = 0;
		while (reg32_test_match(&base->MSTATUS, I3C_MSTATUS_STATE_MASK,
					I3C_MSTATUS_STATE_NORMACT)) {
			if (mcux_i3c_has_error(base)) {
				/*
				 * A timeout error has been observed on
				 * an EMIT_STOP request. Refman doesn't say
				 * how that could occur but clear it
				 * and return the error.
				 */
				if (reg32_test(&base->MERRWARN,
					       I3C_MERRWARN_TIMEOUT_MASK)) {
					mcux_i3c_errwarn_clear_all_nowait(base);
					return -ETIMEDOUT;
				}
				return -EIO;
			}
			k_busy_wait(10);
			stop_tries++;
			if (stop_tries > 10000){
				uint32_t mstatus, merrwarn, ibirules, mctrl;
				int err;
				mstatus = base->MSTATUS;
				merrwarn = base->MERRWARN;
				ibirules = base->MIBIRULES;
				mctrl = base->MCTRL;
				LOG_ERR("Tried to emit stop 10000times..clearing all! MCTRL 0x%08x MSTATUS 0x%08x MERRWARN 0x%08x MIBIRULES 0x%08x",
					mctrl, mstatus, merrwarn, ibirules);
				stop_tries = 0;
				k_msleep(1000);
				mcux_i3c_errwarn_clear_all_nowait(base);

				//same mask as clear all but no inf busy wait
				uint32_t clear_mask = I3C_MSTATUS_MCTRLDONE_MASK |
					I3C_MSTATUS_COMPLETE_MASK |
					I3C_MSTATUS_IBIWON_MASK |
					I3C_MSTATUS_ERRWARN_MASK;
				if((err = mcux_i3c_status_clear_timeout(base, clear_mask, 1000)) != 0){  //clear all with wait 1ms
					LOG_WRN("timeout clearing status mask = 0x%08x", clear_mask);
				};
				// return -ETIMEDOUT;
			}
		}
	}
	return 0;
}

So it looks like there are no IBI and the bus is in NORMACT and I2C mode. But its trapped with the Emit Stop indefinitely. When you say I should be in IDLE and POR state what bits should I be setting in which registers here to force that?

Thanks,
Galen

Dezheng_Tang · ‎02-26-2025

It looks like I3C controller is in NORMAL state and the configuration is OK too. Both MCTRL and MSTATUS are OK, the problem is why your TIMEOUT bit (bit 20) is set in MERRWARN even before you request emit stop? This is the condition when TIMEOUT bit is set:

• The transmit FIFO or receive FIFO is not handled, and the bus is stuck in the middle of a message.
• No STOP is issued after data transfer (there is a gap between messages).

Can you check If there is a condition that the data sitting in the FIFO for a period of time and you don't read promptly?

I think the bus is in fault condition and not back to idle state even before you emit stop, the problem is NOT in mcux_i3c_do_request_emit_stop().

[01:09:10.791,063] <dbg> i3c_mcux: mcux_i3c_has_error: ERROR: MCTRL 0x0002d350 MSTATUS 0x00009003 MERRWARN 0x0010000

galenc · ‎02-27-2025

Well its been basically doing the same thing for an hour over and over again thousands of transactions. As best I believe I'm handling things promptly but zephyr has many tasks.

My core issue might be resolved if there was a way to recover the bus at this point after the error has occurred. I'm not sure its practical that I would be able to prevent every timeout from happening. I actually get timeouts on transactions every couple minutes and the bus is fine afterwards. If there is a special type of timeout condition causing the bus to die completely how do I handle that and get back to normal operation?

I tried adding a `mcux_i3c_xfer_reset()` when it gets stuck but it does not have any effect. The log looks slightly different but the last line repeats forever.

[00:07:29.338,707] <dbg> i3c_mcux: mcux_i3c_has_error: ERROR: MCTRL 0x0002d350 MSTATUS 0x00009003 MERRWARN 0x00100000
[00:07:29.338,719] <dbg> i3c_mcux: mcux_i3c_has_error: ERROR: MCTRL 0x0002d350 MSTATUS 0x00009003 MERRWARN 0x00100000
[00:07:29.338,723] <dbg> i3c_mcux: mcux_i3c_has_error: ERROR: MCTRL 0x0002d350 MSTATUS 0x00009003 MERRWARN 0x00100000
[00:07:29.338,729] <err> i3c_mcux: Timeout error
[00:07:29.464,782] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001803 MERRWARN 0x00000000 MIBIRULES 0x00000000
[00:07:30.592,655] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001003 MERRWARN 0x00000000 MIBIRULES 0x00000000
[00:07:31.718,942] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001003 MERRWARN 0x00000000 MIBIRULES 0x00000000
[00:07:32.846,540] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001003 MERRWARN 0x00000000 MIBIRULES 0x00000000
[00:07:33.974,438] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001003 MERRWARN 0x00000000 MIBIRULES 0x00000000

Dezheng_Tang · ‎02-27-2025

[00:07:29.338,707] <dbg> i3c_mcux: mcux_i3c_has_error: ERROR: MCTRL 0x0002d350 MSTATUS 0x00009003 MERRWARN 0x00100000
[00:07:29.338,719] <dbg> i3c_mcux: mcux_i3c_has_error: ERROR: MCTRL 0x0002d350 MSTATUS 0x00009003 MERRWARN 0x00100000
[00:07:29.338,723] <dbg> i3c_mcux: mcux_i3c_has_error: ERROR: MCTRL 0x0002d350 MSTATUS 0x00009003 MERRWARN 0x00100000
[00:07:29.338,729] <err> i3c_mcux: Timeout error

By looking into the Zephyr driver, inside

/*
* If controller says timed out, we abort the transaction.
*/
if (mcux_i3c_has_error(base)) {
if (mcux_i3c_error_is_timeout(base)) {
ret = -ETIMEDOUT;
}
/* clear error */
base->MERRWARN = base->MERRWARN;

/* for ibi, ignore timeout err if any bytes were
* read, since the code doesn't know how many
* bytes will be sent by device. for regular
* application read request, return err always.
*/
if ((ret == -ETIMEDOUT) && ibi && offset) {
break;
} else {
if (ret == -ETIMEDOUT) {
LOG_ERR("Timeout error");
}
goto one_xfer_read_out;
}
}

I have a few questions:

(1) Based on above code, if mcux_i3c_has_error dbg msg comes out, "Timeout error" should happen immediately. Why the dbg msg comes out 3 times, finally, "Time out error" comes out?

(2) If TIMEOUT bit is set in MERRWARN, I think we should do mcux_i3c_fifo_flush(). Otherwise, below line only hides the problem.

/* clear error */
base->MERRWARN = base->MERRWARN;

(3) mcux_i3c_xfer_reset() doesn't do anything to put the bus to IDLE state. mcux_i3c_recover_bus() looks like the right API to put the bus to idle state. I think both mcux_i3c_fifo_rx_drain() and mcux_i3c_fifo_flush() should be performed inside bus recover API.

(4) I don't understand why timeout occurs every few minutes. I do think bus will die if timeout bit is set. However, the state machine of the master may be wrong and doesn't know how to handle FIFO data after this bit is set. We need to understand why timeout occurs before thinking about recovery mechanism.

(5) Inside mcux_i3c_transfer(), I am not sure if there is an OS starvation here? If you want to do a I3C transfer, mutex is locked but bus is not idle. mcux_i3c_wait_idle() is polling, right?

k_mutex_lock(&dev_data->lock, K_FOREVER);

mcux_i3c_wait_idle(dev_data, base);

galenc · ‎02-28-2025

Thank you. those are helpful things to consider.

to answer some of you questions:
1) with debug enabled each time we check for an error it prints that line which will happen 3 times in a single call of `mcux_i3c_do_one_xfer()` if its a read request.

2) I added this to the driver

3) I already have an implementation with `mcux_i3c_recover_bus()` -- basically if i get 10 error in a row (without any successes) i will call that, but it almost never happens. maybe there is a better time to call it?

4) & 5) these are interesting ideas, i will add some more logging and try some things to see if i can understand this better. I also considered that a higher priority thread might be context switched out in durring the transfer in the middle causing the timeout so i adjusted some priorities. I still see timeouts regularly though and need to keep this at a relatively low priority because this bus is secondary peripherals.

I noticed I also see the IO errors on occasion too. Again they recover and move on and the only time the bus crashes dies completely on the timeout ones.

[00:24:52.345,217] <dbg> i3c_mcux: mcux_i3c_has_error: ERROR: MCTRL 0x0000d250 MSTATUS 0x00009403 MERRWARN 0x00100000
[00:24:52.345,228] <dbg> i3c_mcux: mcux_i3c_has_error: ERROR: MCTRL 0x0000d250 MSTATUS 0x00009403 MERRWARN 0x00100000
[00:24:52.345,249] <err> mp2651: failed to read CONFIG_REGISTER_2 err=-5
[00:24:52.345,256] <err> charger: error with battery data. recover count = 1

Dezheng_Tang · ‎03-04-2025

Regarding debugging timeout, I am thinking that, can you use gpio toggling instead of calling LOG_DBG or LOG_ERR printout when time out occurs?

GPIO1 toggles when timeout in MSTATUS and GPIO2 toggle when MRDATAB is read from the register, you can trigger on edge of GPIO1, the I2C SCL and GPIO2 should be sync well that Data is read on the falling edge of the SCL.

If there is a long delay that data is not read on MRDATAB after SCL is low, there is something wrong in your app.

galenc

Hi @Dezheng_Tang I setup the debug outputs as suggested. I also added a 3rd GPIO into the attempted read loop.

I made the changes as follows:

static int mcux_i3c_do_one_xfer_read(I3C_Type *base, uint8_t *buf, uint8_t buf_sz, bool ibi)
{
	int ret = 0;
	int offset = 0;
	
	while (offset < buf_sz) {
		/*
		 * Transfer data from FIFO into buffer. Read
		 * in a tight loop to reduce chance of losing
		 * FIFO data when the i3c speed is high.
		 */
		while (offset < buf_sz) {
			gpio_pin_set_dt(&dbg_2, 1);
			if (mcux_i3c_fifo_rx_count_get(base) == 0) {
				gpio_pin_set_dt(&dbg_2, 0);
				break;
			}
			gpio_pin_set_dt(&dbg_1, 1);
			buf[offset++] = (uint8_t)base->MRDATAB;
			gpio_pin_set_dt(&dbg_1, 0);
		}
		/*
		 * If controller says timed out, we abort the transaction.
		 */
		if (mcux_i3c_has_error(base)) {
			if (mcux_i3c_error_is_timeout(base)) {
				gpio_pin_set_dt(&dbg_0, 1);
				LOG_WRN("Flushing fifo due to timeout offset=%d, buf_sz=%d", offset, buf_sz);
				mcux_i3c_fifo_flush(base); //nxp succestion to add
				ret = -ETIMEDOUT;
				gpio_pin_set_dt(&dbg_0, 0);
			}
			/* clear error  */
			base->MERRWARN = base->MERRWARN;

			/* for ibi, ignore timeout err if any bytes were
			 * read, since the code doesn't know how many
			 * bytes will be sent by device. for regular
			 * application read request, return err always.
			 */
			if ((ret == -ETIMEDOUT) && ibi && offset) {
				break;
			} else {
				if (ret == -ETIMEDOUT) {
					LOG_ERR("Timeout error");
				}
				goto one_xfer_read_out;
			}
		}
	}

	ret = offset;

one_xfer_read_out:
	gpio_pin_set_dt(&dbg_2, 0);
	return ret;
}

Here is a good read:

good read

Here is a time out where no data was actually read:

timeout timeout

Timeouts will happen when 0 reads actually happen or sometimes 1 will be read or sometimes even both. But it looks like that function in the driver is taking too long to read from the FIFO, perhaps a context switch from zephyr has occurred. I haven't caught the error yet that takes the bus down.

Dezheng_Tang

If you think it could be some context switching issue, can you do something like irq_lock() and irq_unlock() to protect your critical modules, at least, to see it improves or not?

galenc

Hi @Dezheng_Tang Yeah the irq_lock() worked well if i wrap the read/write calls in it. It seems to be context switching mid-transaction to be the cause of timeouts. If i wrap the majority of mcux_i3c_do_one_xfer() the errors are mostly all gone (still some timeouts emitting stop occur). I'm not sure disabling interrupts for ~250uS when it only takes ~70uS to read from the FIFO is acceptable for the rest of my app, but i guess i can do more testing there.

I was able to discover that the EIO errors i was getting were really just timeouts as well, just in places not calling mcux_i3c_error_is_timeout(). You can see in this case above. The I3C module calls it a timeout even if everything works. Even in cases where the FIFO has the correct data the MERRWARN = 0x00100000 (timeout) will still be set. I want to believe I can just ignore these errors (or live with irq_lock for an entire transaction period).

The open item still remains the timeout's when "emitting stop" and when the bus is completely down due to it. Those can take hours to occur and i can keep monitoring I haven't had one occur yet today. Do you have any advice on what could be causing that to occur still? Do you think preventing general timeouts will improve the stability of that failure mode too?

Dezheng_Tang

I don't know why, from time to time, the post update notification was not forwarded to my Outlook although I have subscribed on the community.

I would minimize the use of EIO error if possible and have GPIO toggling to debug which I trust more.

Preventing general timeout will definitely improve the stability of the failure mode. On the other hand, I am also curious if your target I2C is doing clock stretching and then your EmitStop() never finishes.

On RT68x, clock stretching is not supported in our I3C controller, this is one of the limitations of our I2C compatibility. Please take a look at the beginning of the I3C chapter. I saw some similar post related to I2C clock stretching:

https://community.nxp.com/t5/MCX-Microcontrollers/I3C-timeout-due-to-target-clock-stretching/td-p/19...

If so, you will have to switch to the I2C controller to communicate with your I2C target. In the past, whenever there is a "dead" I2C target on the bus, I also implemented some bus recovery logic: something like configure SCL to GPIO pins and toggle ~20 times, make sure the bus is back to idle, then switch the pin back to SCL function.

galenc

No worries, I appreciate how prompt you are regardless Yeah the GPIO works well, unfortunately on out custom board we don't have the ability to use the GPIO toggling for debugging as on the EVK. I will see if I can repurpose some IO there to use it.

Right now I'm seeing it will pretty reliably bring the bus down after 10-20min on the custom board (it is a much more complicated system) but its pretty stable on the EVK now. I like that idea on forcing the SCL line as GPIO, i will just have to figure out a good way to do this in Zephyr.

So far I don't have much else to go on, but I have noticed it looks like the MCTRL is saying its in I3C mode. 0x42 would mean bits 5:4 are 0 = I3C. I'm not sure why that would be as everything should be in I2C. Our only device that is capable of I3C is LSM6DSOX which is an OTS zephyr driver (that actually only works for a couple minutes at bootup) but its forced into i2c mode on init.

[00:38:31.275,094] <inf> charger: Vin=260mV, State=0, Tj=31.701500 DegC, NTC=0.461914, TS=0.283203: Battery V=7970mV I=0mA, cell_count=2, per_cell=3985 ADC_s=1 ADC_c=1
[00:38:32.282,810] <err> i3c_mcux: IBI Timeout error
[00:38:32.282,825] <wrn> i3c_mcux: ERROR from transfer read=1, error=-116
[00:38:32.415,111] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000042 MSTATUS 0x00001803 MERRWARN 0x00000000 MIBIRULES 0x00000000
[00:38:33.415,389] <err> i3c_mcux: FORCE EXIT
[00:38:33.515,582] <wrn> i3c_mcux: EMIT STOP return timeout
[00:38:33.515,597] <wrn> i3c_mcux: Timeout on emit stop, retrying
[00:38:33.648,417] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000002 MSTATUS 0x00001003 MERRWARN 0x00000000 MIBIRULES 0x00000000

Here is a plot of the bus as it dies on our custom board (the lines are steady state like that forever, i.e. SDA low and SCL High):

Dezheng_Tang

This debug log is very different from what I saw before. For all the waveform you have posted before, they were all I2C traffic, MCTRL bit 0~7 is either 0x50 or 0x52 which also indicates I2C. You also mentioned before, when EmitStop() failed to complete, there is no IBI, bus is NORMACK and in I2C mode.

You will have to debug when/where MCTRL register bit 5~4 TYPE got changed and why suddenly IBI message pops out. The other interesting thing is FORCE EXIT is for HDR/DDR and EMIT STOP is for SDR. FORCE EXIT and EMIT STOP should never been at the same time in the log.

galenc

I cleaned up some of the logging but looks like the issue still follows that wave form. the SDA is low and does not return high. In this plot only after i reset the target twice does it recover:

A closer look at the "pulse" in the middle (the first time I press reset):

Zephyr actually cannot get past pre-main init code with the first reset.
The bus only recovers after the second reset, when SDA finally transitions high. What could cause the SDA to be held low like this? We have multiple pull-ups on the lines, so something must be driving it low. Do you think your suggestion of switching to GPIO mux and toggling will resolve this?

Dezheng_Tang

It looks like your Target device is "dead'" and holds SDA low forever. For cases like, I3C controller reinitialization or reset won't help either. It doesn't have anything to do with RT68x.

It's not guaranteed but there are a few things you can try:

(1) Slow the I2C clock down a little, maybe fine tuning the duty cycle, rising/falling time.

(2) H/W consideration, make sure there is no glitch on VDD, never below 1.71V, according to LSM6DSO datasheet. Maybe switch to 3.3V to see the same problem occurs. I am concerned about the i2c slave timing violation mentioned in their datasheet.

(3) Make sure SCL/SDA lines on the bus are clean, maybe try stronger pull-ups.

(4) Hopefully, IOCON full-drive mode is enabled.

(5) Slow the sensor reading rate a little to see if it helps.

(6) Switch to GPIO pins, toggle 20~30 times, switch back to I3C SCL/SDA and see if the bus is in idle state or not, both SCL and SDA are high.

galenc

Hi @Dezheng_Tang sorry for the delay we were trying a few things. We tried to isolate to one of our devices which is causing the issue. Still no definitive signal there unfortunately.

I was able to implement the hardware recovery similar to what you suggested. I actually pulled in i2c_bitbang driver that is used on the i2c module for recovery into the i3c driver:

int i2c_bitbang_recover_bus(struct i2c_bitbang *context)
{
	int i;

	/*
	 * The I2C-bus specification and user manual (NXP UM10204
	 * rev. 6, section 3.1.16) suggests the master emit 9 SCL
	 * clock pulses to recover the bus.
	 *
	 * The Linux kernel I2C bitbang recovery functionality issues
	 * a START condition followed by 9 STOP conditions.
	 *
	 * Other I2C slave devices (e.g. Microchip ATSHA204a) suggest
	 * issuing a START condition followed by 9 SCL clock pulses
	 * with SDA held high/floating, a REPEATED START condition,
	 * and a STOP condition.
	 *
	 * The latter is what is implemented here.
	 */

it seems to work returning SDA high:

and i can trigger it a couple times even:

but the i3c does not get back to IDLE (FYI I added the error EHOSTDOWN to denote this inability to get into IDLE when emitting stop).

[00:17:30.601,824] <err> i3c_mcux: do_one_xfer_read Timeout error
[00:17:30.601,843] <wrn> i3c_mcux: ERROR from transfer read=1, error=-116
[00:17:30.736,449] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001903 MERRWARN 0x00000000 MIBIRULES 0x00000000
[00:17:31.736,672] <wrn> i3c_mcux: EMIT STOP return EHOSTDOWN
[00:17:31.736,688] <wrn> i3c_mcux: Timeout on emit stop, retrying
[00:17:31.871,770] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001103 MERRWARN 0x00000000 MIBIRULES 0x00000000
[00:17:32.871,977] <wrn> i3c_mcux: EMIT STOP return EHOSTDOWN
[00:17:32.871,993] <wrn> i3c_mcux: Timeout on emit stop, retrying
[00:17:33.006,332] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001103 MERRWARN 0x00000000 MIBIRULES 0x00000000
[00:17:34.006,569] <wrn> i3c_mcux: EMIT STOP return EHOSTDOWN
[00:17:34.006,586] <wrn> i3c_mcux: Timeout on emit stop, retrying
[00:17:34.139,991] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001103 MERRWARN 0x00000000 MIBIRULES 0x00000000
[00:17:35.140,169] <wrn> i3c_mcux: EMIT STOP return EHOSTDOWN
[00:17:35.140,180] <wrn> i3c_mcux: Timeout on emit stop, retrying
[00:17:35.274,973] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001103 MERRWARN 0x00000000 MIBIRULES 0x00000000
[00:17:36.275,072] <wrn> i3c_mcux: EMIT STOP return EHOSTDOWN
[00:17:36.275,088] <wrn> i3c_mcux: Timeout on emit stop, retrying
[00:17:36.410,054] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001103 MERRWARN 0x00000000 MIBIRULES 0x00000000
[00:17:37.410,277] <wrn> i3c_mcux: EMIT STOP return EHOSTDOWN
[00:17:37.410,295] <err> i3c_mcux: Error waiting for stop EHOSTDOWN
[00:17:37.410,303] <err> i3c_mcux: STOP ERROR = -117
[00:17:37.410,306] <wrn> i3c_mcux: RETURN EHOSTDOWN
[00:17:37.543,104] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001103 MERRWARN 0x00000000 MIBIRULES 0x00000000
fusion_core:~$i3c_recover
--- 21 messages dropped ---
[00:17:38.543,281] <wrn> i3c_mcux: EMIT STOP return EHOSTDOWN
[00:17:38.543,297] <wrn> i3c_mcux: Timeout on emit stop, retrying
[00:17:38.678,129] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001103 MERRWARN 0x00000000 MIBIRULES 0x00000000
[00:17:38.748,052] <wrn> charger: TRUING TO RECOVR BUS
[00:17:38.848,277] <wrn> i3c_mcux: pnx_mcux_i3c_recover_bus ... using i2c bus reovery!!!!
[00:17:39.678,370] <wrn> i3c_mcux: EMIT STOP return EHOSTDOWN
[00:17:39.678,388] <wrn> i3c_mcux: Timeout on emit stop, retrying
[00:17:39.813,108] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001103 MERRWARN 0x00000000 MIBIRULES 0x00000000
[00:17:40.813,268] <wrn> i3c_mcux: EMIT STOP return EHOSTDOWN
[00:17:40.813,281] <wrn> i3c_mcux: Timeout on emit stop, retrying
[00:17:40.948,013] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001103 MERRWARN 0x00000000 MIBIRULES 0x00000000
[00:17:41.948,273] <wrn> i3c_mcux: EMIT STOP return EHOSTDOWN
[00:17:41.948,292] <wrn> i3c_mcux: Timeout on emit stop, retrying
[00:17:42.081,881] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001103 MERRWARN 0x00000000 MIBIRULES 0x00000000
[00:17:43.082,069] <wrn> i3c_mcux: EMIT STOP return EHOSTDOWN
[00:17:43.082,087] <wrn> i3c_mcux: Timeout on emit stop, retrying
[00:17:43.216,617] <err> i3c_mcux: Tried to emit stop 10000times..clearing all! MCTRL 0x00000052 MSTATUS 0x00001103 MERRWARN 0x00000000 MIBIRULES 0x00000000
[00:17:51.046,370] <wrn> i3c_mcux: EMIT STOP return EHOSTDOWN
[00:17:51.046,389] <err> i3c_mcux: Error waiting for stop EHOSTDOWN

My question is: After I'm able to recover the SDA line via GPIO toggling/bitbang-ing what steps should i follow to get the i3c controller back? it seems its still stuck somewhere in NORMACT but its not transmitting anything.

Zephyr doesn't allow "re-initializing" devices so i have to build one from scratch, and what I'm doing doesn't seem to be working.

Thanks,
Galen

Dezheng_Tang

Glad your I2C bitbang recover bus works. On the master side, several things you should check:

(1) PIN MUX needs to set back to I3C SCL and SDA lines.

(2) Make sure both MSTATUS and MERRWARN registers are reset to default state.

(3) In our SDK code, we have a module called I3C_MasterTransferAbort() to abort the master transfer. Please take a close look on this. I don't know why you would stay in NORMACK state even after you have issued STOP request. After you emit stop, you should wait for CTRL DONE. Once CTRL DONE is seen, it should be back to IDLE state.

(4) If (2) and (3) don't work, I don't know what else you can do, maybe reset I3C by writing 1 to I3C0 bit in RSTCTL1_PRSTCTL2. You may want to check all the I3C registers, especially, MCONFIG and make sure they are the same before and after resetting I3C bit.

galenc

@Dezheng_Tangthat works! Actually i was able to just re-initialize the entire i3c controller after i run the bitbang recovery (which makes those calls). I might be able to pair it down a bit but good enough for now. It was running for 6 hours straight last night and recovered itself 3 times before i powered it off.

Thank you for your assistance and patience!
-Galen

Dezheng_Tang

One more thing to add, if you are using a level-shifter between RT685 and the sensor, can you try to bypass the level-shifter and see if it helps? I suspect the signal quality plays some role on the lockup of the bus.

RT685 i3c port with Zephyr - driver with i2c devices dies after minutes of use

RT685 i3c port with Zephyr - driver with i2c devices dies after minutes of use

i.MXRT 600