i.MX53: Linux 2.6.35 FlexCAN Fifo Handling Bug causing Delayed Receives.

TomE · ‎11-30-2015

We have a working system running Linux 3.4 on an i.MX53.

We've had to go back to 2.6.35 as that's the only OS that supports the video-in hardware on the chip.

Now I'm trying to get the rest of the hardware working as reliably as it does on Kernel 3.4.

The 2.6.35 FlexCAN driver defaults to using 32 receive MBs and 32 transmit MBs. This may work for very simple ID-based messaging, but anything that needs to send a stream of data over CAN (booting, flashing, configuring, debugging) requires the messages be received in the same order they are sent. The default setup can't do that. This has been covered before:

https://community.freescale.com/message/592846#592846

To send in the right order the driver can be told to use one TX MB. To receive in the right order it can be told to use the FIFO.

I tried that, and then the FIFO got in a very funny state. It seemed to be keeping 3 or 4 messages internally, and it would only signal one was available when the next one arrived. So it was behaving like the input and output pointers got misaligned.

Checking the manual, it says the FIFO has to be handled by:

i.MX53 Reference Manual

34.4.7 Rx FIFO

Upon receiving the interrupt, the ARM must read the frame
(accessing an message buffer in the 0x80 address) and then clear the
interrupt. The act of clearing the interrupt triggers
the FIFO engine to replace the message buffer in 0x80 with the next
frame in the queue, and then issue another interrupt to the ARM.

The Linux 3.4 FlexCAN FIFO driver obeys those rules (edited down to show the important instructions):

static void flexcan_read_fifo(const struct net_device *dev,
                  struct can_frame *cf) {
    reg_ctrl = flexcan_read(&mb->can_ctrl);
    reg_id = flexcan_read(&mb->can_id);
    *(__be32 *)(cf->data + 0) = cpu_to_be32(flexcan_read(&mb->data[0]));
    *(__be32 *)(cf->data + 4) = cpu_to_be32(flexcan_read(&mb->data[1]));
    /* mark as read */
    flexcan_write(FLEXCAN_IFLAG_RX_FIFO_AVAILABLE, &regs->iflag1);
    flexcan_read(&regs->timer);
}

The Linux 2.6.35 FIFO driver does this:

void flexcan_mbm_isr(struct net_device *dev) {
    /* Read iflag1 and iflag2 and work out the masking */
    iflag1 = __raw_readl(flexcan->io_base + CAN_HW_REG_IFLAG1) &
             __raw_readl(flexcan->io_base + CAN_HW_REG_IMASK1);
    iflag2 = __raw_readl(flexcan->io_base + CAN_HW_REG_IFLAG2) &
             __raw_readl(flexcan->io_base + CAN_HW_REG_IMASK2);

    __raw_writel(iflag1, flexcan->io_base + CAN_HW_REG_IFLAG1);    #### Clear ALL set interrupts!
    __raw_writel(iflag2, flexcan->io_base + CAN_HW_REG_IFLAG2);

    if (flexcan->fifo) {
        flexcan_fifo_isr(dev, iflag1);        #### This function reads one message only.
        iflag1 &= 0xFFFFFF00;
    }

It clears the interrupts and THEN reads the FIFO. That's completely the wrong order. Reading the manual, I'm surprised it ever delivers the second message. I'm guessing the reception of a new message into the FIFO sets the interrupt request again, so that explains why it can read multiple MBs late.

It should also probably loop reading the FIFO while there's something in it. Instead it reads ONE message, and then cycles through up to 56 Transmit message buffers before returning for another interrupt to read the next one. That's inefficient and is ignoring the highest priority interrupt. Thinking about this though, the state machine (that transfers the next message in) is probably running off a slow clock and may take a long time (relative to the CPU) to get the FIFO ready for the next read. So it may not be worth checking or waiting for. I may measure this later.

This bug would show up in a protocol where two devices are exchanging commands and responses. The replies would have to be "pushed" through the FIFO by protocol retries. It might work, but would be very slow.

In order to trigger this condition you have to have interrupts disabled by something else for long enough for 2 (or 3 or 4) messages to be in the FIFO before its interrupt got serviced.

I can't find any fixes for this anywhere is the FlexCAN driver in 2.6.35 is an "orphan". By 2.6.38 it had been replaced by the mainstream driver which gets it right. I don't have the option of using that driver as too much changed in the kernel.

Has anyone found and fixed this already (or found they had the problem and gave up)?

The Reference Manual doesn't have any instructions on when to clear message buffer interrupts for "normal" receive and transmit one. Since the buffer has to be locked to work on it, the order shouldn't matter. Except it should clear the interrupts before re-enabling a buffer for sending or transmitting (and

Tom

TomE · ‎12-02-2015

All of these horrible problems go away if I change the code to follow the user manual.

I've changed the code ordering shown in my first post above for 2.6.35 to the following:

void flexcan_mbm_isr(struct net_device *dev)
...
    if (flexcan->fifo) {
        flexcan_fifo_isr(dev, iflag1);
    }

    __raw_writel(iflag1, flexcan->io_base + CAN_HW_REG_IFLAG1);
    __raw_writel(iflag2, flexcan->io_base + CAN_HW_REG_IFLAG2);

    if (flexcan->fifo) {
        iflag1 &= 0xFFFFFF00;
    }

It now works fine without getting the FIFO hardware all messed up.

I've also written a test where it disables the FIFO receive interrupt for a while, and reenables it when the "level warning' interrupt is seen. That allows the FIFO to partly fill and be emptied. It works fine.

I've also tried to measure how long it takes for the FlexCAN hardware to make the next work ready in the FIFO after the last one has been emptied. It is less than a microsecond.

Now I've got to work out why it can't SEND properly. I'll raise another post for that problem.

Tom

View solution in original post

TomE · ‎12-01-2015

I wrote a small amount of debug code in the net/can/flexcan/mbm.c file in flexcan_mbm_isr() function. It counted 10 FIFO interrupts and then dropped on (ddn't call flexcan_fifo_isr() that time through). Subsequent CAN messages I sent to it didn't elicit any more interrupts until I send something on that port, and then it started receiving data again.

FIVE messages late. I "cansend" message number "7" (with "7" in the data field to mark it) and "candump" sees message number "2".

But it is worse than that. Bringing the CAN interface down and up (which is meant to reset it) doesn't fix this problem.

Here's what I sent:

root@triton1:/sys/devices/platform/FlexCAN.0# ifconfig can0 down
root@triton1:/sys/devices/platform/FlexCAN.0# ifconfig can0 up
# /usr/local/bin/cansend can1 55 1
# /usr/local/bin/cansend can1 55 2
# /usr/local/bin/cansend can1 55 3
# /usr/local/bin/cansend can1 55 4
# /usr/local/bin/cansend can1 55 5
# /usr/local/bin/cansend can1 55 6
# /usr/local/bin/cansend can1 55 7

And here's what I received:

<0x001> [2] 37 07
<0x001> [2] 37 02
<0x001> [2] 37 03
<0x001> [2] 37 04
<0x001> [2] 37 05
<0x001> [2] 37 01
<0x001> [2] 37 02

I then took it down, reconfigured it with the FIFO disabled, bought it up, bought it down, re-enabled the FIFO and started sending:

# /usr/local/bin/cansend can1 10 1
# /usr/local/bin/cansend can1 10 2
# /usr/local/bin/cansend can1 10 3
# /usr/local/bin/cansend can1 10 4
# /usr/local/bin/cansend can1 10 5
# /usr/local/bin/cansend can1 10 6
# /usr/local/bin/cansend can1 10 7
# /usr/local/bin/cansend can1 10 8
# /usr/local/bin/cansend can1 10 9
# /usr/local/bin/cansend can1 10 10

Here's what it received, starting with zero-length messages:

<0x000> [0]
<0x000> [0]
<0x000> [0]
<0x000> [0]
<0x000> [0]
<0x001> [2] 0a 01
<0x001> [2] 0a 02
<0x001> [2] 0a 03
<0x001> [2] 0a 04
<0x001> [2] 0a 05

It is still 5 behind. As the driver doesn't support any mechanism for resetting the chip (apart from BOFF recovery), this can only be fixed by power-cycling the whole device!

Tom

TomE · ‎12-01-2015

I didn't think today could get any worse, but it just did.

I changed my /etc/init.d/ startup script to initialise the CAN interfaces in FIFO mode before bringing them up the first time. That didn't work at all well.

The following in in two columns for the two terminal connections. The left one shows transmissions to CAN1 and the right reception on CAN0. They're connected together in hardware.

                                  # /usr/local/bin/candump can0
# /usr/local/bin/cansend can1 1
                                  <0x04e3f9ad> [6] da 01 7a 49 74 df
# /usr/local/bin/cansend can1 2
                                  <0x322> [4] e9 65 54 aa remote request
# /usr/local/bin/cansend can1 3

# [  150.036498] ------------[ cut here ]------------
[  150.041182] WARNING: at net/can/af_can.c:633 can_rcv+0x70/0x16c()
[  150.047360] PF_CAN: dropped non conform skbuf: dev type 280, len 16, can_dlc 9
[  150.054663] Backtrace:
[  150.057165] [<c002afe8>] (dump_backtrace+0x0/0x10c) from [<c030cedc>] (dump_stack+0x18/0x1c)
... and so on and so on ...
[  150.308701] ---[ end trace 3ee7a352dfce1c8c ]---

# /usr/local/bin/cansend can1 4
                                  <0x3a1> [0] remote request
# /usr/local/bin/cansend can1 5
                                  <0x675> [0]
# /usr/local/bin/cansend can1 6
                                  <0x001> [1] 01
# /usr/local/bin/cansend can1 7
                                  <0x001> [1] 02
# /usr/local/bin/cansend can1 8
                                  <0x001> [1] 03
# /usr/local/bin/cansend can1 9
                                  <0x001> [1] 04
# /usr/local/bin/cansend can1 10
                                  <0x001> [1] 05

The first 5 "receives" are garbage, and the third one is bad enough to cause a kernel panic. Probably the bit that says there are NINE bytes in the 8-byte-maximum CAN message [1]. The first message only comes out after the sixth one goes in.

This is repeatable. It happens every power-up.

It works OK if I initialise in "non FIFO mode" and then change it later.

This is really confusing. I've checked the code and entering "ifconfig can0 up" to a "down" interface definitely drops a SOFT_RST command into the MCR. So it should be resetting it OK.

Maybe this is a hardware bug that even a soft reset won't clear?

I am setting "maxmb" (which converts to MCR[MAXMB] to "9". Maybe it doesn't like that?

Note 1: ISO/CD 11989 covers this. It says "if DLC values 8-15 mean 8". So the code that treats this as an error is also a bug.

Tom

TomE · ‎12-02-2015

All of these horrible problems go away if I change the code to follow the user manual.

I've changed the code ordering shown in my first post above for 2.6.35 to the following:

void flexcan_mbm_isr(struct net_device *dev)
...
    if (flexcan->fifo) {
        flexcan_fifo_isr(dev, iflag1);
    }

    __raw_writel(iflag1, flexcan->io_base + CAN_HW_REG_IFLAG1);
    __raw_writel(iflag2, flexcan->io_base + CAN_HW_REG_IFLAG2);

    if (flexcan->fifo) {
        iflag1 &= 0xFFFFFF00;
    }

It now works fine without getting the FIFO hardware all messed up.

I've also written a test where it disables the FIFO receive interrupt for a while, and reenables it when the "level warning' interrupt is seen. That allows the FIFO to partly fill and be emptied. It works fine.

I've also tried to measure how long it takes for the FlexCAN hardware to make the next work ready in the FIFO after the last one has been emptied. It is less than a microsecond.

Now I've got to work out why it can't SEND properly. I'll raise another post for that problem.

Tom

alejandrolozan1 · ‎12-09-2015

Hi Tom,

Thanks for sharing your solution. It is very appreciated.

/Alejandro

TomE · ‎12-09-2015

This driver can't TRANSMIT either.

I've documented that problem plus another SEVEN bugs in this driver here:

i.MX53 Linux FlexCAN Driver Can't Send Properly & other bugs.

I've attached Patches to fix all of these problems here:

Submit i.MX53 & i.MX28 Linux kernel patches

Tom

i.MX53: Linux 2.6.35 FlexCAN Fifo Handling Bug causing Delayed Receives.

i.MX53: Linux 2.6.35 FlexCAN Fifo Handling Bug causing Delayed Receives.

i.MX53

Linux

Suspected Software Defect