i.MX53 Linux FlexCAN Driver Can't Send Properly & other bugs.

TomE · ‎12-02-2015

The Linux 2.6.35 FlexCAN driver has all sorts of problems. The latest one I've run into is that it can't send data properly.

CAN on Linux has problems caused by it being bolted into the Network code. Normally when sending data you'd prefer flow control to work by blocking the call before you get ENOBUFS. The way networking is set up this is what normally happens, as you run into the SO_SNDBUF limit (which blocks) before you hit the "/sys/class/net/eth0/tx_queue_len" limit, which gives ENOBUFS. Because the CAN buffers are small, and the "tx_queue_len" for CAN defaults to FIVE you get ENOBUFS all the time unless you push "tx_queue_len" up over 380 or so.

So if you run "cansequence" without any parameters you'll get this:

# /usr/local/bin/cansequence can1
interface = can1, family = 29, type = 3, proto = 1
write: No buffer space available

The writers of that program thought of that, so you can also do this, which has it polling for a POLLOUT condition:

# /usr/local/bin/cansequence can1
interface = can1, family = 29, type = 3, proto = 1

I can then use "cansequence -r -v" to have it print one line for every 256 incoming messages.

And nothing happens. And nothing KEEPS happening. So instead:

# /usr/local/bin/cansequence -r -v -v
interface = can0, family = 29, type = 3, proto = 1
received frame. sequence number: 146
received frame. sequence number: 147
received frame. sequence number: 148
received frame. sequence number: 149
received frame. sequence number: 150

What I can see is that the above is printing ONE LINE PER SECOND.

That is because "cansequence -p" on top of the supplied drivers is only able to send one CAN message per second on a 1 MBit/second CAN bus.

An oscilloscope bears this out. Running "cansequence" has it sending messages in a burst up until the transmit buffer size, and then reverting to sending one per second.

One second is the timeout passed to the "poll()" call by cansequence.

That tells me the driver can't be properly signalling when a buffer comes free. Here's the code that does that:

static void flexcan_mb_bottom(struct net_device *dev, int index)
{
...
        if (hwmb->mb_cs & (CAN_MB_TX_INACTIVE << MB_CS_CODE_OFFSET)) {
            if (netif_queue_stopped(dev))
                netif_start_queue(dev);
            return;

And the above would be wrong because the definition of "netif_start_queue()" basically comes down to:

include/linux/nedevice.h:
clear_bit(__QUEUE_STATE_XOFF, &dev_queue->state);

Every other driver calls "netif_wake_queue() in that place, and that function does more than the above as it actually causes a reschedule:

static inline void netif_tx_wake_queue(struct netdev_queue *dev_queue)
{
    if (test_and_clear_bit(__QUEUE_STATE_XOFF, &dev_queue->state))
        __netif_schedule(dev_queue->qdisc);
}

Changing the call in flexcan_mb_bottom() to netif_wake_queue() fixes this serious bug.

Isn't anyone else using CAN on the i.MX28 and i.MX53? I know of one other, and he ported the mainstream driver back to his project to gt it working:

https://community.freescale.com/thread/272930

Tom

p.s.

While I'm here I might as well list all of the other problems I've found with it so far:

FIFO code doesn't work: https://community.freescale.com/thread/381075
sysfs code forgets to add 1 when printing /sys/devices/platform/FlexCAN.0/rjw.
Code attempting to limit xmit_maxmb to less than maxmb does the reverse.
Code doesn't force DLC values above "8" to "8" to stop netif panics.
dump_rx_mb and dump_xmit_mb functions kills everything when the interface is down.
dump_*_mb() functions don't handle fifo mode.
Sysfs code printing rjw should add one to it.

Here's some more problems with it:

Sysfs supports "Clock Selection" but that only applies for i.MX35 and not any of the other ones.
The "flexcan_set_bitrate()" function starts with the misleading comment "TODO:: implement in future", and then implements the "future". This matches the provided Documentation ("mx53_linux.pdf"), which says that the bitrate setting doesn't work when it does and did prior to that documentation being written.

These latter ones are documented here:

https://community.freescale.com/message/590070#590070

Message was edited by: Tom Evans, adding some more bugs.

Here's some more problems with it, detailed later on in this thread:

There's a seriously bad interrupt hazard in the transmit code. The transmit interrupt attempts to enable the queue, but if it interrupts the mainline transmit code, that disables the queue after it has been enabled by the interrupt. This gives a solid lockup, but you normally have to have it set to one TX MB to have this happen.
The lockup should be able to be cleared by taking the port down and up again, but the open() and stop() functions don't operate on the queue.

Message was edited by: Tom Evans, adding some more bugs.

The Bus Off Recovery doesn't work at all. There are 5 or more separate bugs involved in this one.

https://community.freescale.com/message/599099#599099

Message was edited by: Tom Evans to add the Bus Off Recovery problem pointer.

TomE · ‎12-16-2015

I fixed this with the following basic change:

static int flexcan_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
    struct can_frame *frame = (struct can_frame *)skb->data;
    struct flexcan_device *flexcan = netdev_priv(dev);
    struct net_device_stats *stats = &dev->stats;

    BUG_ON(!flexcan);

    if (frame->can_dlc > 8)
        return -EINVAL;

    if (!flexcan_mbm_xmit(flexcan, frame)) {
        dev_kfree_skb(skb);
        stats->tx_bytes += frame->can_dlc;
        stats->tx_packets++;
        dev->trans_start = jiffies;
        return NETDEV_TX_OK;
    }
    netif_stop_queue(dev);
    return NETDEV_TX_BUSY;
}

The above was changed to the following. Stopping the queue at the start of the function and starting it at the end (instead of stopping it at the end) avoids the hazard.

static int flexcan_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
    struct can_frame *frame = (struct can_frame *)skb->data;
    struct flexcan_device *flexcan = netdev_priv(dev);
    struct net_device_stats *stats = &dev->stats;
    int ret;

    BUG_ON(!flexcan);

    if (frame->can_dlc > 8)
        return -EINVAL;

    /* Stop queue before transmitting to avoid interrupt hazard. */
    netif_stop_queue(dev);
    /* flexcan_mbm_xmit() returns:
     *    -1 if the transmit failed (all MBs full),
     *     0 if the transmit succeeded and there might be more free MBs,
     *     1 if the transmit succeeded and there are no free MBs left.
     */
    ret = flexcan_mbm_xmit(flexcan, frame);
    if (ret >= 0) {
        dev_kfree_skb(skb);
        stats->tx_bytes += frame->can_dlc;
        stats->tx_packets++;
        dev->trans_start = jiffies;
        if (ret == 0)
            netif_start_queue(dev);
        return NETDEV_TX_OK;
    }
    return NETDEV_TX_BUSY;
}

The "flexcan_mbm_xmit()" function was also changed to return -1, 0 and 1 values. That isn't necessary for this fix, but halves the transmit overhead in the case of using single transmit MBs.

People using more than one Transmit MB are less likely to see this problem, as each successive transmit interrupt has the opportunity to fix the damage caused by the one that had the interrupt hazard.

Full sources for this fix posted here:

https://community.freescale.com/message/597951#597951

Tom

View solution in original post

TomE · ‎12-09-2015

I've attached patches for the Receive, Transmit and all the other bugs on this thread:

Submit i.MX53 & i.MX28 Linux kernel patches

And to help others use this driver...

The Reference for this manual is "Chapter 32" in the " i.MX53 EVK Linux Reference Manual". It lists the variables in the sysfs area without saying how to use them. Most map directly through to bits in the FlexCAN registers, but some don't.

In order to get transmissions in the right order it is necessary to reduce the number of transmit buffers to ONE.

The parameters that control this are "fifo", "maxmb" and "xmit_maxmb". You might think that "xmit_maxmb" controls this, but it controls the number of RECEIVE MBs and not the number of transmits. I'm guessing the driver suffered a redesign at some point (to reverse the Receive/Transmit allocation) and the control variable didn't get renamed.

"maxmb" does what it says on the tin. It is the total number of MBs in use. Su it ranges from 1 to 64. The value of this MINUS ONE is written to FLEXCANx_MCR[MAXMB].

The buffers are divided up into the ones to be used for reception first, then the rest are used for transmission. If "fifo" is "1" then the FIFO is enabled, 8 MBs are used for receptionand "maxmb - 8" are used for transmission. "xmit_maxmb" is ignored by the code (except for the "dump" code which are another set of bugs I've fixed).

So if you've enabled the FIFO and want to transmit in order, set "maxmb = 9" and ignore "xmit_maxmb" as long as you've added my patches.

If not using the FIFO, the number of RECEIVE MB's is "xmit_maxmb", and the number of TRANSIT MB's is "maxmb - xmit_maxmb". Yes, very confusing, the reverse of what you'd expect from the names.

The code sets up the receive filters on the receive MBs so that odd ones receive Extended IDs and even ones receive Standard IDs. So you can have a minimum of 2 receive buffers if you need to receive both types.

If you're using Kernel 2.6.35 then all data is received under interrupts. If you're using 2.6.38, 3.x or 4.x then they use a completely different driver, and none of the above applies, but then you'll run into worse problems if you're using the MMC/eMMC/ESDHC driver:

frames

Tom

TomE · ‎12-15-2015

I wrote:

> I've attached patches for the Receive, Transmit and all the other bugs on this thread:

And even with all those patches the driver STILL doesn't work!

It still locks up when transmitting, and it locks up so badly you have to power cycle to recover it. Taking the port down and up again doesn't fix it.

Simply looking at the code shows it has a serious interrupt hazard. There's one mutex in there, and no spinlocks. And the mutex is only used to protext the sysfs variable. There is NO protection of the mainline code against the interrupt service routine.

Other drivers (for other chips and the mainline one for this chip) avoid the interrupt hazard by only having a single transmit MB and calling "netif_stop_queue(dev)" at the start of every transmit. The transmit routine then calls "netif_wake_queue(dev)" to start it again.

Instead this code searches for a spare MB, sends and returns with the queue still running. If the queue is now full, the transmit function is called again, finds no spare MBs, stops the queue and returns "NETDEV_TX_BUSY". This is a bad idea as this is documented in linux/Documentation/networking/netdevices.txt as:

NETDEV_TX_BUSY Cannot transmit packet, try later
Usually a bug, means queue start/stop flow control is broken in
the driver. Note: the driver must NOT put the skb in its DMA ring.

It is also stunningly inefficient. When the queue is full, every single MB send requires two calls to the transmit function. The first sends the MB and the second finds there's no MBs available. Worse, the code SEARCHES through all transmit MBs to see if it can find one. Since the FlexCAN is on a slow bus on the other size of a bridge, this takes about 200ns (EDIT: just wrote code to measure this at 180ns/read) for each read, which wastes 180 CPU instruction cycles at 1GHz. That means it takes about 6000ns in theory (actually 7000ns measured) to scan through the 32 transmit MBs once. And it does it TWICE so it takes between 900ns and 14000ns depending on whether the first scan found one quickly or not, just for the transmit to send one MB. If you've selected 56 transmit MBs for some reason it'll take a maximum of about 20us to transmit one MB. If you're running on a 1MHz CAN bus which can transmit one message every 50us, this means the CPU is busy for between 20% and 40% of the time just looping through the MBs twice, looking for a free one. The take-home message is to use as FEW transmit MBs as you can to avoid this.

As well as being slow, the interrupt hazard means that the transmit interrupt sometimes happens (I've got print statements showing this case) just prior to it returning "NETDEV_TX_BUSY". it then never interrupts again and the transmitter it totally wedged.

That's another thing I'm going to have to fix.

Tom

TomE · ‎12-16-2015

I fixed this with the following basic change:

static int flexcan_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
    struct can_frame *frame = (struct can_frame *)skb->data;
    struct flexcan_device *flexcan = netdev_priv(dev);
    struct net_device_stats *stats = &dev->stats;

    BUG_ON(!flexcan);

    if (frame->can_dlc > 8)
        return -EINVAL;

    if (!flexcan_mbm_xmit(flexcan, frame)) {
        dev_kfree_skb(skb);
        stats->tx_bytes += frame->can_dlc;
        stats->tx_packets++;
        dev->trans_start = jiffies;
        return NETDEV_TX_OK;
    }
    netif_stop_queue(dev);
    return NETDEV_TX_BUSY;
}

The above was changed to the following. Stopping the queue at the start of the function and starting it at the end (instead of stopping it at the end) avoids the hazard.

static int flexcan_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
    struct can_frame *frame = (struct can_frame *)skb->data;
    struct flexcan_device *flexcan = netdev_priv(dev);
    struct net_device_stats *stats = &dev->stats;
    int ret;

    BUG_ON(!flexcan);

    if (frame->can_dlc > 8)
        return -EINVAL;

    /* Stop queue before transmitting to avoid interrupt hazard. */
    netif_stop_queue(dev);
    /* flexcan_mbm_xmit() returns:
     *    -1 if the transmit failed (all MBs full),
     *     0 if the transmit succeeded and there might be more free MBs,
     *     1 if the transmit succeeded and there are no free MBs left.
     */
    ret = flexcan_mbm_xmit(flexcan, frame);
    if (ret >= 0) {
        dev_kfree_skb(skb);
        stats->tx_bytes += frame->can_dlc;
        stats->tx_packets++;
        dev->trans_start = jiffies;
        if (ret == 0)
            netif_start_queue(dev);
        return NETDEV_TX_OK;
    }
    return NETDEV_TX_BUSY;
}

The "flexcan_mbm_xmit()" function was also changed to return -1, 0 and 1 values. That isn't necessary for this fix, but halves the transmit overhead in the case of using single transmit MBs.

People using more than one Transmit MB are less likely to see this problem, as each successive transmit interrupt has the opportunity to fix the damage caused by the one that had the interrupt hazard.

Full sources for this fix posted here:

https://community.freescale.com/message/597951#597951

Tom

alejandrolozan1 · ‎12-16-2015

Hi,

Sorry, the thread was marked as answered. Let me delve into this.

/Alejandro

TomE · ‎12-16-2015

alejandrolozano wrote:

> Sorry, the thread was marked as answered. Let me delve into this.

Yes, that was me that "Answered" it and also me that "Unanswered" it.

The driver can't send CAN messages in order. This has been essential for a long time.
I found that the simplest way to fix this is to set the number of transmit Message Buffers to "1". That problem fixed.
Then I found it goes horridly slow when pushed - it drops to one message/second.
I found that was because it calls "netif_start_queue()" when it should be calling "netif_wake_queue()". That problem fixed. So I posted a patch and marked it "Answered". That was a bit premature.
Then it locks up solid when pushed, so I marked it "Unanswered".
I then found that was due to a very basic interrupt hazard bug. When I fix that, test it and post a Patch for that one I'll mark this "Answered" again.

Tom

i.MX53 Linux FlexCAN Driver Can't Send Properly & other bugs.

i.MX53 Linux FlexCAN Driver Can't Send Properly & other bugs.

i.MX2x

i.MX53

Linux

Suspected Software Defect