AnsweredAssumed Answered

i.MX53 Linux FlexCAN Driver Can't Send Properly & other bugs.

Question asked by TomE on Dec 2, 2015
Latest reply on Dec 16, 2015 by TomE

The Linux 2.6.35 FlexCAN driver has all sorts of problems. The latest one I've run into is that it can't send data properly.

 

CAN on Linux has problems caused by it being bolted into the Network code. Normally when sending data you'd prefer flow control to work by blocking the call before you get ENOBUFS. The way networking is set up this is what normally happens, as you run into the SO_SNDBUF limit (which blocks) before you hit the "/sys/class/net/eth0/tx_queue_len" limit, which gives ENOBUFS. Because the CAN buffers are small, and the "tx_queue_len" for CAN defaults to FIVE you get ENOBUFS all the time unless you push "tx_queue_len" up over 380 or so.

 

So if you run "cansequence" without any parameters you'll get this:

 

# /usr/local/bin/cansequence can1
interface = can1, family = 29, type = 3, proto = 1
write: No buffer space available

 

The writers of that program thought of that, so you can also do this, which has it polling for a POLLOUT condition:

 

# /usr/local/bin/cansequence can1
interface = can1, family = 29, type = 3, proto = 1

 

I can then use "cansequence -r -v" to have it print one line for every 256 incoming messages.

 

And nothing happens. And nothing KEEPS happening. So instead:

 

# /usr/local/bin/cansequence -r -v -v
interface = can0, family = 29, type = 3, proto = 1
received frame. sequence number: 146
received frame. sequence number: 147
received frame. sequence number: 148
received frame. sequence number: 149
received frame. sequence number: 150

 

What I can see is that the above is printing ONE LINE PER SECOND.


That is because "cansequence -p" on top of the supplied drivers is only able to send one CAN message per second on a 1 MBit/second CAN bus.

 

An oscilloscope bears this out. Running "cansequence" has it sending messages in a burst up until the transmit buffer size, and then reverting to sending one per second.

 

One second is the timeout passed to the "poll()" call by cansequence.

 

That tells me the driver can't be properly signalling when a buffer comes free. Here's the code that does that:

 

static void flexcan_mb_bottom(struct net_device *dev, int index)
{
...
        if (hwmb->mb_cs & (CAN_MB_TX_INACTIVE << MB_CS_CODE_OFFSET)) {
            if (netif_queue_stopped(dev))
                netif_start_queue(dev);
            return;


 

And the above would be wrong because the definition of "netif_start_queue()" basically comes down to:

 

include/linux/nedevice.h:
clear_bit(__QUEUE_STATE_XOFF, &dev_queue->state);

 

Every other driver calls "netif_wake_queue() in that place, and that function does more than the above as it actually causes a reschedule:

 

static inline void netif_tx_wake_queue(struct netdev_queue *dev_queue)
{
    if (test_and_clear_bit(__QUEUE_STATE_XOFF, &dev_queue->state))
        __netif_schedule(dev_queue->qdisc);
}


 

Changing the call in flexcan_mb_bottom() to netif_wake_queue() fixes this serious bug.

 

Isn't anyone else using CAN on the i.MX28 and i.MX53? I know of one other, and he ported the mainstream driver back to his project to gt it working:

 

https://community.freescale.com/thread/272930

 

Tom

 

p.s.

 

While I'm here I might as well list all of the other problems I've found with it so far:

 

  • FIFO code doesn't work: https://community.freescale.com/thread/381075
  • sysfs code forgets to add 1 when printing /sys/devices/platform/FlexCAN.0/rjw.
  • Code attempting to limit xmit_maxmb to less than maxmb does the reverse.
  • Code doesn't force DLC values above "8" to "8" to stop netif panics.
  • dump_rx_mb and dump_xmit_mb functions kills everything when the interface is down.
  • dump_*_mb() functions don't handle fifo mode.
  • Sysfs code printing rjw should add one to it.

 

Here's some more problems with it:

 

  • Sysfs supports "Clock Selection" but that only applies for i.MX35 and not any of the other ones.
  • The "flexcan_set_bitrate()" function starts with the misleading comment "TODO:: implement in future", and then implements the "future". This matches the provided Documentation ("mx53_linux.pdf"), which says that the bitrate setting doesn't work when it does and did prior to that documentation being written.

 

These latter ones are documented here:

 

https://community.freescale.com/message/590070#590070

 

Message was edited by: Tom Evans, adding some more bugs.

 

Here's some more problems with it, detailed later on in this thread:

 

  • There's a seriously bad interrupt hazard in the transmit code. The transmit interrupt attempts to enable the queue, but if it interrupts the mainline transmit code, that disables the queue after it has been enabled by the interrupt. This gives a solid lockup, but you normally have to have it set to one TX MB to have this happen.
  • The lockup should be able to be cleared by taking the port down and up again, but the open() and stop() functions don't operate on the queue.

 

Message was edited by: Tom Evans, adding some more bugs.

 

The Bus Off Recovery doesn't work at all. There are 5 or more separate bugs involved in this one.

 

https://community.freescale.com/message/599099#599099

 

Message was edited by: Tom Evans to add the Bus Off Recovery problem pointer.

Outcomes