SocketCan using FlexCan on i.MX53: Longer packet sequence sent in wrong order

fpn · ‎04-02-2014

We're tying to use CAN using Linux 2.6.35.3 and the SocketCan interface. Preliminary tests like sending / receiving with cansend / candump command line tools are running fine.

When we try to send more packets (i.e.48) in a loop, they are sent in a different order, not always the same. Receiving with candump on the same node and interface does not reveal the problem, because the local dump shows the correct order. Only using an external sniffer (which can be another i.mx53 board running candump) the issue becomes visible. Some packets are also sometimes lost.

If the packets are sent followed by a delay much longer than the transmission time, the sending order is preserved, but the throughput is in that case not acceptable. Trying to send 8..16 packets followed by a delay a little longer than the transmission time seemed to offer a workaround in a test program, but it failed in the production code (the packets are again reordered/lost).

I have attached a simple test program that exposes the problem.

Original Attachment has been moved to: i_mx53_FlexCan_bug.zip

compmas2 · ‎10-29-2015

The same kernel is used by the i.MX28 and it sounds like the same issue as found here: i.MX28 CAN frame transmission order problem

I took the i.MX28 patch for using the mainline flexcan driver and added i.MX53 support. Attached is the patch file. Additionally in your board file under arch/arm/mach-mx5/ will need to include something like the following in place of the previous flexcan initialization code (if you need to init your transceiver, then set the .transceiver_switch to the setup function:

#include <linux/can/platform/flexcan.h>

struct flexcan_platform_data flexcan_data[] = {

{

.transceiver_switch = NULL

},

{

.transceiver_switch = NULL

},

};

In the mxc_board_init() function:

mxc_register_device(&mxc_flexcan0_device, &flexcan_data[0]);

mxc_register_device(&mxc_flexcan1_device, &flexcan_data[1]);

TomE · ‎11-04-2015

> The same kernel is used by the i.MX28 and it sounds like the same issue as found here:

Yes you're right. The 2.6.35 kernel uses a Freescale FlexCAN driver in /drivers/net/can/flexcan/. It uses 32 of the 64 message buffers as a "transmit ring" and the other 32 for receive in some way I haven't bothered to work out. It takes no notice whatsoever of the fact that the message buffers have a fixed transmit priority order, so it guarantees bad out-of-order transmission unless you only send one message at a time and then wait a long time until it has gone (somehow). The driver looks to support the hardware FIFO but default to it being off, which then looks like the receive messages won't be in the right order either.

The "mainline" driver is one written by Pengutronix. It only uses one message buffer for transmission, so it is less efficient, but gets the order right. It also only uses the FIFO for reception, which can overflow too easily as it uses NAPI to unload the buffers.

The two drivers seem to have very different interfaces for controlling them. I'm finding my Linux 3.4 Socketcan "canconfig" can't talk to the Freescale one.

The 2.6.38 kernel in the current i.MX53 Linux support package has (an old version of) the Pengutronix driver in it.

It might be worth changing to 2.6.38 to get around this problem if you can handle porting your board customisations to 2.6.38. That should also fix problems in the 2.6.35 SPI driver detailed here:

https://community.freescale.com/message/582264#582264

Tom

TomE · ‎11-30-2015

There are SO many things badly wrong with the Freescale FlexCAN driver.

Transmitting out of order is only one of them, but this is a "Design Feature and not a Bug". What is going wrong is that the sending code fills in the Transmit MBs from the first FREE one, and the hardware transmits from the first one too (depending on LBUF you can have transmits in priority order). So if you queued 10, it has sent 5 and you queue another 10, the first 5 new ones go into the low free slots and get transmitted next, and the second five get sent last. So you end up with the transmit order 1-5, 11-15, 6-10 and 16-20.

You could rewrite the driver so it would have the option of filling all the transmit buffers in the right order, and only resetting to the first one when the last one has been sent. If it never "backfilled behind pending ones" this wouldn't happen.

It also can't RECEIVE in order either and for the same reason, but you have to get it very busy to notice that. You can code around that by sorting the messages in timestamp order. That's what Freescale recommends in the manual, but doesn't seem to have any working code for. The good news is you can enable the FIFO. The bad news is it only has SIX slots, so can overflow. The other bad news is that the FIFO driver code handles the interrupts in completely the wrong way and the FIFO can get stuck. The Transmission locks up too when you push it (still working on that).

Firstly, it is very easy to get it sending packets in order:

ifconfig can0 down

echo 33 > /sys/devices/platform/FlexCAN.0/maxmb

echo 32 > /sys/devices/platform/FlexCAN.0/xmit_maxmb # Redundant but ESSENTIAL

ifconfig can0 up

Those are normally "64" and "32" and mean that you have 32 receive (MBs 0-31) and 32 transmit (MBs 32-64) Message Buffers. Weirdly "xmit_maxmb" is the maximum RECEIVE MB, and the last one before the transmit range starts. You have to read the code for a long time to work that out. It is also ignored if you use the FIFO, but is still used in the sysfs "dump" commands, so you have to set it to "8" to make them work.

After you've done the above you only have one transmit message buffer. That drops the maximum transmit rate by a lot, in my testing down to about 20% of the bus at 1MHz. I can't understand why, as the Mainstream CAN driver also only has one transmit MB and it can saturate the bus to 100% without any trouble with the same tests, so there's something else badly wrong here.

You have to set the "xmit_maxmb" value above, as if you check its value you'll find it JUMPS to the value in "maxmb". Worse, if you change "maxmb" from "64" to "60" and then to "33", "xmit_maxmb" remains stuck at "60". There's a comparison-reversal bug in the code that tries to keep "xmit_maxmb" less than "maxmb" and instead keeps it the same or greater.

Tom

TomE · ‎11-30-2014

The CAN setup under linux has a lot of problems. Here's my previous research on this.

I don't know if the problem you're having is related to the transmit queue block-and-drop limits, but it might be.

Even if it isn't, this information might be useful for someone else.

http://socket-can.996257.n3.nabble.com/Solving-ENOBUFS-returned-by-write-td2886.html

With Ethernet, the transmit queue length is 1000 (which would return ENOBUF) but before that happens it hits SO_SNDBUF, which may be 108544, which is the total Data plus SKB, and with an SKB size of about 200 that means it blocks at about 500 before it ENOBUFs at 1000.

With CAN, it would block at 500, but it ENOBUFs at 10 first!

The blocking limit could be introduced/reduced by setting the SO_SNDBUF socket option.

If we set that limit to a suitable multiple of the socket size (say 5 by 200 or 1000 bytes) then it would probably block before it ENOBUFed.

The recommended ENOBUF recovery method is "retry after a short sleep", in this case at least 100us.

Here's another one:

http://rtime.felk.cvut.cz/can/socketcan-qdisc-final.pdf

Blocking the application when the queue is full

Many SocketCAN users experience a problem with write()/send()

failing with -ENOBUFS error. Since this is related to the use of queueing

disciplines, this section describes why it happens and what can be done

against it. In the default con?figuration, CAN interfaces have attached

pfifo fast queuing disci-pline which, when enqueueing the packet,

checks whether the number of queued packets is greater then dev->txqueuelen

(which is 10 for CAN devices by default). If it is the case, it returns

NET_XMIT_DROP which is translated to -ENOBUFS in net_xmit_errno()

called from can_send().

The problem is, that there is no way for the application to be

blocked until the queue becomes empty again. How can be the application

made to block when the queue is full instead of getting ENOBUFS

error? In general there are two mechanisms that limit the number of queued

packets. First, there is the already mentioned per-device tx_queue_len

limit and second, the per-socket SO_SNDBUF limit. The application only

blocks when the latter limit is reached. Therefore, the solution is to set

SO_SNDBUF low enough that this limit is reached before tx_queue_len limit.

Now, the question is to which value set the SO_SNDBUF limit. First,

the minimum value is SOCK_MIN_SNDBUF/2, i.e. 1024. When the user

supplies a smaller value the minimum is used instead. The more

tricky thing is how is the value interpreted. The value represents

the maximum socket send bu?er in bytes. The kernel always doubles

the value supplied by user (i.e. for the kernel the minimum is 2048)

and stores it in sk->sk_sndbuf. When a packet is sent, a per-socket

counter is increased by sizeof(can_frame) + sizeof(skb) (which is a

value around 200, depending on kernel con?guration and architecture). When

the counter is greater or equal to sk->sk_sndbuf, the application blocks.

The following piece of code sets the SO_SNDBUF value to its minimum:

int sndbuf = 0; if (setsockopt(s, SOL_SOCKET, SO_SNDBUF, &sndbuf,

sizeof(sndbuf)) < 0) perror("setsockopt");

Typically, the minimum value causes the application to block when there are about

15 frames queued. If we want all CAN applications in the system to block instead

of receiving ENOBUFS, it is necessary to set the txqueuelen (see Section

3.1.2) to the number of simultaneously used CAN sockets in the system

multiplied by 15. If the application does not wish to block, it sets

O_NONBLOCK flag on the socket by using fcntl() call. After that, when the

SO_SNDBUF is reached, the application receives EAGAIN error instead of

ENOBUFS

Tom

SocketCan using FlexCan on i.MX53: Longer packet sequence sent in wrong order

SocketCan using FlexCan on i.MX53: Longer packet sequence sent in wrong order

i.MX53

Linux

Suspected Software Defect

Blocking the application when the queue is full