Missing error-passive state interrupt

andriyngvason · ‎10-28-2014

Hi,

Why is there no error-passive interrupt on FlexCAN?

There are interrupts for active, warning (which isn't even standard), and bus-off, but not passive. How come?

The current driver implementation in the linux kernel polls esr1 for state changes whenever an rx- or error interrupt is received. If the bus is open (not connected), the error state in the driver is not updated, so the state is out of sync with what's happening in the device.

Best regards,

Andri Yngvason

alejandrolozan1 · ‎11-05-2014

You mean that there is no an interrupt flag that indicates when the error passive state is reached?

If so, you are correct. But it can be implemented by checking the FLTCONF field.

Remember that the only difference between error active and error passive are the error frames sent when detecting an error.

If there is an error, you could always check the state in the interrupt.

Best Regards,

Alejandro

andriyngvason · ‎11-06-2014

No, I mean that it appears that FlexCAN does not generate an interrupt when the error passive state is reached, so even if I were to check FLTCONF on interrupt events, it wouldn't help.

It does generate interrupts for other errors if the error interrupts are turned on, but those errors are currently only turned on in SocketCAN if the berr-reporting flag is set. I think the rationale is that they don't want to be flooded with interrupts when errors occur, because that might put an rt-system into bad shape. I'm not sure if that's valid though.

I think that this is a design flaw in FlexCAN and it should be fixed in the next "version".

Regards,

Andri

TomE · ‎11-30-2014

> No, I mean that it appears that FlexCAN does not generate an interrupt when the error passive state is reached,

> I think that this is a design flaw in FlexCAN and it should be fixed in the next "version".

Yes, it doesn't work, but it won't ever be fixed. It isn't even documented in the Errata. There's no section in there on FlexCAN.

So this is a request for what is well known to some driver writers to be added to the Errata for the rest of us.

It seems to be "documented" here:

http://lxr.free-electrons.com/source/drivers/net/can/flexcan.c

Now let's see if this thing will let me paste in a few lines of text without messing it up completely - no, it took six tries and it STILL turned the last lot into a pseudo-spreadsheet!

Author: Wolfgang Grandegger <wg@grandegger.com> 2012-09-28 13:17:15

Committer: David S. Miller <davem@davemloft.net> 2012-10-02 07:18:21

Parent: 3d42a379b6fa5b46058e3302b1802b29f64865bb (can: flexcan: add 2nd clock to support imx53 and newer)

Child: bb698ca41ba574b3066ebbc5766e5980ae0051ca (can: flexcan: disable bus error interrupts for the i.MX6q)

Branch: remotes/origin/karo-tx28

Follows: v3.6-rc7

Precedes: next-20121002, next-20121003, next-20121004, next-20121005, next-20121008, next-20121009, next-20121010, next-20121011, next-20121012, v3.7-rc1

flexcan: disable bus error interrupts for the i.MX28

Due to a bug in most Flexcan cores, the bus error interrupt needs

to be enabled. Otherwise we don't get any error warning or passive

interrupts. This is _not_ necessary for the i.MX28 and this patch

disables bus error interrupts if "berr-reporting" is not requested.

This avoids bus error flooding, which might harm, especially on

low-end systems.

To handle such quirks of the Flexcan cores, a hardware feature flag

has been introduced, also replacing the "hw_ver" variable. So far

nobody could tell what Flexcan core version is available on what

Freescale SOC, apart from the i.MX6Q and P1010, and which bugs or

features are present on the various "hw_rev".

Author: Wolfgang Grandegger <wg@grandegger.com> 2012-10-11 06:10:42

Committer: Marc Kleine-Budde <mkl@pengutronix.de> 2012-10-24 03:43:17

Parent: 4f72e5f00dea3eca9139a23cf70fbf18d62fd1db (flexcan: disable bus error interrupts for the i.MX28)

Child: 4358a9dc94cd70fa88797c87ed911bc8d866f37d (can: flexcan: add MODULE_DEVICE_TABLE)

Branch: remotes/origin/karo-tx28

Follows: v3.7-rc1

Precedes: next-20121026, v3.7-rc3

can: flexcan: disable bus error interrupts for the i.MX6q

This patch adds some Flexcan version info and removes the feature flag

FLEXCAN_HAS_BROKEN_ERR_STATE for the i.MX6Q. It also has the line [TR]WRN_INT

properly connected.

162 /*

163 * FLEXCAN hardware feature flags

164 *

165 * Below is some version info we got:

166 * SOC Version IP-Version Glitch- [TR]WRN_INT

167 * Filter? connected?

168 * MX25 FlexCAN2 03.00.00.00 no no

169 * MX28 FlexCAN2 03.00.04.00 yes yes

170 * MX35 FlexCAN2 03.00.00.00 no no

171 * MX53 FlexCAN2 03.00.00.00 yes no

172 * MX6s FlexCAN3 10.00.12.00 yes yes

173 *

174 * Some SOCs do not have the RX_WARN & TX_WARN interrupt line connected.

175 */

176 #define FLEXCAN_HAS_V10_FEATURES BIT(1) /* For core version >= 10 */

177 #define FLEXCAN_HAS_BROKEN_ERR_STATE BIT(2) /* [TR]WRN_INT not connected */

...

875	reg_ctrl = flexcan_read(&regs->ctrl);
876	reg_ctrl &= ~FLEXCAN_CTRL_TSYN;
877	reg_ctrl \|= FLEXCAN_CTRL_BOFF_REC \| FLEXCAN_CTRL_LBUF \|
878	FLEXCAN_CTRL_ERR_STATE;
879	/*
880	* enable the "error interrupt" (FLEXCAN_CTRL_ERR_MSK),
881	* on most Flexcan cores, too. Otherwise we don't get
882	* any error warning or passive interrupts.
883	*/
884	if (priv->devtype_data->features & FLEXCAN_HAS_BROKEN_ERR_STATE \|\|
885	priv->can.ctrlmode & CAN_CTRLMODE_BERR_REPORTING)
886	reg_ctrl \|= FLEXCAN_CTRL_ERR_MSK;
887	else
888	reg_ctrl &= ~FLEXCAN_CTRL_ERR_MSK;

Tom

andriyngvason · ‎12-01-2014

Hi Tom,

The FLEXCAN_HAS_BROKEN_ERR_STATE flag is there because some cores don't have [TR]WRN_INT connected.

Although it helps to enable it, it doesn't cure the problem completely. See: http://article.gmane.org/gmane.linux.can/7041

You can follow my patch-set on the socket-can mailing-list if you're interested in how this is to be handled in the linux kernel.

This is clearly not entirely thought out by the hw-people. There is NO point in having interrupts for warnings if you don't also have interrupts for the rest of the state transitions.

Besides, warning isn't even a proper error state according to the standard. It boggles the mind that they only added the interrupt for the thing that is non-standard.

Andri

TomE · ‎12-01-2014

All CAN controllers are broken in some way or omeaning they seldom do what *I* want. They either don't have FIFOs when I need to receive an ordered byte stream or they don't have enough dedicated match registers when I just need the "shared memory model" of CAN.

Everyone was poised at the starting gate in 2001 waiting for Bosch's patent to run out, then every manufacturer rushed out a different design of controller. Or two, Motorola has FlexCAN and MSCAN. We're stuck with "backwards compatibility" back to those original mistakes. How long did it take until the FlexCAN controller got the very useful SRX_DIS bit?

But what I really want to know is how you Kernel Driver writers found out "some cores don't have [TR]WRN_INT connected". Did you have to reverse-engineer this yourself, did a FAE supply some internal document or is it in a Manual or Errata somewhere?

From the Linux FlexCan driver it looks like there are two versions of the i.MX28, one broken_err_state and one fixed, but only broken i/MX35s and i.MX53s.

My specific problem is that we're using 1MBit CAN and the driver "as supplied" generates a CPU-numbing 20,000 interrupts/second if a single packet is sent on a disconnected CAN bus due to ACK_ERR being enabled. That burns (from memory) about 20% of the entire CPU, and I lose 40% if both buses are disconnected!

While I'm here, in the current i.MX53 Reference Manual:

34.2.2.1 CAN Rx

Dominant state is represented by logic level. Recessive state is represented by logic level.

That at least is fixed in the i.MX6 manuals.

Tom

TomE · ‎12-01-2014

I wrote:

> the driver "as supplied" generates a CPU-numbing 20,000 interrupts/second

I worked that out from theory, assuming a zero-data-byte message having about 50 bits in it.

I measured it again today. It is VERY strange and a lot worse than that

If I send a stream of 8-data-byte messages and then disconnect the CAN bus the interrupt rate is 7380/second. That matches the expected 120 bits/message

With zero-data-byte messages the rate is 16,180/second. That's about right for a 50-60 bit message.

But if I have the bus disconnected and THEN send any message, the interrupt rate is a CPU-numbing 31,000 interrupts/second!

That chews up 27% of the CPU.

If this happens to both CAN ports, the rate goes up to 43,600 interrupts/second PER PORT or 87,200 interrupts/second.

The CPU load doesn't go up in that ratio fortunately, as it is now "only" 33% busy just ignoring these useless interrupts.

I'm measuring the interrupt rate with:

# while true; do cat /proc/interrupts | grep can1; sleep 1; done

83: 15314667 tzic can1

83: 15322051 tzic can1

83: 15329431 tzic can1

83: 15336806 tzic can1

I'm measuring the "CPU Busyness" with:

# while true; do time for (( i = 0 ; i < 100000 ; i++ )) do true; done; done

Baseline with no loading:
real    0m6.496s

With one CAN bus blasting at 31,000 interrupts/second
real    0m8.941s

With both blasting at 43,600/second

real    0m9.679s



6.49  Baseline
8.94  8.94 - 6.49 = 2.45 / 8.94 = 27%
9.68  9.68 - 6.49 = 3.19 / 9.68 = 33%

Tom

TomE · ‎12-17-2014

I've been confused by this problem for weeks. I finally think I've got it sorted.

This details what the Freescale FlexCAN core does (and doesn't) do, and what the mainstream Linux Flexcan driver code does.

This will all change "real soon now" as Andre has submitted a comprehensive set of patches to the CAN system in general, but I haven't seen these (applied and in context so I can understand them).

I'm ignoring the Data Receive and Transmit Interrupts in this discussion. That is assumed to work.

Driver writers like chips that interrupt on all "State Changes". That means the transitions TO and FROM Error-Active, Warning, Error-Passive and Bus-Off states. Freescale's MSCAN core does this.

The i.MX28 and i.MX6 FlexCAN cores provide maskable interrupts for Bus Off, Receive and Transmit Warning Levels (the >96 ones) and "Errors", which are six grouped error bits in the ESR register. The "Warning" interrupts only work on entry into the "Warning" state, and don't interrupt when it leaves, so they're not as useful as they might be.

The i.MX25, i.MX35 and i.MX53 cores do the same, except the Receive and Transmit Warning Level interrupts apparently don't work. This isn't documented anywhere, they just don't work.

Most older Coldfire FlexCAN cores didn't provide Warning Level interrupts, although some MCF54xxx ones did.

None of them interrupt on the Error-Active to-and-from Error-Passive state changes. That's what Andre was originally complaining about.

To summarise:

Interrupts MCF53xx i.MX53 i.MX6 MSCAN

------------------------------------------

Active->Warn No No Yes Yes

Warn->Active No No No Yes

Warn->Passive No No No Yes

Passive->Warn No No No Yes

Passive->BOFF Yes Yes Yes Yes

BOFF->Active No No No Yes

Errors Yes Yes Yes No

- Other Features -

Fifo Size 6 6 6 4

TX Messages 56* 56* 56* 3

[*] There are 64 Message Buffers with 8 used for the 6-entry FIFO. There are 56 buffers available for transmission, but the Linux Driver only uses ONE of them.

You don't want to enable the Error interrupts. They can really hammer the CPU in a lot of conditions. That is why canconfig can be called as "canconfig can0 ctrlmode berr-reporting on" or "canconfig can0 ctrlmode berr-reporting off" to enable or disable this interrupt depending on whether you need to have these errors reported up to your code. I'd guess you'd turn them on when running diagnostics and leave them off "in production".

The Linux 3.4 flexcan.c source code unconditionally enabled the error interrupts, the code comment stating "otherwise we don't get any warning or bus passive interrupts". That generates lots of interrupts on common error conditions (like cable unplugged). The code checks for state changes every interrupt, so those extra Error interrupts sort-of catch more state changes than just relying on the Data, Warning and Boff interrupts alone.

Linux Driver version 3.7 added a "FLEXCAN_HAS_BROKEN_ERR_STATE" flag, set on builds for i.MX25, 35 and 53 and not on the others. When set, that (like 3.4) enables the Error interrupts unconditionally. When not set, the Error interrupts are only enabled if canconfig tells it to via an IOCTL from "berr-reporting on".

There's a comment in the code saying:

* enable the "error interrupt" (FLEXCAN_CTRL_ERR_MSK),

* on most Flexcan cores, too. Otherwise we don't get

* any error warning or passive interrupts.

To me that implies:

1 - The "broken" chips don't provide error interrupts, but they do, and

2 - The "fixed" chips do provide "bus passive interrupts", but they don't.

The only difference between the chips is the "Warning" interrupt and so that comment has been confusing me badly.

The "broken" chips only miss providing the Warning interrupts, which isn't a big a deal as all cores miss providing the Passive interrupt. That's the one you worry about as it means that chip is running in a "degraded mode" with delayed transmits and no error signalling to the bus.

Tom

andriyngvason · ‎12-01-2014

I'm not sure how they figured out that [TR]WRN_INT thing, but I think I read somewhere that it's from an errata. MSCAN actually has perfect error state transitions, but it doesn't have any bus error reporting.

I'm actually quite new to the "kernel driver" scene. I'm just trying to make error handling consistent between platforms. I figured that the quickest way to do that would be to do it myself and post the patches.

See: Gmane -- PATCH v4 1 6 can: dev: Consolidate and unify state change handling.

This issue is just something that I ran into when testing my state transition code.

Do you get a lot of bus errors in error-active mode? Do you think that it might be a good compromise to turn on bus errors only when error-warning is reached and then turn them back off when it goes back to error-active?

Feel free to comment on the mailing list.

Andri

TomE · ‎12-01-2014

> Do you get a lot of bus errors in error-active mode?

I got solid hardware overruns until I rewrote the stupid NAPI stuff.

Errors? You shouldn't be getting any errors. Any error rate indicates there's something seriously wrong with the bus, hardware, a bad connection or some bad software doing something stupid to one of the other controllers. It could also indicate a bad noise source somewhere and a bad system design that is sensitive to that noise.

Since this stuff is meant be be in a car, and work reliably without any user involvement and minimum service involvement, the states are there for the benefit of the hardware automatically performing error limiting and recovery on its own. That's what the CAN standard requires. The fact that they're indicated to the user/programmer at all is more to allow the chip designer to verify they're following the standards rather than something a Linux Driver Designer should have to worry about. Is there software out there that makes any use of the state transitions?

If you're getting errors it could also be that as-shipped (and as mentioned elsewhere in this forum) the CAN controller doesn't generate standard baud rates. By default (with the mainstream kernel and with the port we got) it is fed from the 66.66MHz clock which can't be divided evenly to standard rates. We found this and eventually managed to get a patch for arch/arm/mach-imx/clock-mx51-mx53.c to switch it to the 24MHz one which can be divided properly. That problem made it run 1% high at 1MHz. If all other devices on the bus are programmed with small Quanta and a large SJW then they might be able to handle this, but if they're programmed to expect the baud rate to be within 0.5% or better they might get data-dependent receive errors.

As far as I can tell, the only thing the driver does with the states is to disable the as-specified-in-the-standard BUS_OFF recovery, and reset the bus using a timer, which has to be separately specified as a parameter to canconfig. I don't see the advantage in doing this. It seems to add another thing to get wrong without adding any functionality. Any other state handling seems to be something the User's code has to do by opening and monitoring the error socket.

> Do you think that it might be a good compromise to turn on bus errors only when error-warning is reached and then turn them back off when it goes back to error-active?

I think the simplest and most general solution is to write the driver for the "lowest common denominator" hardware. That means handling the state machine the way it is often done in "bare metal polling loop" code, which runs (I would guess) the majority of CAN controller hardware. You run a periodic timer and you poll the REC and TEC registers and monitor for BUS_OFF. That's about it. Forget these interrupts.

I'm afraid I agree with Wolfgang (on the gmane list) when he says "Puh, that's far too sophisticated."

Your proposal would fix my current problem though. So would simple dumb polling.

Tom

andriyngvason · ‎12-02-2014

Tom Evans wrote:

Since this stuff is meant be be in a car, and work reliably without any user involvement and minimum service involvement, the states are there for the benefit of the hardware automatically performing error limiting and recovery on its own. That's what the CAN standard requires. The fact that they're indicated to the user/programmer at all is more to allow the chip designer to verify they're following the standards rather than something a Linux Driver Designer should have to worry about. Is there software out there that makes any use of the state transitions?

Imagine that you're setting up a factory. You might be setting up a standard system, or maybe it's customised. You might even need to patch it to another system. The service technician setting up the bus might not be the most competent person in the world, maybe he's new, so he forgets to terminate somewhere with a resistor. There's a lot that can go wrong when setting up these factories. Quite often the system will function normally even though the CAN bus is in a sub-optimal state. When the bus is in such a state, we might sometimes see some intermittent failure that's really hard to track down because it happens infrequently. It saves money and effort to draw attention to the fact that the bus's state is sub-optimal before things start acting out.

As far as I can tell, the only thing the driver does with the states is to disable the as-specified-in-the-standard BUS_OFF recovery, and reset the bus using a timer, which has to be separately specified as a parameter to canconfig. I don't see the advantage in doing this. It seems to add another thing to get wrong without adding any functionality. Any other state handling seems to be something the User's code has to do by opening and monitoring the error socket.

You'd have to ask Wolfgang why that is. My guess is that it's intended to increase uniformity between platforms. I.e. every platform has the same BUS_OFF behaviour.

> Do you think that it might be a good compromise to turn on bus errors only when error-warning is reached and then turn them back off when it goes back to error-active?

I think the simplest and most general solution is to write the driver for the "lowest common denominator" hardware. That means handling the state machine the way it is often done in "bare metal polling loop" code, which runs (I would guess) the majority of CAN controller hardware. You run a periodic timer and you poll the REC and TEC registers and monitor for BUS_OFF. That's about it. Forget these interrupts.

I'm afraid I agree with Wolfgang (on the gmane list) when he says "Puh, that's far too sophisticated."

Your proposal would fix my current problem though. So would simple dumb polling.

I'm pretty sure that they would like "bare metal polling" even less. Maybe put it through NAPI? You'd lose some error messages, but you're drowning anyway, so what's the difference? The error handling is actually in the NAPI poll already, but the bus error interrupts aren't turned off when the polling kicks in.

Anyway, this issue won't be addressed in my patch-set. Please post patches if you find a proper solution.

Andri.

TomE · ‎12-02-2014

> You'd have to ask Wolfgang why that is.

I've email him asking how he found out and if there's any Freescale documentation on these problems.

I should be surprised that such an old CAN controller (10 years?) still needs active driver work. It is also obviously too late to change the basic driver architecture.

> When the bus is in such a state, we might sometimes see some intermittent failure that's really

> hard to track down because it happens infrequently. It saves money and effort to draw attention

> to the fact that the bus's state is sub-optimal before things start acting out.

In principal I agree, but where is the Embedded Linux system going to report this so it can be seen by anyone? It is unlikely the Application Programmer will have written code to monitor the Error Socket and do anything sensible with it. We shouldn't exclude that possibility of course. I doubt if the service technician is going to know to run the latest version of candump (or any other tool that reads and understands messages on the error socket) in order to investigate any CAN bus problems. It is unlikely they'll be able to log into the Linux system anyway.

The CAN bus is so weird and esoteric (so unlike serial protocols, so unlike Ethernet) that I think in practice diagnosis requires dedicated test tools. Anyone with a whole factory relying on this bus had better have the proper tools and should be actively monitoring the bus with separate hardware.

The better option (for Linux Drivers) is to simply not support broken hardware, and thereby persuade design engineers to select chips that actually work properly rather than trying to work around these things.

Yeah, right. Sorry about that. What was I thinking? :-)

The "Warning Interrupt" is a later addition to the FlexCAN module. The earlier ones built into some (most?) of the ColdFire parts didn't support it. The same old version made it into the MPC5554 and I'd guess other PPC parts. The "warning interrupt" was a bolt-on (you can see where spare bits have been allocated) and added later. It doesn't even work NOW in most of the i.MX parts.

Too bad it wasn't done like MSCAN which has always had a dedicated interrupt for state changes.

A "lower common denominator" might be to read and handle state transitions on Receive and Transmit interrupts. I assume the broken parts still deliver BUS_OFF interrupts (correct me if I'm wrong please). That would miss transitions that this device isn't involved in (isn't receiving or transmitting), but they could be considered to be "someone else's problem" anyway.

Tom

andriyngvason · ‎12-03-2014

In principal I agree, but where is the Embedded Linux system going to report this so it can be seen by anyone? It is unlikely the Application Programmer will have written code to monitor the Error Socket and do anything sensible with it. We shouldn't exclude that possibility of course. I doubt if the service technician is going to know to run the latest version of candump (or any other tool that reads and understands messages on the error socket) in order to investigate any CAN bus problems. It is unlikely they'll be able to log into the Linux system anyway.

I actually wrote software to monitor the CAN bus. The thing is that I figured out that it didn't work correctly on all our platforms (mscan, flexcan, sja1000).

It basically translates "active", "warning" and "passive" to "healthy", "weak" and "critical", respectively, and displays it on the service page. It also logs state transitions.

It even monitors traffic down to CANopen node-id level.

Andri.

Missing error-passive state interrupt

Missing error-passive state interrupt

i.MX6_All

Linux