AnsweredAssumed Answered

i.MX53 Linux FlexCAN Driver Bus Off Recovery Broken

Question asked by TomE on Dec 21, 2015
Latest reply on Jan 6, 2016 by gusarambula

I've been having problems with the FlexCAN drivers in the (only) supported Linux Release for the i.MX53.

 

https://community.freescale.com/thread/381075

https://community.freescale.com/message/597952#597952

https://community.freescale.com/message/597951#597951

 

It can't receive, can't send and there are lots of other problems.

 

Now I find it can't recover from a Bus Off condition.

 

The code is in drivers/net/can/flexcan/drv.c in the functions flexcan_hw_busoff() and flexcan_hw_watch().

 

flexcan_hw_busoff() starts a timer which is meant to trigger a call to flexcan_hw_watch(), but the timer never triggers.

 

This is what the calling sequence to use one of these timers is meant to look like:

 

drivers/net/can/dev.c:
init_timer(&priv->restart_timer);
setup_timer(&priv->restart_timer, can_restart, (unsigned long)dev);
mod_timer(&priv->restart_timer,
              jiffies + (priv->restart_ms * HZ) / 1000);

 

Here's what the one in the FlexCAN driver looks like:

 

drivers/net/can/flexcan/dev.c:
init_timer(&flexcan->timer);
drivers/net/can/flexcan/drv.c:
flexcan->timer.function = flexcan_hw_watch;
flexcan->timer.data = (unsigned long)dev;
mod_timer(&flexcan->timer, HZ / 20);

 

I don't like the direct assignment of the timer function and data. More checking indicates this is probably OK, but a call to setup_timer() in an appropriate place would be better. The big problem is missing the "jiffies" in the mod_timer() call. Without that the timer never triggers in my testing. Maybe with a different current "jiffies" value it might trigger immediately.

 

Which means this code can't have been tested.

 

Fixing the above Timer problem doesn't get it working, but reveals a lot of other bugs:

 

static void flexcan_hw_watch(unsigned long data)
{
    unsigned int reg, ecr;
    struct net_device *dev = (struct net_device *)data;
    struct flexcan_device *flexcan = dev ? netdev_priv(dev) : NULL;

    BUG_ON(!flexcan);

    reg = __raw_readl(flexcan->io_base + CAN_HW_REG_MCR);
4>  if (reg & __MCR_MDIS) {
        if (flexcan_hw_restart(dev))
2>          mod_timer(&flexcan->timer, HZ / 20);
        return;
    }
5>  ecr = __raw_readl(flexcan->io_base + CAN_HW_REG_ECR);
6>  if (flexcan->boff_rec) {
        if (((reg & __ESR_FLT_CONF_MASK) >> __ESR_FLT_CONF_OFF) > 1) {
            reg |= __MCR_SOFT_RST;
            __raw_writel(reg, flexcan->io_base + CAN_HW_REG_MCR);
2>          mod_timer(&flexcan->timer, HZ / 20);
            return;
        }
7>      netif_carrier_on(dev);
    }
}

static void flexcan_hw_busoff(struct net_device *dev)
{
    struct flexcan_device *flexcan = netdev_priv(dev);
    unsigned int reg;

7>  netif_carrier_off(dev);

1>  flexcan->timer.function = flexcan_hw_watch;
1>  flexcan->timer.data = (unsigned long)dev;

3>  if (flexcan->boff_rec) {
2>      mod_timer(&flexcan->timer, HZ / 10);
        return;
    }
    reg = __raw_readl(flexcan->io_base + CAN_HW_REG_MCR);
    __raw_writel(reg | __MCR_SOFT_RST, flexcan->io_base + CAN_HW_REG_MCR);
2>  mod_timer(&flexcan->timer, HZ / 20);
}


 

flexcan_hw_busoff() is called from flexcan_err_handler() for a BOFF condition, from the flexcan_irq_handler(). That's the only entry into this function, and it is the only caller to flexcan_hw_watch().

 

The Bus Off Recovery can be configured by setting /sys/devices/platform/FlexCAN.0/boff_rec to "0" or "1". The documentation doesn't say what these values mean and the code doesn't help either. The Reference Manual says FLEXCANx_CTRL[BOFF_REC] set to "0" is "Automatic Recovery" where you don't have to do anything, and "1" requires "manual recovery".

 

The obvious-by-inspection bugs in the above and marked by numbers in the first column are:

 

  1. Timer Initialisation should use the proper functions. Timer initialisation is done in flexcan/dev.c, and this should probably be setup_timer(), but not in this ISR-called function.
  2. Timer setting has to be "jiffies + something" or the timer never fires. Or maybe it fires immediately, depending on what "jiffies" currently is.
  3. Code is probably meant to be checking for "automatic recovery" to run the timer but the test is backwards.
  4. Code is testing MDIS which never sets. It should be checking HALT or NOT_READY.
  5. Code is reading ECR which does nothing, and then it doesn't use the result.
  6. This code only runs in Manual Mode, but bug (3) has reset the chip in Automatic mode, so it can't recover.
  7. netif_carrier_off() is called in Automatic and Manual modes, but netif_carrier_on() is only called in Manual Mode or via flexcan_hw_restart().

 

When set to "0" the driver SHOULD monitor the FLEXCANx_ESR[FLT_CONF] bits and make the "netif_carrier" track it. It looks like it might be written to try and do that, but what it actually does is to hit the controller with a SOFT_RST and fires off the timer (which doesn't work, see above). If the timer goes off it does nothing, but that looks like some of the braces in the function are in the wrong place or the tests are reversed, as it SHOULD be calling netif_carrier_on() eventually, but doesn't.

 

When set to "1" the driver has to do some sort of manual recovery. It should turn BOFF_REC off and on again, but instead the code tries to fire off the timer (which doesn't work...). If the timer goes off it looks to see if the hardware has AUTOMATICALLY recovered (remember this code is for the MANUAL operation, so it can't), and if it did, turns the carrier back on again. If it didn't (it can't) it hits it with a SOFT_RST and triggers the timer again.

 

The SOFT_RST resets the MCR, leaving the FlexCAN controller in HALT mode. Nothing brings it out of this, ever. The only recovery is "ifconfig can0 down ; ifconfig can0 up".

 

The driver looks like it has some code in flexcan_hw_watch() that is meant to handle restart after the SOFT_RST, but instead of checking the HALT bit, it checks the MDIS bit. It then calls flexcan_hw_restart() which would clear the HALT bit, but doesn't clear the MDIS bit (which can't ever get set anyway). So that doesn't work either, and looks like the MDIS test is another bug.

 

Does anyone know the history of this code? Did it ever work for some chip in the past? Is there an older version that might have worked somehow before some unfortunate edits?

 

Tom

Outcomes