@magnus3point here is absolutely correct. I have done some investigation in the kernel regarding a bridge connection, and got some insights:
- Yes, all the bridging / vlan implementation is purely software, not really hardware dependent
- However, at some point during the bridging, the MAC driver (hardware dependent) can be called to do a certain vlan configuration. And that's exactly where the error is happening, the EQoS eth1 MAC driver is messed up.
My kernel debugging showed me the following:
- When running ip link set eth1 master br0
- net/bridge/br_vlan.c calls net/8021q/vlan_core.c:vlan_vid_add which internally calls vlan_add_rx_filter_info
static int vlan_add_rx_filter_info(struct net_device *dev, __be16 proto, u16 vid)
{
if (!vlan_hw_filter_capable(dev, proto))
return 0;
if (netif_device_present(dev))
return dev->netdev_ops->ndo_vlan_rx_add_vid(dev, proto, vid);
else
return -ENODEV;
}
- For the case of eth0 the first if matches (eth0 doesn't have vlan hw filter) and so the function returns successfully.
- For the case of eth1 QoS, the MAC driver implementation of ndo_vlan_rx_add_vid gets called, which is drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:stmmac_vlan_rx_add_vid()
- That's when it gets interesting, within this function another internal call is made until the driver dwmac4 gets called, specifically drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c:dwmac4_update_vlan_hash() - this function does a bunch of writel (I believe that is writing to the Chip's MAC register addresses, but I might be wrong)
- Now here, there are 3 possibilities:
- If the interface is down (ip l set eth1 down the kernel hangs on the first writel call, just hangs there forever, the whole system is frozen
- If the interface is up (but no link carrier): the writel operations go through, and the dw4mac eventually writes the hw filtering things - in dwmac4_write_vlan_filter - but it doesn't read back the just written values, and so it prints the line "Timeout accessing MAC_VLAN_Tag_Filter", and returns -EBUSY
- If the interface is up and there's link carrier: everything succeeds!
Summing up, there's something very wrong in these writes operations that's causing this weird freezing. I try to quick look at the writel implementation, but there're a bunch of different definitions in io.h
If this freezing state is not MAC driver related, then I think the driver should at least avoid performing those writes if the link is down, it's a simple check that I added, on the top of the Android patch mentioned.
Now regarding to userspace, systemd-networkd bridges the interfaces when they are down, that's why it freezes in early systemd boot. On the other hand, NetworkManager is a bit more clever, it actually doesn't bridge the interfaces right away, but it waits until the interface gets carrier (connected link) to perform the master operation.