Endpoint: incorrect error handling of rx error (causes WR switch's SW core to hang) (#228) · Issues · Projects / White Rabbit Switch

Endpoint: incorrect error handling of rx error (causes WR switch's SW core to hang)

We discovered bug in the WR Switch (and actually in the Endpoint), the bug is described in:
- Endpoint: incorrect error handling of rx error (causes WR switch's SW core to hang): wr-cores#90
In short: if a WR Node that is in the middle of transmitting an Ethernet Frame is rebooted or its firmware reloaded, the WR Switch connected to this WR Node experiences problems on the Rx port connected to this WR Node, and TX ports that were supposed to transmit the corrupted Ethernet Frame. The watchdog in the WR Switch detects this problem and resets the internal cores. This cause traffic on all ports of the WR Switch to be stopped. Thus, there is temporary loss of any data that should be forwarded by the WR Switch.
With low traffic (e.g. only PTP frames), the probability of this happening is extremely low (yet non-zero)
When traffic is high (e.g. BTrain transmitting at 250kHz), the probability is non-negligible.
In fact, it is quite a corner-case for this bug to occur in operation (except BTrain because of its network configuration), because if a WR Node transmitter is reloaded, usually the entire system is non-operational by definition, so there is no problem in a temporal loss of communication.
In BTrain, the WR switch is split into two halves using VLANs: operational and spare. While the operational half forwards traffic that is used in operation, the spare part is used to prepare/experiment, so it is likely that a WR Node connected to the spare part is reloaded/rebooted while there is operational traffic going through the WR switch, and none of the traffic can be lost.
In v6.0 of the WR Switch firmware, the problem affects only ports 13-18, in v5.0.1 of the WR Switch firmware, the problem affects all ports.

- We discovered bug in the WR Switch (and actually in the Endpoint), the bug is described in:
  - Endpoint: incorrect error handling of rx error (causes WR switch's SW core to hang): https://ohwr.org/project/wr-cores/issues/90
- In short: if a WR Node that is in the middle of transmitting an Ethernet Frame is rebooted or its firmware reloaded, the WR Switch connected to this WR Node experiences problems on the Rx port connected to this WR Node, and TX ports that were supposed to transmit the corrupted Ethernet Frame. The watchdog in the WR Switch detects this problem and resets the internal cores. This cause traffic on all ports of the WR Switch to be stopped. Thus, there is temporary loss of any data that should be forwarded by the WR Switch.
- With low traffic (e.g. only PTP frames), the probability of this happening is extremely low (yet non-zero)
- When traffic is high (e.g. BTrain transmitting at 250kHz), the probability is non-negligible. 
- In fact, it is quite a corner-case for this bug to occur in operation (except BTrain because of its network configuration), because if a WR Node transmitter is reloaded, usually the entire system is non-operational by definition, so there is no problem in a temporal loss of communication.
- In BTrain, the WR switch is split into two halves using VLANs: operational and spare. While the operational half forwards traffic that is used in operation, the spare part is used to prepare/experiment, so it is likely that a WR Node connected to the spare part is reloaded/rebooted while there is operational traffic going through the WR switch, and none of the traffic can be lost. 
- In v6.0 of the WR Switch firmware, the problem affects only ports 13-18, in v5.0.1 of the WR Switch firmware, the problem affects all ports.

Edited Jan 27, 2021 by Maciej Lipinski