Endpoint: incorrect error handling of rx error (causes WR switch's SW core to hang)
Brief description:
- The problem was observed on WR Switch with GTP, yet it is likely to occur on other nodes
- The problem concerns RX PCS/Path in the Endpoint
- The usecase when it occurs:
- WR Node is connected to WR Switch
- WR Node transmits frames
- WR Node enters reset (e.g. gateware is reloaded) while it transmits Ethernet Frame
- WR Switch PCS/RX Path starts reception of this Ethernet Frame and outputting it on wrf_source outputs, yet it never finishes cycle (in v5.0.1 of WR switch and respective version of Endpoint/wr-cores vesion) or finishes the cycle (v6.0 of WR switch and the respective Endpoint/wr-cores version) but the information about Rx error is not propagated.
The reason for this is as follows:
- The rdy_i input from PHY is used to reset the PCS/RX path of Endpoint.
- The err_i input from PHY (indicating incorrect signal) can come before or after rdy_i goes low (so PCS/RX path are resetted). The PCS/RX path of the Endpoint is a pipe with few stages. The reset triggerd by rdy_i prevents the information about err_i to be propagated. In v5.0.1 it also prevented the cycle output signal to go low, for v6.0 of the switch this was fixed and the synchronizer of rdy_i signal to sys_clk domain added delay of reset... still, often the err_i signal does not propagate outside Endpoint (e.g. to SWcore in the WRS) causing problems.
In the switch, this results in the SWcore of the switch to hang on the input and or output port (where the corrupted frame is to be propagated). It was also observed that the error propagates through the next layer(s) of switch(es) and likely cause problem in the receiving WR node (under investigation).
Note that the problem depends on PHY (i.e. the time relation between asserting err_i and deserting rdy_i). For example the bug occurs in the "standard Virtex6 WR GTP" and it does not occur in the "low phase drift Virtex6 GTP"