wrong timing after node reset (bitslide not updated at the node reset)

Descripton: After reset of a node the timing sometimes is wrong by few ns. The observed granularity of the time shift is 400ps. EDIT: The problem was observed only when more than one node was restarted at the same time.

Explanation: At the reset of a node, the link goes down for a very short period of time. If the down time is shorter than the poll period done by HAL (or to be precise HAL does not perform check when the link is down), then the link down is not noticed by HAL (by kernel it is!) and the new bitslide is not read from the endpoint. Due to this the timing can be wrong up to 7.6ns (with granularity of 400ps).

Every version (HW and SW) of WRS is affected with this bug. The only exception are ports running LPDC, which always have the same bitslide (0).

How to know if your board is affected?

If you see in your syslog, link down-up notification by a kernel like:

2021-07-28T21:36:04.132290+00:00 wrs1 kernel: wri14: Link down.
2021-07-28T21:36:04.732285+00:00 wrs1 kernel: wri14: Link up, lpa 0x4020.

There should be also an entry like below for every port that went down-up.

2021-07-28T21:36:24.187741+00:00 wrs1 hald: <30>Info    (/wr/bin/wrsw_hal):port_fsm_state_link_down:wri14: bitslide= 4000 [ps]

If there are missing entries from HAL, your system/card is affected with this bug.

If you think that your system has wrong timing at the moment on port X (e.g. 14), login to the switch (master) and check the following. Check the semistaticLatency of an instance running on port X (e.g. 14). If you use standard configuration the instance number is port - 1 (here 14-1=13):

#wrs_dump_shmem | grep HAL.port.14.calib.bitslide_ps
ppsi.inst.13.info.timestampCorrectionPortDS.semistaticLatency: 3.200

The returned value is in ns.

Then check the register R16 of port's endpoint:

#wr_phytool 14 dump | grep R16
R16 = 0x00000040

Do some math on the result (R16 >> 4) * 800(ps), here (0x40 >> 4) * 800 = 3200(ps). If the result (3200ps) does not match the value of semistaticLatency read above (3.200ns), your system is affected with this bug.

Workaround:

Increase PHY down time of your node.
ifconfig down/up triggers the re-read of bitslide

Solution:

Make HAL to be notified from the kernel about link down/up.

Edited Nov 16, 2022 by Adam Wujek

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information