Jumbo frames dropped after a particular rate
=============== REPORT from Antonio.
We are evaluating the version 6.0.1 for the WR Switches and we have issue related to packet losses with jumbo frames.
With MTU >= 2000 we can see that switches start to loss packets. You can find attached a document (WR_Tests_6.0.1_v1_.pdf) with some experiments where Version 5.0.1 works fine under different conditions, but version 6.0.1 does not behave the same way. You can see the wrsw_pstat outputs.
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Link issues together to show that they're related. Learn more.
Activity
- Author Maintainer
Reported by Antonio:
version wr-switch-sw-v5.0.1-20170825_binaries.tar works fine, but wr-switch-sw-v6.0.1-20210705_binaries.tar fails. With MTU <=1800 works fine, (wrsw_pstat outputs): But with MTU>= 2000, it starts to fail:
Edited by Maciej Lipinski - Author Maintainer
Just for the record: the problem seems to be a short lost of synchronization on the receiving link, in Pstats you see Rsyn-l which tells that the endpoint on rx lost synchronization with the incoming data wr-cores/modules/wr_endpoint/ep_rx_pcs_16bit.vhd:
U_SYNC_DET : ep_sync_detect_16bit port map ( rst_n_i => rst_n_rx, rbclk_i => phy_rx_clk_i, en_i => rx_sync_enable, data_i => phy_rx_data_i, k_i => phy_rx_k_i, err_i => phy_rx_enc_err_i, synced_o => rx_synced, -> Rsyn-l cal_i => d_is_cal);
This seems to happen only on the ports that support LPDC (1-12). I do not think you would see this error when transferring data between two WR switches.
Edited by Maciej Lipinski - Author Maintainer
Reported by Antonio:
these are the results using Topology="2>1", only 1 sender and 1 receiver, with version 6.0.1. The switch loses packets. 4 tests: (pdf test_WR_6_20230405.pdf)
-MTU 8972 and bandwidth 90 MB/s, it loses > 91% packets. -MTU 8972 and bandwidth 43 MB/s, it loses > 89% packets. -MTU 8972 and bandwidth 17 MB/s, it loses > 83% packets. -MTU 2000 and bandwidth 4 MB/s, it loses > 16% packets.
these are the measurements obtained with one WRS to check the ports (pdf with full wrsw_pstats WR_Tests_FW6_20230406.pdf):
-You are right, ports affected are 1-12. Ports 13-18 work fine. -Full duplex loses more packets than half duplex.
Edited by Maciej Lipinski - Author Maintainer
Here is a proposed experiment scenarios: KM3Net_debugging_v2.pptx
The presentation includes commands to configure VLANs, I did not test this configuration and I might have made mistakes, please review carefully.
Please, note that WR switch is optimized for broadcast and multicast traffic within VLAN. The broadcast optimization is enabled by default, the multicast addresses need to be explicitly configured.
- Maciej Lipinski mentioned in issue wr-switch-sw#277 (closed)
mentioned in issue wr-switch-sw#277 (closed)
- Reporter
These are the results for the scenario with broadcast packets from WRS2.port18:
**MTU 1400: 20000 UDP broadcast packets from WRS2.port18 ** WRS1: WRS2:
**MTU 2000: 20000 UDP broadcast packets from WRS2.port18 ** WRS1: WRS2:
**MTU 4000: 20000 UDP broadcast packets from WRS2.port18 ** WRS1: WRS2:
**MTU 8000: 20000 UDP broadcast packets from WRS2.port18 ** WRS1: WRS2:
- Maintainer
It was possible to reproduce the frame lost using xena tester with pre 6.1 WRS firmware. VLANs were not used.
The generated traffic was from wri1 to wri18, frame size 4000B. With jumbo frames enabled (EP_RFCR_A_GIANT set in RFCR register), but MRU not changed (0x800), at the rate of ~10pkt/s it was possible to observe link down/link up every 10-20seconds.
With the increase of the packet rate the time between link restarts is decreased.
With the MRU set to 9000B (devmem 0x10030008 32 0x02328002), link dowm/up is not observed at 10pkt/s and 100pkt/s, but is observed at the 1000pkt/s (values between 100 and 1000 were not tested). Conclusion is that changing the MRU helps a little, but does not solve the problem.
Notifications about link down/up in syslog:
2023-05-09T11:11:10.611734+00:00 ctdwa-774-cxnatest1.cern.ch kernel: wri1: Link up, lpa 0x4020. 2023-05-09T11:11:10.616078+00:00 ctdwa-774-cxnatest1.cern.ch hald: <30>Info (/wr/bin/wrsw_hal):wri1: Link state change detected: was down, is up 2023-05-09T11:11:29.493004+00:00 ctdwa-774-cxnatest1.cern.ch hald: <30>Info (/wr/bin/wrsw_hal):port_fsm_state_link_down:wri1: bitslide= 0 [ps] 2023-05-09T11:11:35.210656+00:00 ctdwa-774-cxnatest1.cern.ch kernel: wri1: Link down.
Traffic between non-LPDC ports (e.g. from wri18 to wri17) show no problems (90% of utilization link (4000B*27985pkt/s)).
- Adam Wujek added Done label
added Done label
- Author Maintainer
Fixed with commit 35e9419 to wr-cores: wr-cores@35e94190
- Maciej Lipinski closed
closed
- Maciej Lipinski added v7.0 label
added v7.0 label