Make sure PCIe links are stable in the long term
This is something we at LNLS have not been able to tackle before final production of 175 boards, probably because this is related to some failure with low probability to occur. During burn-in tests we got issues in 5 units of a batch of 175 AFC v3.1 boards.
One board could not establish PCIe link in all lanes. The other 4 boards could establish PCIe stably for many hours or days, but after a while they had some link drop, which could be reestablished automatically a few seconds later. We replaced those boards and the new ones had no noticeable issues at all, so we were quite confident the issue was not coming from the MCH itself.
This still happens on other boards within much longer periods now and randomly across the units. It can take weeks or months for some failures take place on an universe of ~160 boards operating routinely 24/7.
This may not be related to PCIe itself but with the AFC transceivers. MGT signal pairs, clocks or power supplies (VTT) are the main suspects for this failure, but we're short of possibilities on how to investigate that.
For AFC v4, Hyperlynx simulations are crucial to potentially catch signal integrity issues on the MGT lines and reference clocks. For VTT supply detailed simulations are also envisaged. Does anybody have other ideas on how to debug this issue and/or maybe provision AFC v4 with diagnostics to tackle this?
In meantime we're trying to build a setup at LNLS to be able to reproduce this issues (unfortunately we have logistics issues to access our lab these days due to Covid-19). The current idea is to try to monitor the VTT rail with an scope.