Commit 7f4c52ad authored by Adam Wujek's avatar Adam Wujek 💬

docs/specifications/management: move wrs_failures to wr-switch-sw repo

Move done to keep document synced with source.
Signed-off-by: Adam Wujek's avatarAdam Wujek <adam.wujek@cern.ch>
parent 0855671b
all : wrs_failures.pdf
.PHONY : all clean
wrs_failures.pdf : wrs_failures.tex fail.tex intro.tex snmp_exports.tex
pdflatex -dPDFSETTINGS=/prepress -dSubsetFonts=true -dEmbedAllFonts=true -dMaxSubsetPct=100 -dCompatibilityLevel=1.4 $^
pdflatex -dPDFSETTINGS=/prepress -dSubsetFonts=true -dEmbedAllFonts=true -dMaxSubsetPct=100 -dCompatibilityLevel=1.4 $^
clean :
rm -f *.eps *.pdf *.dat *.log *.out *.aux *.dvi *.ps *.toc
\subsection{Timing error}
As a timing error we define WR Switch not being able to provide its slave
nodes/switches with correct timing information consistent with the rest of the
WR network.\\
\noindent Faults leading to a timing error:
\begin{enumerate}
\item {\bf \emph{PTP/PPSi} went out of \texttt{TRACK\_PHASE}}
\label{fail:timing:ppsi_track_phase}
\begin{packed_enum}
\item [] \underline{Status}: DONE
\item [] \underline{Severity}: ERROR
\item [] \underline{Mode}: \emph{Slave}
\item [] \underline{Description}:\\
If \emph{PTP/PPSi} WR servo goes out of the \texttt{TRACK\_PHASE} state,
that means something bad has happened and switch has lost the
synchronization to its Master.
\item [] \underline{SNMP objects}:\\
\texttt{WR-SWITCH-MIB::ptpServoState}\\
\texttt{WR-SWITCH-MIB::ptpServoStateN}
%ppsiServoStateN shall contain state as a integer taken from ppsi shm
\item [] \underline{Note}: we should also monitor PTP/PPSi state inside the
switch to build up the general WRS status word.
\end{packed_enum}
\item {\bf Offset jump not compensated by Slave}
\label{fail:timing:offset_jump}
\begin{packed_enum}
\item [] \underline{Status}: DONE
\item [] \underline{Severity}: ERROR
\item [] \underline{Mode}: \emph{Slave}
\item [] \underline{Description}:\\
This may happen if Master resets its WR time counters (e.g. because it
lost the link to its Master higher in the hierarchy or to external
clock), but Slave switch does not follow the jump.
\item [] \underline{SNMP objects}:\\
\texttt{WR-SWITCH-MIB::ptpClockOffsetPs}\\
\texttt{WR-SWITCH-MIB::ptpClockOffsetPsHR}
\item [] \underline{Note}: HR version is 32-bit signed value of the offset. With saturation on overflow and underflow.
\end{packed_enum}
\item {\bf Detected jump in the RTT value calculated by \emph{PTP/PPSi}}
\label{fail:timing:rtt_jump}
\begin{packed_enum}
\item [] \underline{Status}: DONE
\item [] \underline{Severity}: ERROR
\item [] \underline{Mode}: \emph{Slave}
\item [] \underline{Description}:\\
Once WR link is established round-trip delay (RTT) can change smoothly
due to the temperature variations. If a sudden jump is detected, that
means erroneous timestamp was generated either on Master or Slave side.
One cause of that could be the wrong value of the t24p transition point.
\item [] \underline{SNMP objects}:\\
\texttt{WR-SWITCH-MIB::ptpRTT}
\item [] \underline{Note}: we should also monitor RTT variations inside
the switch to build up the general WRS status word.
\end{packed_enum}
\item {\bf Wrong $\Delta_{TXM}$, $\Delta_{RXM}$, $\Delta_{TXS}$,
$\Delta_{RXS}$ values are reported to the \emph{PTP/PPSi} daemon}
\label{fail:timing:deltas_report}
\begin{packed_enum}
\item [] \underline{Status}: DONE
\item [] \underline{Severity}: ERROR
\item [] \underline{Mode}: \emph{all}
\item [] \underline{Description}:\\
If \emph{PTP/PPSi} doesn't get the correct values of fixed hardware delays,
it won't be able to calculate a proper Master-to-Slave delay. Although
the estimated offset in \emph{PTP/PPSi} is close to 0, WRS won't be
synchronized to Master with the sub-nanosecond accuracy.
\item [] \underline{SNMP objects}:\\
\texttt{WR-SWITCH-MIB::ptpDeltaTxM.<n>}\\
\texttt{WR-SWITCH-MIB::ptpDeltaRxM.<n>}\\
\texttt{WR-SWITCH-MIB::ptpDeltaTxS.<n>}\\
\texttt{WR-SWITCH-MIB::ptpDeltaRxS.<n>}
\end{packed_enum}
\item {\bf \emph{SoftPLL} became unlocked}
\label{fail:timing:spll_unlock}
\begin{packed_enum}
\item [] \underline{Status}: TODO \emph{(depends on SoftPLL mem read)}
\item [] \underline{Severity}: ERROR
\item [] \underline{Mode}: \emph{all}
\item [] \underline{Description}:\\
If \emph{SoftPLL} loses lock, for any reason, Slave or Grand Master
switch can no longer be syntonized and phase aligned with its time
source. WRS in Free-running mode without properly locked Helper PLL is
not able to perform reliable phase measurements for enhancing Rx
timestamps resolution. For Grand Master the reason of \emph{SoftPLL}
going out of lock might be disconnected 1-PPS/10MHz signals or external
clock down. In that case, the switch goes into Free-running mode and
resets WR time. Later we will have a holdover to keep the Grand Master
switch disciplined in case it loses external reference.
\item [] \underline{SNMP objects}: \emph{(not yet implemented)}\\
\texttt{WR-SWITCH-MIB::spllMode}\\
\texttt{WR-SWITCH-MIB::spllSeqState}\\
\texttt{WR-SWITCH-MIB::spllAlignState}\\
\texttt{WR-SWITCH-MIB::spllHlock}\\
\texttt{WR-SWITCH-MIB::spllMlock}\\
\texttt{WR-SWITCH-MIB::spllDelCnt}
\item [] \underline{Note}: The idea to export the status from LM32 is to
place a structure with all these values under a fixed address in the
memory and read it from Linux.
\end{packed_enum}
\item {\bf \emph{SoftPLL} has crashed/restarted}
\label{fail:timing:spll_crash}
\begin{packed_enum}
\item [] \underline{Status}: TODO \emph{(depends on SoftPLL mem read), (require changes in lm32 software)}
\item [] \underline{Severity}: ERROR
\item [] \underline{Mode}: \emph{all}
\item [] \underline{Description}:\\
If LM32 software crashes or restarts for some reason, its state may be
either reseted or random (if for some reason variables were overwritten
with junk values). In such case PLL becomes unlocked and switch is not
able to provide synchronization to other devices.
\item [] \underline{SNMP objects}: \emph{(not yet implemented)}\\
\texttt{WR-SWITCH-MIB::spllIrqCnt}
\item [] \underline{Note}: we need to have a similar mechanism as in the
\emph{wrpc-sw} to detect if the LM32 program has restarted because of
the CPU following a NULL pointer. If it occurs, we need to export this
information through Mini-IPC/HAL. In addition to that, we can detect if
\emph{SoftPLL} is hanging (but not restarted) based on irq counter.
\end{packed_enum}
\item {\bf Link to WR Master is down}
\label{fail:timing:master_down}
\begin{packed_enum}
\item [] \underline{Status}: DONE
\item [] \underline{Severity}: ERROR (will become WARNING with the
switch-over)
\item [] \underline{Mode}: \emph{Slave}
\item [] \underline{Description}:\\
In that case, WR Switch loses timing reference, resets counters
responsible for keeping the WR time, and starts operating in a
Free-Running Master mode.
\item [] \underline{SNMP objects}:\\
\texttt{WR-SWITCH-MIB::portLink.<n>}\\
\texttt{WR-SWITCH-MIB::portMode.<n>}
\end{packed_enum}
\item {\bf PTP frames don't reach ARM}
\label{fail:timing:no_frames}
\begin{packed_enum}
\item [] \underline{Status}: TODO \emph{(depends on ppsi shm?)}
\item [] \underline{Severity}: ERROR
\item [] \underline{Mode}: \emph{all}
\item [] \underline{Description}:\\
In this case, \emph{PTP/PPSi} will fail to stay synchronized and provide
synchronization. Even if WR servo is in the \texttt{TRACK\_PHASE} state,
it calculates new phase shift based on the Master-to-Slave delay
variations. To calculate these variations, it still needs timestamped
PTP frames flowing. There could be several causes of such fault:
\begin{itemize}
\item HDL problem (e.g. SwCore or Endpoint hanging)
\item \emph{wr\_nic.ko} driver crash
\item wrong VLANs configuration
\end{itemize}
\item [] \underline{SNMP objects}:\\
\texttt{WR-SWITCH-MIB::portPtpTxFrames.<n>} \emph{(not implemented)}\\
\texttt{WR-SWITCH-MIB::portPtpRxFrames.<n>} \emph{(not implemented)}\\
\texttt{WR-SWITCH-MIB::portLink.<n>} \emph{(implemented)}\\
\texttt{WR-SWITCH-MIB::portMode.<n>} \emph{(implemented)}
\item [] \underline{Note}: If the kernel driver crashes, there is not much
we can do. We end up with either our system frozen or a reboot. For
wrong VLAN configuration and HDL problems we can monitor if PTP frames
are flowing on Slave port(s) of WRS and raise an alarm (change status
word) if they don't flow anymore. We should combine this with the link
status (up/down). If VLANs are misconfigured, we don't receive PTP
frames, but the link is still up. This could let us distinguish from a
lack of frames due to the link down (which is a separate issue).
\end{packed_enum}
\item {\bf Detected SFP not supported for WR timing}
\label{fail:timing:wrong_sfp}
\begin{packed_enum}
\item [] \underline{Status}: DONE
\item [] \underline{Severity}: ERROR
\item [] \underline{Mode}: \emph{all}
\item [] \underline{Description}:\\
By not supported SFP for WR timing we mean a transceiver that doesn't
have the \emph{alpha} parameter and fixed hardware delays defined in the
SFP database (\emph{/wr/etc/sfp\_database.conf}). The consequence is
\emph{PTP/PPSi} not having the right values to estimate link asymmetry.
Despite \emph{PTP/PPSi} offset being close to 0 \emph{ps}, the device won't
be properly synchronized.
\item [] \underline{SNMP objects}:\\
\texttt{WR-SWITCH-MIB::portSfpVN.<n>}\\
\texttt{WR-SWITCH-MIB::portSfpPN.<n>}\\
\texttt{WR-SWITCH-MIB::portSfpVS.<n>}\\
\texttt{WR-SWITCH-MIB::portSfpInDB.<n>}\\
\texttt{WR-SWITCH-MIB::portSfpGbE.<n>}\\
\texttt{WR-SWITCH-MIB::portSfpError.<n>}
\item [] \underline{Note}: WRS configuration allow to disable this check on some ports.
That is because ports may be used for regular (non-WR) PTP
synchronization or for data transfer only (no timing). In that case any
Gigabit SFP can be used (also copper). Detecting if a non-Gigabit
Ethernet SFP is plugged into the cage is covered in a separate issue
\ref{fail:other:sfp} in section \ref{sec:other_fail}.
\end{packed_enum}
\item {\bf \emph{PTP/PPSi} process has crashed/restarted}
\label{fail:timing:ppsi_crash}
\begin{packed_enum}
\item [] \underline{Status}: TODO \emph{(depends on monit)}
\item [] \underline{Severity}: ERROR
\item [] \underline{Mode}: \emph{all}
\item [] \underline{Description}:\\
If the \emph{PTP/PPSi} daemon crashes we lose any synchronization
capabilities. If, in the future, we will have another process that could
bring \emph{PTP/PPSi} back to live, such a restart would still create a time
jump and has to be reported.
\item [] \underline{SNMP objects}:\\
\texttt{WR-SWITCH-MIB::ptpRunCnt} \emph{(not implemented)}\\
\texttt{HOST-RESOURCES-MIB::hrSWRunName.<x>} \emph{(implemented)}
\item [] \underline{Note}: list of the processes has to be monitored, if
\emph{PTP/PPSi} is there and if its PID has changed (it was restarted).
\end{packed_enum}
\item {\bf \emph{HAL} process has crashed/restarted}
\label{fail:timing:hal_crash}
\begin{packed_enum}
\item [] \underline{Status}: TODO \emph{(depends on monit)}
\item [] \underline{Severity}: WARNING (but only after we modify PTP/PPSi so
it reconnects to HAL, and HAL does not re-initialize SoftPLL after
crash)
\item [] \underline{Mode}: \emph{all}
\item [] \underline{Description}:\\
If \emph{HAL} crashes, \emph{PTP/PPSi} is not able to communicate with
hardware i.e. read phase shift, get timestamps, phase shift the clock
etc.
\item [] \underline{SNMP objects}:\\
\texttt{WR-SWITCH-MIB::halRunCnt} \emph{(not implemented)}\\
\texttt{HOST-RESOURCES-MIB::hrSWRunName.<x>} \emph{(implemented)}
\item [] \underline{Note}: list of processes has to be monitored, if
\emph{wrsw\_hal} is there and if its PID has changed (it was restarted).
\end{packed_enum}
\item {\bf Wrong configuration applied}
\label{fail:timing:wrong_config}
\begin{packed_enum}
\item [] \underline{Status}: TODO \emph{(to be done later)}
\item [] \underline{Severity}: WARNING
\item [] \underline{Mode}: \emph{all}
\item [] \underline{Description}:\\
If there is a wrong configuration applied to the \emph{PTP/PPSi} or HAL
(i.e. wrong fixed delays, mode of operation etc.) there is not much we
can do. The responsibility of WR experts (or person deploying the
system) is to make sure that all the devices have correct initial
configuration. Later we can only generate warnings, if the key
configuration options are changed remotely (e.g. Grand Master mode to
Free-running Master or updated fixed hardware delays values).\\
For misconfigured VLANs, we can monitor if PTP frames are flowing on
Slave port(s) of the switch.
\item [] \underline{SNMP objects}: \emph{(not yet implemented)}
\item [] \underline{Note}: monitor remote updates of key configuration
options (PTP/WR mode, fixed hardware delays)
\end{packed_enum}
\item {\bf Switchover failed}
\begin{packed_enum}
\item [] \underline{Status}: for later
\item [] \underline{Severity}: ERROR
\item [] \underline{Mode}: \emph{Slave}, \emph{Grand Master}
\item [] \underline{Description}: \emph{(not yet implemented)}\\
In case the primary timing link breaks, switchover is responsible for
seamless switching to the backup one to keep the device in sync. If WRS
operates in a \emph{Slave} mode, switchover is about switching
between two (or more) WR links to one or multiple WR Masters. If it
operates in a \emph{Grand Master} mode, it is about broken/lost
connection to an external reference and switching to a backup WR link
(another WR Master). Regardless of the configuration, if we fail to
switch-over to a backup link (e.g. because the it is down), WRS resets
the time counters and continue the operation as a Free-Running Master.
\item [] \underline{SNMP objects}: \emph{(not yet implemented)}
\item [] \underline{Note}: we should probably use parameters reported by
the backup channel(s) of the SoftPLL and the backup PTP servo to be able
to detect and report that something went wrong.
\end{packed_enum}
\item {\bf Holdover for too long}
\begin{packed_enum}
\item [] \underline{Status}: for later
\item [] \underline{Severity}: WARNING
\item [] \underline{Mode}: \emph{Grand Master}
\item [] \underline{Description}: \emph{(not yet implemented)}\\
Signaling active holdover is one thing, but if a Grand Master switch is
kept in sync with holdover for too long, it might drift away from the
ideal external reference too much. All devices in WR network will be
still synchronized, but no longer in sync with external reference.
\item [] \underline{SNMP objects}: \emph{(not yet implemented)}
\end{packed_enum}
\end{enumerate}
\newpage
\subsection{Data error}
As a data error we define WR Switch not being able to forward Ethernet traffic
between devices connected to the ports.\\
\noindent Faults leading to a data error:
\begin{enumerate}
\item {\bf Link down}
\label{fail:data:link_down}
\begin{packed_enum}
\item [] \underline{Status}: DONE \emph{(to be changed later for switchover)}
\item [] \underline{Severity}: ERROR (will be WARNING with the
switch-over)
\item [] \underline{Description}:\\
This obviously stops the flow of frames on an Ethernet port and there is
not much we can do besides reporting an error. Topology redundancy is a
cure for that (if backup link is fine, and reconfiguration does not
fail). There might be several causes of a link down:
\begin{itemize}
\item unplugged fiber
\item broken fiber
\item broken SFP
\item wrong(non-complementary) pair of WDM SPFs is used
\end{itemize}
However, we are not able to distinguish between them inside the switch.
\item [] \underline{SNMP objects}:\\
\texttt{IF-MIB::ifOperStatus.<n>}\\
\texttt{WR-SWITCH-MIB::portLink.<n>}
\end{packed_enum}
\item {\bf Fault in the Endpoint's transmission/reception path}
\label{fail:data:ep_txrx}
\begin{packed_enum}
\item [] \underline{Status}: DONE
\item [] \underline{Severity}: ERROR
\item [] \underline{Description}:\\
This fault covers various errors reported by the Endpoint, e.g. FIFO
underrun in the Tx PCS or FIFO overrun in the Rx PCS, receiving invalid
\emph{8b10b} code, CRC error etc.
\item [] \underline{SNMP objects}:\\
\texttt{WR-SWITCH-MIB::pstatsWR<n>.1} - Tx PCS FIFO underrun\\
\texttt{WR-SWITCH-MIB::pstatsWR<n>.2} - Rx PCS FIFO overrun\\
\texttt{WR-SWITCH-MIB::pstatsWR<n>.3} - Rx invalid \emph{8b10b} code\\
\texttt{WR-SWITCH-MIB::pstatsWR<n>.4} - Rx sync lost\\
\texttt{WR-SWITCH-MIB::pstatsWR<n>.6} - Rx frame dropped by PFilter\\
\texttt{WR-SWITCH-MIB::pstatsWR<n>.7} - Rx PCS Error\\
\texttt{WR-SWITCH-MIB::pstatsWR<n>.10} - Rx CRC Error
\end{packed_enum}
\item {\bf Problem with the \emph{SwCore} or Endpoint HDL module}
\label{fail:data:swcore_hang}
\begin{packed_enum}
\item [] \underline{Status}: TODO \emph{(depends on HDL, then hal?)}
\item [] \underline{Severity}: ERROR
\item [] \underline{Description}:\\
If any of these HDL modules hangs, there is usually not much the user
can do besides resetting the WR Switch so that the FPGA is reprogrammed.
It may happen that frames are lost only on one or two ports, but it may
be also that the whole SwCore refuses to forward traffic. In the current
firmware release we have a bug causing SwCore/Endpoint to hang after
forwarding a specific frame size and load. It will be improved in the
future releases.
\item [] \underline{SNMP objects}:\\
\texttt{WR-SWITCH-MIB::pstatsWR<n>.19} - Endpoint Tx frames\\
\texttt{WR-SWITCH-MIB::pstatsWR<n>.38} - RTU forward decisions to the
port
\item [] \underline{Note}: We should probably provide also some events for
counting from the SwCore.\\
Two early ideas for checking if SwCore is hanging or not:
\begin{itemize}
\item Monitor the number of used and free pages in the MPM memory
\item Compare per-port \emph{RTUfwd} counter with the \emph{Tx}
Endpoint counter for this port. \emph{RTUfwd} counts all forwarding
decisions from RTU to the port $<$n$>$ (including PTP frames from
NIC). If this number is equal to the number of frames actually
transmitted by the Endpoint, then everything works fine).
\end{itemize}
\end{packed_enum}
\item {\bf RTU is full and cannot accept more requests}
\label{fail:data:rtu_full}
\begin{packed_enum}
\item [] \underline{Status}: TODO \emph{(depends on HDL)}
\item [] \underline{Severity}: ERROR
\item [] \underline{Description}:\\
If RTU is full for a given port, it's not able to accept more requests
and generate new responses. In such case frames are dropped in the
Rx path of the Endpoint.
\item [] \underline{SNMP objects}:\\
\texttt{WR-SWITCh-MIB::pstatsWR<n>.21} - Rx drop, RTU full
\item [] \underline{Note}: It turns out that the \emph{rtu\_port} HDL
component was changed and currently RTU full events are not generated
and therefore not counted by PSTATS.
\end{packed_enum}
\item {\bf Too much HP traffic / Per-priority queue full}
\label{fail:data:too_much_HP}
\begin{packed_enum}
\item [] \underline{Status}: TODO \emph{(depends on HDL)}
\item [] \underline{Severity}: ERROR
\item [] \underline{Description}:\\
If we get too much High Priority traffic, then SwCore will be busy all
the time forwarding HP frames. This way regular/best effort traffic
won't be flowing through the switch. In the extreme case, HP traffic
queue may become full and we start losing HP frames, which is
unacceptable.
\item [] \underline{SNMP objects}:\\
\texttt{WR-SWITCH-MIB::pstatsWR<n>.33} - HP frames on a port\\
\texttt{WR-SWITCH-MIB::pstatsWR<n>.20} - Total number of Rx frames on
the port\\
\texttt{WR-SWITCh-MIB::pstatsWR<n>.22-29} - Rx priorities 0-7
\item [] \underline{Note}: we need to get from SwCore the information
about per-priority queue utilization, or at least an event when it's
full.
\end{packed_enum}
\item {\bf \emph{RTUd} has crashed}
\label{fail:data:rtu_crash}
\begin{packed_enum}
\item [] \underline{Status}: TODO \emph{(depends on monit)}
\item [] \underline{Severity}: WARNING
\item [] \underline{Description}:\\
If RTUd crashed, traffic would be still routed between the WRS ports, but
only based on already existing static and dynamic rules. There would be
no learning or aging functionality. That means MAC addresses wouldn't be
removed from the RTU table if a device is disconnected from port. Since
there would be no learning, each frame with yet unknown destination MAC
will be broadcast to all ports (within a VLAN).
\item [] \underline{SNMP objects}:\\
\texttt{WR-SWITCH-MIB::rtuRunCnt} \emph{(not implemented)}\\
\texttt{HOST-RESOURCES-MIB::hrSWRunName.<x>} \emph{(implemented)}
\item [] \underline{Note}: the list of processes has to be monitored, if
\emph{RTUd} is there and if its PID has changed (it was restarted).
\end{packed_enum}
\item {\bf Network loop - two or more identical MACs on two or more ports}
\label{fail:data:net_loop}
\begin{packed_enum}
\item [] \underline{Status}: TODO \emph{(to be done later)}
\item [] \underline{Severity}: ERROR
\item [] \underline{Description}:\\
In such case we have a ping-pong situation. If two ports receive frames
with the same source MAC, it is learned on one of these ports. Then if
it comes on a second port, it is learned on a second port, and removed
from the first one. Later, MAC is learned again on the first port, and
removed from the MAC table for the second port, and so on. This
situation is a network configuration problem or eRSTP failure.
\item [] \underline{SNMP objects}: \emph{(not yet implemented)}
\item [] \underline{Note}: we need to monitor the \emph{rtu\_stat} to
diagnose ping-pong in the RTU table.
\end{packed_enum}
\item {\bf Wrong configuration applied (e.g. wrong VLAN config)}
\begin{packed_enum}
\item [] \underline{Status}: TODO \emph{(to be done later)}
\item [] \underline{Severity}: WARNING
\item [] \underline{Description}:\\
The same problem as described in the timing fault
\ref{fail:timing:no_frames}
\end{packed_enum}
\item {\bf Topology Redundancy failure}
\begin{packed_enum}
\item [] \underline{Status}: for later
\item [] \underline{Severity}: ERROR
\item [] \underline{Description}: \emph{(not yet implemented)}\\
Topology redundancy let's us prevent from losing data when the primary
uplink is down for some reason. However, if a backup link is also down
or reconfiguration to backup link fails, we start losing data and an
alarm should be raised.
\item [] \underline{SNMP objects}: \emph{(not yet implemented)}
\item [] \underline{Note}: One thing we need to report is a backup link(s)
going down, but we should also think about how to determine if there is
some problem with eRSTP and if it may fail/has failed if the primary
link is down.
\end{packed_enum}
\end{enumerate}
\newpage
\subsection{Other errors}
\label{sec:other_fail}
\begin{enumerate}
\item {\bf WR Switch did not boot correctly}
\label{fail:other:boot}
\begin{packed_enum}
\item [] \underline{Status}: TODO
\item [] \underline{Severity}: ERROR
\item [] \underline{Description}:\\
That one is about making sure that everything is up and running after WR
switch boots. If any of the services fails, an alarm should be raised.
\item [] \underline{SNMP objects}: \emph{(not yet implemented)}
\item [] \underline{Note}: we should have a flag somewhere reported
through the SNMP (e.g. in the main status word) saying that WRS has
booted correctly, FPGA is programmed, all kernel drivers are loaded and
all daemons are up and running. If it's not the case, we should report
what has happened:
\begin{itemize}
\item reading HW information from dataflash failed ?
\item programming FPGA or LM32 failed ?
\item loading any of the kernel modules failed ?
\item starting any of the userspace daemons failed ?
\end{itemize}
The idea for that is to reboot the system if it was not able to boot
correctly. Then we use the scratchpad registers of the processor to keep
the boot count. If the value of this counter is more than X we stop
rebooting and try to have a system running with at least \emph{dropbear}
for SSH and \emph{net-snmp} to allow remote diagnostics. If on the other
hand we have booted correctly we set the boot count to 0.
\end{packed_enum}
\item {\bf Any userspace daemon has crashed/restarted}
\label{fail:other:daemon_crash}
\begin{packed_enum}
\item [] \underline{Status}: TODO \emph{(depends on monit)}
\item [] \underline{Severity}: ERROR / WARNING (depending on the process)
\item [] \underline{Description}:
\item [] \underline{SNMP objects}:\\
\texttt{HOST-RESOURCES-MIB::hrSWRunName.<x>}\\
\texttt{WR-SWITCH-MIB::ptpRunCnt}\\
\texttt{WR-SWITCH-MIB::halRunCnt}\\
\texttt{WR-SWITCH-MIB::rtuRunCnt}\\
\texttt{WR-SWITCH-MIB::sshRunCnt}\\
\texttt{WR-SWITCH-MIB::udhcpdRunCnt}\\
\texttt{WR-SWITCH-MIB::rsyslogRunCnt}\\
\texttt{WR-SWITCH-MIB::snmpdRunCnt}\\
\texttt{WR-SWITCH-MIB::httpdRunCnt}
\item [] \underline{Note}: We have to monitor the list of running
processes and their PIDs. We shall distinguish between crucial
processes - error should be reported if one of them crashes; and less
important processes which should just be restarted if they crash (and
warning should be reported). If any of the processes has crashed, we
need to restart it and increment a per-process counter reported through
the SNMP to indicate how many times each process has crashed.\\
Crucial processes (Error report if any of them crashes):
\begin{itemize}
\item \emph{PTP/PPSi}
\item \emph{WRSW\_RTUd} - after adding configuration preserving code
on restart, RTUd could be crossed out from this list
\item \emph{WRSW\_HAL}
\end{itemize}
Less critical processes (Restarting them and Warning generation is
enough):
\begin{itemize}
\item \emph{dropbear}
\item \emph{udhcpc}
\item \emph{rsyslogd}
\item \emph{snmpd}
\item \emph{lighttpd}
\item \emph{TRUd/eRSTPd} - not yet implemented
\end{itemize}
\emph{RTUd} - we need to set the flag that it has crashed so that when
it runs again it knows that HDL is already configured. It should not
erase static entries in RTU table (e.g. multicasts for PTP) and it
should not erase or it should configure again static entries set by-hand
as well as VLANs. Dynamic entries are not a problem. RTUd will learn all
MACs after restarting. The only consequence will be increased network
traffic due to frames broadcast until all MACs are learned. In general
the source code has to be checked to make sure what is cleared on
startup and modified to preserve the configuration.\\
\emph{TRUd/eRSTPd} - topology reconfiguration is done in hardware if
needed, this daemon is used only to configure TRU/RTU HDL module.
However, the story is similar as with the RTUd. If eRSTPd crashes, we
need to store this information so that when it runs again, it does not
erase the whole configuration. Also if topology reconfiguration is done
in HDL while eRSTPd is down, HDL should keep the flag that it happened,
and eRSTPd should read this flag when starting, so that it's aware that
now, backup link is active.\\
\end{packed_enum}
\item {\bf Kernel crash}
\begin{packed_enum}
\item [] \underline{Status}: TODO
\item [] \underline{Severity}: ERROR
\item [] \underline{Description}:
If Linux kernel has crashed the system reboots. We have
no synchronization, no SNMP to report the status, FPGA may be still
forwarding Ethernet traffic, but based on dynamic and static routing
rules from before the crash.
\item [] \underline{SNMP objects}: \emph{(not yet implemented)}
\item [] \underline{Note}: On kernel crash, we should restart (it's done
already) but also be able to determine after the next boot what was the
reason of the reboot. There is a register in the processor that tells us
if we rebooted after the crash or is it a "clean" boot:\\
\lstset{frame=single, captionpos=b, caption=, basicstyle=\scriptsize, backgroundcolor=\color{light-gray}, label= }
\begin{lstlisting}
After a power-on:
wrs-192.168.16.242# devmem 0xfffffd04
0x00010001
After reboot:
wrs-192.168.16.242# devmem 0xfffffd04
0x00010300
\end{lstlisting}
\end{packed_enum}
\item {\bf System nearly out of memory}
\label{fail:other:no_mem}
\begin{packed_enum}
\item [] \underline{Status}: TODO \emph{(DONE?, create new object to report if error?)}
\item [] \underline{Severity}: WARNING
\item [] \underline{Description}:
\item [] \underline{SNMP objects}:\\
\texttt{HOST-RESOURCES-MIB::hrStorageDescr.<x>}\\
\texttt{HOST-RESOURCES-MIB::hrStorageSize.<x>}\\
\texttt{HOST-RESOURCES-MIB::hrStorageUsed.<x>}
\item [] \underline{Note}: we need to monitor and report the amount of the
free memory, report it through SNMP and raise an alarm if it's extremely
low (but still enough to keep the system running). In general we should
compare \texttt{hrStorageSize} with \texttt{hrStorageUsed} for each
chunk of memory and each partition.
\end{packed_enum}
\item {\bf CPU load too high}
\label{fail:other:cpu}
\begin{packed_enum}
\item [] \underline{Status}: TODO \emph{(DONE?)}
\item [] \underline{Severity}: WARNING
\item [] \underline{Description}:
\item [] \underline{SNMP objects}:\\
\texttt{WR-SWITCH-MIB::cpuLoad} \emph{(not implemented)}\\
Can \texttt{HOST-RESOURCES-MIB::hrProcessorLoad} be used?
("The average, over the last minute, of the percentage
of time that this processor was not idle.
Implementations may approximate this one minute
smoothing period if necessary.")
\item [] \underline{Note}: similar situation as with the memory. We need
to monitor, report and alarm if CPU load is close to 100\% (but still
enough to keep the system running).
\end{packed_enum}
\item {\bf Temperature inside the box too high}
\label{fail:other:temp}
\begin{packed_enum}
\item [] \underline{Status}: DONE
\item [] \underline{Severity}: WARNING
\item [] \underline{Description}:\\
If the temperature raises too high we might break our electronics inside
the box. It also means that most probably one or both of the fans inside
the box are broken and should be replaced. There are 4 temperature
sensors monitored:
\begin{itemize}
\item \emph{IC19} - temperature below the FPGA
\item \emph{IC20}, \emph{IC17} - temperature near the SCB power supply
circuit
\item \emph{IC18} - temperature near the VCXO and PLLs (AD9516,
CDCM6100)
\end{itemize}
\item [] \underline{SNMP objects}:\\
\texttt{WR-SWITCH-MIB::tempFPGA}\\
\texttt{WR-SWITCH-MIB::tempPSL}\\
\texttt{WR-SWITCH-MIB::tempPSR}\\
\texttt{WR-SWITCH-MIB::tempPLL}\\
\texttt{WR-SWITCH-MIB::tempTholdFPGA}\\
\texttt{WR-SWITCH-MIB::tempTholdPLL}\\
\texttt{WR-SWITCH-MIB::tempTholdPSL}\\
\texttt{WR-SWITCH-MIB::tempTholdPSR}\\
\texttt{WR-SWITCH-MIB::tempWarning}\\
\item [] \underline{Note}:
\texttt{tempWarning} is raised when temperature read from any of these sensors
exceeds individually set threshold in \emph{.config}. When at least one threshold
temperature is not set tempWarning returns \emph{Threshold-not-set}.
Temperature is read by the HAL to drive PWM inside the FPGA. HAL reports
temperature to its area in the shared memory.
\end{packed_enum}
\item {\bf Not supported SFP plugged into the cage (especially non 1-Gb SFP)}
\label{fail:other:sfp}
\begin{packed_enum}
\item [] \underline{Status}: DONE
\item [] \underline{Severity}: WARNING
\item [] \underline{Description}:\\
If not supported Gigabit Fiber SFP is plugged into the cage, then it's a
timing issue \ref{fail:timing:wrong_sfp}. However, if a non 1-Gb SFP is
used, then no Ethernet traffic would be flowing on that port. It's due
to the fact, that we don't have 10/100Mbit Ethernet implemented inside
the WRS.
\item [] \underline{SNMP objects}:\\
\texttt{WR-SWITCH-MIB::portSfpVN.<n>}\\
\texttt{WR-SWITCH-MIB::portSfpPN.<n>}\\
\texttt{WR-SWITCH-MIB::portSfpVS.<n>}\\
\texttt{WR-SWITCH-MIB::portSfpGbE.<n>}\\
\texttt{WR-SWITCH-MIB::portSfpError.<n>}
\end{packed_enum}
\item {\bf File system / Memory corruption}
\label{fail:other:memory}
\begin{packed_enum}
\item [] \underline{Description}:\\
\item [] \underline{SNMP objects}: \emph{(none)}
\item [] \underline{Note}: how shall we detect this ? Based on
\emph{dmesg} errors reported by UBI and system in general ?\\
This is bad, crazy things may happen, we can't do much about it.
\end{packed_enum}
\item {\bf Kernel freeze}
\begin{packed_enum}
\item [] \underline{Description}:
If kernel freezes we can do nothing. It can freeze e.g. due to some
infinite in the irq handler. It's like with the power failure, somebody
has to go to the place where WRS is installed and investigate/restart
the device.
\item [] \underline{SNMP objects}: \emph{(none)}
\end{packed_enum}
\item {\bf Power failure}
\begin{packed_enum}
\item [] \underline{Description}:\\
Power failure may be either a WRS problem (i.e. broken power supply
inside the switch) or an external problem (i.e. providing voltage to the
device). There is not much reporting we can do in such case, it's up to
the Network Management Station to raise an alarm if the SNMP Agent does
not respond to the SNMP requests.
\item [] \underline{SNMP objects}: \emph{(none)}
\end{packed_enum}
\item {\bf Hardware problem}
\begin{packed_enum}
\item [] \underline{Description}:\\
If any crucial hardware part breaks we'll most probably notice it as one
(or multiple) timing/data errors described in the previous sections.
Besides that, we don't have any self-diagnostics on-board. Few examples:
\begin{itemize}
\item DAC / VCO - problems with synchronization
\item cooling fans - rise of the temperature inside the WRS box
(failure \ref{fail:other:temp})
\item power supply, ARM, FPGA - booting problem (failure
\ref{fail:other:boot})
\item memory chip - data corruption (failure \ref{fail:other:memory})
\end{itemize}
\item [] \underline{SNMP objects}: \emph{(none)}
\end{packed_enum}
\item {\bf Management link down}
\label{fail:other:management_link}
\begin{packed_enum}
\item [] \underline{Description}:\\
For obvious reasons we are not able to report through SNMP that the
management link is down. This should be detected and reported by the NMS
if it does not receive SNMP and ICMP responses from the WRS.
\item [] \underline{SNMP objects}: \emph{(none)}
\end{packed_enum}
\item {\bf No static IP on the management port \& failed to DHCP}
\begin{packed_enum}
\item [] \underline{Description}:\\
From operator's point of view it is similar to the issue
\ref{fail:other:management_link}. WRS is not accessible through the
management port, so its status cannot be reported. This should be
detected and reported by the NMS if it does not receive SNMP and ICMP
responses from the WRS. In such case WR expert should make a physical
connection to the management USB port of the WRS to diagnose the
problem.
\item [] \underline{SNMP objects}: \emph{(none)}
\end{packed_enum}
\item {\bf IP address on the management port has changed}
\begin{packed_enum}
\item [] \underline{Status}: TODO
\item [] \underline{Severity}: WARNING
\item [] \underline{Description}:\\
I'm not yet sure how we should report this. Probably SNMP is not the
best choice because if the IP changes we're no longer able to poll SNMP
objects (until IP is updated also in the Network Management Station). We
should either generate SNMP trap to NMS or send Syslog message to a
central server.
\item [] \underline{SNMP objects}: \emph{(not yet implemented)}
\end{packed_enum}
\item {\bf Multiple unauthorized access attempts}
\begin{packed_enum}
\item [] \underline{Status}: for later
\item [] \underline{Severity}: WARNING
\item [] \underline{Description}:\\
If we observe many attempts to gain a root access through ssh (or the
web interface) that might mean somebody tries to do something nasty. We
should report such situation as a Warning.
\item [] \underline{SNMP objects}: \emph{(not yet implemented)}
\item [] \underline{Note}: Bad password event is reported by syslog as
warning. We should probably use this information to add an SNMP object.
\end{packed_enum}
\item {\bf Network reconfiguration (RSTP)}
\label{fail:other:rstp}
\begin{packed_enum}
\item [] \underline{Status}: for later
\item [] \underline{Severity}: WARNING
\item [] \underline{Description}: \emph{(not yet implemented)}\\
If topology reconfiguration occurs because of the primary link failure,
this fact should be reported through SNMP as a warning. It's not
critical situation, WR network still works. However, further
investigation should be performed to repair the broken link.
\item [] \underline{SNMP objects}: \emph{(not yet implemented)}
\end{packed_enum}
\item {\bf Backup link down}
\begin{packed_enum}
\item [] \underline{Status}: for later
\item [] \underline{Severity}: WARNING
\item [] \underline{Description}: \emph{(not yet implemented)}\\
It's related to the issue \ref{fail:other:rstp}. If the WRS uses primary
uplink, but the backup one fails, it's not a critical fault. WR Network
still works, but the problem should be diagnosed and repaired to have
the backup link operational in case the primary one fails.
\item [] \underline{SNMP objects}: \emph{(not yet implemented)}
\end{packed_enum}
\end{enumerate}
%\subsection{Switch out of sync to Master}
%
%\subsection{Switch made a big offset jump to follow Master}
%
%\subsection{Unsupported SFP plugged to one of the cages}
%
%\subsection{Lost lock to external 1-PPS \& 10 MHz}
%
%\subsection{Switch wasn't able to fetch initial time from NTP}
%
%\subsection{Suspicious value of any PTP parameter}
%e.g. bitslide > 16000; dTx/dRx = 0, etc.
%
%\subsection{PPSi/HAL/SNMP/any other userspace daemon has crashed}
%
%\subsection{LM32 software has crashed/restarted}
%
%\subsection{Cooling fan broken}
%
%\subsection{Power supply broken}
%
%\subsection{Switch not reachable after power cut}
%
%\subsection{Switch not reachable through SNMP}
%
%\subsection{One of the links went down}
%
%\subsection{Ethernet frames being dumped}
%
%\subsection{Linux is out of memory}
%
%\subsection{Filesystem error/corruption}
%
%\subsection{HW version not recognized, FPGA bitstream not loaded}
%
%\subsection{Frames storm coming from one or multiple ports to CPU}
\section{Introduction}
This document tries to list all possible ways the White Rabbit Switch can
brake. It is my brain dump and should be a starting point to improve SNMP
implementation and alarms (traps) generation. The document also tries to
describe what should be the operator's action for each failure. Whether it's
enough to reboot the switch or if it should be replaced with a new unit.
The document is organized in two parts. First one (section \ref{sec:failures})
tries to list all the possible failures that may disturb synchronization and
Ethernet switching. The structure of each failure description is the following:
\begin{itemize}[leftmargin=0pt]
\item [] \underline{Mode}: for timing failures, it says which modes are
affected. Possible values are:
\begin{itemize}
\item \emph{Slave} - WR Switch has at least one Slave port synchronized to
another WR device higher in the timing hierarchy (though it may be also
Master to other WR/PTP devices lower in the timing hierarchy).
\item \emph{Grand Master} - WR Switch at the top of the synchronization
hierarchy. It is synchronized to an external clock (e.g. GPSDO, Cesium)
and provides timing to other WR/PTP devices.
\item \emph{Free-Running Master} - WR Switch at the top of the
synchronization hierarchy. It provides timing to other WR/PTP devices
but runs from a local oscillator (not synchronized to external atomic
clock).
\end{itemize}
\item [] \underline{Description}: what the problem is about, how important it
is and what bad may happen if it occurs.
\item [] \underline{SNMP objects}: which SNMP objects should be monitored to
detect the failure. These may be objects from \texttt{WR-SWITCH-MIB} or one
of the standard MIBs used by the \emph{net-snmp}.
\item [] \underline{Notes}: optional comment in case required SNMP objects are
not yet exported by our current implementation of the SNMP agent. It
describes some preliminary ideas what should be exported in the near future.
\end{itemize}
Section \ref{sec:snmp_exports} is a documentation for people integrating WR
switch into a control system, operators and WR experts. It describes all
essential SNMP objects exported by the device divided into two groups:
\emph{Operator/basic objects}, \emph{Expert objects}
\section{SNMP exports (WIP)}
\label{sec:snmp_exports}
\subsection{Operator/basic objects}
Objects providing basic status of the WR Switch. It should be used by control
system operators and people without deep knowledge of the White Rabbit
internals. These values report the general status of the device and high level
errors.\\
{\bf Note}: We will need to change the SNMP code. There should be something like
a loop reading all information periodically (e.g. every 5s) from various SHM
areas (HAL, PPSi, SPLL), caching and calculating general status information.
This way, when we receive SNMP request we can feed the information from our
local SNMP cache. The same code could be later used to generate SNMP Traps.\\
\noindent {\bf General Status}:
\begin{itemize}%[leftmargin=0pt]
\item WRS general status - OK / Warning / Error
\item Timing Status
\item Networking Status
\item System Statue
\item Detailed status
\begin{itemize}
\item Timing
\begin{itemize}
\item PTP (TRACK\_PHASE, offset, RTT, fixed deltas, deamon crash,
servo\_update\_cnt)
\item SoftPLL (DelCnt = 0; mode, SeqState, AlignState)
\item Slave link down
\item PTP frames flowing ?
\item (placeholder for Switchover)
\item (placeholder for Holdover)
\end{itemize}
\item Networking
\begin{itemize}
\item (placeholder for Link down)
\item SFPs (portSfpError.<x> ?)
\item Endpoint status (2.2.2)
\item Swcore status (2.2.3, 2.2.5)
\item RTU status (2.2.4, 2.2.7)
\item (placeholder for TRU)
\item (placeholder for switchover or backup link state)
\end{itemize}
\item System
\begin{itemize}
\item Boot ok
\item Free memory too low
\item Temperature
\item CPU load too high
\item Disk space too low (?)
\end{itemize}
\end{itemize}
\item Version (rewrite existing)
\begin{itemize}
\item last date/time when firmware was updated\\
(save current time on restart, when new firmware is in /update so that it can be exported with SNMP)
\item contact info
\item build by
\item build date
\item hash, HW, SW,
\item (check what exists and add missing)
\end{itemize}
\end{itemize}
\newpage
\subsection{Expert/extended status}
Expert objects can be used by White Rabbit experts for the in-depth diagnosis of
the switch failures. These values are verbose and should not be used by
operators.
\begin{itemize}
\item Operation Status
\begin{itemize}
\item CPU Load (\%)
\item current time
\begin{itemize}
\item TAI
\item date string
\end{itemize}
\item Boot status
\begin{itemize}
\item boot cnt
\item restart reason
\item boot status values\\
(1 object for each: hwinfo readout, FPGA, LM32, kernel modules, userspace daemons, config retreived ok)
\item config source (tftp, flash, as string?)
\end{itemize}
\item Temperature
\begin{itemize}
\item temp 1..4
\item threshold 1..4
\end{itemize}
\end{itemize}
\item Restart Counters
\begin{itemize}
\item HAL
\item PPSi
\item RTUd
\item (..)
\item SPLL
\end{itemize}
\item SoftPLL state
\begin{itemize}
\item mode, irqcnt, seqstate, alignstate, Hlock, Mlock, Block[18], Err[18], HY, MY, delCnt, holdover, holdoverTime
\item spll version
\item spll build date
\item (...)
\end{itemize}
\item Networking
\begin{itemize}
\item VLAN table dump
\item RTU table dump (check if management sw uses snmpwalk)
\item SW core status
\begin{itemize}
\item Free pages
\end{itemize}
\end{itemize}
\item Pstats (pivot table, some of the counters should be used to fill
standard MIBs)
\item PtpData (make it an array for later switch-over needs)
\begin{itemize}
\item per instance/ which port
\end{itemize}
\item Ports status (per-port information)
\begin{itemize}
\item portEnable (enable/disable port via ifconfig)
\item ptpTxFrames (per port or per instance, depending on implementation)
\item ptpRxFrames (per port or per instance, depending on implementation)
\end{itemize}
\item Configuration
\begin{itemize}
\item PPS width
\end{itemize}
\end{itemize}
\newpage
\subsection{Expert objects (to be updated, was first draft}
{\bf Note:} we will put here MIB file dump later.
\subsubsection{PTP/WR parameters}
\begin{itemize}[leftmargin=0pt]
\item [] \texttt{WR-SWITCH-MIB::ptpGrandmasterID}\\ - is it really Grand
Master, so the same ID for the whole network ? or is it a Master higher in
the sync hierarchy for a given device ?
\item [] \texttt{WR-SWITCH-MIB::ptpOwnID}
\item [] \texttt{WR-SWITCH-MIB::ptpMode}
\item [] \texttt{WR-SWITCH-MIB::ptpSyncSource}\\ - port number
\emph{wr0}/\emph{wr1}/... for Slave mode or \emph{ext} for Grand Master mode
\item [] \texttt{WR-SWITCH-MIB::ptpServoState}\\ - string, WR servo state
(\emph{SYNC\_IDLE}, \emph{SYNC\_SEC}, \emph{SYNC\_NSEC}, \emph{SYNC\_PHASE},
\emph{OFFSET\_STABLE}, \emph{TRACK\_PHASE}) (timing:
\ref{fail:timing:ppsi_track_phase})
\item [] \texttt{WR-SWITCH-MIB::ptpServoStateN}\\ - would it be usefull to
report also ptpServoState in a numeric form ? (timing:
\ref{fail:timing:ppsi_track_phase})
\item [] \texttt{WR-SWITCH-MIB::ptpRTT}\\ - Round-trip delay ($delay_{MM}$)
(timing: \ref{fail:timing:rtt_jump})
\item [] \texttt{WR-SWITCH-MIB::ptpDelayMS}\\ - one-way M-S delay
($delay_{MS}$)
\item [] \texttt{WR-SWITCH-MIB::ptpLinkLength}
\item [] \texttt{WR-SWITCH-MIB::ptpPhaseTracking}\\ - if phase tracking is
enabled (only for WR-demo purposes I think)
\item [] \texttt{WR-SWITCH-MIB::ptpClockOffsetPs}\\ (timing:
\ref{fail:timing:offset_jump})
\item [] \texttt{WR-SWITCH-MIB::ptpSkew}
\item [] \texttt{WR-SWITCH-MIB::ptpPhSetpoint}
\item [] \texttt{WR-SWITCH-MIB::ServoUpdates}
\item [] \texttt{WR-SWITCH-MIB::portLink.<n>}\\ (timing:
\ref{fail:timing:master_down}, \ref{fail:timing:no_frames}; data:
\ref{fail:data:link_down})
\item [] \texttt{WR-SWITCH-MIB::portMode.<n>}\\ (timing:
\ref{fail:timing:master_down}, \ref{fail:timing:no_frames})
\item [] \texttt{WR-SWITCH-MIB::portLocked.<n>}
\item [] \texttt{WR-SWITCH-MIB::portPeer.<n>}
\item [] \texttt{WR-SWITCH-MIB::portPtpState.<n>}\\ - does it make sense to
report PTP state for each port ? (regular PTP, not WR state)
\item [] \texttt{WR-SWITCH-MIB::portPtpTxFrames.<n>}\\ - how many PTP frames
were sent from the port (counted by PTP/PPSi) (timing:
\ref{fail:timing:no_frames})
\item [] \texttt{WR-SWITCH-MIB::portPtpRxFrames.<n>}\\ - how many PTP frames
were received on the port (counted by PTP/PPSi) (timing:
\ref{fail:timing:no_frames})
\item [] \texttt{WR-SWITCH-MIB::portActiveSlave.<n>}\\ - 0/1 to mark which one
is the active Slave (if there are also Backups and timing switchover)
\item [] \texttt{WR-SWITCh-MIB::portDeltaTxM.<n>}\\ - for each Slave and
Backup port (timing: \ref{fail:timing:deltas_report})
\item [] \texttt{WR-SWITCH-MIB::portDeltaRxM.<n>}\\ - for each Slave and
Backup port (timing: \ref{fail:timing:deltas_report})
\item [] \texttt{WR-SWITCH-MIB::portDeltaTxS.<n>}\\ - for each Slave and
Backup port (timing: \ref{fail:timing:deltas_report})
\item [] \texttt{WR-SWITCH-MIB::portDeltaRxS.<n>}\\ - for each Slave and
Backup port (timing: \ref{fail:timing:deltas_report})
\item [] \texttt{WR-SWITCH-MIB::}
\begin{itemize}[topsep=-12pt]
\item any other usefull to report stuff from backup channels
\item holdover information (e.g. timestamp when it was activated)
\end{itemize}
\end{itemize}
\subsubsection{SoftPLL parameters}
\begin{itemize}[leftmargin=0pt]
\item [] \texttt{WR-SWITCH-MIB::spllMode}\\ - Grand Master / Free-running
Master / Slave / Disabled (timing: \ref{fail:timing:spll_unlock})
\item [] \texttt{WR-SWITCH-MIB::spllIrqCnt}\\ - IRQ counter
\item [] \texttt{WR-SWITCH-MIB::spllSeqState}\\ (timing:
\ref{fail:timing:spll_unlock})
\item [] \texttt{WR-SWITCH-MIB::spllAlignState}\\ (timing:
\ref{fail:timing:spll_unlock})
\item [] \texttt{WR-SWITCH-MIB::spllHlock}\\ (timing:
\ref{fail:timing:spll_unlock})
\item [] \texttt{WR-SWITCH-MIB::spllMlock}\\ (timing:
\ref{fail:timing:spll_unlock})
\item [] \texttt{WR-SWITCH-MIB::spllBlock}\\ - All backup channels locked
\item [] \texttt{WR-SWITCH-MIB::spllHY}\\ - Helper DAC setting (Helper PI.Y)
\item [] \texttt{WR-SWITCH-MIB::spllMY}\\ - Main DAC setting (Main PI.Y)
\item [] \texttt{WR-SWITCH-MIB::spllDelCnt}\\ - De-lock counter (timing:
\ref{fail:timing:spll_unlock})
\item [] \texttt{WR-SWITCH-MIB::spllCrashCnt}\\ - counter incremented when
LM32 SoftPLL software crash was detected (e.g. CPU has followed a NULL
pointer). Should this be a counter ? (timing: \ref{fail:timing:spll_crash})
\item [] \texttt{WR-SWITCH-MIB::}
\begin{itemize}[topsep=-12pt]
\item per-port stuff for active and backup channels related to timing
switchover
\end{itemize}
\end{itemize}
\subsubsection{Per-port statistics}
\begin{itemize}[leftmargin=0pt]
\item [] \texttt{WR-SWITCH-MIB::pstatsDescr.<x>} - string describing counter
$<$x$>$
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.1} - Tx PCS FIFO underruns (data:
\ref{fail:data:ep_txrx})
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.2} - Rx PCS FIFO overruns (data:
\ref{fail:data:ep_txrx})
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.3} - Rx invalid 8b10b codes (data:
\ref{fail:data:ep_txrx})
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.4} - Rx sync losts (data:
\ref{fail:data:ep_txrx})
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.5} - received pause frames
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.6} - Packet Filter frame drops
(data: \ref{fail:data:ep_txrx})
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.7} - Rx PCS Errors (data:
\ref{fail:data:ep_txrx})
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.8} - received giant frames
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.9} - received runt frames
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.10} - Rx CRC errors (data:
\ref{fail:data:ep_txrx})
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.11 - 18} - Rx framess assigned by
Packet Filter to classes 0 to 7
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.19} - transmitted frames (data:
\ref{fail:data:swcore_hang})
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.20} - received frames (data:
\ref{fail:data:too_much_HP})
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.21} - Rx frames dropped due to RTU
being full and not accepting requests (data: \ref{fail:data:rtu_full})
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.22 - 29} - received frames with
priority 0 - 7 (based on 802.1q tag priorities to traffic classes mapping)
(data: \ref{fail:data:too_much_HP})
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.30} - valid RTU requests
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.31} - valid RTU responses
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.32} - dropped frames based on RTU
decision
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.33} - Fast Match high priority
frames (data: \ref{fail:data:too_much_HP})
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.34} - Fast Match fast-forward
frames
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.35} - Fast Match non-forward
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.36} - Fast Match valid responses
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.37} - Full Match valid responses
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.38} - RTU forward decisions to the
port (data: \ref{fail:data:swcore_hang})
\item [] \texttt{WR-SWITCH-MIB::pstatsWR<n>.39} - TRU valid responses (not
supported)
\end{itemize}
\subsubsection{Other per-port status}
\begin{itemize}[leftmargin=0pt]
\item [] \texttt{IF-MIB::ifOperStatus.<n>}\\ - is link up or down (data:
\ref{fail:data:link_down})
\item [] \texttt{WR-SWITCH-MIB::portSfpID.<n>}\\ (timing:
\ref{fail:timing:wrong_sfp})
\item [] \texttt{WR-SWITCH-MIB::portSfpInDB.<n>}\\ - was SFP ID found in SFP
database with fixed delays and alpha ? (timing: \ref{fail:timing:wrong_sfp})
\item [] \texttt{WR-SWITCH-MIB::portSfpGbE.<n>}\\ - is there Gigabit Ethernet
SFP plugged ? (other: \ref{fail:other:sfp})
\item [] \texttt{WR-SWITCH-MIB::portNoTiming.<n>}\\ - port is used only for data
transfer, no timing, or non-WR timing; timing-unsupported SFP may be used
there (it's a signal for NMS not to raise an alarm if
\texttt{WR-SWITCH-MIB::portSfpInDB} is \emph{false})
\item [] \texttt{WR-SWITCH-MIB::confVLAN.<n>}\\ - per-port VLAN configuration
\item [] \texttt{WR-SWITCH-MIB::portEnabled.<n>}\\
- read/write value\\
- if the port is enabled / enable or disable the port; this may be useful of
part of the network causing problem would have to be remotely disconnected
\end{itemize}
\subsubsection{Other HDL info}
\begin{itemize}[leftmargin=0pt]
\item [] \texttt{WR-SWITCH-MIB::swcoreUsedPages}\\ - number of used pages in
the MPM memory (data: \ref{fail:data:swcore_hang})
\item [] \texttt{WR-SWITCH-MIB::swcoreFreePages}\\ - number of free pages in
the MPM memory (data: \ref{fail:data:swcore_hang})
\end{itemize}
\subsubsection{System status and configuration}
\begin{itemize}[leftmargin=0pt]
\item [] \texttt{WR-SWITCH-MIB::ppsWidth}\\ - configured width of 1-PPS signal
% info from wrs_version -t
\item [] \texttt{WR-SWITCH-MIB::swVer}\\ - version of the WRS software
\item [] \texttt{WR-SWITCH-MIB::swBuildBy}\\ - who compiled the firmware
\item [] \texttt{WR-SWITCH-MIB::swBuildDate}\\ - when the firmware was
compiled
\item [] \texttt{WR-SWITCH-MIB::swSpllVer}\\ - version of the LM32 software
(revision reported in rt\_cpu.elf)
\item [] \texttt{WR-SWITCH-MIB::swSpllBuildDate}\\ - when LM32 firmware was
compiled
\item [] \texttt{WR-SWITCH-MIB::gwVer}\\ - version of the WRS gateware
\item [] \texttt{WR-SWITCH-MIB::gwBuild}\\ - gateware build
\item [] \texttt{WR-SWITCH-MIB::gwHash.0}\\ - commit hash of the
\emph{wr-switch-hdl} repo
\item [] \texttt{WR-SWITCH-MIB::gwHash.1}\\ - commit hash of the
\emph{general-cores} repo
\item [] \texttt{WR-SWITCH-MIB::gwHash.2}\\ - commit hash of the
\emph{wr-cores} repo
\item [] \texttt{WR-SWITCH-MIB::hwVer.0}\\ - version of the scb
\item [] \texttt{WR-SWITCH-MIB::hwVer.1}\\ - version of the backplane
\item [] \texttt{WR-SWITCH-MIB::hwFpga}\\ - FPGA type
\item [] \texttt{WR-SWITCH-MIB::hwSN}\\ - serial number of the device
\item [] \texttt{WR-SWITCH-MIB::hwProd}\\ - manufacturer of the hardware
\item [] \texttt{HOST-RESOURCES-MIB::hrSWRunName.<x>}\\ - is a list of running
processes in the system. Each object \emph{x} is a string with process name,
and \emph{x} is PID of this process. We need to filter processes like:
\begin{packed_items}
\item \emph{ppsi}
\item \emph{wrsw\_hal}
\item \emph{wrsw\_rtud}
\item \emph{dropbear}
\item \emph{udhcpc}
\item \emph{rsyslogd}
\item \emph{snmpd}
\item \emph{lighttpd}
\end{packed_items}
\vspace{12pt}
(timing: \ref{fail:timing:ppsi_crash}, \ref{fail:timing:hal_crash}; data:
\ref{fail:data:rtu_crash}; other: \ref{fail:other:daemon_crash})
\item [] \texttt{WR-SWITCH-MIB::ptpRunCnt}\\ - how many times PTP/PPSi daemon
has crashed (timing: \ref{fail:timing:ppsi_crash})
\item [] \texttt{WR-SWITCH-MIB::halRunCnt}\\ - how many times HAL daemon
has crashed (timing: \ref{fail:timing:hal_crash})
\item [] \texttt{WR-SWITCH-MIB::rtuRunCnt}\\ - how many times RTU daemon
has crashed (data: \ref{fail:data:rtu_crash})
\item [] \texttt{WR-SWITCH-MIB::sshRunCnt}\\ - how many times Dropbear
daemon has crashed (other: \ref{fail:other:daemon_crash})
\item [] \texttt{WR-SWITCH-MIB::udhcpdRunCnt}\\ - how many times DHCP daemon
has crashed (other: \ref{fail:other:daemon_crash})
\item [] \texttt{WR-SWITCH-MIB::rsyslogRunCnt}\\ - how many times rsyslog
daemon has crashed (other: \ref{fail:other:daemon_crash})
\item [] \texttt{WR-SWITCH-MIB::snmpdRunCnt}\\ - how many times SNMP daemon
has crashed (other: \ref{fail:other:daemon_crash})
\item [] \texttt{WR-SWITCH-MIB::httpdRunCnt}\\ - how many times HTTPd daemon
has crashed (other: \ref{fail:other:daemon_crash})
\item [] \texttt{WR-SWITCH-MIB::sysCnfDate}\\ - TAI seconds when last
time the configuration was changed
\item [] \texttt{WR-SWITCH-MIB::sysCnfCrit}\\ - is \emph{true} when any of
the critical configuration options was modified during the last
reconfiguration. Critical configuration options:
\begin{packed_items}
\item PTP/PPSi timing mode
\item fixed hardware delays
\end{packed_items}
(timing: \ref{fail:timing:wrong_config})
\item [] \texttt{WR-SWITCH-MIB::sysRst}\\ - if true, system had to auto-reboot
due to a serious fault (e.g. kernel crash)
\item [] \texttt{WR-SWITCH-MIB::rtuRules}\\ - RTU table with dynamic and
static entries (\emph{rtu\_stat}) (data: \ref{fail:data:net_loop})
\item [] \texttt{HOST-RESOURCES-MIB::hrStorageDescr.<x>}\\ - description of
the memory/partition. \emph{x} can be:
\begin{packed_items}
\item [] {\bf 1} - Physical memory
\item [] {\bf 3} - Virtual memory
\item [] {\bf 6} - Memory buffers
\item [] {\bf 7} - Cached memory
\item [] {\bf 10} - Swap space
\item [] {\bf 31} - /update partition
\item [] {\bf 32} - /boot partition
\item [] {\bf 33} - /usr partition
\end{packed_items}
(other: \ref{fail:other:no_mem})
\item [] \texttt{HOST-RESOURCES-MIB::hrStorageSize.<x>}\\ - size of the
memory/partition (other: \ref{fail:other:no_mem})
\item [] \texttt{HOST-RESOURCES-MIB::hrStorageUsed.<x>}\\ - utilization of the
memory/partition (other: \ref{fail:other:no_mem})
\item [] \texttt{WR-SWITCH-MIB::sysNoMem}\\ - if true, system is nearly out of
memory (other: \ref{fail:other:no_mem})
\item [] \texttt{WR-SWITCH-MIB::cpuLoad}\\ - current CPU utilization (\%)
(other: \ref{fail:other:cpu})
\item [] \texttt{WR-SWITCH-MIB::tempFPGA}\\ - SCB temperature below the FPGA
(other: \ref{fail:other:temp})
\item [] \texttt{WR-SWITCH-MIB::tempScbPsu.1}\\ - SCB temperature near the
power supply circuit (other: \ref{fail:other:temp})
\item [] \texttt{WR-SWITCH-MIB::tempScbPsu.2}\\ - SCB temperature near the
power supply circuit (other: \ref{fail:other:temp})
\item [] \texttt{WR-SWITCH-MIB::tempPLL}\\ - SCB temperature near the VCXO and
PLLs (other: \ref{fail:other:temp})
\end{itemize}
\noindent \rule{\textwidth}{2pt}
%%%%%%%%%%%%%%%%%%5
%% Other notes
%
% What else should be reported in the future
% Status of Primary Slave port and backup links
% For backup timing links, report parameters from Backup SPLL channels and PTP servo
% What can be reported regarding eRSTP ?
% % role of the bridge - root/designated
% % port role - root/designated/backup/alternate/disabled
% % number of exchanged BPDUs
%
% * we could use information from RSTP to visualize the topology of network made of switches
% * switches exchange BPDU messages to leard about other switches
% * RFC 2674 - Bridges with priority, multicast pruning and VLAN
\def\us{\char`\_}
\documentclass[a4paper, 12pt]{article}
%\documentclass{article}
\usepackage{fullpage}
\usepackage{pgf}
\usepackage{tikz}
\usetikzlibrary{arrows,automata,shapes}
\usepackage{multirow}
\usepackage{color}
\usepackage[latin1]{inputenc}
\usepackage{verbatim}
\usepackage{amsmath}
\usepackage{times,mathptmx}
\usepackage{chngcntr}
\usepackage{hyperref}
\usepackage{enumitem}
\usepackage{scrextend}
%\usepackage[table]{xcolor}
\usepackage{listings}
\definecolor{light-gray}{gray}{0.95}
%\usepackage[firstpage]{draftwatermark}
\usepackage{listings}
\usepackage{cancel}
\graphicspath{ {../../../../figures/} }
\newenvironment{packed_enum}{
\begin{itemize}[leftmargin=0pt,topsep=-12pt]
\setlength{\itemsep}{1pt}
\setlength{\parskip}{0pt}
\setlength{\parsep}{0pt}
}{\end{itemize}}
\newenvironment{packed_items}{
\begin{itemize}[topsep=-12pt]
\setlength{\itemsep}{1pt}
\setlength{\parskip}{0pt}
\setlength{\parsep}{0pt}
}{\end{itemize}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% creating subsubsubsection notation
% src: http://www.latex-community.org/forum/viewtopic.php?f=5&t=791
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\setcounter{secnumdepth}{6}
\renewcommand\theparagraph{\Alph{paragraph}}
\makeatletter
\renewcommand\paragraph{\@startsection{paragraph}{4}{\z@}%
{-3.25ex\@plus -1ex \@minus -.2ex}%
{0.0001pt \@plus .2ex}%
{\normalfont\normalsize\bfseries}}
\renewcommand\subparagraph{\@startsection{subparagraph}{5}{\z@}%
{-3.25ex\@plus -1ex \@minus -.2ex}%
{0.0001pt \@plus .2ex}%
{\normalfont\normalsize\bfseries}}
\counterwithin{paragraph}{subsubsection}
\counterwithin{subparagraph}{paragraph}
\makeatother
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\eqoffset}[1]{%
{\ensuremath{%
{\text{offset}}_{#1}}%
}%
}
\newcommand{\eqdelay}[1]{{\text{delay}}_{#1}}
\newcommand{\eqasymm}{{\text{asymmetry}}}
\begin{document}
\title{White Rabbit Switch: Failures and Diagnostics}
\author{Grzegorz Daniluk\\ Adam Wujek\\[.5cm] CERN BE-CO-HT}
\maketitle
\thispagestyle{empty}
\begin{figure}[ht!]
\centering
\vspace{1.3cm}
\includegraphics[width=0.50\textwidth]{logo/WRlogo.pdf}
\end{figure}
\newpage
\newpage
\newpage
\tableofcontents
\newpage
\input{intro.tex}
\newpage
\section{Possible Errors}
\label{sec:failures}
\input{fail.tex}
\newpage
\input{snmp_exports.tex}
%\section{SNMP exports}
%\subsection{Operator/basic objects}
%\subsection{Expert objects}
%\newpage
%\bibliographystyle{unsrt}
%\bibliography{references}
\end{document}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment