Commit 515758d4 authored by Grzegorz Daniluk's avatar Grzegorz Daniluk

doc/wrs_failures: adding first draft of procedures in case of errors

parent 59beed5d
\section{Repair procedures}
General rules:
\begin{itemize}
\item Linux inside the WR Switch enumerates WR interfaces starting from 0.
This means we have to use internally port indexes 0..17. However, the
port numbers printed on the front panel are 1..18. Syslog messages
generated from the switch use the Linux port numbering. The consequence is
that every time Syslog says there is a problem on port X, this refers to
port index X+1 on the front panel of the switch.
\item If a procedure given for a specific SNMP object does not solve the
problem. Please contact WR experts to perform more in-depth analysis of your
network. For this, you should provide a complete dump of the WRS status
generated in the first step of each procedure.
\item If a solving procedure requires restarting or replacing a broken WR
Switch, please make sure that all other WR devices connected to the affected
switch are synchronized and do not report any problems.
\item If procedure requires replacing switch with a new unit, the broken one
should be handled to WR experts to investigate the problem.
\end{itemize}
\begin{itemize}
\item \texttt{wrsBootSuccessful}
\begin{enumerate}
\item Dump state
\item Check \texttt{WR-SWITCH-MIB::wrsBootConfigStatus}, if it reports an
error, please verify your WRS configuration.
\item Restart the switch
\item Please consult WR experts if the problem persists.
\end{enumerate}
\item \texttt{wrsTemperatureWarning}
\begin{enumerate}
\item Dump state
\item Verify if cooling of the rack where WR Switch is installed works
properly.
\item Verify if both cooling fans in the back of the WR Switch case are
working.
\item Replace the switch with a new unit and consult the WR Switch
manufacturer for a repair.
\end{enumerate}
\item \texttt{wrsMemoryFreeLow}
\begin{enumerate}
\item Dump state
\item Restart the switch
\item Send the dumped state of the switch to WR experts for analysis as
this might mean there is some internal problem in the WRS firmware.
\end{enumerate}
\item \texttt{wrsCpuLoadHigh}
\begin{enumerate}
\item Dump state
\item Restart the switch
\item Send the dumped state of the switch to WR experts for analysis as
this might mean there is some internal problem in the WRS firmware.
\end{enumerate}
\item \texttt{wrsDiskSpaceLow}
\begin{enumerate}
\item Dump state
\item Check the values of \emph{CONFIG\_WRS\_LOG\_*} configuration options
on the switch. These are the parameters describing where log messages
should be sent from various processes in the switch. Normally users
don't need to modify them, but if any of them is set to a file in the
WRS filesystem (e.g. /tmp/snmp.log) this may reduce the free space after
some time of operation.
\item Restart the switch
\item Send the dumped state of the switch to WR experts for analysis as
this might mean there is some internal problem in the WRS firmware.
\end{enumerate}
\end{itemize}
\begin{itemize}
\item \texttt{wrsPTPStatus}
\begin{enumerate}
\item Dump state
\item Check \texttt{wrsSoftPLLStatus} on the Master (WR device one step
higher in a timing hierarchy). Eventually proceed to investigate the
problem on the Master switch. Otherwise, continue with the primary WRS.
\item Verify if the link to WR Master was not lost by checking the object
\texttt{wrsSlaveLinksStatus}.
\item If this is not the case, restart the switch.
\item If the problem persists replace the switch with a new unit (see
\ref{cern:wrs_replacement}).
\end{enumerate}
\item \texttt{wrsSoftPLLStatus}\\
For GrandMaster WRS:
\begin{enumerate}
\item Dump state
\item Check 1-PPS and 10 MHz signals coming from an external source.
Verify if they are properly connected and, in case of GPS receiver,
check if it is synchronized and locked.
\item Restart the GrandMaster switch.
\item If the problem persists, replace the switch with a new unit (see
\ref{cern:wrs_replacement}).
\end{enumerate}
For Boundary Clock WRS:
\begin{enumerate}
\item Dump state
\item Check \texttt{wrsSoftPLLStatus} on the Master. Eventually proceed to
investigate the problem on the Master switch.
\item Verify if the link to WR Master was not lost by checking the object
\texttt{wrsSlaveLinksStatus}.
\item Restart the switch.
\item If the problem persists, replace the switch with a new unit (see
\ref{cern:wrs_replacement}).
\end{enumerate}
\item \texttt{wrsSlaveLinksStatus}\\
For Master/GrandMaster WRS:
\begin{enumerate}
\item Check the configuration of the switch. Especially if the
\emph{Timing Mode} is correctly set (i.e. if it was not accidentally set
to \emph{Boundary Clock}).
\item Check the role of each port timing configuration. They should be all
set to \emph{master}. If any of them is set to \emph{slave} you should
verify if there is no WR Master connected to it.
\end{enumerate}
For Boundary Clock WRS:
\begin{enumerate}
\item Check the fiber connection on the slave port of the WRS.
\item Check the configuration of the switch. Especially if the
\emph{Timing Mode} is correctly set (i.e. if it was not accidentally set
to \emph{Grand-Master} or \emph{Free-Running Master}).
\item Check the status of the WR Master connected to the slave port of the
WRS.
\item Replace the faulty switch with a new unit, if this does not solve
the problem, make sure your fiber link is not broken.
\end{enumerate}
\item \texttt{wrsPTPFramesFlowing}
% non-WR device connected, but port not set to non-WR mode
% device on the other side has some problem
% HDL / kernel crash or another problem on WRS
\begin{enumerate}
\item Check Syslog message to determine the WR port on which the
problem is reported. You should see a message similar to this one:\\
\texttt{SNMP: wrsPTPFramesFlowing failed for port 1}
\item Check your network layout and the WR Switch configuration. If you
have some non-WR devices connected to ports of the WR Switch (e.g.
computer sending/receiving only data, without the need of
synchronization), these ports should have their role in the timing
configuration set to \emph{non-wr}.
\item Check the status of a WR device connected to the reported port.
\item Restart the switch.
\item If the problem persists, please contact WR experts for in-depth
investigation.
\end{enumerate}
\end{itemize}
\begin{itemize}
\item \texttt{wrsSFPsStatus}
\begin{enumerate}
\item Check Syslog messages to determine the WR port on which the problem
is reported. You should see a message similar to this one:\\
\texttt{Unknown SFP vn="AVAGO" pn="ABCU-5710RZ" vs="AN1151PD8A" on port
wr1}
\item If the reported port is intended to be used to connect a device that
does not require WR synchronization (e.g. using a copper SFP module),
then you should verify whether the role in the timing configuration for
this port is set to \emph{non-wr}.
\item Otherwise, you should use a WR-supported SFP module and make sure it
is declared together with calibration values in the WRS configuration.
\end{enumerate}
\item \texttt{wrsEndpointStatus}
% link problem (e.g. broken SFP, fiber)
% gateware problem
\begin{enumerate}
\item Make several state dumps.
\item Restart the switch.
\item Check Syslog messages to determine the WR port on which the problem
is reported. You should see a message similar to this one:\\
\texttt{SNMP: wrsEndpointStatus failed for port 1}
\item Check the fiber link on a reported port, i.e. try replacing SFP
transceivers on both sides of the link, try using another fiber.
\item If the problem persists, please contact WR experts for in-depth
investigation.
\end{enumerate}
\item \texttt{wrsSwcoreStatus}
\begin{enumerate}
\item Dump state.
\item Restart the switch.
\item Please contact WR experts since this might mean that either there is
too much high priority traffic in your network, or there is some
internal problem in the WRS firmware.
\end{enumerate}
\item \texttt{wrsRTUStatus}
\begin{enumerate}
\item Dump state
\item Restart the switch.
\item If possible, try reducing the load of small Ethernet frames flowing
through your switch. If possible in your application, try using larger
Ethernet frames with lower load to transfer information.
\end{enumerate}
\end{itemize}
\subsection{Replacing WR Switch with a new unit}
\label{cern:wrs_replacement}
This just a reference holder to point to the CERN wikis with the description of
updating MAC in network database so that the same configuration is used.
......@@ -256,4 +256,7 @@
\ifglsused{\thislabel}{}{\glsadd[format=ignore]{\thislabel}}%
}
\newpage
\input{procedures.tex}
\end{document}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment