Commit 6ff70c97 authored by Grzegorz Daniluk's avatar Grzegorz Daniluk

doc/wrs_failures: more cleanup and a/the fixes

parent c8593494
This diff is collapsed.
\section{Repair procedures}
General rules:
\begin{itemize}
\item Linux inside the WR Switch enumerates WR interfaces starting from 0.
This means we have to use internally port indexes 0..17. However, the
port numbers printed on the front panel are 1..18. Syslog messages
generated from the switch use the Linux port numbering. The consequence is
that every time Syslog says there is a problem on port X, this refers to
port index X+1 on the front panel of the switch.
\item If a procedure given for a specific SNMP object does not solve the
problem. Please contact WR experts to perform more in-depth analysis of your
network. For this, you should provide a complete dump of the WRS status
generated in the first step of each procedure.
\item If a solving procedure requires restarting or replacing a broken WR
Switch, please make sure that all other WR devices connected to the affected
switch are synchronized and do not report any problems.
\item If procedure requires replacing switch with a new unit, the broken one
should be handled to WR experts to investigate the problem.
\end{itemize}
\begin{itemize}
\item \texttt{wrsBootSuccessful}
\begin{enumerate}
\item Dump state
\item Check \texttt{WR-SWITCH-MIB::wrsBootConfigStatus}, if it reports an
error, please verify your WRS configuration.
\item Restart the switch
\item Please consult WR experts if the problem persists.
\end{enumerate}
\item \texttt{wrsTemperatureWarning}
\begin{enumerate}
\item Dump state
\item Verify if cooling of the rack where WR Switch is installed works
properly.
\item Verify if both cooling fans in the back of the WR Switch case are
working.
\item Replace the switch with a new unit and consult the WR Switch
manufacturer for a repair.
\end{enumerate}
\item \texttt{wrsMemoryFreeLow}
\begin{enumerate}
\item Dump state
\item Restart the switch
\item Send the dumped state of the switch to WR experts for analysis as
this might mean there is some internal problem in the WRS firmware.
\end{enumerate}
\item \texttt{wrsCpuLoadHigh}
\begin{enumerate}
\item Dump state
\item Restart the switch
\item Send the dumped state of the switch to WR experts for analysis as
this might mean there is some internal problem in the WRS firmware.
\end{enumerate}
\item \texttt{wrsDiskSpaceLow}
\begin{enumerate}
\item Dump state
\item Check the values of \emph{CONFIG\_WRS\_LOG\_*} configuration options
on the switch. These are the parameters describing where log messages
should be sent from various processes in the switch. Normally users
don't need to modify them, but if any of them is set to a file in the
WRS filesystem (e.g. /tmp/snmp.log) this may reduce the free space after
some time of operation.
\item Restart the switch
\item Send the dumped state of the switch to WR experts for analysis as
this might mean there is some internal problem in the WRS firmware.
\end{enumerate}
\end{itemize}
\begin{itemize}
\item \texttt{wrsPTPStatus}
\begin{enumerate}
\item Dump state
\item Check \texttt{wrsSoftPLLStatus} on the Master (WR device one step
higher in a timing hierarchy). Eventually proceed to investigate the
problem on the Master switch. Otherwise, continue with the primary WRS.
\item Verify if the link to WR Master was not lost by checking the object
\texttt{wrsSlaveLinksStatus}.
\item If this is not the case, restart the switch.
\item If the problem persists replace the switch with a new unit (see
\ref{cern:wrs_replacement}).
\end{enumerate}
\item \texttt{wrsSoftPLLStatus}\\
For GrandMaster WRS:
\begin{enumerate}
\item Dump state
\item Check 1-PPS and 10 MHz signals coming from an external source.
Verify if they are properly connected and, in case of GPS receiver,
check if it is synchronized and locked.
\item Restart the GrandMaster switch.
\item If the problem persists, replace the switch with a new unit (see
\ref{cern:wrs_replacement}).
\end{enumerate}
For Boundary Clock WRS:
\begin{enumerate}
\item Dump state
\item Check \texttt{wrsSoftPLLStatus} on the Master. Eventually proceed to
investigate the problem on the Master switch.
\item Verify if the link to WR Master was not lost by checking the object
\texttt{wrsSlaveLinksStatus}.
\item Restart the switch.
\item If the problem persists, replace the switch with a new unit (see
\ref{cern:wrs_replacement}).
\end{enumerate}
\item \texttt{wrsSlaveLinksStatus}\\
For Master/GrandMaster WRS:
\begin{enumerate}
\item Check the configuration of the switch. Especially if the
\emph{Timing Mode} is correctly set (i.e. if it was not accidentally set
to \emph{Boundary Clock}).
\item Check the role of each port timing configuration. They should be all
set to \emph{master}. If any of them is set to \emph{slave} you should
verify if there is no WR Master connected to it.
\end{enumerate}
For Boundary Clock WRS:
\begin{enumerate}
\item Check the fiber connection on the slave port of the WRS.
\item Check the configuration of the switch. Especially if the
\emph{Timing Mode} is correctly set (i.e. if it was not accidentally set
to \emph{Grand-Master} or \emph{Free-Running Master}).
\item Check the status of the WR Master connected to the slave port of the
WRS.
\item Replace the faulty switch with a new unit, if this does not solve
the problem, make sure your fiber link is not broken.
\end{enumerate}
\item \texttt{wrsPTPFramesFlowing}
% non-WR device connected, but port not set to non-WR mode
% device on the other side has some problem
% HDL / kernel crash or another problem on WRS
\begin{enumerate}
\item Check Syslog message to determine the WR port on which the
problem is reported. You should see a message similar to this one:\\
\texttt{SNMP: wrsPTPFramesFlowing failed for port 1}
\item Check your network layout and the WR Switch configuration. If you
have some non-WR devices connected to ports of the WR Switch (e.g.
computer sending/receiving only data, without the need of
synchronization), these ports should have their role in the timing
configuration set to \emph{non-wr}.
\item Check the status of a WR device connected to the reported port.
\item Restart the switch.
\item If the problem persists, please contact WR experts for in-depth
investigation.
\end{enumerate}
\end{itemize}
\begin{itemize}
\item \texttt{wrsSFPsStatus}
\begin{enumerate}
\item Check Syslog messages to determine the WR port on which the problem
is reported. You should see a message similar to this one:\\
\texttt{Unknown SFP vn="AVAGO" pn="ABCU-5710RZ" vs="AN1151PD8A" on port
wr1}
\item If the reported port is intended to be used to connect a device that
does not require WR synchronization (e.g. using a copper SFP module),
then you should verify whether the role in the timing configuration for
this port is set to \emph{non-wr}.
\item Otherwise, you should use a WR-supported SFP module and make sure it
is declared together with calibration values in the WRS configuration.
\end{enumerate}
\item \texttt{wrsEndpointStatus}
% link problem (e.g. broken SFP, fiber)
% gateware problem
\begin{enumerate}
\item Make several state dumps.
\item Restart the switch.
\item Check Syslog messages to determine the WR port on which the problem
is reported. You should see a message similar to this one:\\
\texttt{SNMP: wrsEndpointStatus failed for port 1}
\item Check the fiber link on a reported port, i.e. try replacing SFP
transceivers on both sides of the link, try using another fiber.
\item If the problem persists, please contact WR experts for in-depth
investigation.
\end{enumerate}
\item \texttt{wrsSwcoreStatus}
\begin{enumerate}
\item Dump state.
\item Restart the switch.
\item Please contact WR experts since this might mean that either there is
too much high priority traffic in your network, or there is some
internal problem in the WRS firmware.
\end{enumerate}
\item \texttt{wrsRTUStatus}
\begin{enumerate}
\item Dump state
\item Restart the switch.
\item If possible, try reducing the load of small Ethernet frames flowing
through your switch. If possible in your application, try using larger
Ethernet frames with lower load to transfer information.
\end{enumerate}
\end{itemize}
\subsection{Replacing WR Switch with a new unit}
\label{cern:wrs_replacement}
This just a reference holder to point to the CERN wikis with the description of
updating MAC in network database so that the same configuration is used.
......@@ -30,21 +30,21 @@ are some common remarks that apply to all situations:
that every time Syslog says there is a problem on port X, this refers to
port index X+1 on the front panel of the switch.
\item If a procedure given for a specific SNMP object does not solve the
problem. Please contact WR experts to perform more in-depth analysis of your
network. For this, you should provide a complete dump of the WRS status
problem, please contact WR experts to perform a more in-depth analysis of
the network. For this, you should provide a complete dump of the WRS status
generated in the first step of each procedure.
\item First action in most of the procedures below named \emph{Dump state}
\item The first action in most of the procedures below named \emph{Dump state}
requires simply calling a tool provided by WR developers that reads all the
detailed information from the switch and writes it to a single file that can
be later analyzed by the experts.\\
{\bf TODO: point to the tool once it's done}
\item If solving procedure requires restarting or replacing a broken WR
Switch, please make sure that after the repair, all other WR devices
\item If a problem solving procedure requires restarting or replacing a broken
WR Switch, please make sure that after the repair, all other WR devices
connected to the affected switch are synchronized and do not report any
problems.
\item If a procedure requires replacing switch with a new unit, the broken one
should be handled to WR experts or the switch manufacturer to investigate
the problem.
\item If a procedure requires replacing a switch with a new unit, the broken
one should be handled to WR experts or the switch manufacturer to
investigate the problem.
\end{itemize}
\subsection{General status objects for operators}
......@@ -52,7 +52,7 @@ are some common remarks that apply to all situations:
This section describes the general status MIB objects that represent the overall
status of a device and its subsystems. They are organized in a tree structure
(fig.\ref{fig:snmp_oper}) where each object reports a problem based on the
status of its child objects. SNMP object in the third layer of this tree are
status of its child objects. SNMP objects in the third layer of this tree are
calculated based on the SNMP expert objects. Most of the status objects
described in this section can have one of the following values:
\begin{figure}[ht]
......@@ -69,12 +69,12 @@ described in this section can have one of the following values:
\item \texttt{Warning} -- objects used to calculate this value are outside the
proper values, but problem in not critical enough to report \texttt{Error}.
\item \texttt{WarningNA} -- at least one of the objects used to calculate the
status has a value \texttt{NA} or \texttt{WarningNA}.
status has a value \texttt{NA} (or \texttt{WarningNA}).
\item \texttt{Error} -- error in values used to calculate the particular
object.
\item \texttt{FirstRead} -- the value of the object cannot be calculated
because at least one condition uses deltas between the current and previous
value. This value should appear only at first SNMP read. Threated as a
value. This value should appear only at first SNMP read. To be treated as a
correct value.
\item \texttt{Bug} -- Something wrong has happened while calculating the
object. If you see this please report to WR developers.
......
......@@ -8,24 +8,32 @@
subsystems.}
\snmpentrys{WR-SWITCH-MIB}{wrsGeneralStatusGroup}{wrsMainSystemStatus}{
\underline{Description:}
WRS general status of a switch can be \texttt{OK}, \texttt{Warning} or
\texttt{Error}. In case of an error or warning, please check the values of
\texttt{\glshyperlink{WR-SWITCH-MIB::wrsOSStatus}},
\texttt{\glshyperlink{WR-SWITCH-MIB::wrsTimingStatus}} and
\texttt{\glshyperlink{WR-SWITCH-MIB::wrsNetworkingStatus}} to find out which
subsystem causes the problem.}
subsystem causes the problem.
\glspar \underline{Related problems:}}
\snmpentrys{WR-SWITCH-MIB}{wrsGeneralStatusGroup}{wrsOSStatus}{
\underline{Description:}
Collective status of the operating system running on WR switch. In case of
an error or warning, please check status objects in the
\texttt{\glshyperlink{WR-SWITCH-MIB::wrsOSStatusGroup}}.}
\texttt{\glshyperlink{WR-SWITCH-MIB::wrsOSStatusGroup}}.
\glspar \underline{Related problems:}}
\snmpentrys{WR-SWITCH-MIB}{wrsGeneralStatusGroup}{wrsTimingStatus}{
\underline{Description:}
Collective status of the synchronization subsystem. In case of an
error or warning, please check status objects in the
\texttt{\glshyperlink{WR-SWITCH-MIB::wrsTimingStatusGroup}}.}
\texttt{\glshyperlink{WR-SWITCH-MIB::wrsTimingStatusGroup}}.
\glspar \underline{Related problems:}}
\snmpentrys{WR-SWITCH-MIB}{wrsGeneralStatusGroup}{wrsNetworkingStatus}{
\underline{Description:}
Collective status of the Ethernet switching subsystem. In case of an error
or warning, please check status objects in the
\texttt{\glshyperlink{WR-SWITCH-MIB::wrsNetworkingStatusGroup}}.}
\texttt{\glshyperlink{WR-SWITCH-MIB::wrsNetworkingStatusGroup}}.
\glspar \underline{Related problems:}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\snmpentrys{WR-SWITCH-MIB}{}{wrsDetailedStatusesGroup}{
......@@ -174,7 +182,7 @@
\begin{pck_proc}
\item Dump state
\item Check 1-PPS and 10 MHz signals coming from an external source.
Verify if they are properly connected and, in case of GPS receiver,
Verify if they are properly connected and, in case of a GPS receiver,
check if it is synchronized and locked.
\item Restart the GrandMaster switch.
\item If the problem persists, replace the switch with a new unit.
......@@ -388,7 +396,7 @@
\snmpentrys{WR-SWITCH-MIB}{wrsVersionGroup}{wrsVersionLastUpdateDate}{
\underline{Description:}
Date and time of the last firmware update, this information may not be
Date and time of the last firmware update. This information may not be
accurate, due to hard restarts or lack of the proper time during the
upgrade.}
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment