Skip to content
Projects
Groups
Snippets
Help
Loading...
Sign in
Toggle navigation
W
White Rabbit Switch - Software
Project
Project
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
86
Issues
86
List
Board
Labels
Milestones
Merge Requests
4
Merge Requests
4
CI / CD
CI / CD
Pipelines
Schedules
Wiki
Wiki
image/svg+xml
Discourse
Discourse
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Commits
Issue Boards
Open sidebar
Projects
White Rabbit Switch - Software
Commits
515758d4
Commit
515758d4
authored
Jan 20, 2016
by
Grzegorz Daniluk
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
doc/wrs_failures: adding first draft of procedures in case of errors
parent
59beed5d
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
211 additions
and
0 deletions
+211
-0
procedures.tex
doc/wrs_failures/procedures.tex
+208
-0
wrs_failures.tex
doc/wrs_failures/wrs_failures.tex
+3
-0
No files found.
doc/wrs_failures/procedures.tex
0 → 100644
View file @
515758d4
\section
{
Repair procedures
}
General rules:
\begin{itemize}
\item
Linux inside the WR Switch enumerates WR interfaces starting from 0.
This means we have to use internally port indexes 0..17. However, the
port numbers printed on the front panel are 1..18. Syslog messages
generated from the switch use the Linux port numbering. The consequence is
that every time Syslog says there is a problem on port X, this refers to
port index X+1 on the front panel of the switch.
\item
If a procedure given for a specific SNMP object does not solve the
problem. Please contact WR experts to perform more in-depth analysis of your
network. For this, you should provide a complete dump of the WRS status
generated in the first step of each procedure.
\item
If a solving procedure requires restarting or replacing a broken WR
Switch, please make sure that all other WR devices connected to the affected
switch are synchronized and do not report any problems.
\item
If procedure requires replacing switch with a new unit, the broken one
should be handled to WR experts to investigate the problem.
\end{itemize}
\begin{itemize}
\item
\texttt
{
wrsBootSuccessful
}
\begin{enumerate}
\item
Dump state
\item
Check
\texttt
{
WR-SWITCH-MIB::wrsBootConfigStatus
}
, if it reports an
error, please verify your WRS configuration.
\item
Restart the switch
\item
Please consult WR experts if the problem persists.
\end{enumerate}
\item
\texttt
{
wrsTemperatureWarning
}
\begin{enumerate}
\item
Dump state
\item
Verify if cooling of the rack where WR Switch is installed works
properly.
\item
Verify if both cooling fans in the back of the WR Switch case are
working.
\item
Replace the switch with a new unit and consult the WR Switch
manufacturer for a repair.
\end{enumerate}
\item
\texttt
{
wrsMemoryFreeLow
}
\begin{enumerate}
\item
Dump state
\item
Restart the switch
\item
Send the dumped state of the switch to WR experts for analysis as
this might mean there is some internal problem in the WRS firmware.
\end{enumerate}
\item
\texttt
{
wrsCpuLoadHigh
}
\begin{enumerate}
\item
Dump state
\item
Restart the switch
\item
Send the dumped state of the switch to WR experts for analysis as
this might mean there is some internal problem in the WRS firmware.
\end{enumerate}
\item
\texttt
{
wrsDiskSpaceLow
}
\begin{enumerate}
\item
Dump state
\item
Check the values of
\emph
{
CONFIG
\_
WRS
\_
LOG
\_*
}
configuration options
on the switch. These are the parameters describing where log messages
should be sent from various processes in the switch. Normally users
don't need to modify them, but if any of them is set to a file in the
WRS filesystem (e.g. /tmp/snmp.log) this may reduce the free space after
some time of operation.
\item
Restart the switch
\item
Send the dumped state of the switch to WR experts for analysis as
this might mean there is some internal problem in the WRS firmware.
\end{enumerate}
\end{itemize}
\begin{itemize}
\item
\texttt
{
wrsPTPStatus
}
\begin{enumerate}
\item
Dump state
\item
Check
\texttt
{
wrsSoftPLLStatus
}
on the Master (WR device one step
higher in a timing hierarchy). Eventually proceed to investigate the
problem on the Master switch. Otherwise, continue with the primary WRS.
\item
Verify if the link to WR Master was not lost by checking the object
\texttt
{
wrsSlaveLinksStatus
}
.
\item
If this is not the case, restart the switch.
\item
If the problem persists replace the switch with a new unit (see
\ref
{
cern:wrs
_
replacement
}
).
\end{enumerate}
\item
\texttt
{
wrsSoftPLLStatus
}
\\
For GrandMaster WRS:
\begin{enumerate}
\item
Dump state
\item
Check 1-PPS and 10 MHz signals coming from an external source.
Verify if they are properly connected and, in case of GPS receiver,
check if it is synchronized and locked.
\item
Restart the GrandMaster switch.
\item
If the problem persists, replace the switch with a new unit (see
\ref
{
cern:wrs
_
replacement
}
).
\end{enumerate}
For Boundary Clock WRS:
\begin{enumerate}
\item
Dump state
\item
Check
\texttt
{
wrsSoftPLLStatus
}
on the Master. Eventually proceed to
investigate the problem on the Master switch.
\item
Verify if the link to WR Master was not lost by checking the object
\texttt
{
wrsSlaveLinksStatus
}
.
\item
Restart the switch.
\item
If the problem persists, replace the switch with a new unit (see
\ref
{
cern:wrs
_
replacement
}
).
\end{enumerate}
\item
\texttt
{
wrsSlaveLinksStatus
}
\\
For Master/GrandMaster WRS:
\begin{enumerate}
\item
Check the configuration of the switch. Especially if the
\emph
{
Timing Mode
}
is correctly set (i.e. if it was not accidentally set
to
\emph
{
Boundary Clock
}
).
\item
Check the role of each port timing configuration. They should be all
set to
\emph
{
master
}
. If any of them is set to
\emph
{
slave
}
you should
verify if there is no WR Master connected to it.
\end{enumerate}
For Boundary Clock WRS:
\begin{enumerate}
\item
Check the fiber connection on the slave port of the WRS.
\item
Check the configuration of the switch. Especially if the
\emph
{
Timing Mode
}
is correctly set (i.e. if it was not accidentally set
to
\emph
{
Grand-Master
}
or
\emph
{
Free-Running Master
}
).
\item
Check the status of the WR Master connected to the slave port of the
WRS.
\item
Replace the faulty switch with a new unit, if this does not solve
the problem, make sure your fiber link is not broken.
\end{enumerate}
\item
\texttt
{
wrsPTPFramesFlowing
}
% non-WR device connected, but port not set to non-WR mode
% device on the other side has some problem
% HDL / kernel crash or another problem on WRS
\begin{enumerate}
\item
Check Syslog message to determine the WR port on which the
problem is reported. You should see a message similar to this one:
\\
\texttt
{
SNMP: wrsPTPFramesFlowing failed for port 1
}
\item
Check your network layout and the WR Switch configuration. If you
have some non-WR devices connected to ports of the WR Switch (e.g.
computer sending/receiving only data, without the need of
synchronization), these ports should have their role in the timing
configuration set to
\emph
{
non-wr
}
.
\item
Check the status of a WR device connected to the reported port.
\item
Restart the switch.
\item
If the problem persists, please contact WR experts for in-depth
investigation.
\end{enumerate}
\end{itemize}
\begin{itemize}
\item
\texttt
{
wrsSFPsStatus
}
\begin{enumerate}
\item
Check Syslog messages to determine the WR port on which the problem
is reported. You should see a message similar to this one:
\\
\texttt
{
Unknown SFP vn="AVAGO" pn="ABCU-5710RZ" vs="AN1151PD8A" on port
wr1
}
\item
If the reported port is intended to be used to connect a device that
does not require WR synchronization (e.g. using a copper SFP module),
then you should verify whether the role in the timing configuration for
this port is set to
\emph
{
non-wr
}
.
\item
Otherwise, you should use a WR-supported SFP module and make sure it
is declared together with calibration values in the WRS configuration.
\end{enumerate}
\item
\texttt
{
wrsEndpointStatus
}
% link problem (e.g. broken SFP, fiber)
% gateware problem
\begin{enumerate}
\item
Make several state dumps.
\item
Restart the switch.
\item
Check Syslog messages to determine the WR port on which the problem
is reported. You should see a message similar to this one:
\\
\texttt
{
SNMP: wrsEndpointStatus failed for port 1
}
\item
Check the fiber link on a reported port, i.e. try replacing SFP
transceivers on both sides of the link, try using another fiber.
\item
If the problem persists, please contact WR experts for in-depth
investigation.
\end{enumerate}
\item
\texttt
{
wrsSwcoreStatus
}
\begin{enumerate}
\item
Dump state.
\item
Restart the switch.
\item
Please contact WR experts since this might mean that either there is
too much high priority traffic in your network, or there is some
internal problem in the WRS firmware.
\end{enumerate}
\item
\texttt
{
wrsRTUStatus
}
\begin{enumerate}
\item
Dump state
\item
Restart the switch.
\item
If possible, try reducing the load of small Ethernet frames flowing
through your switch. If possible in your application, try using larger
Ethernet frames with lower load to transfer information.
\end{enumerate}
\end{itemize}
\subsection
{
Replacing WR Switch with a new unit
}
\label
{
cern:wrs
_
replacement
}
This just a reference holder to point to the CERN wikis with the description of
updating MAC in network database so that the same configuration is used.
doc/wrs_failures/wrs_failures.tex
View file @
515758d4
...
...
@@ -256,4 +256,7 @@
\ifglsused
{
\thislabel
}{}{
\glsadd
[format=ignore]
{
\thislabel
}}
%
}
\newpage
\input
{
procedures.tex
}
\end{document}
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment