Skip to content
Projects
Groups
Snippets
Help
Loading...
Sign in
Toggle navigation
W
White Rabbit Switch - Software
Project
Project
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
84
Issues
84
List
Board
Labels
Milestones
Merge Requests
4
Merge Requests
4
CI / CD
CI / CD
Pipelines
Schedules
Wiki
Wiki
image/svg+xml
Discourse
Discourse
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Commits
Issue Boards
Open sidebar
Projects
White Rabbit Switch - Software
Commits
6ff70c97
Commit
6ff70c97
authored
Feb 01, 2016
by
Grzegorz Daniluk
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
doc/wrs_failures: more cleanup and a/the fixes
parent
c8593494
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
235 additions
and
390 deletions
+235
-390
fail.tex
doc/wrs_failures/fail.tex
+210
-165
procedures.tex
doc/wrs_failures/procedures.tex
+0
-208
snmp_exports.tex
doc/wrs_failures/snmp_exports.tex
+11
-11
snmp_objects.tex
doc/wrs_failures/snmp_objects.tex
+14
-6
No files found.
doc/wrs_failures/fail.tex
View file @
6ff70c97
This diff is collapsed.
Click to expand it.
doc/wrs_failures/procedures.tex
deleted
100644 → 0
View file @
c8593494
\section
{
Repair procedures
}
General rules:
\begin{itemize}
\item
Linux inside the WR Switch enumerates WR interfaces starting from 0.
This means we have to use internally port indexes 0..17. However, the
port numbers printed on the front panel are 1..18. Syslog messages
generated from the switch use the Linux port numbering. The consequence is
that every time Syslog says there is a problem on port X, this refers to
port index X+1 on the front panel of the switch.
\item
If a procedure given for a specific SNMP object does not solve the
problem. Please contact WR experts to perform more in-depth analysis of your
network. For this, you should provide a complete dump of the WRS status
generated in the first step of each procedure.
\item
If a solving procedure requires restarting or replacing a broken WR
Switch, please make sure that all other WR devices connected to the affected
switch are synchronized and do not report any problems.
\item
If procedure requires replacing switch with a new unit, the broken one
should be handled to WR experts to investigate the problem.
\end{itemize}
\begin{itemize}
\item
\texttt
{
wrsBootSuccessful
}
\begin{enumerate}
\item
Dump state
\item
Check
\texttt
{
WR-SWITCH-MIB::wrsBootConfigStatus
}
, if it reports an
error, please verify your WRS configuration.
\item
Restart the switch
\item
Please consult WR experts if the problem persists.
\end{enumerate}
\item
\texttt
{
wrsTemperatureWarning
}
\begin{enumerate}
\item
Dump state
\item
Verify if cooling of the rack where WR Switch is installed works
properly.
\item
Verify if both cooling fans in the back of the WR Switch case are
working.
\item
Replace the switch with a new unit and consult the WR Switch
manufacturer for a repair.
\end{enumerate}
\item
\texttt
{
wrsMemoryFreeLow
}
\begin{enumerate}
\item
Dump state
\item
Restart the switch
\item
Send the dumped state of the switch to WR experts for analysis as
this might mean there is some internal problem in the WRS firmware.
\end{enumerate}
\item
\texttt
{
wrsCpuLoadHigh
}
\begin{enumerate}
\item
Dump state
\item
Restart the switch
\item
Send the dumped state of the switch to WR experts for analysis as
this might mean there is some internal problem in the WRS firmware.
\end{enumerate}
\item
\texttt
{
wrsDiskSpaceLow
}
\begin{enumerate}
\item
Dump state
\item
Check the values of
\emph
{
CONFIG
\_
WRS
\_
LOG
\_*
}
configuration options
on the switch. These are the parameters describing where log messages
should be sent from various processes in the switch. Normally users
don't need to modify them, but if any of them is set to a file in the
WRS filesystem (e.g. /tmp/snmp.log) this may reduce the free space after
some time of operation.
\item
Restart the switch
\item
Send the dumped state of the switch to WR experts for analysis as
this might mean there is some internal problem in the WRS firmware.
\end{enumerate}
\end{itemize}
\begin{itemize}
\item
\texttt
{
wrsPTPStatus
}
\begin{enumerate}
\item
Dump state
\item
Check
\texttt
{
wrsSoftPLLStatus
}
on the Master (WR device one step
higher in a timing hierarchy). Eventually proceed to investigate the
problem on the Master switch. Otherwise, continue with the primary WRS.
\item
Verify if the link to WR Master was not lost by checking the object
\texttt
{
wrsSlaveLinksStatus
}
.
\item
If this is not the case, restart the switch.
\item
If the problem persists replace the switch with a new unit (see
\ref
{
cern:wrs
_
replacement
}
).
\end{enumerate}
\item
\texttt
{
wrsSoftPLLStatus
}
\\
For GrandMaster WRS:
\begin{enumerate}
\item
Dump state
\item
Check 1-PPS and 10 MHz signals coming from an external source.
Verify if they are properly connected and, in case of GPS receiver,
check if it is synchronized and locked.
\item
Restart the GrandMaster switch.
\item
If the problem persists, replace the switch with a new unit (see
\ref
{
cern:wrs
_
replacement
}
).
\end{enumerate}
For Boundary Clock WRS:
\begin{enumerate}
\item
Dump state
\item
Check
\texttt
{
wrsSoftPLLStatus
}
on the Master. Eventually proceed to
investigate the problem on the Master switch.
\item
Verify if the link to WR Master was not lost by checking the object
\texttt
{
wrsSlaveLinksStatus
}
.
\item
Restart the switch.
\item
If the problem persists, replace the switch with a new unit (see
\ref
{
cern:wrs
_
replacement
}
).
\end{enumerate}
\item
\texttt
{
wrsSlaveLinksStatus
}
\\
For Master/GrandMaster WRS:
\begin{enumerate}
\item
Check the configuration of the switch. Especially if the
\emph
{
Timing Mode
}
is correctly set (i.e. if it was not accidentally set
to
\emph
{
Boundary Clock
}
).
\item
Check the role of each port timing configuration. They should be all
set to
\emph
{
master
}
. If any of them is set to
\emph
{
slave
}
you should
verify if there is no WR Master connected to it.
\end{enumerate}
For Boundary Clock WRS:
\begin{enumerate}
\item
Check the fiber connection on the slave port of the WRS.
\item
Check the configuration of the switch. Especially if the
\emph
{
Timing Mode
}
is correctly set (i.e. if it was not accidentally set
to
\emph
{
Grand-Master
}
or
\emph
{
Free-Running Master
}
).
\item
Check the status of the WR Master connected to the slave port of the
WRS.
\item
Replace the faulty switch with a new unit, if this does not solve
the problem, make sure your fiber link is not broken.
\end{enumerate}
\item
\texttt
{
wrsPTPFramesFlowing
}
% non-WR device connected, but port not set to non-WR mode
% device on the other side has some problem
% HDL / kernel crash or another problem on WRS
\begin{enumerate}
\item
Check Syslog message to determine the WR port on which the
problem is reported. You should see a message similar to this one:
\\
\texttt
{
SNMP: wrsPTPFramesFlowing failed for port 1
}
\item
Check your network layout and the WR Switch configuration. If you
have some non-WR devices connected to ports of the WR Switch (e.g.
computer sending/receiving only data, without the need of
synchronization), these ports should have their role in the timing
configuration set to
\emph
{
non-wr
}
.
\item
Check the status of a WR device connected to the reported port.
\item
Restart the switch.
\item
If the problem persists, please contact WR experts for in-depth
investigation.
\end{enumerate}
\end{itemize}
\begin{itemize}
\item
\texttt
{
wrsSFPsStatus
}
\begin{enumerate}
\item
Check Syslog messages to determine the WR port on which the problem
is reported. You should see a message similar to this one:
\\
\texttt
{
Unknown SFP vn="AVAGO" pn="ABCU-5710RZ" vs="AN1151PD8A" on port
wr1
}
\item
If the reported port is intended to be used to connect a device that
does not require WR synchronization (e.g. using a copper SFP module),
then you should verify whether the role in the timing configuration for
this port is set to
\emph
{
non-wr
}
.
\item
Otherwise, you should use a WR-supported SFP module and make sure it
is declared together with calibration values in the WRS configuration.
\end{enumerate}
\item
\texttt
{
wrsEndpointStatus
}
% link problem (e.g. broken SFP, fiber)
% gateware problem
\begin{enumerate}
\item
Make several state dumps.
\item
Restart the switch.
\item
Check Syslog messages to determine the WR port on which the problem
is reported. You should see a message similar to this one:
\\
\texttt
{
SNMP: wrsEndpointStatus failed for port 1
}
\item
Check the fiber link on a reported port, i.e. try replacing SFP
transceivers on both sides of the link, try using another fiber.
\item
If the problem persists, please contact WR experts for in-depth
investigation.
\end{enumerate}
\item
\texttt
{
wrsSwcoreStatus
}
\begin{enumerate}
\item
Dump state.
\item
Restart the switch.
\item
Please contact WR experts since this might mean that either there is
too much high priority traffic in your network, or there is some
internal problem in the WRS firmware.
\end{enumerate}
\item
\texttt
{
wrsRTUStatus
}
\begin{enumerate}
\item
Dump state
\item
Restart the switch.
\item
If possible, try reducing the load of small Ethernet frames flowing
through your switch. If possible in your application, try using larger
Ethernet frames with lower load to transfer information.
\end{enumerate}
\end{itemize}
\subsection
{
Replacing WR Switch with a new unit
}
\label
{
cern:wrs
_
replacement
}
This just a reference holder to point to the CERN wikis with the description of
updating MAC in network database so that the same configuration is used.
doc/wrs_failures/snmp_exports.tex
View file @
6ff70c97
...
...
@@ -30,21 +30,21 @@ are some common remarks that apply to all situations:
that every time Syslog says there is a problem on port X, this refers to
port index X+1 on the front panel of the switch.
\item
If a procedure given for a specific SNMP object does not solve the
problem
. Please contact WR experts to perform more in-depth analysis of your
network. For this, you should provide a complete dump of the WRS status
problem
, please contact WR experts to perform a more in-depth analysis of
the
network. For this, you should provide a complete dump of the WRS status
generated in the first step of each procedure.
\item
F
irst action in most of the procedures below named
\emph
{
Dump state
}
\item
The f
irst action in most of the procedures below named
\emph
{
Dump state
}
requires simply calling a tool provided by WR developers that reads all the
detailed information from the switch and writes it to a single file that can
be later analyzed by the experts.
\\
{
\bf
TODO: point to the tool once it's done
}
\item
If
solving procedure requires restarting or replacing a broken WR
Switch, please make sure that after the repair, all other WR devices
\item
If
a problem solving procedure requires restarting or replacing a broken
WR
Switch, please make sure that after the repair, all other WR devices
connected to the affected switch are synchronized and do not report any
problems.
\item
If a procedure requires replacing
switch with a new unit, the broken one
should be handled to WR experts or the switch manufacturer to investigate
the problem.
\item
If a procedure requires replacing
a switch with a new unit, the broken
one should be handled to WR experts or the switch manufacturer to
investigate
the problem.
\end{itemize}
\subsection
{
General status objects for operators
}
...
...
@@ -52,7 +52,7 @@ are some common remarks that apply to all situations:
This section describes the general status MIB objects that represent the overall
status of a device and its subsystems. They are organized in a tree structure
(fig.
\ref
{
fig:snmp
_
oper
}
) where each object reports a problem based on the
status of its child objects. SNMP object in the third layer of this tree are
status of its child objects. SNMP object
s
in the third layer of this tree are
calculated based on the SNMP expert objects. Most of the status objects
described in this section can have one of the following values:
\begin{figure}
[ht]
...
...
@@ -69,12 +69,12 @@ described in this section can have one of the following values:
\item
\texttt
{
Warning
}
-- objects used to calculate this value are outside the
proper values, but problem in not critical enough to report
\texttt
{
Error
}
.
\item
\texttt
{
WarningNA
}
-- at least one of the objects used to calculate the
status has a value
\texttt
{
NA
}
or
\texttt
{
WarningNA
}
.
status has a value
\texttt
{
NA
}
(or
\texttt
{
WarningNA
}
)
.
\item
\texttt
{
Error
}
-- error in values used to calculate the particular
object.
\item
\texttt
{
FirstRead
}
-- the value of the object cannot be calculated
because at least one condition uses deltas between the current and previous
value. This value should appear only at first SNMP read. T
h
reated as a
value. This value should appear only at first SNMP read. T
o be t
reated as a
correct value.
\item
\texttt
{
Bug
}
-- Something wrong has happened while calculating the
object. If you see this please report to WR developers.
...
...
doc/wrs_failures/snmp_objects.tex
View file @
6ff70c97
...
...
@@ -8,24 +8,32 @@
subsystems.
}
\snmpentrys
{
WR-SWITCH-MIB
}{
wrsGeneralStatusGroup
}{
wrsMainSystemStatus
}{
\underline
{
Description:
}
WRS general status of a switch can be
\texttt
{
OK
}
,
\texttt
{
Warning
}
or
\texttt
{
Error
}
. In case of an error or warning, please check the values of
\texttt
{
\glshyperlink
{
WR-SWITCH-MIB::wrsOSStatus
}}
,
\texttt
{
\glshyperlink
{
WR-SWITCH-MIB::wrsTimingStatus
}}
and
\texttt
{
\glshyperlink
{
WR-SWITCH-MIB::wrsNetworkingStatus
}}
to find out which
subsystem causes the problem.
}
subsystem causes the problem.
\glspar
\underline
{
Related problems:
}}
\snmpentrys
{
WR-SWITCH-MIB
}{
wrsGeneralStatusGroup
}{
wrsOSStatus
}{
\underline
{
Description:
}
Collective status of the operating system running on WR switch. In case of
an error or warning, please check status objects in the
\texttt
{
\glshyperlink
{
WR-SWITCH-MIB::wrsOSStatusGroup
}}
.
}
\texttt
{
\glshyperlink
{
WR-SWITCH-MIB::wrsOSStatusGroup
}}
.
\glspar
\underline
{
Related problems:
}}
\snmpentrys
{
WR-SWITCH-MIB
}{
wrsGeneralStatusGroup
}{
wrsTimingStatus
}{
\underline
{
Description:
}
Collective status of the synchronization subsystem. In case of an
error or warning, please check status objects in the
\texttt
{
\glshyperlink
{
WR-SWITCH-MIB::wrsTimingStatusGroup
}}
.
}
\texttt
{
\glshyperlink
{
WR-SWITCH-MIB::wrsTimingStatusGroup
}}
.
\glspar
\underline
{
Related problems:
}}
\snmpentrys
{
WR-SWITCH-MIB
}{
wrsGeneralStatusGroup
}{
wrsNetworkingStatus
}{
\underline
{
Description:
}
Collective status of the Ethernet switching subsystem. In case of an error
or warning, please check status objects in the
\texttt
{
\glshyperlink
{
WR-SWITCH-MIB::wrsNetworkingStatusGroup
}}
.
}
\texttt
{
\glshyperlink
{
WR-SWITCH-MIB::wrsNetworkingStatusGroup
}}
.
\glspar
\underline
{
Related problems:
}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\snmpentrys
{
WR-SWITCH-MIB
}{}{
wrsDetailedStatusesGroup
}{
...
...
@@ -174,7 +182,7 @@
\begin{pck_proc}
\item
Dump state
\item
Check 1-PPS and 10 MHz signals coming from an external source.
Verify if they are properly connected and, in case of GPS receiver,
Verify if they are properly connected and, in case of
a
GPS receiver,
check if it is synchronized and locked.
\item
Restart the GrandMaster switch.
\item
If the problem persists, replace the switch with a new unit.
...
...
@@ -388,7 +396,7 @@
\snmpentrys
{
WR-SWITCH-MIB
}{
wrsVersionGroup
}{
wrsVersionLastUpdateDate
}{
\underline
{
Description:
}
Date and time of the last firmware update
, t
his information may not be
Date and time of the last firmware update
. T
his information may not be
accurate, due to hard restarts or lack of the proper time during the
upgrade.
}
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment