Commit 5bd39806 authored by Grzegorz Daniluk's avatar Grzegorz Daniluk

papers/ICALEPCS2011_PEER_REVIEWED: use common figures

parent 6d8a7fa8
\section{Clock Distribution}
% The resilience of the Clock Distribution translates into continuous and stable
% synchronization of all the nodes and switches in the WRN (Table~\ref{tab:requirements}).
% A loss of time notion in a node can be caused by a link or switch failure - break of clock path
% between the TM and the node. In order to prevent synchronization break, redundancy of network
% elements (switches, cables) can be introduced ensuring redundant clock paths. However,
% the switch-over between redundant elements might introduce instability and render the network
% unreliable despite the costly redundancy. Therefore, the seamless switch-over between redundant
% clock paths is one of the design-goals to enable network topology redundancy and, as a consequence,
% offer robust and stable synchronization. The other reasons for the deterioration of synchronization
% accuracy are the variation of external conditions (e.g. temperature) and loss of Ethernet frames with
% timing information (PTP).
%\subsection{Switch-over}
A seamless switch-over between redundant sources of timing (uplink ports) is heavily supported by
the Clock Recovery System (CRS) \cite{biblio:TomekMSc} of the switch and the WR extension to PTP
(WRPTP)\cite{biblio:WRPTP}.
Figure~\ref{fig:switch-over} presents an example where a switch (timing slave) is connected
(by its uplinks 1 \& 2)
to two other switches (primary and secondary masters) -- the sources of timing. On both
uplinks the frequency is recovered from the signal and provided to the CRS. Similarly, WRPTP
measures delay and offset on each of the links and provides this data to the CRS.
The modified Best Master Clock (mBMC) algorithm \cite{biblio:WRPTP} decides which of the
timing masters is "better" and elects it the primary, the other is considered secondary (backup).
The information from {\it uplink 1} (primary) is used to control
the CRS and adjust the local time. However, at any time all the necessary information from the
{\it uplink 2} is available and a seamless switch-over can be performed in case of
primary master failure \cite{biblio:TomekMSc}.
\begin{figure}[t]
\centering
\includegraphics[width=3.2in]{../../figures/robustness/clockDistribution.eps}
\caption{Seamless switch-over.}
\label{fig:switch-over}
\end{figure}
%\subsection{Variable conditions and loss of PTP messages}
In addition to the switch-over-related synchronization instability, the variation of external temperature
can cause an accuracy degradation. This problem, however, is solved by the PTP standard itself. By
frequent link delay measurements, the fluctuation is compensated.
\section{Data Resilience}
\subsection{Forward Error Correction}
\label{sec:fec}
The objective of the FEC scheme is to decrease the loss rate of the CMs, preferably,
to less then one per year. WR uses as a physical medium Fiber Optic and CAT-5. The number
of received corrupted bits compared to the total number of received bits is called Bit Error Rate
(BER). The value of BER characterizes a physical medium and can be used to characterize the entire
switched network.
% if the following factors are taken into account:
% (1) {\it type of cabling} (fiber/twisted pair),
% (2) {\it logic topology},
% (3) {\it network address} (broadcast/unicast).
A WRN can be seen as a Packet Erasure Channel (PEC) or as a Binary Erasure Channel (BEC) depending
on the effect of a bit error on the frame. If the frame is lost (e.g. dropped by the switch due to
a corrupted header or lost during switch-over between redundant components), the WRN is a PEC.
If the bit error happens in the link between a switch and node, a corrupted frame
\modified{can be used (optional)}
%is used
to attempt frame recovery. In such case, the channel is called BEC. Each type of channel requires
a different FEC solution. Therefore two concatenated FECs are used in WR.
Reed-Solomon (R-S) %\cite{biblio:r-s}
%\cite{biblio:coding}
coding is used for the PEC and
allows to encode k original-frames into n encoded-frames ($n>k$).
Reception of any k encoded-frames can be used to decode the original frames.
\modified{Hamming coding with additional parity (SEC-DED)}
%Hamming coding
is used for the BEC and allows to detect up to two simultaneous bit errors and
correct a single error.
These two schemes (R-S and Hamming) are combined to encode each CM -- it is
split into two and encoded using R-S into four messages (two original and two
with redundant data). Each of the four messages is then encoded using Hamming. Such encoded
messages are sent in a burst of 4 Ethernet frames. Reception of any two of these frames enables
to decode the original Control Messages.
A systematic analysis, using the BER characteristic of the WRN, proves that the presented FEC scheme
guarantees less than single CM lost per year due to physical medium
imperfection, as can be seen from Table~\ref{tab:gsi_cern_fec}.
\begin{table}[ht]
\begin{center}
\caption{GSI and CERN FEC characteristics.}
\begin{tabular}{|p{4cm}|c|c|} \hline
% \cline{2-3}
% & \multicolumn{2}{|c|}{Use Case} \\ \cline{2-3}
\rowcolor{gray!35}{}
{\bf Parameter} & {\bf GSI} & {\bf CERN} \\ \hline
\multicolumn{1}{|p{4cm}|}{Control Message length} & 500 bytes & 1500 bytes \\ \hline
\multicolumn{1}{|p{4cm}|}{Control Message per year} & $3.145 10^{11} $ &$ 3.145 10^{8} $ \\ \hline
\multicolumn{1}{|p{4cm}|}{Max Bit Correct.} & 1 & 1 \\ \hline
% \multicolumn{1}{|p{4cm}|}{Parity-Check Bits} & 13 & 13 \\ \hline
% \multicolumn{1}{|p{4cm}|}{PEC Code Overhead} & 3 & 2 \\ \hline
% \multicolumn{1}{|p{4cm}|}{Payload Length} & 400 b & 800b \\ \hline
\multicolumn{1}{|p{4cm}|}{Payload Length} & \modified{294 bytes} & \modified{854 bytes} \\ \hline
\multicolumn{1}{|p{4cm}|}{Num Encoded Frames} & 4 & 4 \\ \hline
\multicolumn{1}{|p{4cm}|}{Needed Frames to Receiver} & 2 & 2 \\ \hline
\multicolumn{1}{|p{4cm}|}{Probability of Loosing a CM} & $10^{-14}$ & $10^{-13}$\\ \hline
\end{tabular}
\label{tab:gsi_cern_fec}
\end{center}
\end{table}
\subsection{Rapid Spanning Tree Protocol (RSTP)}
In an Ethernet network with redundant topology, the problem of loops (causing "broadcast storms")
is handled by the Rapid Spanning Tree Protocol (RSTP)
% \cite{biblio:IEEE8021D}
. It creates a loop-free
logical topology by blocking appropriate ports in switches, and unblocks them in case of topology
break (due to element failure).
The functionality provided by the RSTP is essential for the WRN. However, the convergence speed
provided by the standard implementation of the RSTP (milliseconds
%\cite{biblio:RSTPperf}
at best)
would cause many CMs to be lost during the process. This is not acceptable, we need
a solution which is fast enough to prevent loosing the CMs at all. Since we know the
size-range of the CMs (Table~\ref{tab:requirements}) and how they are FEC-encoded into Ethernet frames,
we can calculate the maximum value of the convergence time: 3$\mu s$. This time is smaller than
the duration of transmitting a single frame with FEC-encoded CM -- this ensures that no more than
two frames with FEC-encoded CM are lost, thus the CM can be recovered.
In order to achieve a convergence time of 3$\mu s$, the switch-over between active
and backup connections needs to be performed in the hardware as soon as the link-down is detected.
It can only be done if the alternative topology is known in advance. The knowledge of alternative
topology is translated into an RSTP-assignment of alternative and backup roles of switch ports,
i.e at least one port with alternative role must be identified in every switch
(except the topology-root switch).
%\modified{, i.e at least one port with each of these roles must be identified in every switch}.
%
%If at least one port of a switch is assigned an alternative role, it means that
%the RSTP algorithm establishes more than one path to the topology-root switch and therefore
%the alternative topology is know in advance.
%Such ports are identified when the RSTP algorithm establishes more than one path to the
%topology-root switch and all paths can be used simultaneously,
%
If we ensure, by restricting the topology, that RSTP identifies the alternative links,
we can use its data to feed the hardware, consequently achieving the required convergence time
and staying standard-compatible:
the hardware switch-over is just a faster RSTP-driven convergence. The required topology
restrictions, described in \cite{biblio:robustness}, greatly overlap with these imposed by
the Time Distribution.
\section{Determinism}
% The delivery latency of an Ethernet frame varies with cable length and the number of hops (switches)
% it has to traverse to reach its destination, the traffic load on the way and
% the assigned Class of Service (CoS, \cite{bilbio:vlan}).
A carefully configured and properly used WRN offers deterministic Ethernet frame delivery
thanks to the implementation of CoS and the fact that the delay introduced by the switch can be
verified by analysis of {\bf publicly available source code} \cite{biblio:whiteRabbit}.
Such analyses were performed to verify the worst-case upper bound
delivery latency of a CM against the requirements listed in the Table~\ref{tab:requirements}.
The results, presented in Table~\ref{tab:CMlatency} ({\it Store-and-forward} column),
take into account the fact that a CM is encoded into 4 Ethernet frames (as required by the FEC
and described in the next Section), it is sent with the highest priority (CoS) and it always
traverses 3 hops.
\begin{table}[ht]
\caption{Control Message(CM) deliver latency estimations.}
\centering
\begin{tabular}{| c | c | c | c | c | c |} \hline
\rowcolor{gray!35}{}
& \multicolumn{4}{|>{\columncolor{gray!35}}c|}{\textbf{CM deliver latency}} \\ \cline{2-5}
\rowcolor{gray!35}{}
\textbf{CM size}& \multicolumn{2}{|>{\columncolor{gray!35}}c|}{\textbf{Store-and-forward}}
&\multicolumn{2}{|>{\columncolor{gray!35}}c|}{\textbf{Cut-through}} \\\cline{2-5}
\multicolumn{1}{|>{\columncolor{gray!35}}c|}{} &
\multicolumn{1}{|>{\columncolor{gray!35}}c|}{GSI} &
\multicolumn{1}{|>{\columncolor{gray!35}}c|}{CERN} &
\multicolumn{1}{|>{\columncolor{gray!35}}c|}{GSI} &
\multicolumn{1}{|>{\columncolor{gray!35}}c|}{CERN} \\ \hline
% & GSI & CERN & GSI & CERN \\ \hline
%200 bytes & ???$\mu s$ & ???$\mu s$ & ??$\mu s$ & ???$\mu s$ \\ \hline
500 bytes & 221$\mu s$ & 283$\mu s$ & 76$\mu s$ & 118$\mu s$ \\ \hline
1500 bytes & 285$\mu s$ & 325$\mu s$ & 102$\mu s$ & 142$\mu s$ \\ \hline
5000 bytes & 324$\mu s$ & 364$\mu s$ & 162$\mu s$ & 202$\mu s$ \\ \hline
\end{tabular}
\label{tab:CMlatency}
\end{table}
The analysis revealed that GSI's requirements are not fulfilled: the upper-bound delivery latency
for the required size of CM and max distance of 2km is greater then 100$\mu s$.
The solution to decrease delivery latency is targeted into the CD only and
takes advantage of its characteristics (broadcast within a VLAN, sent by privileged node).
We propose to break the highest priority of
the CoS into two (unicast and broadcast) and use the highest priority broadcast Ethernet traffic only for
the CD. Moreover, this particular traffic shall be forwarded using the cut-through method
(unlike the store-and-forward method used normally in the switch) which can be effectively fast
for the broadcast traffic with a single source (DM). The results,
presented in Table~\ref{tab:CMlatency} ({\it Cut-through} column), show a significant improvement.
The solution requires hardware-supported cut-through forwarding in the switch as described
in \cite{biblio:robustness}.
\section{Failure Study}
One of the main possible reasons for WRN failure, which affects both Timing and Data Distribution, is
a malfunction of its elements (switches or links). Since the distribution of information
in the WRN is of one-to-all character (Data/Timing Master to all nodes), all the elements of the WRN are
considered Single Points of Failure (SPoF)\cite{biblio:mtbf}. Malfunction of any SPoF
results in failure of the entire system.
SPoFs can be eliminated by introducing redundancy of the system components. Due to its special features
(distribution of frequency over physical layer) and strict requirements (determinism, low data loss),
the number of possible redundant topologies of the WRN is restricted, as explained in the
following sections.
Imperfections of the physical medium as well as switching between redundant elements of the network
(which takes time) can cause loss or corruption of data. The deterministic and \modified{mostly} broadcast character
of the data distribution in the WRN enforces application of the Forward Error Correction (FEC)
%\cite{biblio:coding}
-- adding redundant information on transmission to enable recovery of lost or corrupted data
on reception. This brings constant data overhead and the probability that the added redundancy is
not sufficient to recover the data. However, it is the price to pay for ensuring low latency
and determinism of data delivery in the WRN.
The delivery latency of an Ethernet frame varies with cable length and the number of hops (switches)
it has to traverse to reach its destination, the traffic load on the way and
the assigned Class of Service (CoS). Therefore, to ensure the required determinism
of the CD delivery, we need to make sure that there is no congestion of Ethernet frames
carrying CMs. Moreover, the number of hops (the latency introduced by them) needs to be
sufficiently small, which can be done by restricting the topology.
The resilience of the Clock Distribution translates into continuous and stable
synchronization of all the nodes and switches in the WRN (Table~\ref{tab:requirements}). Although,
the network redundancy eliminates SPoFs, the switch-over between redundant elements might introduce
instability and render the network unreliable despite the costly redundancy.
Therefore, a seamless switch-over between redundant clock paths needs to be ensured.
Another reason for the deterioration of the synchronization
accuracy is the variation of external conditions (e.g. temperature) which needs to be compensated.
% In terms of the Data Distribution reliability, the topology redundancy can turn out to be
% useless, if the switch-over between redundant elements causes more data to be lost then the
% capabilities of FEC scheme.
% {\it [add here, change the rest]}
% In summary, we need investigate how to :
% \begin{Itemize}
% \item eliminate/decrease data loss due to :
% \begin{Itemize}
% \item physical medium imperfection,
% \item switch over between redundant elements,
% \item traffic congestion,
% \end{Itemize}
% \item eliminate synchronization instability due to:
% \begin{Itemize}
% \item switch over between redundant data paths,
% \item external condition variations,
% \item Ethernet frame loss (PTP),
% \end{Itemize}
% \item ensure required upper-bound delivery latency of Control Data.
% \end{Itemize}
all : WhiteRabbit.pdf
.PHONY : all clean
WhiteRabbit.pdf : WhiteRabbit.tex
latex $^
bibtex WhiteRabbit
latex $^
latex $^
dvips -j0 WhiteRabbit
ps2pdf -dPDFX -dEmbedAllFonts=true -dSubsetFonts=true -dEPSCrop=true WhiteRabbit.ps
clean :
rm -f *.eps *.pdf *.dat *.log *.out *.aux *.dvi *.ps *~ *.bbl *.blg
\section{Overall Reliability}
The final equation of the WRN reliability is a sum of the data and clock distribution reliabilities.
The clock distribution is assumed to be sufficiently accurate as long as there is a connection
between the TM and all the nodes. The same applies to the CD distribution:
as long as there is a valid connection, the FEC makes sure that the data is delivered with
a sufficient reliability and the latency calculations prove it to be deterministic while the
congestion is prevented by CoS and limited number of data sources (DM). Consequently, the overall
reliability is strongly dependent on the WRN topology, which needs to be appropriate for the proposed
solutions (SyncE, H/W-supported RSTP, upper-bound latency).
For the comparison of different network topologies, we consider the reliability of a network of
switches.
%with M inputs (connected to DM/TM).
Each node is connected to such a network with M links
(each to a separate switch). The value of M reflects the level of redundancy
(M=1 for no redundancy, M=2 for double redundancy, etc).
In the calculations of the network reliability we used the idea of Mean Time Between Failure (MTBF)
and its relation with the failure probability presented in \cite{biblio:mtbf}
(a very simplified mathematical model). In order to calculate the MTBF of the entire network, we need the
MTBFs of each network component: switches and links. Since the WR switches are still under
development (no MTBF measured), we used representative values for CISCO switches
({2, 10 and 100}$*10^4$[h]). Two estimation methods were used: "Fault Tree analysis"
\cite{biblio:faultTree} and analytic. Both provide just rough estimations of the reliability.
The former allowed to estimate two-terminal reliability (DM to single node)
%\cite{biblio:INF_TECH}
of simple non/double/triple-redundancy topologies ($P_f$). The most desired value is the
all-terminal network reliability ($P_{f\_Network}$), where : $P_f < P_{f\_Network} < N_{nodes}*P_f$.
Table~\ref{tab:2000nodesReliability}
presents rough estimations of $P_{f\_Network}$ using analytic calculations for the three considered
topologies ($MTBF_{Switch}$=200 000[h]). However, to meet the requirement of $\approx$2000 nodes and
only three network layers (hops),
\modified{the Data Master node is connected to more separate switches than
the level of redundancy (M).}
% the topologies are of the type M-inputs/N-outputs, where
% $N \geq M$.
The estimations show that a triple redundancy topology can barely satisfy the requirements by CERN
(Table~\ref{tab:requirements}).
% \begin{figure}[t]
% \centering
% \includegraphics[width=3.4in]{fig/threeTopologies.ps}
% \caption{Examples of topologies with different level of redundancy.}
% \label{fig:threeTopology}
% \end{figure}
\begin{table}[ht]
%\caption{Different topologies ($\approx 2000$ nodes).}
\caption{WRN topologies's reliabilities.}
\centering
%\rowcolors {0}{gray!35}{}
\begin{tabular}{| c | c | c | c |} \hline
%{\bf Redundancy}& \textbf{Switches} & \multicolumn{2}{| c |}{\textbf{$MTBF_{Switch}$= 20 000[h] }} \\
% & & $P_f$ & MTBF[h] \\ \hline
\rowcolor{gray!35}{}
{\bf Redundancy}& \textbf{Switches} & $P_f$ & MTBF[h] \\ \hline
No & 127 & $ 2.08*10^{-3}$ & $ 5.77*10^{3}$ \\ \hline
Double & 292 & $ 4.71*10^{-7}$ & $ 2.55*10^{7}$ \\ \hline
Triple & 495 & $ 3.06*10^{-11}$ & $ 4.08*10^{11}$ \\ \hline
\end{tabular}
\label{tab:2000nodesReliability}
\end{table}
\section{Definition of reliability in a WRN}
A WRN, consisting of White Rabbit Switches (switches) connected by fiber
or copper, is meant to transport information among White Rabbit Nodes (nodes). We distinguish
two types of information distributed over the WRN:
%(1) {\bf Timing} (frequency and International Atomic Time) and
(1) {\it Timing} (frequency and Coordinated Universal Time) and
(2) {\it Data} (the Ethernet traffic).
This translates into two types of services provided by the WRN which have their own requirements and
can be handled separately. The requirements are defined by GSI and CERN as the prospective
users of WR to control their accelerators.
\subsection{Timing Distribution}
Timing is distributed in the WRN from a switch/node called Timing Master (TM)
to all the other nodes/switches in the network.
% The TM is usually connected
% to the external source, such as Global Positioning System (GPS) receiver.
All the devices in the
WRN lock their frequency (syntonize) and adjust their local clocks (synchronize) to that of the TM.
The deviation between the clock of the TM and that of any other node/switch is called {\bf accuracy}.
A stable and continuous synchronization of all the nodes with an appropriate accuracy is the key
requirement of the Timing Distribution in the WRN.
\subsection{Data Distribution}
The critical data distributed over the WRN is the one carrying sets of commands (events) which are
organized into Control Messages (CM). The CMs are sent by a privileged node (Data Master, DM) in the
payload of the Ethernet frame(s). Therefore, the Data Distribution in the WRN is broken into
(1) {\it Control Data (CD)} -- the Ethernet frames carrying CMs, critical, and
(2) {\it Standard Data (SD)} -- the Ethernet frames which do not carry CMs, non-critical.
The reliability of the WRN depends on the successful delivery of the CD to all
the designated nodes. The CMs are always broadcast within a VLAN
% \cite{bilbio:vlan}
, which can span
the entire network. The worst-case upper bound of their delivery latency from the DM to any node in
the network, regardless of it's location ({\bf maximum distance from the DM}), is required to be
guaranteed by the network -- this is {\bf a determinism} requirement.
\subsection{Reliability of the WRN}
The reliability of the WRN relies on the {\bf deterministic} delivery of the CD
to all the designated nodes and their sufficiently {\bf accurate and stable synchronization}.
This means that the WRN is considered non-functional if one or more of the following occur:
\begin{itemize}
\item A node is synchronized with insufficient accuracy.
\item A designated node receives corrupted CD or no CD.
\item The upper-bound delivery latency has been exceeded.
\end{itemize}
% (1) A node is synchronized with insufficient accuracy;
% (2) A designated node receives corrupted CD or no CD;
% (3) The upper-bound delivery latency has been exceeded.
Unreliability is translated into the number of CMs considered lost (not delivered, delivered
corrupted or in a non-deterministic way) in a given period of time. During this time,
the synchronization must be always of the required quality.
Quantitative requirements of the accelerator facilities are listed in Table~\ref{tab:requirements}.
\begin{table}[ht]
\caption{GSI's and CERN's requirements summary.}
\centering
\begin{tabular}{| l | c | c |} \hline
%\textbf{Requirement}& \multicolumn{2}{|c|}{\textbf{Value(s)}} \\
\rowcolor{gray!35}{}
\textbf{Requirement} & {\bf GSI} & {\bf CERN} \\ \hline
Max latency & 100$\mu s$ & 1000$\mu s$ \\ \hline
CM failure rate & $3.17*10^{-12}$ & $3.17*10^{-11}$ \\ \hline
CMs lost per year & 1 & 1 \\ \hline
$d_{max}$ from DM & 2km & 10km \\ \hline
CM size & 200-500 bytes & 1200-5000 bytes \\ \hline
Accuracy & probably 8ns & 1$\mu s$ to ~2ns \\
%accuracy & & few nodes ~2ns \\
\hline
\end{tabular}
\label{tab:requirements}
\end{table}
%\documentclass{JAC2003} % A4
%\documentclass[acus]{JAC2003} % US
\documentclass[reprint, superscriptaddress,aps,prstab]{revtex4-1}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{color}
\usepackage{multirow}
%\usepackage{multicol}
\usepackage[table]{xcolor}
\usepackage{colortbl}
\usepackage{array}
%\setlength{\titleblockheight}{27mm}
\hyphenation{op-tical net-works semi-conduc-tor}
%\newcommand \modified[1]{{\textcolor{red}{#1}}}
\newcommand \modified[1]{{\textcolor{black}{#1}}}
\begin{document}
\title{RELIABILITY IN A WHITE RABBIT NETWORK}
\input{authors}
\input{abstract}
\maketitle
\input{introduction}
\input{ReliabilityDefinition}
\input{FailureStudy}
\input{Determinism}
\input{ControlDataDistribution}
\input{ClockDistribution}
\input{OverallReliability}
\input{conclusion}
\bibliographystyle{IEEEtran}
\bibliography{IEEEabrv,./biblio}
\end{document}
\begin{abstract}
White Rabbit (WR) is a time-deterministic, low-latency Ethernet-based network which enables
transparent, sub-ns accuracy timing distribution. It is being developed to replace
the General Machine Timing (GMT)
%\cite{biblio:GMT}
system currently used at CERN and will become
the foundation for the control system of the Facility for Antiproton and Ion Research (FAIR)
at GSI. High reliability is an important issue in WR's design,
since unavailability of the accelerator's
control system will directly translate into expensive downtime of the machine.
A typical WR network is required to lose not more than a single message per year.
Due to WR's complexity, the translation of this real-world-requirement into
a reliability-requirement constitutes an interesting issue on its own -- a WR network
is considered functional only if it provides all its services to all its clients at any time.
This paper defines reliability in WR and describes how it was addressed by dividing it into
sub-domains: deterministic packet delivery, data
%redundancy,
resilience,
topology redundancy and clock
resilience. The studies show that the Mean Time Between Failure (MTBF) of the WR Network
is the main factor affecting its reliability. Therefore, probability calculations for
different topologies were performed using the "Fault Tree analysis" and analytic estimations.
Results of the study show that the requirements of WR are demanding. Design changes might be needed
and further in-depth studies required, e.g. Monte Carlo simulations. Therefore, a direction
for further investigations is proposed.
\end{abstract}
%\author
%{
% Maciej Lipi\'{n}ski, Javier Serrano, Tomasz W\l{}ostowski, CERN, Geneva, Switzerland\\
% Cesar Prados, GSI, Darmstadt, Germany
%}
\author{C.Prados}
\affiliation{GSI Helmholtz Centre for Heavy Ion Research, Darmstadt, Germany}
\author{Maciej Lipi\'{n}ski}
\affiliation{CERN, Geneva, Switzerland}
\author{J.Serrano}
\affiliation{CERN, Geneva, Switzerland}
\author{T. Wlostowski}
\affiliation{CERN, Geneva, Switzerland}
@standard{biblio:IEEE8021D,
title = "IEEE Standard for Local and metropolitan area networks
Media Access Control (MAC) Bridges",
organization = "IEEE",
address = "New York",
number = "802.1D",
year = "2004",
}
@standard{biblio:IEEE1588,
title = "IEEE Standard for a Precision
Clock Synchronization Protocol for Networked Measurement and Control Systems",
organization = "IEEE",
address = "New York",
number = "1588-2008",
year = "2008",
}
@standard{biblio:IEEE8023,
title = "IEEE Standard for
Information Technology--Telecommunications and Information Exchange Between
Systems--Local and Metropolitan Area Networks--Specific Requirements Part 3:
Carrier Sense Multiple Access With Collision Detection (CSMA/CD) Access Method
and Physical Layer Specifications - Section Three",
year = "2008",
organization = "IEEE",
address = "New York",
number = "802.3-2008",
}
@standard{bilbio:vlan,
title = "{IEEE Standard for Local and metropolitan area networks
Virtual Bridged Local Area Networks}",
year = "2005",
organization = "IEEE",
address = "New York",
number = "802.1Q-2005"
}
@standard{biblio:SynchE,
title = "Timing characteristics of a synchronous Ethernet equipment slave clock {(EEC)}",
year = "2007",
number = "G.8262",
organization = "ITU-T",
}
@inproceedings{biblio:ISPCS2011,
author = "M.Lipinski, T.Wlostowski, J.Serrano and P.Alvarez",
title = "White Rabbit: a {PTP} Application for robust sub-nanosecond synchronization",
booktitle = "Proceedings of ISPCS2011",
address = "Munich, Germany",
year = "2011",
}
@inproceedings{biblio:GMT,
author = "J.Serrano and P.Alvarez and D.Dominguez, J.Lewis",
title = "Nanosecond Level {UTC} Timng Generation and Stamping in {CERN}'s {LHC}",
booktitle = "Proceedings of ICALEPSC2003",
address = "Gyeongju, Korea",
year = "2003",
}
@techreport{biblio:FAIRtimingSystem,
author = "T. Fleck and C. Prados and S. Rauch and M. Kreider",
title = "{FAIR} Timing System",
institution = "GSI",
address = "Darmstadt, Germany",
year = "2009",
note = "v1.2",
}
@inproceedings{biblio:distOscilloscope,
author = "S. Deghaye and D. Jacquet and I. Kozsar and J. Serrano",
title = "{OASIS}: A NEW SYSTEM TO ACQUIRE AND DISPLAY THE ANALOG SIGNALS FOR {LHC}",
booktitle = "Proceedings of ICALEPCS2003",
address = "Gyeongju, Korea",
year = "2003",
}
@inproceedings{biblio:PAC11,
author = "J.Serrano, P.Alvarez, M.Lipinski and T.Wlostowski",
title = "Accelerator Timing Systems Overview",
booktitle = "Proceedings of PAC11",
address = "New York, USA",
year = "2011",
}
@Inproceedings{biblio:WRproject,
author = "J. Serrano and P. Alvarez and M. Cattin and E. G. Cota and others",
title = "{The White Rabbit Project}",
booktitle = "ICALEPCS",
address = "Kobe, Japan",
year = "2009",
}
@Misc{biblio:WRPTP,
author = "E.G. Cota and M. Lipinski and T. Wlostowski and E.V.D. Bij and J. Serrano",
title = "{White Rabbit Specification: Draft for Comments}",
note = "v2.0",
month = "july",
year = "2011",
howpublished = {\url{http://www.ohwr.org/documents/21}}
}
@Misc{biblio:CERNwrControlAndTiming,
author = "J-C.Bau and M.Lipinski",
title = "{White Rabbit CERN Control and Timing Network}",
month = "July",
year = "2011",
howpublished = {\url{http://www.ohwr.org/documents/85}}
}
@Misc{biblio:robustness,
author = "C.Prados and M.Lipinski",
title = "{White Rabbit and Robustness}",
month = "March",
year = "2011",
howpublished = {\url{http://www.ohwr.org/documents/103}}
}
@mastersthesis{biblio:TomekMSc,
author = "T.Wlostowski",
title = "Precise time and frequency transfer in a {White} {Rabbit} network",
month = "may",
year = "2011",
school = "Warsaw University of Technology",
howpublished = {\url{http://www.ohwr.org/documents/80}}
}
@Inproceedings{biblio:Takahide,
author = "Takahide Murakami and Yukio Horiuchi",
title = "{A Master Redundancy Technique in IEEE 1588 Synchronization with a Link Congestion
Estimation}",
booktitle = "Proceedings of ISPCS",
year = "2010",
}
@electronic{biblio:whiteRabbit,
title = "{White Rabbit}",
howpublished = {\url{http://www.ohwr.org/projects/white-rabbit}}
}
@article{biblio:ohl,
author = "M.Giampietro",
title = "Hardware joins the open movement",
journal = "CERN Courier",
address = "CERN, Geneva",
year = "2011",
howpublished = {\url{http://cerncourier.com/cws/article/cern/46054}},
}
@article{biblio:RSTPperf,
authors = "Pallos, R., Farkas, J., Moldovn, I. and Lukovszki, C.",
title = "Performance of Rapid Spanning Tree Protocol in Access and Metro Networks",
journal = "2nd International ICST Conference on Access Networks",
year = "2007",
}
@article{biblio:r-s,
author = "I.S.Reed, G.Solomon",
title = "{Polynomial Codes Over Certain Finite Fields}",
journal = "SIAM Journal of Applied Math",
address = "USA",
year = "1960",
}
@book{biblio:mtbf,
author = "K.Dooley",
title = "Designing Large-Scale LANs",
publisher = "O'Reilly",
year = "2002",
}
@book{biblio:coding,
author = "S.Lin, D.J.Castello",
title = "Error Control Coding",
publisher = "Pearson Prentice Hall",
year = "2004",
}
@book{biblio:INF_TECH,
author = "D.J.C. MacKay",
title = "Information Theory, Inference, and Learning Algorithms",
publisher = "Cambridge University Press",
year = "2005",
}
@misc{biblio:faultTree,
title = "Reliability Workbench, Fault Tree",
publisher = "Isograph",
howpublished = {\url{www.isograph.com}},
}
\ No newline at end of file
\section{Conclusions}
A WRN must be considered as an ordinary Ethernet network with extra optional built-in features
which, when properly used, can make it robust and more reliable. This, however, comes at a price
of topology restrictions and redundant elements (money). The reliability study described in this
article and detailed in \cite{biblio:robustness} presents areas which need to be addressed to
increase the reliability of a WRN. The development of WR is an on-going effort and some of the
suggested solutions have been already properly investigated or developed (FEC, clock distribution)
while the others need further verification (RSTP, cut-through forwarding).
Suggested solutions enable to fulfill the requirements set by CERN and GSI.
However the costs might trigger double-checking and re-justifying of at least two of them:
upper-bound latency by GSI and the number of CMs lost per year.
The former requires additional development efforts to achieve the required 100$\mu s$.
The latter requires a high level of network redundancy (triple or more) which is very costly.
Since the network topology and its reliability calculations turned out to be the greater factor in
the overall system reliability, it is necessary to perform more precise calculations and
simulations to verify the rough estimations. This might include different techniques (e.g. Monte Carlo simulations)
but also more real-life use cases (i.e. of the network layout suggested in
\cite{biblio:CERNwrControlAndTiming}, which was not available at the time of described study).
\modified{Especially, we need to take into account and include into calculations the fact that
not all the nodes connected to the WRN are equally critical in real-life applications.}
\section{Introduction}
The WR project is a multi-laboratory,
multi-company, international effort to create a universal fieldbus for control and timing systems
to be used at CERN, GSI and possibly other such facilities. The rationale behind WR,
the choice of the technologies and technical details of its functioning have been already
described in a number of papers \cite{biblio:WRproject}, \cite{biblio:TomekMSc},
\cite{biblio:WRPTP}.
%, \cite{biblio:ISPCS2011}.
The resilience and robustness is one of the key features of any fieldbus.
This article presents a study on the reliability of a White Rabbit Network (WRN)
assuming a basic knowledge about WR.
Reliability is defined as the ability of a system to provide its services to clients under both
routine and abnormal circumstances. It can be estimated by calculating the probability of
the system's failure ($P_f$).
% \begin{equation}
% \label{eq:reliability}
% R =1 - P_f
% \end{equation}
The lesser the probability of WRN failure, the higher its reliability. Thus, in this article we
identify critical services of a WRN based on the study of WR's requirements.
Then, we analyze each critical service to identify possible
reasons for their failure and propose targeted counter-measures to increase reliability.
Finally, their impact on the overall system reliability is studied to
identify the highest contributor and the focus for the further studies.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment