Commit 0f08e51e authored by Alessandro Rubini's avatar Alessandro Rubini

doc: update radiusvlan, describing the 'multiple servers' implementation

Signed-off-by: Alessandro Rubini's avatarAlessandro Rubini <rubini@gnudd.com>
parent 8684bfa2
......@@ -120,8 +120,8 @@ of them has effects on the firmware build, they are only used at runtime.
@item RVLAN_RADIUS_SERVERS
A string listing the IP addresses of a set of Radius servers.
Currently, only the first address (or the only one) is used.
A comma-separated list of the names or IP addresses of a set
of Radius servers. See @ref{Multiple Servers}.
@item CONFIG_RVLAN_RADIUS_SECRET
......@@ -207,14 +207,23 @@ on @i{select()}).
itlsef, it is ignored. Any address configured in the switch is
considered as @i{self} (i.e., also the @i{eth0} mac address, used
by @i{ppsi} as sender address, is ignored). When a foreign
frame is received, the tool runs @i{radclient}, feeding data
to its @i{stdin} and collecting its @i{stdout}. The @i{pid}
frame is received, the tool saves the MAC address of the peer,
it closes the sniffing socket and it moves to the next state.
@item RVLAN_RADCLIENT
The tool selects a radius server (see @ref{Multiple Servers},
and runs @i{radclient}, feeding data
to its @i{stdin} and collecting its @i{stdout+stderr}. The @i{pid}
of the child process is retained for later cleanup.
@item RVLAN_AUTH
@i{radclient} returned some data. This state collects it until EOF.
When the reply is complete, the tool looks for ``@t{Framed-User}''
If @i{radclient} reports an error in communication, the server
is marked as ``recently faulty'' and @i{radiusvlan} moves back
to @t{RVLAN_RACLIENT}, where a different server will be selected.
If the server replied, the tool looks for ``@t{Framed-User}''
and ``@t{Tunnel-Private-Group-Id}''. If both exist authentication
succeeded. The @i{chosen_vlan} is either the one returned by the
Radius Server or the one set forth in dot-config, according to
......@@ -240,6 +249,105 @@ on @i{select()}).
@end table
@c ##########################################################################
@node Multiple Servers
@chapter Multiple Servers
To support multiple Radius servers, @i{radiusvlan} creates an internal
list of possible servers (by splitting the comma-separated list it gets from
dot-config).
When a port needs to call @i{radclient}, it asks for the ``best'' server.
At the beginning, this is the first server listed in the dot-config line.
Then, Whenever @i{radclient} returns, if it exited in error @b{and}
the reply begins with ``@t{radclient:}'', then this is considered a
communication error (which is different from an authorization denial).
Please note that @i{radiusvlan} merges @i{stdout} and @i{stderr} to
the same file descriptor, due to developer laziness.
Communication errors can happen because the server name cannot
be resolved by @sc{dns}; because the shared secret is wrong, or because
the radius server is not running at the selected address.
When such an error happens, the server name is marked as ``recently
faulty'', and another (or the same) server is selected, to resend the
same query. The selection process returns the radius server whose
failure is oldest, among the list of known servers; this ensures that
if two servers are out of order at different times, @i{radiusvlan} sticks
to the one that is currently working, but can get back to the other server
when needed.
Thus, in a dynamic environment where ports feature ``link up'' and
``link down'' over time, we always query the right server, if any
in the list is off-service. However, when several port go up
at the same time (e.g. at power on time), we may query the wrong server
for all ports, because no failure is yet known to the system when we
see the ``link up'' events.
What follows is an example, in verbose mode, using the string
@t{tornado,gsi.de,192.168.16.201,192.168.16.200} as
@t{CONFIG_RVLAN_RADIUS_SERVERS}. Here, ``@t{tornado}'' is a host name
but it can't be resolved by @sc{dns} (fast error reporting);
``@t{gsi.de}'' is not responding (slow error); host 201 does not
respond either (slow error); host 200 replies with an authorization error
-- i.e. it does ``@i{exit 1}'', but for a different reason.
The log is augmented with timestamps (minutes:seconds), so we see
that the timeout of @i{radtest} is 16 seconds.
@smallexample
13:49: Pmask = 0xffffffff
13:49: Radius server: "tornado"
13:49: Radius server: "gsi.de"
13:49: Radius server: "192.168.16.201"
13:49: Radius server: "192.168.16.200"
13:49: Interface "wri1": not access mode
13:49: Check wri2: down
[...]
13:50: FSM: wri3: justup -> sniff
13:51: recvfrom(wri3): 0800-90e2ba456c6b
13:51: dev wri3 queries server tornado
13:51: FSM: wri3: sniff -> auth
13:52: dev wri3, got 63 bytes so far
13:52: wri3: reaped radclient: 0x00000100
13:52: wri3: server failed
13:52: FSM: wri3: auth -> radclient
13:52: dev wri3 queries server gsi.de
13:52: FSM: wri3: radclient -> auth
14:08: dev wri3, got 55 bytes so far
14:08: wri3: reaped radclient: 0x00000100
14:08: wri3: server failed
14:08: FSM: wri3: auth -> radclient
14:09: dev wri3 queries server 192.168.16.201
14:09: FSM: wri3: radclient -> auth
14:25: dev wri3, got 55 bytes so far
14:25: wri3: reaped radclient: 0x00000100
14:25: wri3: server failed
14:25: FSM: wri3: auth -> radclient
14:25: dev wri3 queries server 192.168.16.200
14:25: FSM: wri3: radclient -> auth
14:27: dev wri3, got 46 bytes so far
14:27: wri3: reaped radclient: 0x00000100
14:27: dev wri3: vlan 4094
14:27: FSM: wri3: auth -> config
14:28: FSM: wri3: config -> configured
@end smallexample
After the above events, if we plug @i{wri2}, only the right server is
queried:
@smallexample
14:34: FSM: wri2: down -> sniff
14:34: recvfrom(wri2): 0026-0008546f9863
14:34: dev wri2 queries server 192.168.16.200
14:34: FSM: wri2: sniff -> auth
14:36: dev wri2, got 46 bytes so far
14:36: wri2: reaped radclient: 0x00000100
14:36: dev wri2: vlan 4094
14:36: FSM: wri2: auth -> config
14:36: FSM: wri2: config -> configured
@end smallexample
@c ##########################################################################
@node Robustness
@chapter Robustness
......@@ -254,12 +362,17 @@ reading replies from @i{radclient} turn the FSM to @i{GODOWN}, so
the procedure is started again -- because the port is up.
If the Radius server is not reachable @i{radclient} will time out,
so the state machine is not stuck.
as shown, so the state machine is not stuck.
The startup script uses @t{CONFIG_WRS_LOG_OTHER} as a destination for
its own output, and it is verified to work with my local @i{rsyslog}
server.
The only weak point is in understanding @i{radclient}'s replies.
A sane tool would @i{exit(1)} or @i{exit(2)} to mean different things,
but @i{radclient} always does @i{exit(1)}, so we ar forced to rely
on the output strings, which might change from one version to he next.
@c ##########################################################################
@node Diagnostic Tools
@chapter Diagnostic Tools
......@@ -417,10 +530,6 @@ A few, unfortunately
@itemize @bullet
@item Only one Radius server is queried. The tools should use all
the servers found in @t{RVLAN_RADIUS_SERVERS}, moving to the next one
when a server is not replying (like in the @i{wri3} example above).
@item It is not expected that MAC addresses change. Both identification
of self frames and blessing of peers (for authorization) has an
ever-lasting effect. Clearly, if you change client in a port, the
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment