Fault isolation and recovery
RFC 816

Document Type RFC - Historic (July 1982; No errata)
Obsoleted by RFC 7805
Last updated 2016-04-08
Stream Legacy
Formats plain text html pdf htmlized bibtex
Stream Legacy state (None)
Consensus Boilerplate Unknown
RFC Editor Note (None)
IESG IESG state RFC 816 (Historic)
Telechat date
Responsible AD (None)
Send notices to (None)
RFC:  816

                      FAULT ISOLATION AND RECOVERY

                             David D. Clark
                  MIT Laboratory for Computer Science
               Computer Systems and Communications Group
                               July, 1982

     1.  Introduction

     Occasionally, a network or a gateway will go down, and the sequence

of  hops  which the packet takes from source to destination must change.

Fault isolation is that action which  hosts  and  gateways  collectively

take  to  determine  that  something  is  wrong;  fault  recovery is the

identification and selection of an alternative route which will serve to

reconnect the source to the destination.  In fact, the gateways  perform

most  of  the  functions  of  fault  isolation and recovery.  There are,

however, a few actions which hosts must take if they wish to  provide  a

reasonable  level  of  service.   This document describes the portion of

fault isolation and recovery which is the responsibility of the host.

     2.  What Gateways Do

     Gateways collectively implement an algorithm which  identifies  the

best  route  between  all pairs of networks.  They do this by exchanging

packets  which  contain  each  gateway's  latest   opinion   about   the

operational status of its neighbor networks and gateways.  Assuming that

this  algorithm is operating properly, one can expect the gateways to go

through a period of confusion immediately after some network or  gateway


                                   2

has  failed,  but  one  can assume that once a period of negotiation has

passed, the gateways are equipped with a consistent and correct model of

the connectivity of the internet.  At present this period of negotiation

may actually take several minutes, and many TCP implementations time out

within that period, but it is a design goal of  the  eventual  algorithm

that  the  gateway  should  be  able to reconstruct the topology quickly

enough that a TCP connection should be able to survive a failure of  the

route.

     3.  Host Algorithm for Fault Recovery

     Since  the gateways always attempt to have a consistent and correct

model of the internetwork topology, the host strategy for fault recovery

is very simple.  Whenever the host feels that  something  is  wrong,  it

asks the gateway for advice, and, assuming the advice is forthcoming, it

believes  the  advice  completely.  The advice will be wrong only during

the transient  period  of  negotiation,  which  immediately  follows  an

outage, but will otherwise be reliably correct.

     In  fact,  it  is  never  necessary  for a host to explicitly ask a

gateway for advice, because the gateway will provide it as  appropriate.

When  a  host  sends  a datagram to some distant net, the host should be

prepared to receive back either  of  two  advisory  messages  which  the

gateway  may  send.    The  ICMP  "redirect"  message indicates that the

gateway to which the host sent the  datagram  is  not  longer  the  best

gateway  to  reach the net in question.  The gateway will have forwarded

the datagram, but the host should revise its routing  table  to  have  a

different  immediate  address  for  this  net.    The  ICMP "destination


                                   3

unreachable"  message  indicates  that  as  a result of an outage, it is

currently impossible to reach the addressed net or host in  any  manner.

On  receipt  of  this  message, a host can either abandon the connection

immediately without any further retransmission, or resend slowly to  see

if the fault is corrected in reasonable time.

     If  a  host  could assume that these two ICMP messages would always

arrive when something was amiss in the network, then no other action  on

the  part  of the host would be required in order maintain its tables in

an optimal condition.  Unfortunately, there are two circumstances  under

which  the  messages  will  not  arrive  properly.    First,  during the

transient following a failure, error messages may  arrive  that  do  not

correctly  represent  the  state of the world.  Thus, hosts must take an

isolated error message with some scepticism.  (This transient period  is

discussed  more  fully  below.)    Second,  if the host has been sending

datagrams to a particular gateway, and that gateway itself crashes, then

all the other gateways in the internet will  reconstruct  the  topology,

but  the  gateway  in  question will still be down, and therefore cannot

provide any advice back to the host.  As long as the host  continues  to

direct  datagrams at this dead gateway, the datagrams will simply vanish

off the face of the earth, and nothing will come back in return.   Hosts

must detect this failure.

     If some gateway many hops away fails, this is not of concern to the

host, for then the discovery of the failure is the responsibility of the

immediate  neighbor gateways, which will perform this action in a manner
Show full document text