Portions derived from

G22.3023 Special Topics of Industrial Interest

Network Design and Implementation

Lecture Notes Spring 84

© Copyright 1984, 1991

Herbert J. Bernstein & Max Goldstein

In this section we will consider first the selection of a suitable network topology, and then the design of algorithms to provide routing and congestion control. The tools of queuing theory and graph theory provide a rigorous basis for network topology design. Unfortunately, however, the complexity of real problems is so great that we will be obliged to fall back on crude approximations and heuristics. That will also be the case in handling routing and congestion control questions.

The tools we have introduced allow us to do two conflicting things. With queuing theory we can optimize a network for minimal delay times. As we have seen, this will require us to provide servers with significantly faster service rates than the customer arrival time to avoid saturating network capacities. Indeed, to avoid building queues, we would have to stay below half the capacity of the servers.

With graph theory, we can allocate traffic among paths to make maximal use of capacity and to move as much data as possible. Ideally we would have no reserve capacity, but that implies infinite queue lengths. This conflict between response delays and throughput optimization is real and has no simple solution. There are customers who require short response delays because they have many independent transactions, and there are customers who require maximal data rates because they are pumping very large streams of data through the system. About all one can do to provide both kinds of service is to logically divide a network into two networks, one optimized for minimal delay times by having reserve capacity, and one optimized for throughput by saturating lines.

As long as one is not guaranteeing the high-throughput service, it is possible to cheat a little and use some of the reserve capacity intended to ensure short delays as capacity available for continuous traffic, provided one can steal back that capacity at will. This is similar to the idea of running batch production jobs at low priority in the backgound of a time-sharing system. The danger is that, when the time comes to take back the capacity loaned to the high throughput system, it will not be economically or politically feasible to do so.

To make this inherent conflict clear, consider the plight of the telephone company. For call establishment, it must maintain a moderate delay signalling network. For actual voice traffic, it must maintain a close to zero delay voice network. To make best use of its equipment, avoiding idle capacity and attracting digital traffic, it is prudent to convert voice and signalling traffic into digital packets sharing very high bandwidth trunks. Yet, once a conversation is broken into packets, the time between packets cannot be allowed to stretch, or conversations will get out of synchronization, so some capacity, while available most of the time, must be kept idle, just in case there is a burst in the information content of a conversation.

Once one has made the basic design decisions of what characteristics one is trying to optimize and with what relative weights, it is necessary to organize the problem into manageable tasks. In large networks this organizational problem can be reduced by imposing a hierarchical structure, similar to the telephone system of local exchanges and long distance exchanges.

The idea is to consider the traffic in local areas as a separate design problem, and then treat each local area as if it were a single node on a larger network. In the telephone system, each instrument is connected to a local exchange. Traffic among subscribers connected to the same exchange is switched within that exchange. Local exchanges are then grouped into areas. Within an area, exchanges are connected by local trunks. When a subscriber in the area calls another subscriber in the same area, but within a different exchange, the call is switched onto a local trunk to the target exchange by the calling exchange. This might be identified by the first three digits dialed. The remaining, say four, digits are ignored by the calling exchange and used by the target exchange to complete the call. In order to call between areas, long distance exchanges are used, with a similar division of labor, where an area code of, say, three digits is used to start the call on its way to the foreign area, which then handles the remaining digits of the number at that end.

By adhering to such a hierarchy, we can greatly reduce the complexity of the networks to be handled. The cost is that we do not allow arbitrary one-step interconnections among terminal nodes throughout the network, introducing extra switching and forwarding delays. Thus it might seem that a hierarchical design is only suitable when traffic tends to be localized. In many networks, as in the telephone system, it is the case that much traffic is localized. However, even in networks with heavy non-local traffic, modern hardware allows the costs of efficient hierarchical structures to be low enough to make non-hierarchical schemes unattractive in almost all cases.

*-------------* /|\ /|\ / | \ / | \ / | \ / | \ / | \ / | \ / | \ / | \ / | \ / | \ * * * * * /|\ /|\ /|\ /|\ /|\

In communications network design, the major division in the hierarchy is usually between the design of the local area networks and some wide area networks joining the local area networks. Some node(s) on the local area networks serve as gateways to the wide area networks.

Let us look at the local area network problem first, since many network designs never have to go beyond that point. In its simplest form, a local area network might attempt to connect a set of terminal nodes to a single exchange point or message concentrator. Wire routings would be determined by local geography, and the only questions would concern optimal placement of the exchange and grouping of wire bundles to minimize total wire costs. Consider, again, the telephone company example. For each residence with a single telephone, a pair of wires will suffice. For a block of, say, 25 houses, one might run a cable of 25 pairs to the block, and then branch out with single pair wires for the houses. If there are more houses in the next block, however, it might well be more economical to run a much larger cable past the first block.

In the fine detail of the wiring, each house is a terminal node of the network. However, in planning the major cable routings, one might ignore the inexpensive single pair cables to the houses, and work only in terms of the drop-off points from the major cables.

-------*---------*----------*--------*------ / \ / / / T \ T T T \ \-----*---------*---------*------ / / / T T T

becomes

-------T---------T----------T--------T------ \ \ \------T---------T---------T------

As a crude first approximation to solving such a problem, it is often useful to find a minimal spanning tree from a tentative exchange location to the terminal nodes. The parameter to be minimized will vary with the application. It might be wire lengths, pole counts, repeaters, or any appropriate cost factors.

Algorithms to find spanning trees of minimal total cost have much the flavor of Dijkstra's path finding algorithm, since essentially one is finding minimal paths within the graph formed by totally interconnecting all vertices. In Prim's algorithm, edges are added in order of increasing path cost from the exchange, always taking the costs from unattached vertices to the tree built thus far. In Kruskal's algorithm, edges are added in order of increasing edge cost, skipping those that make circuits.

These algorithms are only approximately correct when external constraints are added. For example, if 25 pair cables are used, paths must be limited to 25 terminal vertices each. In such cases, one takes a less than perfect tree and tries to make small changes to improve it.

One of the simplest improvement techniques, which we will look at again, is the "branch exchange" technique. One finds an edge which was not used and which seems to have an attractively low cost. That new edge is inserted into the tree, closing some circuit, then other edges are removed until we have a tree again. If the total cost is actually lower, we can retain the new tree, otherwise we undo our changes and try another exchange.

At some point, the number of terminal nodes becomes too large for one local exchange, and some division among local exchanges must be made. Equivalently, one may think of grouping lines with message concentrators. There are two interrelated decisions to make: where to locate the exchanges and which terminals to assign to each exchange. Fortunately, we usually have only a few alternate exchange locations available. In the more general case we would have a constraining continuous multiparameter minimization to do. In the more restricted case at hand we have a combinatorial minimization to do, either starting with a skeletal set of assignments and working up, or with excessive assignments and working down. Depending on the nature of the constraints and the couplings of the cost factors, an optimal or suboptimal solution may be feasible. An optimal solution is always possible by exhaustive enumeration of cases, but may require years of calculations.

If the previous discussion seems vague, it is because there is no clearly superior approach to the optimization involved. This is even more the case on the higher levels of the design problem, establishing the topology of wide area networks, i.e. the pattern of interconnections and line capacities of a network of long distance exchanges.

The costs involved in long-line trunks are large. This is true both for the system which runs its own lines and for the system which rents capacity from a common carrier. It is worth considerable effort to optimize the topolgy.

Since many message paths can be disconnected by the loss of one long distance trunk, a reasonable starting point in designing a wide area network is an analysis of the degree of redundancy required for each exchange. Using an estimate of the reliability of the equipment involved and some limit on the number of exchanges through which traffic must pass, one can estimate the required number of vertex-disjoint paths required for communications between various regions, and start with a topology with exchanges having at least the required number of alternate lines.

Steiglitz, Weiner and Kleitman proposed sequentially connecting vertices which are most lacking in the required degree of redundancy, choosing to add the edge of lowest cost when several alternative choices are available. This provides a starting point which may or may not be sufficiently connected.

The resulting network may or may not have the required node-disjoint path connectivity. To achieve it, we must take each vertex pair, compute the actual degree of connectivity, and add sufficient paths to achieve the required degree. If there are no vertices available, extra vertices or extra direct edges may have to be added.

Eventually one achieves a network with the required degree of connectivity. However, this network may well have unnecessarily high connectivity for some vertex pairs and will probably have many uneconomically long paths. Long paths raise reliability and response time questions. Each extra vertex in a path increases the chance of a path disconnection and adds to the total transit delays. Thus we wish both to prune extra edges and to add new ones which bypass long paths, until we have a network of the required connectivity with sufficiently short paths.

Now we are ready to match edges to traffic, line capacities and costs. The traffic along edges is implied by the required flow between vertex pairs. It is tempting to use just the shortest path between two vertices to imply the flow. However, in highly connected networks this is not realistic, since traffic may be expected to flow along many equivalent routes. If one used just one such route, and it happened to share edges with a shortest path between two other vertices, one would force an unnecessarily high flow onto the shared edges, instead of making an economical distribution. Therefore, it is necessary to distribute the flows among alternate paths, either equally, or with weights inversely proportional to some cost function.

Since line capacities have not yet been assigned, the second choice would require some preliminary cost assumption, say a linear relationship between cost and distance.

Having flows, we can take a given limit on average delay to compute line capacities.

This requires us to establish a relationship between service rates, rs[i], and line capacities, C[i]. The simplest assumption is that the relationship is linear with the same constant factor of proportionality, u, for all lines:

rs[i] = u*C[i].

If C[i] is in, say, bits per second, and rs[i] is in messages per second, then u must be interpreted as the reciprocal of bits per message. Since the physical bit rate of a line is usually a constant, and most systems run with a few discrete packet sizes in a very limited range of sizes, we do not seem to have a Poisson distribution for the servers involved. Some writers avoid this issue by assuming a Poisson distribution of packet sizes. That just begs the question. The primary reason for using the M/M/1 queuing model is that, in practice, the predictions made from it are in reasonable agreement with reality. One might suspect that this is the case because variations in message handling times at the ends of the lines are sufficiently random to fit a Poisson distribution with C[i] suitably scaled down from the raw line bit rate.

As we saw earlier, the expected delay for a network with line traffic rates r[i] and service rates u*C[i] is given by:

Σ r[i]/(u*C[i] - r[i]) W = ----------------------- Σ r[j]

If the cost is some function of the capacities, g(C[1],C[2],..), then our task is to minimize g subject to some performance constraints. A common choice of constraint is a bound, W[0], on W. The calculation can be simplified by doing the minimization of g for W = W[0], on the assumption that reductions in delay time usually increase cost.

In order to solve such a constrained minimization problem, we can apply the technique of "Lagrange multipliers". Suppose we are trying to minimize a function of many variables, g, subject to the constraint equation, h = h[0]. Suppose both g and h are differentiable. Introduce a new variable, λ, and form

f = g + λ*(h-h[0])

Lagrange showed, for reasonable contraints, that, if g has a minimum at some point subject to the constraint h=h[0], then there is a value of lambda for which the partial derivatives of f all vanish at the same point. (We are actually assuming that the constraint implies a smooth surface of dimensionality one less than the number of free variables). In this case the form of f is

/ /_____ \ \ | 1 | \ | | | ___--- *| \ r[i] | | f = g + lambda*| \ | / --------- | -W[0]| | /__ r[j] | /____ u*C[i]-r[i] | | \ \ / /

The equations which result from setting the partial derivatives of f with respect to C[i] equal to zero are:

d f d g u*lambda*r[i] -2 0 = ------ = ------ - ------------- *( u*C[i] - r[i]) d C[i] d C[i] sum (r[j])

If we assume that the cost, g, is linear in the capacities, C[i], with coefficients d[i] (for distance along the edge), we get

____________________ 1 / d[i] * sum (r[j]) --------------- = / ------------------- u*C[i] - r[i] \/ u*lambda*r[i]

which allows us to recompute W[0] without the u*C[i]-r[i] terms as

___ \ /__ (r[i]*d[i])**.5 W[0] = ------------------------ ___________________ /___ / \ \/ /__u*lambda*r[j]

from which we can derive lambda in terms of W[0], u, r, and d.

/___ \ 2 | \ | 2 | /__ (r[i]*d[i])**.5 | W[0] = \ / ------------------------ ___ \ /__u*lambda*r[j] /___ \ 2 | \ | | /__ (r[i]*d[i])**.5 | lambda = \ / ------------------------ ___ 2 \ u *W[0] /__r[j]

This then gives

C[i] = (1/u)*(r[i] + sqrt((u*lambda*r[i])/(d[i]*sum(r[j]))))

a definition of C[i] in terms of known quantities.

The difficulty with this approach, due to Kleinrock, is that the differentiability assumption rarely applies in practice. Usually, cost is a step function is terms of capacity, with only discrete capacities available. However, this approach does provide a quick and simple calculation of a starting point for a combinatorial cost minimization using more realistic cost functions.

Once we have a cost for a topology, we can compare it to the costs of alternative topologies. To get alternative topologies, we can take a starting topology, remove edges of low utilization or high cost and restore connectivity between vertices without direct paths but with high flows. Gerla, Frank and Eckl proposed a more efficient heuristic, the "Saturated Cut Heuristic", in which one first finds minimal cut with maximal flow per unit capacity. The vertices incident on the cut are called primary vertices. The non-primary vertices within one edge of the primary vertices are called secondary vertices. Call the rest tertiary.

Tertiary vertices which have traffic to cross the cut face multiple edge paths for that traffic, and thus are likely candidates for a new edge to bypass the cut, especially if an inexpensive edge is possible and path delays are high. After reconnecting the network, edges with low utilization can be deleted and the process continued.

It is sometimes tempting to solve network cost optimization problems by adding more and more direct paths. In order to understand the cases in which direct connection is more or less cost effective than indirect routing, consider a simple network of three nodes, A, B, C, in which A and B each have traffic, rc[a] and rc[b], for C. Let the distance from A to B be d1, the distance from B to C be d2, and the distance from A to C be d3. Suppose we either route the traffic from A directly to C or via B.

case 1: A ---------->-------- C ---------<---------B traffic = rc[a] traffic = rc[b] distance = d3 distance = d2 case 2: A ---------->-------- B ---------->--------C traffic = rc[a] traffic = rc[a]+rc[b] distance = d1 distance = d2

In case 1, the expected node to node delay for traffic is the same as the end to end delay, while in case 2, the expected delays differ. Let W[e] be the end to end delay in case 2. Let W[n] be the node to node delay in case 2. Since multiple hops are involved in case 2, W[e] is scaled up from W[n] by the expected number of hops:

2*rc[a] + rc[b] W[e] = --------------- * W[n] rc[a] + rc[b]

With this in mind, we can calculate the optimal cost in each case for a given end to end delay, W, from the formulae above which use the node to node delay.

In case 1:

sqrt(rc[a]*d3) + sqrt(rc[b]*d2) W = ------------------------------------- sqrt(u*lambda*(rc[a]+rc[b])) cost = (1/u)*(rc[a]*d3 + rc[b]*d2 + sqrt(u*lambda)*(sqrt(rc[a]*d3)+sqrt(rc[b]*d2))/ sqrt(rc[a]+rc[b])) = (1/u)*(rc[a]*d3 * rc[b]*d2 + (1/W)*(sqrt(rc[a]*d3) + sqrt(rc[b]*d2))**2/ (rc[a]+rc[b]))

In case 2:

sqrt(rc[a]*d1) + sqrt((rc[a]+rc[b])*d2) W[n] = ---------------------------------------- sqrt(u*lambda*(2*rc[a]+rc[b])) cost = (1/u)*(rc[a]*d1 + (rc[a]+rc[b])*d2 + (1/W[n])*(sqrt(rc[a]*d1)+sqrt((rc[a]+rc[b])*d2))**2/ (2*rc[a]+rc[b])) = (1/u)*(rc[a]*d1 + (rc[a]+rc[b])*d2 + (1/W)*(sqrt(rc[a]*d1)+sqrt((rc[a]+rc[b])*d2))**2/ (rc[a]+rc[b]))

The difference, delta, of the cost in case 1 minus the cost in case 2 is then given by:

delta = (1/u)*(rc[a]*(d3-d1-d2) + (1/(W*(rc[a]+rc[b])))*(rc[a]*(d3-d1-d2) +2*sqrt(rc[a]*rc[b]*d3*d2) - 2*sqrt(rc[a]*(rc[a]+rc[b])*d1*d2)) >= (1/u)*(-2*d1*rc[a] + (1/(W*(rc[a]+rc[b])))*(-2*d1*rc[a] + 2*sqrt(rc[a]*rc[b]*(d2-d1)*d2) - 2*sqrt(rc[a]*(rc[a]+rc[b])*d1*d2))

by the triangle inequality, d2 ≤ d1+d3. For example, if d1 is small compared to d2, and W*(rc[a]+rc[b]) is small, then this lower bound for delta is approximately

(1/(u*W*(rc[a]+rc[b])))*(2*d2*sqrt(rc[a]+rc[b])).

Thus we have cases in which, since delta is positive, it is more cost effective to route A-C traffic through B than to send it directly from A to C.

In general, however, there will be a range of both positive and negative delta. Consider the straight line case, d3 = d1+d2. Then

delta = (1/(W*(rc[a]+rc[b])))* (2*sqrt(rc[a]*rc[b]*(d1+d2)*d2) - 2*sqrt(rc[a]*(rc[a]+rc[b])*d1*d2))

is positive if and only if

This is reasonable, in that it pays to go directly from A to C when A is far from B or has a great deal of traffic for C. That the cross-over point is the one given may not be so obvious.rc[b]*d2 >= rc[a]*d1.

In the general case, we can make a rough estimate for a lower bound on delta by using the inequality

sqrt(v) - sqrt(w) >= sqrt(v/2 - w), for v >= 4*w >= 0,

to obtain

delta >= (1/u)*(-2*d1*rc[a] + (1/(W*(rc[a]+rc[b])))* (-2*d1*rc[a] + sqrt(2*rc[a]*rc[b]*(d2-d1)*d2 -4*rc[a]*(rc[a]+rc[b])*d1*d2)))

= (1/u)*(-2*d1*rc[a] + (1/(W*(rc[a]+rc[b])))* (-2*d1*rc[a] + sqrt(2*rc[a]*(rc[b]*d2**2 -2rc[a]*d1*d2 -3rc[b]*d1*d2)))) >= (1/u)*(-2*d1*rc[a] + (1/(W*(rc[a]+rc[b])))* sqrt(rc[a]*(rc[b]*d2**2 - 4rc[a]*d1**2 -2rc[a]*d1*d2 - 3rc[b]*d1*d2)))) >= (1/(W*u*(rc[a]*rc[b])))* sqrt(.5*rc[a]*(rc[b]*d2**2 - 4rc[a]*d1**2 -2rc[a]*d1*d2 - 3rc[b]*d1*d2 -8rc[a]*d1**2*W**2*(rc[a]+rc[b])**2))))

Suppose d1 ≤ d2/6, and W*(rc[a]+rc[b]) is less than 1, then delta will be positive if rc[b] > 5*rc[a].

Plotting delta in other common cases is left as an exercise for the reader.

Network topology design problems can be addressed by combining the tools of queuing theory and graph theory with reasonable heuristics. The scope of the problem can be greatly reduced by adopting a hierarchical approach, laying out local area networks and placing local exchanges as problems separate from the design of a backbone wide area network. In local area networks line paths, line capacities, and exchange locations are usually sufficiently constrained to make the problem combinatorial rather than continuous. We may start with Prim's or Kruskal's algorithm to find a spanning tree of minimal cost and improve it with branch exchange.

Wide area network topology design usually raises additional questions of redundant paths and continuous variations in traffic and line capacities. Starting with a topology of sufficiently high connectivity, we allocate traffic flows to the lines and use Kleinrock`s approach to allocate capacities which will minimize costs subject to theconstraint of some maximum traffic delay. Having such a starting topolgy we can attempt to improve it by branch exchange or the saturated cut heuristic. While perfect solutions are unlikely, and many different starting topologies may have to be tried before one is satisfied, such an approach does produce economically viable solutions.

- Design a network for New York State, Connecticut, New Jersey, Pennsylvania,
and
Rhode Island. Assume one network node at the geographic location of each state
capitol. Assume that each state has traffic with each other state in direct
proportion to the product of their populations, at a rate of 1 bit per second per
1000000 people**2. Assume that wire costs are directly proportional to the product
of capacity and distance at a rate of $.02 per foot per megabit per second. Assume
that wire is available in all capacities, instead of in discrete units of capacity.
Assume that service rate is equivalent to outgoing capacity and Poisson distributed.
Limit average service delay to 100 microseconds per bit. Assume that every state
must have two alternate paths to every other state.
- Design a local area network for a community consisting of two sets of 21
houses
each, each set of houses uniformly distributed on a circular road of diameter one
mile. Do the problem in three cases: first when the two circular roads touch at
one point, sharing one house, second when the two circular roads are two miles
apart (center to center), and third when the two circular roads are ten miles
apart. Assume a uniform distribution of traffic among pairs of houses and a wire
cost proportional to distance. Assume local exchanges are free. First assume that
house-to-house or house-to-exchange traffic cannot share wire capacity. Then
assume that party-lines are acceptable. In all cases, assume that
exchange-to-exchange traffic can share a common wire.
- Suppose A, B and C are three nodes along a straight line, in which A and B
each
have traffic for C of 100 bits per second. Assume cost is proportional to the
product of traffic and distance. Assume the B-C distance is fixed. Compute the
range of A-C distances for which it is more cost-effective to send A-C traffic
through B, rather than send it directly.
- Change the cost assumption to be that cost is proportional only to distance and independent of traffic. Redo problem 3.

Having defined the major characteristics of network topology, we will now consider the techniques for moving data within the constraints of a given topology. We assume that we have the necessary physical connections (the physical layer) and have taken whatever steps are necessary to achieve a suffciently low error rate in those connections (the data link layer). We are about to address the design of network layer processes. In the network layer, we are concerned with the problems which arise from the general interconnection of communicating nodes with possible intermediate nodes in the message paths and possible multiple paths for the same message.

A network layer which provides a "virtual circuit" service is expected to act like a perfect wire for its end-point users, while a network layer which provides "datagram" service is responsible only for the integrity of individual messages, not for ensuring their order, or even their arrival.

The extreme example of a virtual circuit service is a network, such as the older analogue telephone system, which provides physical circuits. Before a call is placed, a phone is connected only as far as its local exchange. When the call is placed, a sequence of trunk lines to the target exchange is reserved for the life of the call, creating a temporary direct connection between communicating phones. The only involvement of the network during the call is to monitor the traffic for a disconnect signal (i.e. the calling party hangs up) and to do accounting. Newer phone systems may use channel sharing on trunk lines, analogue to digital conversions, etc., but the effect of a physical circuit is still provided.

The extreme example of a datagram service is, as the name suggests, the old telegram system, where each message was sent to its destination with no attempt to relate it to any other message. For many years, businesses made effective use of large "tear-tape" shops, which could get in messages on paper tape from some source, tear them off of the incoming punch, and manually transfer them to the appropriate outgoing reader. These shops were the prototypes of what is now done electronically in packet switching networks providing datagram service.

For most applications, if the network layer does not provide virtual circuits, some higher layer will have to do the job. However, there are cases where datagrams are all that are required. In particular, when interconnecting heterogeneous networks, each using internal datagrams, there is little value in forcing sequencing at the interface. Also, in applications, such as military command and control, where the network topology may be very unstable, datagrams may be all that can be provided.

A related distinction is between "circuit switching" and "packet switching" networks. In circuit switching, a complete sequence of lines is provided to a data stream for some period of time, as in the telephone example above. The data stream can be quite arbitrary, since it does not have to share the lines. In packet switching, all data is broken into packets of limited size, and intermediate nodes may share a given line among the packets of many data streams. Clearly a circuit switched network is well suited to providing virtual circuits, and a packet switched network is well suited to providing datagram service, but the distinction is not rigid. As the necessary hardware becomes cheaper, packet switching has found application as the internal mechanism in more and more formerly circuit switched networks. When the applications require it, a circuit switched network can be used for datagrams.

In any case, either during circuit establishment for a virtual circuit or physical circuit, or with every message in getting a datagram to its destination, one must provide a routing mechanism. Since many messages will be handled by a single virtual circuit, one can usually accept a higher routing overhead than with datagrams. One should not, however, take this as a license for unnecessary delays.

In addition to providing a routing mechanism the network layer must also provide means to regulate message flow, and to avoid congestion.

Let us first consider routing. In designing the network topology we made flow assignments for pairs of communicating nodes. When the topology allowed alternate routes, we distributed the traffic among those routes. One of the simplest techniques to handle routing is to provide each of the communicating nodes with precisely the same information used in designing the network, so that messages may be sent along the anticipated routes. This is called static routing.

Each node needs a table with an row of alternate next nodes for each ultimate destination, possibly with an estimate of the desirability of using that route. Each message must contain its destination address. When a node gets a message, it takes the destination address, looks up the appropriate row of the routing table, and using a suitable algorithm, sends the message on to one of the acceptable next nodes.

One possible algorithm is to use a random number generator weighted by the desirability of each route and randomly select from among the row entries. An alternative is to use the best route to saturation, then the next best, etc. In a network with high connectivity, the random selection corresponds best to the original design criteria and reduces the chances of saturating a route needed for other traffic. If connectivity is low, or the best route is dramatically better than the alternates, saturating the best route first may be the best choice. In all cases, there is an important distinction between the first hop a message is to take and all later hops. On the first hop, the message is being started on one of many alternate routes, and the design analysis holds. On later hops, the message is already on one of the node disjoint paths selected, and should be diverted only if the primary path is not available. Usually, that primary path to the destination is just the continuation of the best paths to that destination from earlier nodes (the optimality principle), making a tree, called the sink tree, leading from all other nodes to the destination.

When a message simply must get through at all costs, one might go to the boundary case of static routing, flooding, i.e. sending a message on all alternate routes simultaneously. This, however, creates the risk of having a single message saturate the entire system forever. To prevent that, one adds a hop-counter to the message, which is incremented each time the message moves one link. When the hop counter reaches some limit, the message is discarded. In networks of modest size, an alternative to the hop counter is a bit map, with a bit position for each node. The originating node sets only its own bit. Each node that gets the message sends it only to nodes whose bits are not set. This limits the number of copies that one node will see to the number of incoming lines, and is considerably more efficient than a hop counter.

When one moves beyond static routing and flooding to routing algorithms which adapt to changes in topology and traffic, it is important to maintain a clear view of the design criteria of the routing algorithm chosen. As we saw above, there is an inherent conflict between optimizing for minimal delay and optimizing for maximal utilization. Further, once one is doing network design on the fly, it is difficult to guarantee the correctness of the routings, which places a premium on simplicity and clarity. In networks of high connectivity, one would also like the algorithm to make the best use of that connectivity, i.e. to be robust in the presence of line and node failures.

The primary choice in adaptive routing is between centralized routing control and distributed routing control. In centralized routing control, one or a few routing control centers make the routing decisions for the network. The routing control centers periodically sample the state of the network and distribute new routing tables to the nodes. To avoid conflicts among nodes as to the state of their tables, each distributed table must have an "effective time" stamp. Then each node will start using the newly updated tables at the same time. The primary advantage of this technique is that massive computing resources can be brought to bear on the routing problem, achieving near optimal solutions. The disadvantages are that the centralized routing adds its own sampling and update traffic to the network load, and, to preserve synchronization, cannot respond very rapidly to load and topology changes. Further, the routing control centers and the links to them are potential points of failure of the system.

In distributed routing, each node makes its own decisions about modifications to its routing tables. This allows very rapid response to changes in load and topology, but can lead to uncoordinated and conflicting decisions. The result can be an effective loss of connectivity, even though the network is physically connected. The extreme case of distributed routing is isolated routing, in which no information is exchanged among nodes in making routing decisions, other than noticing how well traffic is moving on various lines. The probability of poor routing choices can be very large. In more general distributed routing schemes, nodes explicitly probe their neighbors for routing information, causing major routing errors to gradually be corrected. Still, in all its forms, because distributed routing lacks time synchronized topology updates, it is more prone to routing errors and gross inefficiencies than centralized routing.

It seems the best one can do is to combine the two approaches, by having some centralized source(s) inform the nodes about the basic network topology, while the individual nodes bias that information with locally gleaned information about which lines don't seem to be working well and are producing large backlogs. The French Transpac system uses such a scheme. ARPANET uses a variation in which all nodes start with a centrally distributed topology and then periodically broadcast line delay information to all other nodes.

Most of these routing problems are greatly simplified if the network is hierarchically structured, so that only a small number of nodes have to be considered on each level. This keeps the table sizes small and allows rather thorough optimization of routes. The penalty paid is in ignoring possible better routes which violate the hierarchy. In practice, that penalty is small.

The following material is abstracted from Some Comments on Highly Dynamic Network Routing, by H. J. Bernstein, Technical Report 371, New York University, Department of Computer Science, CIMS, NY, NY, May 1988.

In a large network there can be a significant benefit in avoiding poor choices of routing. When a high speed path of few hops is available, it would seem to be sheer folly to send any traffic along slow multi-hop back-door paths. However, the addition of traffic to a route remove some of its capacity, and the next set of messages might well be better sent along an alternate path. The assignment of a set of alternate paths with an allocation of portions of the offered load among them gives rise to the bifurcated routing problem. Under reasonable constraints it is possible to find routings which optimize an appropriate payoff function, e.g. average end-to-end delay [M. Gerla, The Design of Store and Forward Networks for Computer Communications,, Ph. D. Thesis, Dept. of Computer Science, UCLA, 1973, and L. Fratta, M. Gerla and L. Kleinrock, The Flow Deviation Method: An Approach to Store and Forward Communication Network Design, Networks, vol 3, pp. 97-133, 1973]

Assuming we have some acceptable routing mechanism, the remaining network layer questions are the regulation of message flow and the avoidance of congestion.

The message flow control question arises from the finite resources of receivers. At some point, no further buffers are available for incoming messages, and the sending of more messages must be stopped. The simplest approach is to design the network so that sender and line capacities are sufficiently below receiver capacities to prevent the problem from ever arising. An example is a 30 character per second printer used on a 300 Baud line. However, the desire to avoid wasted capacity usually makes this an unacceptable solution. The next level of control is provided by having the receiver inform the sender of the currently acceptable rate of transfer. As long as the estimates are conservative, this will work, but one cannot use all the capacity because one must allow for the time delay between the receiver's perception of its capacity and the arrival of that information at the sender. To go beyond such rate limiting, one must introduce a positive acknowledgement scheme, in which no more than a certain number of messages may be sent before the sender waits for an acknowledgement of the earliest messages. Such handshaking is common in high speed data communications, and also provides the basis for reliable error recovery.

Congestion is a more difficult problem to handle than simple flow control. In congestion, an increasing number of messages are moving among nodes, but a decreasing number are arriving at their final destinations. Virtual circuits avoid the problem by preallocating the necessary resources to keep traffic moving to its destination. In a datagram system, since arrival is not guaranteed, one can back off from congestion by simply discarding packets to reduce the load. There are, however, intermediate solutions, which amount to restricting the total load on the network below the point of congestion.

Let us consider packet discarding. If congestion is moderate and localized, discarding a few packets might clear the problem. Clearly, it is most desirable to discard packets which are involved with thecongested lines. I.M. Irland ("Buffer Management in a Packet Switch", IEEE Trans. Commun., vol COM-26, pp 328-337, Mar 1978) suggested that the designation of a line as being congested be based on a queue using more than (number of out going lines)**-.5 of the output capacity of a node. Kamoun ("Design Considerations for Large Computer Communications Networks", Ph.D. thesis, Computer Science Dept., UCLA, 1976) suggested also leaving each uncongested line with some capacity, even in the absence of traffic. This keeps congestion on a few lines from stopping traffic on the others.

For example, suppose we have a node with 16 outgoing lines and buffer space for 160 messages. We might decide to reserve 2 message buffers per line in all cases. If we run short on buffers, any line using more than 40 buffers will be considered congested, and have further messages discarded, rather than taking more buffers. As long as only three lines are building large queues, we can go even higher than 40 without cutting into the reserve. When many lines are building large queues, we may never reach 40, since there are only an average of 10 buffers per line. In that case, we discard messages for all lines which don't have a reserved buffer free.

This last situation can have serious consequences. Extensive discarding of packets simply generates more traffic which has to be discarded since the discarded packets must eventually be retransmitted. If the only option available is packet discarding, one should at least discard the packets as early in their lives as possible. It would have been better never to have accepted too many packets to begin with.

In the special case of ring networks, it is possible to achieve this result by filling the ring with a constant number of circulating dummy packets. A message may be introduced into the ring only in place of a dummy message. When a real message is finally delivered, a dummy packet must be reintroduced into the ring. This "isarithmetic" approach can be extended to networks of linked rings. As long as no packets are lost, the system can be taken to full load without congestion. If packets are lost, as they must eventually be, a loss of capacity similar to that of congestion occurs.

In a simple ring, or a network built of linked simple rings, lost packets can be reinserted on the basis of open time slots, so networks made of linked rings can use this approach effectively. Attempts to apply the idea to generate networks without a uniform circulation of packets fail, because there is no simple way to decide when to reinsert a lost packet.

In the general case, one must fall back on the original design of the network, and limit flow to the network capacity. In the absence of traffic variations, node or line failures, this could be done by table lookups. Traffic does, however, vary, and components do fail, so a more dynamic approach is required. One approach is to use the flow control mechanisms considered above, to insure that no sender is allowed to swamp any receiver. The triggers can be similar to those resorted to in packet discarding, but with thresholds set so that, if the sender responds as required, buffers will be available for all traffic. If the sender being slowed is an intermediate node in a message flow, the action of slowing down as a sender will probably trigger its own flow control mechanisms to slow down the prior nodes in the chain sending the traffic. Thus, unlike packet discarding, use of flow control does not tend to generate extra traffic.

Both with simple flow control, and with packet discarding, it is important to design to avoid deadlocks. Deadlocks occur when two customers each need service and each of them is holding resources needed to serve the other. There are two major approaches to avoiding this situation. First, one can avoid giving any customer part of his required resources until all of his required resources are free. Second, one can include timeout, after which an unserviced customer holding scarce resources is forced to release them. In networks, the resource most prone to deadlocks is the buffer pool. Preallocation, as in virtual circuit design and in the line reservation scheme above, prevents deadlocks at the expense of idle capacity. Resource release after a timeout can be made to work with less wasted capacity, but requires a careful case analysis to avoid infinite loops among states.

Network layer routing is required either for circuit establishment in a virtual circuit network, or for each packet in a datagram service. If the network traffic and topology are stable, static routing can be used. However, if the traffic changes or lines and nodes drop out of the network, dynamic routing becomes essential. The rerouting decisions could be made by a central facility or by each node in a distributed fashion. The first choice insures correct choices. The second choice saves considerable time. In practice, a combination of both centralized and distributed routing works well.

Unless we are dealing with a very deterministic message pattern, some form of congestion control is needed. We can discard packets in a datagram service. We can reserve some capacity for each essential message route. In ring networks, we can fill the system with dummy traffic and require that new traffic wait for a dummy message to replace. We can use explicit flow control mechanisms to keep the message traffic below network saturation.

- Consider a network made up of a 5 by 5 rectangular grid of 25 nodes
connected by lines of equal capacity:
`*-----*-----*-----*-----* | | | | | | | | | | *-----A-----*-----*-----* | | | | | | | | | | *-----*-----C-----*-----* | | | | | | | | | | *-----*-----*-----B-----* | | | | | | | | | | *-----*-----*-----*-----*`Provide routing tables for all nodes involved in traffic from A to B.

- Now suppose node C wishes to broadcast a message to all nodes in the network
above, except to A and B. Compare the amount of traffic generated by the following:
- a. Send an individual message to each target node.
- b. Send the message by flooding with a minimal hop counter.
- c. Send the message to each appropriate neighbor with a 25-bit wide bit map indicating which destinations are to be reached.

- Suppose we wished to apply isarithmetic congestion control to the network in
problem 1, above. Assume each line can be fed at most one packet per second.
Treat each small square of the network as a ring in which the dummy packets are to
circulate. How fast can packets circulate in each square? How long will it take
to get a packet from A to B? How long would it take without the dummy packets in
circulation?