NNeettwwoorrkkiinngg IImmpplleemmeennttaattiioonn NNootteess 44..44BBSSDD EEddiittiioonn _S_a_m_u_e_l _J_. _L_e_f_f_l_e_r_, _W_i_l_l_i_a_m _N_. _J_o_y_, _R_o_b_e_r_t _S_. _F_a_b_r_y_, _a_n_d _M_i_c_h_a_e_l _J_. _K_a_r_e_l_s Computer Systems Research Group Computer Science Division Department of Electrical Engineering and Computer Science University of California, Berkeley Berkeley, CA 94720 _A_B_S_T_R_A_C_T This report describes the internal structure of the networking facilities developed for the 4.4BSD version of the UNIX* operating system for the VAX|-. These facilities are based on several central abstractions which structure the external (user) view of network communication as well as the internal (system) implementation. The report documents the internal structure of the networking system. The ``Berkeley Software Architecture Manual, 4.4BSD Edition'' (PSD:5) pro- vides a description of the user interface to the networking facilities. Revised June 10, 1993 ----------- * UNIX is a trademark of Bell Laboratories. |- DEC, VAX, DECnet, and UNIBUS are trademarks of Digital Equipment Corporation. SMM:18-2 Networking Implementation Notes TTAABBLLEE OOFF CCOONNTTEENNTTSS 11.. IInnttrroodduuccttiioonn 22.. OOvveerrvviieeww 33.. GGooaallss 44.. IInntteerrnnaall aaddddrreessss rreepprreesseennttaattiioonn 55.. MMeemmoorryy mmaannaaggeemmeenntt 66.. IInntteerrnnaall llaayyeerriinngg 6.1. Socket layer 6.1.1. Socket state 6.1.2. Socket data queues 6.1.3. Socket connection queuing 6.2. Protocol layer(s) 6.3. Network-interface layer 6.3.1. UNIBUS interfaces 77.. SSoocckkeett//pprroottooccooll iinntteerrffaaccee 88.. PPrroottooccooll//pprroottooccooll iinntteerrffaaccee 8.1. pr_output 8.2. pr_input 8.3. pr_ctlinput 8.4. pr_ctloutput 99.. PPrroottooccooll//nneettwwoorrkk--iinntteerrffaaccee iinntteerrffaaccee 9.1. Packet transmission 9.2. Packet reception 1100.. GGaatteewwaayyss aanndd rroouuttiinngg iissssuueess 10.1. Routing tables 10.2. Routing table interface 10.3. User level routing policies 1111.. RRaaww ssoocckkeettss 11.1. Control blocks 11.2. Input processing 11.3. Output processing 1122.. BBuuffffeerriinngg aanndd ccoonnggeessttiioonn ccoonnttrrooll 12.1. Memory management 12.2. Protocol buffering policies 12.3. Queue limiting 12.4. Packet forwarding 1133.. OOuutt ooff bbaanndd ddaattaa 1144.. TTrraaiilleerr pprroottooccoollss Networking Implementation Notes SMM:18-3 AAcckknnoowwlleeddggeemmeennttss RReeffeerreenncceess SMM:18-4 Networking Implementation Notes 11.. IInnttrroodduuccttiioonn This report describes the internal structure of facili- ties added to the 4.2BSD version of the UNIX operating sys- tem for the VAX, as modified in the 4.4BSD release. The system facilities provide a uniform user interface to net- working within UNIX. In addition, the implementation intro- duces a structure for network communications which may be used by system implementors in adding new networking facili- ties. The internal structure is not visible to the user, rather it is intended to aid implementors of communication protocols and network services by providing a framework which promotes code sharing and minimizes implementation effort. The reader is expected to be familiar with the C pro- gramming language and system interface, as described in the _B_e_r_k_e_l_e_y _S_o_f_t_w_a_r_e _A_r_c_h_i_t_e_c_t_u_r_e _M_a_n_u_a_l_, _4_._4_B_S_D _E_d_i_t_i_o_n [Joy86]. Basic understanding of network communication con- cepts is assumed; where required any additional ideas are introduced. The remainder of this document provides a description of the system internals, avoiding, when possible, those por- tions which are utilized only by the interprocess communica- tion facilities. 22.. OOvveerrvviieeww If we consider the International Standards Organiza- tion's (ISO) Open System Interconnection (OSI) model of net- work communication [ISO81] [Zimmermann80], the networking facilities described here correspond to a portion of the session layer (layer 3) and all of the transport and network layers (layers 2 and 1, respectively). The network layer provides possibly imperfect data transport services with minimal addressing structure. Addressing at this level is normally host to host, with implicit or explicit routing optionally supported by the communicating agents. At the transport layer the notions of reliable trans- fer, data sequencing, flow control, and service addressing are normally included. Reliability is usually managed by explicit acknowledgement of data delivered. Failure to acknowledge a transfer results in retransmission of the data. Sequencing may be handled by tagging each message handed to the network layer by a _s_e_q_u_e_n_c_e _n_u_m_b_e_r and main- taining state at the endpoints of communication to utilize received sequence numbers in reordering data which arrives out of order. Networking Implementation Notes SMM:18-5 The session layer facilities may provide forms of addressing which are mapped into formats required by the transport layer, service authentication and client authenti- cation, etc. Various systems also provide services such as data encryption and address and protocol translation. The following sections begin by describing some of the common data structures and utility routines, then examine the internal layering. The contents of each layer and its interface are considered. Certain of the interfaces are protocol implementation specific. For these cases examples have been drawn from the Internet [Cerf78] protocol family. Later sections cover routing issues, the design of the raw socket interface and other miscellaneous topics. 33.. GGooaallss The networking system was designed with the goal of supporting multiple _p_r_o_t_o_c_o_l _f_a_m_i_l_i_e_s and addressing styles. This required information to be ``hidden'' in common data structures which could be manipulated by all the pieces of the system, but which required interpretation only by the protocols which ``controlled'' it. The system described here attempts to minimize the use of shared data structures to those kept by a suite of protocols (a _p_r_o_t_o_c_o_l _f_a_m_i_l_y), and those used for rendezvous between ``synchronous'' and ``asynchronous'' portions of the system (e.g. queues of data packets are filled at interrupt time and emptied based on user requests). A major goal of the system was to provide a framework within which new protocols and hardware could be easily be supported. To this end, a great deal of effort has been extended to create utility routines which hide many of the more complex and/or hardware dependent chores of networking. Later sections describe the utility routines and the under- lying data structures they manipulate. 44.. IInntteerrnnaall aaddddrreessss rreepprreesseennttaattiioonn Common to all portions of the system are two data structures. These structures are used to represent addresses and various data objects. Addresses, internally are described by the _s_o_c_k_a_d_d_r structure, struct sockaddr { short sa_family; /* data format identifier */ char sa_data[14]; /* address */ }; All addresses belong to one or more _a_d_d_r_e_s_s _f_a_m_i_l_i_e_s which define their format and interpretation. The _s_a___f_a_m_i_l_y field indicates the address family to which the address belongs, and the _s_a___d_a_t_a field contains the actual data value. The SMM:18-6 Networking Implementation Notes size of the data field, 14 bytes, was selected based on a study of current address formats.* Specific address formats use private structure definitions that define the format of the data field. The system interface supports larger address structures, although address-family-independent sup- port facilities, for example routing and raw socket inter- faces, provide only 14 bytes for address storage. Protocols that do not use those facilities (e.g, the current Unix domain) may use larger data areas. 55.. MMeemmoorryy mmaannaaggeemmeenntt A single mechanism is used for data storage: memory buffers, or _m_b_u_f's. An mbuf is a structure of the form: struct mbuf { struct mbuf *m_next; /* next buffer in chain */ u_long m_off; /* offset of data */ short m_len; /* amount of data in this mbuf */ short m_type; /* mbuf type (accounting) */ u_char m_dat[MLEN]; /* data storage */ struct mbuf *m_act; /* link in higher-level mbuf list */ }; The _m___n_e_x_t field is used to chain mbufs together on linked lists, while the _m___a_c_t field allows lists of mbuf chains to be accumulated. By convention, the mbufs common to a single object (for example, a packet) are chained together with the _m___n_e_x_t field, while groups of objects are linked via the _m___a_c_t field (possibly when in a queue). Each mbuf has a small data area for storing informa- tion, _m___d_a_t. The _m___l_e_n field indicates the amount of data, while the _m___o_f_f field is an offset to the beginning of the data from the base of the mbuf. Thus, for example, the macro _m_t_o_d, which converts a pointer to an mbuf to a pointer to the data stored in the mbuf, has the form #define mtod(_x,_t) ((_t)((int)(_x) + (_x)->m_off)) (note the _t parameter, a C type cast, which is used to cast the resultant pointer for proper assignment). In addition to storing data directly in the mbuf's data area, data of page size may be also be stored in a separate area of memory. The mbuf utility routines maintain a pool of pages for this purpose and manipulate a private page map for such pages. An mbuf with an external data area may be recognized by the larger offset to the data area; this is formalized by the macro M_HASCL(_m), which is true if the mbuf whose address is _m has an external page cluster. An ----------- * Later versions of the system may support vari- able length addresses. Networking Implementation Notes SMM:18-7 array of reference counts on pages is also maintained so that copies of pages may be made without core to core copy- ing (copies are created simply by duplicating the reference to the data and incrementing the associated reference counts for the pages). Separate data pages are currently used only when copying data from a user process into the kernel, and when bringing data in at the hardware level. Routines which manipulate mbufs are not normally aware whether data is stored directly in the mbuf data array, or if it is kept in separate pages. The following may be used to allocate and free mbufs: m = m_get(wait, type); MGET(m, wait, type); The subroutine _m___g_e_t and the macro _M_G_E_T each allocate an mbuf, placing its address in _m. The argument _w_a_i_t is either M_WAIT or M_DONTWAIT according to whether allocation should block or fail if no mbuf is avail- able. The _t_y_p_e is one of the predefined mbuf types for use in accounting of mbuf allocation. MCLGET(m); This macro attempts to allocate an mbuf page cluster to associate with the mbuf _m. If successful, the length of the mbuf is set to CLSIZE, the size of the page cluster. n = m_free(m); MFREE(m,n); The routine _m___f_r_e_e and the macro _M_F_R_E_E each free a sin- gle mbuf, _m, and any associated external storage area, placing a pointer to its successor in the chain it heads, if any, in _n. m_freem(m); This routine frees an mbuf chain headed by _m. The following utility routines are available for manip- ulating mbuf chains: m = m_copy(m0, off, len); The _m___c_o_p_y routine create a copy of all, or part, of a list of the mbufs in _m_0. _L_e_n bytes of data, starting _o_f_f bytes from the front of the chain, are copied. Where possible, reference counts on pages are used instead of core to core copies. The original mbuf chain must have at least _o_f_f + _l_e_n bytes of data. If _l_e_n is specified as M_COPYALL, all the data present, offset as before, is copied. SMM:18-8 Networking Implementation Notes m_cat(m, n); The mbuf chain, _n, is appended to the end of _m. Where possible, compaction is performed. m_adj(m, diff); The mbuf chain, _m is adjusted in size by _d_i_f_f bytes. If _d_i_f_f is non-negative, _d_i_f_f bytes are shaved off the front of the mbuf chain. If _d_i_f_f is negative, the alteration is performed from back to front. No space is reclaimed in this operation; alterations are accom- plished by changing the _m___l_e_n and _m___o_f_f fields of mbufs. m = m_pullup(m0, size); After a successful call to _m___p_u_l_l_u_p, the mbuf at the head of the returned list, _m, is guaranteed to have at least _s_i_z_e bytes of data in contiguous memory within the data area of the mbuf (allowing access via a pointer, obtained using the _m_t_o_d macro, and allowing the mbuf to be located from a pointer to the data area using _d_t_o_m, defined below). If the original data was less than _s_i_z_e bytes long, _l_e_n was greater than the size of an mbuf data area (112 bytes), or required resources were unavailable, _m is 0 and the original mbuf chain is deallocated. This routine is particularly useful when verifying packet header lengths on reception. For example, if a packet is received and only 8 of the necessary 16 bytes required for a valid packet header are present at the head of the list of mbufs representing the packet, the remaining 8 bytes may be ``pulled up'' with a single _m___p_u_l_l_u_p call. If the call fails the invalid packet will have been discarded. By insuring that mbufs always reside on 128 byte bound- aries, it is always possible to locate the mbuf associated with a data area by masking off the low bits of the virtual address. This allows modules to store data structures in mbufs and pass them around without concern for locating the original mbuf when it comes time to free the structure. Note that this works only with objects stored in the inter- nal data buffer of the mbuf. The _d_t_o_m macro is used to con- vert a pointer into an mbuf's data area to a pointer to the mbuf, #define dtom(x) ((struct mbuf *)((int)x & ~(MSIZE-1))) Mbufs are used for dynamically allocated data struc- tures such as sockets as well as memory allocated for pack- ets and headers. Statistics are maintained on mbuf usage and can be viewed by users using the _n_e_t_s_t_a_t(1) program. Networking Implementation Notes SMM:18-9 66.. IInntteerrnnaall llaayyeerriinngg The internal structure of the network system is divided into three layers. These layers correspond to the services provided by the socket abstraction, those provided by the communication protocols, and those provided by the hardware interfaces. The communication protocols are normally lay- ered into two or more individual cooperating layers, though they are collectively viewed in the system as one layer pro- viding services supportive of the appropriate socket abstraction. The following sections describe the properties of each layer in the system and the interfaces to which each must conform. 66..11.. SSoocckkeett llaayyeerr The socket layer deals with the interprocess communica- tion facilities provided by the system. A socket is a bidi- rectional endpoint of communication which is ``typed'' by the semantics of communication it supports. The system calls described in the _B_e_r_k_e_l_e_y _S_o_f_t_w_a_r_e _A_r_c_h_i_t_e_c_t_u_r_e _M_a_n_u_a_l [Joy86] are used to manipulate sockets. A socket consists of the following data structure: struct socket { short so_type; /* generic type */ short so_options; /* from socket call */ short so_linger; /* time to linger while closing */ short so_state; /* internal state flags */ caddr_t so_pcb; /* protocol control block */ struct protosw *so_proto; /* protocol handle */ struct socket *so_head; /* back pointer to accept socket */ struct socket *so_q0; /* queue of partial connections */ short so_q0len; /* partials on so_q0 */ struct socket *so_q; /* queue of incoming connections */ short so_qlen; /* number of connections on so_q */ short so_qlimit; /* max number queued connections */ struct sockbuf so_rcv; /* receive queue */ struct sockbuf so_snd; /* send queue */ short so_timeo; /* connection timeout */ u_short so_error; /* error affecting connection */ u_short so_oobmark; /* chars to oob mark */ short so_pgrp; /* pgrp for signals */ }; Each socket contains two data queues, _s_o___r_c_v and _s_o___s_n_d, and a pointer to routines which provide supporting services. The type of the socket, _s_o___t_y_p_e is defined at socket creation time and used in selecting those services which are appropriate to support it. The supporting SMM:18-10 Networking Implementation Notes protocol is selected at socket creation time and recorded in the socket data structure for later use. Protocols are defined by a table of procedures, the _p_r_o_t_o_s_w structure, which will be described in detail later. A pointer to a protocol-specific data structure, the ``protocol control block,'' is also present in the socket structure. Protocols control this data structure, which normally includes a back pointer to the parent socket structure to allow easy lookup when returning information to a user (for example, placing an error number in the _s_o___e_r_r_o_r field). The other entries in the socket structure are used in queuing connection requests, validating user requests, storing socket charac- teristics (e.g. options supplied at the time a socket is created), and maintaining a socket's state. Processes ``rendezvous at a socket'' in many instances. For instance, when a process wishes to extract data from a socket's receive queue and it is empty, or lacks sufficient data to satisfy the request, the process blocks, supplying the address of the receive queue as a ``wait channel' to be used in notification. When data arrives for the process and is placed in the socket's queue, the blocked process is identified by the fact it is waiting ``on the queue.'' 66..11..11.. SSoocckkeett ssttaattee A socket's state is defined from the following: #define SS_NOFDREF 0x001 /* no file table ref any more */ #define SS_ISCONNECTED 0x002 /* socket connected to a peer */ #define SS_ISCONNECTING 0x004 /* in process of connecting to peer */ #define SS_ISDISCONNECTING 0x008 /* in process of disconnecting */ #define SS_CANTSENDMORE 0x010 /* can't send more data to peer */ #define SS_CANTRCVMORE 0x020 /* can't receive more data from peer */ #define SS_RCVATMARK 0x040 /* at mark on input */ #define SS_PRIV 0x080 /* privileged */ #define SS_NBIO 0x100 /* non-blocking ops */ #define SS_ASYNC 0x200 /* async i/o notify */ The state of a socket is manipulated both by the proto- cols and the user (through system calls). When a socket is created, the state is defined based on the type of socket. It may change as control actions are performed, for example connection establishment. It may also change according to the type of input/output the user wishes to perform, as indicated by options set with _f_c_n_t_l. ``Non-blocking'' I/O implies that a process should never be blocked to await resources. Instead, any call which would block returns pre- maturely with the error EWOULDBLOCK, or the service request may be partially fulfilled, e.g. a request for more data than is present. Networking Implementation Notes SMM:18-11 If a process requested ``asynchronous'' notification of events related to the socket, the SIGIO signal is posted to the process when such events occur. An event is a change in the socket's state; examples of such occurrences are: space becoming available in the send queue, new data available in the receive queue, connection establishment or disestablish- ment, etc. A socket may be marked ``privileged'' if it was created by the super-user. Only privileged sockets may bind addresses in privileged portions of an address space or use ``raw'' sockets to access lower levels of the network. 66..11..22.. SSoocckkeett ddaattaa qquueeuueess A socket's data queue contains a pointer to the data stored in the queue and other entries related to the manage- ment of the data. The following structure defines a data queue: struct sockbuf { u_short sb_cc; /* actual chars in buffer */ u_short sb_hiwat; /* max actual char count */ u_short sb_mbcnt; /* chars of mbufs used */ u_short sb_mbmax; /* max chars of mbufs to use */ u_short sb_lowat; /* low water mark */ short sb_timeo; /* timeout */ struct mbuf *sb_mb; /* the mbuf chain */ struct proc *sb_sel; /* process selecting read/write */ short sb_flags; /* flags, see below */ }; Data is stored in a queue as a chain of mbufs. The actual count of data characters as well as high and low water marks are used by the protocols in controlling the flow of data. The amount of buffer space (characters of mbufs and associated data pages) is also recorded along with the limit on buffer allocation. The socket routines cooper- ate in implementing the flow control policy by blocking a process when it requests to send data and the high water mark has been reached, or when it requests to receive data and less than the low water mark is present (assuming non- blocking I/O has not been specified).* When a socket is created, the supporting protocol ``reserves'' space for the send and receive queues of the socket. The limit on buffer allocation is set somewhat higher than the limit on data characters to account for the granularity of buffer allocation. The actual storage asso- ciated with a socket queue may fluctuate during a socket's ----------- * The low-water mark is always presumed to be 0 in the current implementation. SMM:18-12 Networking Implementation Notes lifetime, but it is assumed that this reservation will always allow a protocol to acquire enough memory to satisfy the high water marks. The timeout and select values are manipulated by the socket routines in implementing various portions of the interprocess communications facilities and will not be described here. Data queued at a socket is stored in one of two styles. Stream-oriented sockets queue data with no addresses, head- ers or record boundaries. The data are in mbufs linked through the _m___n_e_x_t field. Buffers containing access rights may be present within the chain if the underlying protocol supports passage of access rights. Record-oriented sockets, including datagram sockets, queue data as a list of packets; the sections of packets are distinguished by the types of the mbufs containing them. The mbufs which comprise a record are linked through the _m___n_e_x_t field; records are linked from the _m___a_c_t field of the first mbuf of one packet to the first mbuf of the next. Each packet begins with an mbuf containing the ``from'' address if the protocol pro- vides it, then any buffers containing access rights, and finally any buffers containing data. If a record contains no data, no data buffers are required unless neither address nor access rights are present. A socket queue has a number of flags used in synchro- nizing access to the data and in acquiring resources: #define SB_LOCK 0x01 /* lock on data queue (so_rcv only) */ #define SB_WANT 0x02 /* someone is waiting to lock */ #define SB_WAIT 0x04 /* someone is waiting for data/space */ #define SB_SEL 0x08 /* buffer is selected */ #define SB_COLL 0x10 /* collision selecting */ The last two flags are manipulated by the system in imple- menting the select mechanism. 66..11..33.. SSoocckkeett ccoonnnneeccttiioonn qquueeuuiinngg In dealing with connection oriented sockets (e.g. SOCK_STREAM) the two ends are considered distinct. One end is termed _a_c_t_i_v_e, and generates connection requests. The other end is called _p_a_s_s_i_v_e and accepts connection requests. From the passive side, a socket is marked with SO_ACCEPTCONN when a _l_i_s_t_e_n call is made, creating two queues of sockets: _s_o___q_0 for connections in progress and _s_o___q for connections already made and awaiting user accep- tance. As a protocol is preparing incoming connections, it creates a socket structure queued on _s_o___q_0 by calling the routine _s_o_n_e_w_c_o_n_n(). When the connection is established, the socket structure is then transferred to _s_o___q, making it Networking Implementation Notes SMM:18-13 available for an _a_c_c_e_p_t. If an SO_ACCEPTCONN socket is closed with sockets on either _s_o___q_0 or _s_o___q, these sockets are dropped, with noti- fication to the peers as appropriate. 66..22.. PPrroottooccooll llaayyeerr((ss)) Each socket is created in a communications domain, which usually implies both an addressing structure (address family) and a set of protocols which implement various socket types within the domain (protocol family). Each domain is defined by the following structure: struct domain { int dom_family; /* PF_xxx */ char *dom_name; int (*dom_init)(); /* initialize domain data structures */ int (*dom_externalize)(); /* externalize access rights */ int (*dom_dispose)(); /* dispose of internalized rights */ struct protosw *dom_protosw, *dom_protoswNPROTOSW; struct domain *dom_next; }; At boot time, each domain configured into the kernel is added to a linked list of domain. The initialization proce- dure of each domain is then called. After that time, the domain structure is used to locate protocols within the pro- tocol family. It may also contain procedure references for externalization of access rights at the receiving socket and the disposal of access rights that are not received. Protocols are described by a set of entry points and certain socket-visible characteristics, some of which are used in deciding which socket type(s) they may support. An entry in the ``protocol switch'' table exists for each protocol module configured into the system. It has the following form: SMM:18-14 Networking Implementation Notes struct protosw { short pr_type; /* socket type used for */ struct domain *pr_domain; /* domain protocol a member of */ short pr_protocol; /* protocol number */ short pr_flags; /* socket visible attributes */ /* protocol-protocol hooks */ int (*pr_input)(); /* input to protocol (from below) */ int (*pr_output)(); /* output to protocol (from above) */ int (*pr_ctlinput)(); /* control input (from below) */ int (*pr_ctloutput)(); /* control output (from above) */ /* user-protocol hook */ int (*pr_usrreq)(); /* user request */ /* utility hooks */ int (*pr_init)(); /* initialization routine */ int (*pr_fasttimo)(); /* fast timeout (200ms) */ int (*pr_slowtimo)(); /* slow timeout (500ms) */ int (*pr_drain)(); /* flush any excess space possible */ }; A protocol is called through the _p_r___i_n_i_t entry before any other. Thereafter it is called every 200 milliseconds through the _p_r___f_a_s_t_t_i_m_o entry and every 500 milliseconds through the _p_r___s_l_o_w_t_i_m_o for timer based actions. The system will call the _p_r___d_r_a_i_n entry if it is low on space and this should throw away any non-critical data. Protocols pass data between themselves as chains of mbufs using the _p_r___i_n_p_u_t and _p_r___o_u_t_p_u_t routines. _P_r___i_n_p_u_t passes data up (towards the user) and _p_r___o_u_t_p_u_t passes it down (towards the network); control information passes up and down on _p_r___c_t_l_i_n_p_u_t and _p_r___c_t_l_o_u_t_p_u_t. The protocol is responsible for the space occupied by any of the arguments to these entries and must either pass it onward or dispose of it. (On output, the lowest level reached must free buffers storing the arguments; on input, the highest level is responsible for freeing buffers.) The _p_r___u_s_r_r_e_q routine interfaces protocols to the socket code and is described below. The _p_r___f_l_a_g_s field is constructed from the following values: #define PR_ATOMIC 0x01 /* exchange atomic messages only */ #define PR_ADDR 0x02 /* addresses given with messages */ #define PR_CONNREQUIRED 0x04 /* connection required by protocol */ #define PR_WANTRCVD 0x08 /* want PRU_RCVD calls */ #define PR_RIGHTS 0x10 /* passes capabilities */ Protocols which are connection-based specify the PR_CONNRE- QUIRED flag so that the socket routines will never attempt to send data before a connection has been established. If the PR_WANTRCVD flag is set, the socket routines will notify Networking Implementation Notes SMM:18-15 the protocol when the user has removed data from the socket's receive queue. This allows the protocol to imple- ment acknowledgement on user receipt, and also update win- dowing information based on the amount of space available in the receive queue. The PR_ADDR field indicates that any data placed in the socket's receive queue will be preceded by the address of the sender. The PR_ATOMIC flag specifies that each _u_s_e_r request to send data must be performed in a single _p_r_o_t_o_c_o_l send request; it is the protocol's responsi- bility to maintain record boundaries on data to be sent. The PR_RIGHTS flag indicates that the protocol supports the passing of capabilities; this is currently used only by the protocols in the UNIX protocol family. When a socket is created, the socket routines scan the protocol table for the domain looking for an appropriate protocol to support the type of socket being created. The _p_r___t_y_p_e field contains one of the possible socket types (e.g. SOCK_STREAM), while the _p_r___d_o_m_a_i_n is a back pointer to the domain structure. The _p_r___p_r_o_t_o_c_o_l field contains the protocol number of the protocol, normally a well-known value. 66..33.. NNeettwwoorrkk--iinntteerrffaaccee llaayyeerr Each network-interface configured into a system defines a path through which packets may be sent and received. Nor- mally a hardware device is associated with this interface, though there is no requirement for this (for example, all systems have a software ``loopback'' interface used for debugging and performance analysis). In addition to manipu- lating the hardware device, an interface module is responsi- ble for encapsulation and decapsulation of any link-layer header information required to deliver a message to its des- tination. The selection of which interface to use in deliv- ering packets is a routing decision carried out at a higher level than the network-interface layer. An interface may have addresses in one or more address families. The address is set at boot time using an _i_o_c_t_l on a socket in the appro- priate domain; this operation is implemented by the protocol family, after verifying the operation through the device _i_o_c_t_l entry. An interface is defined by the following structure, SMM:18-16 Networking Implementation Notes struct ifnet { char *if_name; /* name, e.g. ``en'' or ``lo'' */ short if_unit; /* sub-unit for lower level driver */ short if_mtu; /* maximum transmission unit */ short if_flags; /* up/down, broadcast, etc. */ short if_timer; /* time 'til if_watchdog called */ struct ifaddr *if_addrlist; /* list of addresses of interface */ struct ifqueue if_snd; /* output queue */ int (*if_init)(); /* init routine */ int (*if_output)(); /* output routine */ int (*if_ioctl)(); /* ioctl routine */ int (*if_reset)(); /* bus reset routine */ int (*if_watchdog)(); /* timer routine */ int if_ipackets; /* packets received on interface */ int if_ierrors; /* input errors on interface */ int if_opackets; /* packets sent on interface */ int if_oerrors; /* output errors on interface */ int if_collisions; /* collisions on csma interfaces */ struct ifnet *if_next; }; Each interface address has the following form: struct ifaddr { struct sockaddr ifa_addr; /* address of interface */ union { struct sockaddr ifu_broadaddr; struct sockaddr ifu_dstaddr; } ifa_ifu; struct ifnet *ifa_ifp; /* back-pointer to interface */ struct ifaddr *ifa_next; /* next address for interface */ }; #define ifa_broadaddr ifa_ifu.ifu_broadaddr /* broadcast address */ #define ifa_dstaddr ifa_ifu.ifu_dstaddr /* other end of p-to-p link */ The protocol generally maintains this structure as part of a larger structure containing additional information concern- ing the address. Each interface has a send queue and routines used for initialization, _i_f___i_n_i_t, and output, _i_f___o_u_t_p_u_t. If the interface resides on a system bus, the routine _i_f___r_e_s_e_t will be called after a bus reset has been performed. An inter- face may also specify a timer routine, _i_f___w_a_t_c_h_d_o_g; if _i_f___t_i_m_e_r is non-zero, it is decremented once per second until it reaches zero, at which time the watchdog routine is called. The state of an interface and certain characteristics are stored in the _i_f___f_l_a_g_s field. The following values are possible: Networking Implementation Notes SMM:18-17 #define IFF_UP 0x1 /* interface is up */ #define IFF_BROADCAST 0x2 /* broadcast is possible */ #define IFF_DEBUG 0x4 /* turn on debugging */ #define IFF_LOOPBACK 0x8 /* is a loopback net */ #define IFF_POINTOPOINT 0x10 /* interface is point-to-point link */ #define IFF_NOTRAILERS 0x20 /* avoid use of trailers */ #define IFF_RUNNING 0x40 /* resources allocated */ #define IFF_NOARP 0x80 /* no address resolution protocol */ If the interface is connected to a network which supports transmission of _b_r_o_a_d_c_a_s_t packets, the IFF_BROADCAST flag will be set and the _i_f_a___b_r_o_a_d_a_d_d_r field will contain the address to be used in sending or accepting a broadcast packet. If the interface is associated with a point-to- point hardware link (for example, a DEC DMR-11), the IFF_POINTOPOINT flag will be set and _i_f_a___d_s_t_a_d_d_r will con- tain the address of the host on the other side of the con- nection. These addresses and the local address of the interface, _i_f___a_d_d_r, are used in filtering incoming packets. The interface sets IFF_RUNNING after it has allocated system resources and posted an initial read on the device it man- ages. This state bit is used to avoid multiple allocation requests when an interface's address is changed. The IFF_NOTRAILERS flag indicates the interface should refrain from using a _t_r_a_i_l_e_r encapsulation on outgoing packets, or (where per-host negotiation of trailers is possible) that trailer encapsulations should not be requested; _t_r_a_i_l_e_r pro- tocols are described in section 14. The IFF_NOARP flag indicates the interface should not use an ``address resolu- tion protocol'' in mapping internetwork addresses to local network addresses. Various statistics are also stored in the interface structure. These may be viewed by users using the _n_e_t_- _s_t_a_t(1) program. The interface address and flags may be set with the SIOCSIFADDR and SIOCSIFFLAGS _i_o_c_t_ls. SIOCSIFADDR is used initially to define each interface's address; SIOGSIFFLAGS can be used to mark an interface down and perform site-spe- cific configuration. The destination address of a point-to- point link is set with SIOCSIFDSTADDR. Corresponding opera- tions exist to read each value. Protocol families may also support operations to set and read the broadcast address. In addition, the SIOCGIFCONF _i_o_c_t_l retrieves a list of interface names and addresses for all interfaces and proto- cols on the host. 66..33..11.. UUNNIIBBUUSS iinntteerrffaacceess All hardware related interfaces currently reside on the UNIBUS. Consequently a common set of utility routines for dealing with the UNIBUS has been developed. Each UNIBUS interface utilizes a structure of the following form: SMM:18-18 Networking Implementation Notes struct ifubinfo { short iff_uban; /* uba number */ short iff_hlen; /* local net header length */ struct uba_regs *iff_uba; /* uba regs, in vm */ short iff_flags; /* used during uballoc's */ }; Additional structures are associated with each receive and transmit buffer, normally one each per interface; for read, struct ifrw { caddr_t ifrw_addr; /* virt addr of header */ short ifrw_bdp; /* unibus bdp */ short ifrw_flags; /* type, etc. */ #define IFRW_W 0x01 /* is a transmit buffer */ int ifrw_info; /* value from ubaalloc */ int ifrw_proto; /* map register prototype */ struct pte *ifrw_mr; /* base of map registers */ }; and for write, struct ifxmt { struct ifrw ifrw; caddr_t ifw_base; /* virt addr of buffer */ struct pte ifw_wmap[IF_MAXNUBAMR]; /* base pages for output */ struct mbuf *ifw_xtofree; /* pages being dma'd out */ short ifw_xswapd; /* mask of clusters swapped */ short ifw_nmr; /* number of entries in wmap */ }; #define ifw_addr ifrw.ifrw_addr #define ifw_bdp ifrw.ifrw_bdp #define ifw_flags ifrw.ifrw_flags #define ifw_info ifrw.ifrw_info #define ifw_proto ifrw.ifrw_proto #define ifw_mr ifrw.ifrw_mr One of each of these structures is conveniently packaged for interfaces with single buffers for each direction, as fol- lows: struct ifuba { struct ifubinfo ifu_info; struct ifrw ifu_r; struct ifxmt ifu_xmt; }; #define ifu_uban ifu_info.iff_uban #define ifu_hlen ifu_info.iff_hlen #define ifu_uba ifu_info.iff_uba #define ifu_flags ifu_info.iff_flags #define ifu_w ifu_xmt.ifrw #define ifu_xtofree ifu_xmt.ifw_xtofree Networking Implementation Notes SMM:18-19 The _i_f___u_b_i_n_f_o structure contains the general informa- tion needed to characterize the I/O-mapped buffers for the device. In addition, there is a structure describing each buffer, including UNIBUS resources held by the interface. Sufficient memory pages and bus map registers are allocated to each buffer upon initialization according to the maximum packet size and header length. The kernel virtual address of the buffer is held in _i_f_r_w___a_d_d_r, and the map registers begin at _i_f_r_w___m_r. UNIBUS map register _i_f_r_w___m_r[-1] maps the local network header ending on a page boundary. UNIBUS data paths are reserved for read and for write, given by _i_f_r_w___b_d_p. The prototype of the map registers for read and for write is saved in _i_f_r_w___p_r_o_t_o. When write transfers are not at least half-full pages on page boundaries, the data are just copied into the pages mapped on the UNIBUS and the transfer is started. If a write transfer is at least half a page long and on a page boundary, UNIBUS page table entries are swapped to reference the pages, and then the initial pages are remapped from _i_f_w___w_m_a_p when the transfer completes. The mbufs containing the mapped pages are placed on the _i_f_w___x_t_o_f_r_e_e queue to be freed after transmission. When read transfers give at least half a page of data to be input, page frames are allocated from a network page list and traded with the pages already containing the data, mapping the allocated pages to replace the input pages for the next UNIBUS data input. The following utility routines are available for use in writing network interface drivers; all use the structures described above. if_ubaminit(ifubinfo, uban, hlen, nmr, ifr, nr, ifx, nx); if_ubainit(ifuba, uban, hlen, nmr); _i_f___u_b_a_m_i_n_i_t allocates resources on UNIBUS adapter _u_b_a_n, storing the information in the _i_f_u_b_i_n_f_o, _i_f_r_w and _i_f_x_m_t structures referenced. The _i_f_r and _i_f_x parameters are pointers to arrays of _i_f_r_w and _i_f_x_m_t structures whose dimensions are _n_r and _n_x, respectively. _i_f___u_b_a_i_n_i_t is a simpler, backwards-compatible interface used for hardware with single buffers of each type. They are called only at boot time or after a UNIBUS reset. One data path (buffered or unbuffered, depending on the _i_f_u___f_l_a_g_s field) is allocated for each buffer. The _n_m_r parameter indicates the number of UNIBUS mapping regis- ters required to map a maximal sized packet onto the UNIBUS, while _h_l_e_n specifies the size of a local net- work header, if any, which should be mapped separately from the data (see the description of trailer protocols in chapter 14). Sufficient UNIBUS mapping registers and pages of memory are allocated to initialize the SMM:18-20 Networking Implementation Notes input data path for an initial read. For the output data path, mapping registers and pages of memory are also allocated and mapped onto the UNIBUS. The pages associated with the output data path are held in reserve in the event a write requires copying non-page- aligned data (see _i_f___w_u_b_a_p_u_t below). If _i_f___u_b_a_i_n_i_t is called with memory pages already allocated, they will be used instead of allocating new ones (this normally occurs after a UNIBUS reset). A 1 is returned when allocation and initialization are successful, 0 other- wise. m = if_ubaget(ifubinfo, ifr, totlen, off0, ifp); m = if_rubaget(ifuba, totlen, off0, ifp); _i_f___u_b_a_g_e_t and _i_f___r_u_b_a_g_e_t pull input data out of an interface receive buffer and into an mbuf chain. The first interface passes pointers to the _i_f_u_b_i_n_f_o struc- ture for the interface and the _i_f_r_w structure for the receive buffer; the second call may be used for single- buffered devices. _t_o_t_l_e_n specifies the length of data to be obtained, not counting the local network header. If _o_f_f_0 is non-zero, it indicates a byte offset to a trailing local network header which should be copied into a separate mbuf and prepended to the front of the resultant mbuf chain. When the data amount to at least a half a page, the previously mapped data pages are remapped into the mbufs and swapped with fresh pages, thus avoiding any copy. The receiving interface is recorded as _i_f_p, a pointer to an _i_f_n_e_t structure, for the use of the receiving network protocol. A 0 return value indicates a failure to allocate resources. if_wubaput(ifubinfo, ifx, m); if_wubaput(ifuba, m); _i_f___u_b_a_p_u_t and _i_f___w_u_b_a_p_u_t map a chain of mbufs onto a network interface in preparation for output. The first interface is used by devices with multiple transmit buffers. The chain includes any local network header, which is copied so that it resides in the mapped and aligned I/O space. Page-aligned data that are page- aligned in the output buffer are mapped to the UNIBUS in place of the normal buffer page, and the correspond- ing mbuf is placed on a queue to be freed after trans- mission. Any other mbufs which contained non-page- sized data portions are copied to the I/O space and then freed. Pages mapped from a previous output opera- tion (no longer needed) are unmapped. Networking Implementation Notes SMM:18-21 77.. SSoocckkeett//pprroottooccooll iinntteerrffaaccee The interface between the socket routines and the com- munication protocols is through the _p_r___u_s_r_r_e_q routine defined in the protocol switch table. The following requests to a protocol module are possible: #define PRU_ATTACH 0 /* attach protocol */ #define PRU_DETACH 1 /* detach protocol */ #define PRU_BIND 2 /* bind socket to address */ #define PRU_LISTEN 3 /* listen for connection */ #define PRU_CONNECT 4 /* establish connection to peer */ #define PRU_ACCEPT 5 /* accept connection from peer */ #define PRU_DISCONNECT 6 /* disconnect from peer */ #define PRU_SHUTDOWN 7 /* won't send any more data */ #define PRU_RCVD 8 /* have taken data; more room now */ #define PRU_SEND 9 /* send this data */ #define PRU_ABORT 10 /* abort (fast DISCONNECT, DETATCH) */ #define PRU_CONTROL 11 /* control operations on protocol */ #define PRU_SENSE 12 /* return status into m */ #define PRU_RCVOOB 13 /* retrieve out of band data */ #define PRU_SENDOOB 14 /* send out of band data */ #define PRU_SOCKADDR 15 /* fetch socket's address */ #define PRU_PEERADDR 16 /* fetch peer's address */ #define PRU_CONNECT2 17 /* connect two sockets */ /* begin for protocols internal use */ #define PRU_FASTTIMO 18 /* 200ms timeout */ #define PRU_SLOWTIMO 19 /* 500ms timeout */ #define PRU_PROTORCV 20 /* receive from below */ #define PRU_PROTOSEND 21 /* send to below */ A call on the user request routine is of the form, error = (*protosw[].pr_usrreq)(so, req, m, addr, rights); int error; struct socket *so; int req; struct mbuf *m, *addr, *rights; The mbuf data chain _m is supplied for output operations and for certain other operations where it is to receive a result. The address _a_d_d_r is supplied for address-oriented requests such as PRU_BIND and PRU_CONNECT. The _r_i_g_h_t_s parameter is an optional pointer to an mbuf chain containing user-specified capabilities (see the _s_e_n_d_m_s_g and _r_e_c_v_m_s_g system calls). The protocol is responsible for disposal of the data mbuf chains on output operations. A non-zero return value gives a UNIX error number which should be passed to higher level software. The following paragraphs describe each of the requests possible. PRU_ATTACH When a protocol is bound to a socket (with the _s_o_c_k_e_t system call) the protocol module is called with this request. It is the responsibility of the protocol mod- ule to allocate any resources necessary. The ``attach'' request will always precede any of the other SMM:18-22 Networking Implementation Notes requests, and should not occur more than once. PRU_DETACH This is the antithesis of the attach request, and is used at the time a socket is deleted. The protocol module may deallocate any resources assigned to the socket. PRU_BIND When a socket is initially created it has no address bound to it. This request indicates that an address should be bound to an existing socket. The protocol module must verify that the requested address is valid and available for use. PRU_LISTEN The ``listen'' request indicates the user wishes to listen for incoming connection requests on the associ- ated socket. The protocol module should perform any state changes needed to carry out this request (if pos- sible). A ``listen'' request always precedes a request to accept a connection. PRU_CONNECT The ``connect'' request indicates the user wants to a establish an association. The _a_d_d_r parameter supplied describes the peer to be connected to. The effect of a connect request may vary depending on the protocol. Virtual circuit protocols, such as TCP [Postel81b], use this request to initiate establishment of a TCP connec- tion. Datagram protocols, such as UDP [Postel80], sim- ply record the peer's address in a private data struc- ture and use it to tag all outgoing packets. There are no restrictions on how many times a connect request may be used after an attach. If a protocol supports the notion of _m_u_l_t_i_-_c_a_s_t_i_n_g, it is possible to use multiple connects to establish a multi-cast group. Alterna- tively, an association may be broken by a PRU_DISCON- NECT request, and a new association created with a sub- sequent connect request; all without destroying and creating a new socket. PRU_ACCEPT Following a successful PRU_LISTEN request and the arrival of one or more connections, this request is made to indicate the user has accepted the first con- nection on the queue of pending connections. The pro- tocol module should fill in the supplied address buffer with the address of the connected party. PRU_DISCONNECT Eliminate an association created with a PRU_CONNECT request. Networking Implementation Notes SMM:18-23 PRU_SHUTDOWN This call is used to indicate no more data will be sent and/or received (the _a_d_d_r parameter indicates the direction of the shutdown, as encoded in the _s_o_s_h_u_t_d_o_w_n system call). The protocol may, at its discretion, deallocate any data structures related to the shutdown and/or notify a connected peer of the shutdown. PRU_RCVD This request is made only if the protocol entry in the protocol switch table includes the PR_WANTRCVD flag. When a user removes data from the receive queue this request will be sent to the protocol module. It may be used to trigger acknowledgements, refresh windowing information, initiate data transfer, etc. PRU_SEND Each user request to send data is translated into one or more PRU_SEND requests (a protocol may indicate that a single user send request must be translated into a single PRU_SEND request by specifying the PR_ATOMIC flag in its protocol description). The data to be sent is presented to the protocol as a list of mbufs and an address is, optionally, supplied in the _a_d_d_r parameter. The protocol is responsible for preserving the data in the socket's send queue if it is not able to send it immediately, or if it may need it at some later time (e.g. for retransmission). PRU_ABORT This request indicates an abnormal termination of ser- vice. The protocol should delete any existing associa- tion(s). PRU_CONTROL The ``control'' request is generated when a user per- forms a UNIX _i_o_c_t_l system call on a socket (and the ioctl is not intercepted by the socket routines). It allows protocol-specific operations to be provided out- side the scope of the common socket interface. The _a_d_d_r parameter contains a pointer to a static kernel data area where relevant information may be obtained or returned. The _m parameter contains the actual _i_o_c_t_l request code (note the non-standard calling conven- tion). The _r_i_g_h_t_s parameter contains a pointer to an _i_f_n_e_t structure if the _i_o_c_t_l operation pertains to a particular network interface. PRU_SENSE The ``sense'' request is generated when the user makes an _f_s_t_a_t system call on a socket; it requests status of the associated socket. This currently returns a stan- dard _s_t_a_t structure. It typically contains only the optimal transfer size for the connection (based on SMM:18-24 Networking Implementation Notes buffer size, windowing information and maximum packet size). The _m parameter contains a pointer to a static kernel data area where the status buffer should be placed. PRU_RCVOOB Any ``out-of-band'' data presently available is to be returned. An mbuf is passed to the protocol module, and the protocol should either place data in the mbuf or attach new mbufs to the one supplied if there is insufficient space in the single mbuf. An error may be returned if out-of-band data is not (yet) available or has already been consumed. The _a_d_d_r parameter contains any options such as MSG_PEEK to examine data without consuming it. PRU_SENDOOB Like PRU_SEND, but for out-of-band data. PRU_SOCKADDR The local address of the socket is returned, if any is currently bound to it. The address (with protocol spe- cific format) is returned in the _a_d_d_r parameter. PRU_PEERADDR The address of the peer to which the socket is con- nected is returned. The socket must be in a SS_ISCON- NECTED state for this request to be made to the proto- col. The address format (protocol specific) is returned in the _a_d_d_r parameter. PRU_CONNECT2 The protocol module is supplied two sockets and requested to establish a connection between the two without binding any addresses, if possible. This call is used in implementing the _s_o_c_k_e_t_p_a_i_r(2) system call. The following requests are used internally by the pro- tocol modules and are never generated by the socket rou- tines. In certain instances, they are handed to the _p_r___u_s_r_- _r_e_q routine solely for convenience in tracing a protocol's operation (e.g. PRU_SLOWTIMO). PRU_FASTTIMO A ``fast timeout'' has occurred. This request is made when a timeout occurs in the protocol's _p_r___f_a_s_t_i_m_o rou- tine. The _a_d_d_r parameter indicates which timer expired. PRU_SLOWTIMO A ``slow timeout'' has occurred. This request is made when a timeout occurs in the protocol's _p_r___s_l_o_w_t_i_m_o routine. The _a_d_d_r parameter indicates which timer expired. Networking Implementation Notes SMM:18-25 PRU_PROTORCV This request is used in the protocol-protocol inter- face, not by the routines. It requests reception of data destined for the protocol and not the user. No protocols currently use this facility. PRU_PROTOSEND This request allows a protocol to send data destined for another protocol module, not a user. The details of how data is marked ``addressed to protocol'' instead of ``addressed to user'' are left to the protocol mod- ules. No protocols currently use this facility. 88.. PPrroottooccooll//pprroottooccooll iinntteerrffaaccee The interface between protocol modules is through the _p_r___u_s_r_r_e_q, _p_r___i_n_p_u_t, _p_r___o_u_t_p_u_t, _p_r___c_t_l_i_n_p_u_t, and _p_r___c_t_l_o_u_t_- _p_u_t routines. The calling conventions for all but the _p_r___u_s_r_r_e_q routine are expected to be specific to the proto- col modules and are not guaranteed to be consistent across protocol families. We will examine the conventions used for some of the Internet protocols in this section as an exam- ple. 88..11.. pprr__oouuttppuutt The Internet protocol UDP uses the convention, error = udp_output(inp, m); int error; struct inpcb *inp; struct mbuf *m; where the _i_n_p, ``_i_nternet _protocol _control _block'', passed between modules conveys per connection state information, and the mbuf chain contains the data to be sent. UDP per- forms consistency checks, appends its header, calculates a checksum, etc. before passing the packet on. UDP is based on the Internet Protocol, IP [Postel81a], as its transport. UDP passes a packet to the IP module for output as follows: error = ip_output(m, opt, ro, flags); int error; struct mbuf *m, *opt; struct route *ro; int flags; The call to IP's output routine is more complicated than that for UDP, as befits the additional work the IP mod- ule must do. The _m parameter is the data to be sent, and the _o_p_t parameter is an optional list of IP options which should be placed in the IP packet header. The _r_o parameter is is used in making routing decisions (and passing them back to the caller for use in subsequent calls). The final parameter, _f_l_a_g_s contains flags indicating whether the user is allowed to transmit a broadcast packet and if routing is to be performed. The broadcast flag may be inconsequential if the underlying hardware does not support the notion of SMM:18-26 Networking Implementation Notes broadcasting. All output routines return 0 on success and a UNIX error number if a failure occurred which could be detected immediately (no buffer space available, no route to destina- tion, etc.). 88..22.. pprr__iinnppuutt Both UDP and TCP use the following calling convention, (void) (*protosw[].pr_input)(m, ifp); struct mbuf *m; struct ifnet *ifp; Each mbuf list passed is a single packet to be processed by the protocol module. The interface from which the packet was received is passed as the second parameter. The IP input routine is a VAX software interrupt level routine, and so is not called with any parameters. It instead communicates with network interfaces through a queue, _i_p_i_n_t_r_q, which is identical in structure to the queues used by the network interfaces for storing packets awaiting transmission. The software interrupt is enabled by the network interfaces when they place input data on the input queue. 88..33.. pprr__ccttlliinnppuutt This routine is used to convey ``control'' information to a protocol module (i.e. information which might be passed to the user, but is not data). The common calling convention for this routine is, (void) (*protosw[].pr_ctlinput)(req, addr); int req; struct sockaddr *addr; The _r_e_q parameter is one of the following, Networking Implementation Notes SMM:18-27 #define PRC_IFDOWN 0 /* interface transition */ #define PRC_ROUTEDEAD 1 /* select new route if possible */ #define PRC_QUENCH 4 /* some said to slow down */ #define PRC_MSGSIZE 5 /* message size forced drop */ #define PRC_HOSTDEAD 6 /* normally from IMP */ #define PRC_HOSTUNREACH 7 /* ditto */ #define PRC_UNREACH_NET 8 /* no route to network */ #define PRC_UNREACH_HOST 9 /* no route to host */ #define PRC_UNREACH_PROTOCOL 10 /* dst says bad protocol */ #define PRC_UNREACH_PORT 11 /* bad port # */ #define PRC_UNREACH_NEEDFRAG 12 /* IP_DF caused drop */ #define PRC_UNREACH_SRCFAIL 13 /* source route failed */ #define PRC_REDIRECT_NET 14 /* net routing redirect */ #define PRC_REDIRECT_HOST 15 /* host routing redirect */ #define PRC_REDIRECT_TOSNET 14 /* redirect for type of service & net */ #define PRC_REDIRECT_TOSHOST 15 /* redirect for tos & host */ #define PRC_TIMXCEED_INTRANS 18 /* packet lifetime expired in transit */ #define PRC_TIMXCEED_REASS 19 /* lifetime expired on reass q */ #define PRC_PARAMPROB 20 /* header incorrect */ while the _a_d_d_r parameter is the address to which the condi- tion applies. Many of the requests have obviously been derived from ICMP (the Internet Control Message Protocol [Postel81c]), and from error messages defined in the 1822 host/IMP convention [BBN78]. Mapping tables exist to con- vert control requests to UNIX error codes which are deliv- ered to a user. 88..44.. pprr__ccttlloouuttppuutt This is the routine that implements per-socket options at the protocol level for _g_e_t_s_o_c_k_o_p_t and _s_e_t_s_o_c_k_o_p_t. The calling convention is, error = (*protosw[].pr_ctloutput)(op, so, level, optname, mp); int op; struct socket *so; int level, optname; struct mbuf **mp; where _o_p is one of PRCO_SETOPT or PRCO_GETOPT, _s_o is the socket from whence the call originated, and _l_e_v_e_l and _o_p_t_- _n_a_m_e are the protocol level and option name supplied by the user. The results of a PRCO_GETOPT call are returned in an mbuf whose address is placed in _m_p before return. On a PRCO_SETOPT call, _m_p contains the address of an mbuf con- taining the option data; the mbuf should be freed before return. SMM:18-28 Networking Implementation Notes 99.. PPrroottooccooll//nneettwwoorrkk--iinntteerrffaaccee iinntteerrffaaccee The lowest layer in the set of protocols which comprise a protocol family must interface itself to one or more net- work interfaces in order to transmit and receive packets. It is assumed that any routing decisions have been made before handing a packet to a network interface, in fact this is absolutely necessary in order to locate any interface at all (unless, of course, one uses a single ``hardwired'' interface). There are two cases with which to be concerned, transmission of a packet and receipt of a packet; each will be considered separately. 99..11.. PPaacckkeett ttrraannssmmiissssiioonn Assuming a protocol has a handle on an interface, _i_f_p, a (struct ifnet *), it transmits a fully formatted packet with the following call, error = (*ifp->if_output)(ifp, m, dst) int error; struct ifnet *ifp; struct mbuf *m; struct sockaddr *dst; The output routine for the network interface transmits the packet _m to the _d_s_t address, or returns an error indication (a UNIX error number). In reality transmission may not be immediate or successful; normally the output routine simply queues the packet on its send queue and primes an interrupt driven routine to actually transmit the packet. For unreli- able media, such as the Ethernet, ``successful'' transmis- sion simply means that the packet has been placed on the cable without a collision. On the other hand, an 1822 interface guarantees proper delivery or an error indication for each message transmitted. The model employed in the networking system attaches no promises of delivery to the packets handed to a network interface, and thus corresponds more closely to the Ethernet. Errors returned by the output routine are only those that can be detected immediately, and are normally trivial in nature (no buffer space, address format not handled, etc.). No indication is received if errors are detected after the call has returned. 99..22.. PPaacckkeett rreecceeppttiioonn Each protocol family must have one or more ``lowest level'' protocols. These protocols deal with internetwork addressing and are responsible for the delivery of incoming packets to the proper protocol processing modules. In the PUP model [Boggs78] these protocols are termed Level 1 pro- tocols, in the ISO model, network layer protocols. In this system each such protocol module has an input packet queue assigned to it. Incoming packets received by a network interface are queued for the protocol module, and a VAX software interrupt is posted to initiate processing. Networking Implementation Notes SMM:18-29 Three macros are available for queuing and dequeuing packets: IF_ENQUEUE(ifq, m) This places the packet _m at the tail of the queue _i_f_q. IF_DEQUEUE(ifq, m) This places a pointer to the packet at the head of queue _i_f_q in _m and removes the packet from the queue. A zero value will be returned in _m if the queue is empty. IF_DEQUEUEIF(ifq, m, ifp) Like IF_DEQUEUE, this removes the next packet from the head of a queue and returns it in _m. A pointer to the interface on which the packet was received is placed in _i_f_p, a (struct ifnet *). IF_PREPEND(ifq, m) This places the packet _m at the head of the queue _i_f_q. Each queue has a maximum length associated with it as a simple form of congestion control. The macro IF_QFULL(ifq) returns 1 if the queue is filled, in which case the macro IF_DROP(ifq) should be used to increment the count of the number of packets dropped, and the offending packet is dropped. For example, the following code fragment is com- monly found in a network interface's input routine, if (IF_QFULL(inq)) { IF_DROP(inq); m_freem(m); } else IF_ENQUEUE(inq, m); 1100.. GGaatteewwaayyss aanndd rroouuttiinngg iissssuueess The system has been designed with the expectation that it will be used in an internetwork environment. The ``canonical'' environment was envisioned to be a collection of local area networks connected at one or more points through hosts with multiple network interfaces (one on each local area network), and possibly a connection to a long haul network (for example, the ARPANET). In such an envi- ronment, issues of gatewaying and packet routing become very important. Certain of these issues, such as congestion con- trol, have been handled in a simplistic manner or specifi- cally not addressed. Instead, where possible, the network system attempts to provide simple mechanisms upon which more involved policies may be implemented. As some of these problems become better understood, the solutions developed will be incorporated into the system. SMM:18-30 Networking Implementation Notes This section will describe the facilities provided for packet routing. The simplistic mechanisms provided for con- gestion control are described in chapter 12. 1100..11.. RRoouuttiinngg ttaabblleess The network system maintains a set of routing tables for selecting a network interface to use in delivering a packet to its destination. These tables are of the form: struct rtentry { u_long rt_hash; /* hash key for lookups */ struct sockaddr rt_dst; /* destination net or host */ struct sockaddr rt_gateway; /* forwarding agent */ short rt_flags; /* see below */ short rt_refcnt; /* no. of references to structure */ u_long rt_use; /* packets sent using route */ struct ifnet *rt_ifp; /* interface to give packet to */ }; The routing information is organized in two separate tables, one for routes to a host and one for routes to a network. The distinction between hosts and networks is nec- essary so that a single mechanism may be used for both broadcast and multi-drop type networks, and also for net- works built from point-to-point links (e.g DECnet [DEC80]). Each table is organized as a hashed set of linked lists. Two 32-bit hash values are calculated by routines defined for each address family; one based on the destina- tion being a host, and one assuming the target is the net- work portion of the address. Each hash value is used to locate a hash chain to search (by taking the value modulo the hash table size) and the entire 32-bit value is then used as a key in scanning the list of routes. Lookups are applied first to the routing table for hosts, then to the routing table for networks. If both lookups fail, a final lookup is made for a ``wildcard'' route (by convention, net- work 0). The first appropriate route discovered is used. By doing this, routes to a specific host on a network may be present as well as routes to the network. This also allows a ``fall back'' network route to be defined to a ``smart'' gateway which may then perform more intelligent routing. Each routing table entry contains a destination (the desired final destination), a gateway to which to send the packet, and various flags which indicate the route's status and type (host or network). A count of the number of pack- ets sent using the route is kept, along with a count of ``held references'' to the dynamically allocated structure to insure that memory reclamation occurs only when the route is not in use. Finally, a pointer to the a network inter- face is kept; packets sent using the route should be handed Networking Implementation Notes SMM:18-31 to this interface. Routes are typed in two ways: either as host or net- work, and as ``direct'' or ``indirect''. The host/network distinction determines how to compare the _r_t___d_s_t field dur- ing lookup. If the route is to a network, only a packet's destination network is compared to the _r_t___d_s_t entry stored in the table. If the route is to a host, the addresses must match bit for bit. The distinction between ``direct'' and ``indirect'' routes indicates whether the destination is directly con- nected to the source. This is needed when performing local network encapsulation. If a packet is destined for a peer at a host or network which is not directly connected to the source, the internetwork packet header will contain the address of the eventual destination, while the local network header will address the intervening gateway. Should the destination be directly connected, these addresses are likely to be identical, or a mapping between the two exists. The RTF_GATEWAY flag indicates that the route is to an ``indirect'' gateway agent, and that the local network header should be filled in from the _r_t___g_a_t_e_w_a_y field instead of from the final internetwork destination address. It is assumed that multiple routes to the same destina- tion will not be present; only one of multiple routes, that most recently installed, will be used. Routing redirect control messages are used to dynami- cally modify existing routing table entries as well as dynamically create new routing table entries. On hosts where exhaustive routing information is too expensive to maintain (e.g. work stations), the combination of wildcard routing entries and routing redirect messages can be used to provide a simple routing management scheme without the use of a higher level policy process. Current connections may be rerouted after notification of the protocols by means of their _p_r___c_t_l_i_n_p_u_t entries. Statistics are kept by the rout- ing table routines on the use of routing redirect messages and their affect on the routing tables. These statistics may be viewed using _n_e_t_s_t_a_t(1). Status information other than routing redirect control messages may be used in the future, but at present they are ignored. Likewise, more intelligent ``metrics'' may be used to describe routes in the future, possibly based on band- width and monetary costs. 1100..22.. RRoouuttiinngg ttaabbllee iinntteerrffaaccee A protocol accesses the routing tables through three routines, one to allocate a route, one to free a route, and one to process a routing redirect control message. The SMM:18-32 Networking Implementation Notes routine _r_t_a_l_l_o_c performs route allocation; it is called with a pointer to the following structure containing the desired destination: struct route { struct rtentry *ro_rt; struct sockaddr ro_dst; }; The route returned is assumed ``held'' by the caller until released with an _r_t_f_r_e_e call. Protocols which implement virtual circuits, such as TCP, hold onto routes for the duration of the circuit's lifetime, while connection-less protocols, such as UDP, allocate and free routes whenever their destination address changes. The routine _r_t_r_e_d_i_r_e_c_t is called to process a routing redirect control message. It is called with a destination address, the new gateway to that destination, and the source of the redirect. Redirects are accepted only from the cur- rent router for the destination. If a non-wildcard route exists to the destination, the gateway entry in the route is modified to point at the new gateway supplied. Otherwise, a new routing table entry is inserted reflecting the informa- tion supplied. Routes to interfaces and routes to gateways which are not directly accessible from the host are ignored. 1100..33.. UUsseerr lleevveell rroouuttiinngg ppoolliicciieess Routing policies implemented in user processes manipu- late the kernel routing tables through two _i_o_c_t_l calls. The commands SIOCADDRT and SIOCDELRT add and delete routing entries, respectively; the tables are read through the /dev/kmem device. The decision to place policy decisions in a user process implies that routing table updates may lag a bit behind the identification of new routes, or the failure of existing routes, but this period of instability is nor- mally very small with proper implementation of the routing process. Advisory information, such as ICMP error messages and IMP diagnostic messages, may be read from raw sockets (described in the next section). Several routing policy processes have already been implemented. The system standard ``routing daemon'' uses a variant of the Xerox NS Routing Information Protocol [Xerox82] to maintain up-to-date routing tables in our local environment. Interaction with other existing routing proto- cols, such as the Internet EGP (Exterior Gateway Protocol), has been accomplished using a similar process. Networking Implementation Notes SMM:18-33 1111.. RRaaww ssoocckkeettss A raw socket is an object which allows users direct access to a lower-level protocol. Raw sockets are intended for knowledgeable processes which wish to take advantage of some protocol feature not directly accessible through the normal interface, or for the development of new protocols built atop existing lower level protocols. For example, a new version of TCP might be developed at the user level by utilizing a raw IP socket for delivery of packets. The raw IP socket interface attempts to provide an identical inter- face to the one a protocol would have if it were resident in the kernel. The raw socket support is built around a generic raw socket interface, (possibly) augmented by protocol-specific processing routines. This section will describe the core of the raw socket interface. 1111..11.. CCoonnttrrooll bblloocckkss Every raw socket has a protocol control block of the following form: struct rawcb { struct rawcb *rcb_next; /* doubly linked list */ struct rawcb *rcb_prev; struct socket *rcb_socket; /* back pointer to socket */ struct sockaddr rcb_faddr; /* destination address */ struct sockaddr rcb_laddr; /* socket's address */ struct sockproto rcb_proto; /* protocol family, protocol */ caddr_t rcb_pcb; /* protocol specific stuff */ struct mbuf *rcb_options; /* protocol specific options */ struct route rcb_route; /* routing information */ short rcb_flags; }; All the control blocks are kept on a doubly linked list for performing lookups during packet dispatch. Associations may be recorded in the control block and used by the output rou- tine in preparing packets for transmission. The _r_c_b___p_r_o_t_o structure contains the protocol family and protocol number with which the raw socket is associated. The protocol, fam- ily and addresses are used to filter packets on input; this will be described in more detail shortly. If any protocol- specific information is required, it may be attached to the control block using the _r_c_b___p_c_b field. Protocol-specific options for transmission in outgoing packets may be stored in _r_c_b___o_p_t_i_o_n_s. A raw socket interface is datagram oriented. That is, each send or receive on the socket requires a destination address. This address may be supplied by the user or stored in the control block and automatically installed in the SMM:18-34 Networking Implementation Notes outgoing packet by the output routine. Since it is not pos- sible to determine whether an address is present or not in the control block, two flags, RAW_LADDR and RAW_FADDR, indi- cate if a local and foreign address are present. Routing is expected to be performed by the underlying protocol if nec- essary. 1111..22.. IInnppuutt pprroocceessssiinngg Input packets are ``assigned'' to raw sockets based on a simple pattern matching scheme. Each network interface or protocol gives unassigned packets to the raw input routine with the call: raw_input(m, proto, src, dst) struct mbuf *m; struct sockproto *proto, struct sockaddr *src, *dst; The data packet then has a generic header prepended to it of the form struct raw_header { struct sockproto raw_proto; struct sockaddr raw_dst; struct sockaddr raw_src; }; and it is placed in a packet queue for the ``raw input pro- tocol'' module. Packets taken from this queue are copied into any raw sockets that match the header according to the following rules, 1) The protocol family of the socket and header agree. 2) If the protocol number in the socket is non-zero, then it agrees with that found in the packet header. 3) If a local address is defined for the socket, the address format of the local address is the same as the destination address's and the two addresses agree bit for bit. 4) The rules of 3) are applied to the socket's foreign address and the packet's source address. A basic assumption is that addresses present in the control block and packet header (as constructed by the network interface and any raw input protocol module) are in a canon- ical form which may be ``block compared''. 1111..33.. OOuuttppuutt pprroocceessssiinngg On output the raw _p_r___u_s_r_r_e_q routine passes the packet and a pointer to the raw control block to the raw protocol output routine for any processing required before it is Networking Implementation Notes SMM:18-35 delivered to the appropriate network interface. The output routine is normally the only code required to implement a raw socket interface. 1122.. BBuuffffeerriinngg aanndd ccoonnggeessttiioonn ccoonnttrrooll One of the major factors in the performance of a proto- col is the buffering policy used. Lack of a proper buffer- ing policy can force packets to be dropped, cause falsified windowing information to be emitted by protocols, fragment host memory, degrade the overall host performance, etc. Due to problems such as these, most systems allocate a fixed pool of memory to the networking system and impose a policy optimized for ``normal'' network operation. The networking system developed for UNIX is little dif- ferent in this respect. At boot time a fixed amount of mem- ory is allocated by the networking system. At later times more system memory may be requested as the need arises, but at no time is memory ever returned to the system. It is possible to garbage collect memory from the network, but difficult. In order to perform this garbage collection properly, some portion of the network will have to be ``turned off'' as data structures are updated. The interval over which this occurs must kept small compared to the aver- age inter-packet arrival time, or too much traffic may be lost, impacting other hosts on the network, as well as increasing load on the interconnecting mediums. In our environment we have not experienced a need for such com- paction, and thus have left the problem unresolved. The mbuf structure was introduced in chapter 5. In this section a brief description will be given of the allo- cation mechanisms, and policies used by the protocols in performing connection level buffering. 1122..11.. MMeemmoorryy mmaannaaggeemmeenntt The basic memory allocation routines manage a private page map, the size of which determines the maximum amount of memory that may be allocated by the network. A small amount of memory is allocated at boot time to initialize the mbuf and mbuf page cluster free lists. When the free lists are exhausted, more memory is requested from the system memory allocator if space remains in the map. If memory cannot be allocated, callers may block awaiting free memory, or the failure may be reflected to the caller immediately. The allocator will not block awaiting free map entries, however, as exhaustion of the page map usually indicates that buffers have been lost due to a ``leak.'' The private page table is used by the network buffer management routines in remapping pages to be logically contiguous as the need arises. In addition, an array of reference counts parallels the page table and is used when multiple references to a page are SMM:18-36 Networking Implementation Notes present. Mbufs are 128 byte structures, 8 fitting in a 1Kbyte page of memory. When data is placed in mbufs, it is copied or remapped into logically contiguous pages of memory from the network page pool if possible. Data smaller than half of the size of a page is copied into one or more 112 byte mbuf data areas. 1122..22.. PPrroottooccooll bbuuffffeerriinngg ppoolliicciieess Protocols reserve fixed amounts of buffering for send and receive queues at socket creation time. These amounts define the high and low water marks used by the socket rou- tines in deciding when to block and unblock a process. The reservation of space does not currently result in any action by the memory management routines. Protocols which provide connection level flow control do this based on the amount of space in the associated socket queues. That is, send windows are calculated based on the amount of free space in the socket's receive queue, while receive windows are adjusted based on the amount of data awaiting transmission in the send queue. Care has been taken to avoid the ``silly window syndrome'' described in [Clark82] at both the sending and receiving ends. 1122..33.. QQuueeuuee lliimmiittiinngg Incoming packets from the network are always received unless memory allocation fails. However, each Level 1 pro- tocol input queue has an upper bound on the queue's length, and any packets exceeding that bound are discarded. It is possible for a host to be overwhelmed by excessive network traffic (for instance a host acting as a gateway from a high bandwidth network to a low bandwidth network). As a ``defensive'' mechanism the queue limits may be adjusted to throttle network traffic load on a host. Consider a host willing to devote some percentage of its machine to handling network traffic. If the cost of handling an incoming packet can be calculated so that an acceptable ``packet handling rate'' can be determined, then input queue lengths may be dynamically adjusted based on a host's network load and the number of packets awaiting processing. Obviously, discard- ing packets is not a satisfactory solution to a problem such as this (simply dropping packets is likely to increase the load on a network); the queue lengths were incorporated mainly as a safeguard mechanism. 1122..44.. PPaacckkeett ffoorrwwaarrddiinngg When packets can not be forwarded because of memory limitations, the system attempts to generate a ``source quench'' message. In addition, any other problems Networking Implementation Notes SMM:18-37 encountered during packet forwarding are also reflected back to the sender in the form of ICMP packets. This helps hosts avoid unneeded retransmissions. Broadcast packets are never forwarded due to possible dire consequences. In an early stage of network develop- ment, broadcast packets were forwarded and a ``routing loop'' resulted in network saturation and every host on the network crashing. 1133.. OOuutt ooff bbaanndd ddaattaa Out of band data is a facility peculiar to the stream socket abstraction defined. Little agreement appears to exist as to what its semantics should be. TCP defines the notion of ``urgent data'' as in-line, while the NBS proto- cols [Burruss81] and numerous others provide a fully inde- pendent logical transmission channel along which out of band data is to be sent. In addition, the amount of the data which may be sent as an out of band message varies from pro- tocol to protocol; everything from 1 bit to 16 bytes or more. A stream socket's notion of out of band data has been defined as the lowest reasonable common denominator (at least reasonable in our minds); clearly this is subject to debate. Out of band data is expected to be transmitted out of the normal sequencing and flow control constraints of the data stream. A minimum of 1 byte of out of band data and one outstanding out of band message are expected to be sup- ported by the protocol supporting a stream socket. It is a protocol's prerogative to support larger-sized messages, or more than one outstanding out of band message at a time. Out of band data is maintained by the protocol and is usually not stored in the socket's receive queue. A socket- level option, SO_OOBINLINE, is provided to force out-of-band data to be placed in the normal receive queue when urgent data is received; this sometimes amelioriates problems due to loss of data when multiple out-of-band segments are received before the first has been passed to the user. The PRU_SENDOOB and PRU_RCVOOB requests to the _p_r___u_s_r_r_e_q routine are used in sending and receiving data. SMM:18-38 Networking Implementation Notes 1144.. TTrraaiilleerr pprroottooccoollss Core to core copies can be expensive. Consequently, a great deal of effort was spent in minimizing such opera- tions. The VAX architecture provides virtual memory hard- ware organized in page units. To cut down on copy opera- tions, data is kept in page-sized units on page-aligned boundaries whenever possible. This allows data to be moved in memory simply by remapping the page instead of copying. The mbuf and network interface routines perform page table manipulations where needed, hiding the complexities of the VAX virtual memory hardware from higher level code. Data enters the system in two ways: from the user, or from the network (hardware interface). When data is copied from the user's address space into the system it is deposited in pages (if sufficient data is present). This encourages the user to transmit information in messages which are a multiple of the system page size. Unfortunately, performing a similar operation when tak- ing data from the network is very difficult. Consider the format of an incoming packet. A packet usually contains a local network header followed by one or more headers used by the high level protocols. Finally, the data, if any, fol- lows these headers. Since the header information may be variable length, DMA'ing the eventual data for the user into a page aligned area of memory is impossible without _a _p_r_i_o_r_i knowledge of the format (e.g., by supporting only a single protocol header format). To allow variable length header information to be pre- sent and still ensure page alignment of data, a special local network encapsulation may be used. This encapsula- tion, termed a _t_r_a_i_l_e_r _p_r_o_t_o_c_o_l [Leffler84], places the variable length header information after the data. A fixed size local network header is then prepended to the resultant packet. The local network header contains the size of the data portion (in units of 512 bytes), and a new _t_r_a_i_l_e_r _p_r_o_- _t_o_c_o_l _h_e_a_d_e_r, inserted before the variable length informa- tion, contains the size of the variable length header infor- mation. The following trailer protocol header is used to store information regarding the variable length protocol header: struct { short protocol; /* original protocol no. */ short length; /* length of trailer */ }; The processing of the trailer protocol is very simple. On output, the local network header indicates that a trailer encapsulation is being used. The header also includes an Networking Implementation Notes SMM:18-39 indication of the number of data pages present before the trailer protocol header. The trailer protocol header is initialized to contain the actual protocol identifier and the variable length header size, and is appended to the data along with the variable length header information. On input, the interface routines identify the trailer encapsulation by the protocol type stored in the local net- work header, then calculate the number of pages of data to find the beginning of the trailer. The trailing information is copied into a separate mbuf and linked to the front of the resultant packet. Clearly, trailer protocols require cooperation between source and destination. In addition, they are normally cost effective only when sizable packets are used. The current scheme works because the local network encapsulation header is a fixed size, allowing DMA operations to be performed at a known offset from the first data page being received. Should the local network header be variable length this scheme fails. Statistics collected indicate that as much as 200Kb/s can be gained by using a trailer protocol with 1Kbyte pack- ets. The average size of the variable length header was 40 bytes (the size of a minimal TCP/IP packet header). If hardware supports larger sized packets, even greater gains may be realized. AAcckknnoowwlleeddggeemmeennttss The internal structure of the system is patterned after the Xerox PUP architecture [Boggs79], while in certain places the Internet protocol family has had a great deal of influence in the design. The use of software interrupts for process invocation is based on similar facilities found in the VMS operating system. Many of the ideas related to pro- tocol modularity, memory management, and network interfaces are based on Rob Gurwitz's TCP/IP implementation for the 4.1BSD version of UNIX on the VAX [Gurwitz81]. Greg Chesson explained his use of trailer encapsulations in Datakit, instigating their use in our system. SMM:18-40 Networking Implementation Notes RReeffeerreenncceess [Boggs79] Boggs, D. R., J. F. Shoch, E. A. Taft, and R. M. Metcalfe; _P_U_P_: _A_n _I_n_t_e_r_n_e_t_w_o_r_k _A_r_c_h_i_t_e_c_t_u_r_e. Report CSL-79-10. XEROX Palo Alto Research Center, July 1979. [BBN78] Bolt Beranek and Newman; Specification for the Interconnection of Host and IMP. BBN Technical Report 1822. May 1978. [Cerf78] Cerf, V. G.; The Catenet Model for Internetworking. Internet Working Group, IEN 48. July 1978. [Clark82] Clark, D. D.; Window and Acknowledge- ment Strategy in TCP, RFC-813. Network Information Center, SRI International. July 1982. [DEC80] Digital Equipment Corporation; _D_E_C_n_e_t _D_I_G_I_T_A_L _N_e_t_w_o_r_k _A_r_c_h_i_t_e_c_t_u_r_e _- _G_e_n_e_r_a_l _D_e_s_c_r_i_p_t_i_o_n. Order No. AA-K179A-TK. October 1980. [Gurwitz81] Gurwitz, R. F.; VAX-UNIX Networking Support Project - Implementation Description. Internetwork Working Group, IEN 168. January 1981. [ISO81] International Organization for Standard- ization. _I_S_O _O_p_e_n _S_y_s_t_e_m_s _I_n_t_e_r_c_o_n_n_e_c_- _t_i_o_n _- _B_a_s_i_c _R_e_f_e_r_e_n_c_e _M_o_d_e_l. ISO/TC 97/SC 16 N 719. August 1981. [Joy86] Joy, W.; Fabry, R.; Leffler, S.; McKu- sick, M.; and Karels, M.; Berkeley Soft- ware Architecture Manual, 4.4BSD Edi- tion. _U_N_I_X _P_r_o_g_r_a_m_m_e_r_'_s _S_u_p_p_l_e_m_e_n_t_a_r_y _D_o_c_u_m_e_n_t_s, Vol. 1 (PSD:5). Computer Systems Research Group, University of California, Berkeley. May, 1986. [Leffler84] Leffler, S.J. and Karels, M.J.; Trailer Encapsulations, RFC-893. Network Infor- mation Center, SRI International. April 1984. [Postel80] Postel, J. User Datagram Protocol, RFC-768. Network Information Center, SRI International. May 1980. Networking Implementation Notes SMM:18-41 [Postel81a] Postel, J., ed. Internet Protocol, RFC-791. Network Information Center, SRI International. September 1981. [Postel81b] Postel, J., ed. Transmission Control Protocol, RFC-793. Network Information Center, SRI International. September 1981. [Postel81c] Postel, J. Internet Control Message Protocol, RFC-792. Network Information Center, SRI International. September 1981. [Xerox81] Xerox Corporation. _I_n_t_e_r_n_e_t _T_r_a_n_s_p_o_r_t _P_r_o_t_o_c_o_l_s. Xerox System Integration Standard 028112. December 1981. [Zimmermann80] Zimmermann, H. OSI Reference Model - The ISO Model of Architecture for Open Systems Interconnection. _I_E_E_E _T_r_a_n_s_a_c_- _t_i_o_n_s _o_n _C_o_m_m_u_n_i_c_a_t_i_o_n_s. Com-28(4); 425-432. April 1980.