BBeerrkkeelleeyy SSooffttwwaarree AArrcchhiitteeccttuurree MMaannuuaall 44..44BBSSDD EEddiittiioonn _W_i_l_l_i_a_m _J_o_y_, _R_o_b_e_r_t _F_a_b_r_y_, _S_a_m_u_e_l _L_e_f_f_l_e_r_, _M_. _K_i_r_k _M_c_K_u_s_i_c_k_, _M_i_c_h_a_e_l _K_a_r_e_l_s Computer Systems Research Group Computer Science Division Department of Electrical Engineering and Computer Science University of California, Berkeley Berkeley, CA 94720 _A_B_S_T_R_A_C_T This document summarizes the facilities pro- vided by the 4.4BSD version of the UNIX* operating system. It does not attempt to act as a tutorial for use of the system nor does it attempt to explain or justify the design of the system facil- ities. It gives neither motivation nor implemen- tation details, in favor of brevity. The first section describes the basic kernel functions provided to a UNIX process: process nam- ing and protection, memory management, software interrupts, object references (descriptors), time and statistics functions, and resource controls. These facilities, as well as facilities for boot- strap, shutdown and process accounting, are pro- vided solely by the kernel. The second section describes the standard system abstractions for files and file systems, communication, terminal handling, and process con- trol and debugging. These facilities are imple- mented by the operating system or by network server processes. ----------- * UNIX is a trademark of Bell Laboratories. PSD:5-2 4.4BSD Architecture Manual TTAABBLLEE OOFF CCOONNTTEENNTTSS IInnttrroodduuccttiioonn.. 00.. NNoottaattiioonn aanndd ttyyppeess 11.. KKeerrnneell pprriimmiittiivveess 11..11.. PPrroocceesssseess aanndd pprrootteeccttiioonn 1.1.1. Host and process identifiers 1.1.2. Process creation and termination 1.1.3. User and group ids 1.1.4. Process groups 11..22.. MMeemmoorryy mmaannaaggeemmeenntt 1.2.1. Text, data and stack 1.2.2. Mapping pages 1.2.3. Page protection control 1.2.4. Giving and getting advice 1.2.5. Protection primitives 11..33.. SSiiggnnaallss 1.3.1. Overview 1.3.2. Signal types 1.3.3. Signal handlers 1.3.4. Sending signals 1.3.5. Protecting critical sections 1.3.6. Signal stacks 11..44.. TTiimmiinngg aanndd ssttaattiissttiiccss 1.4.1. Real time 1.4.2. Interval time 11..55.. DDeessccrriippttoorrss 1.5.1. The reference table 1.5.2. Descriptor properties 1.5.3. Managing descriptor references 1.5.4. Multiplexing requests 1.5.5. Descriptor wrapping 11..66.. RReessoouurrccee ccoonnttrroollss 1.6.1. Process priorities 1.6.2. Resource utilization 1.6.3. Resource limits 11..77.. SSyysstteemm ooppeerraattiioonn ssuuppppoorrtt 1.7.1. Bootstrap operations 1.7.2. Shutdown operations 1.7.3. Accounting 4.4BSD Architecture Manual PSD:5-3 22.. SSyysstteemm ffaacciilliittiieess 22..11.. GGeenneerriicc ooppeerraattiioonnss 2.1.1. Read and write 2.1.2. Input/output control 2.1.3. Non-blocking and asynchronous operations 22..22.. FFiillee ssyysstteemm 2.2.1 Overview 2.2.2. Naming 2.2.3. Creation and removal 2.2.3.1. Directory creation and removal 2.2.3.2. File creation 2.2.3.3. Creating references to devices 2.2.3.4. Portal creation 2.2.3.6. File, device, and portal removal 2.2.4. Reading and modifying file attributes 2.2.5. Links and renaming 2.2.6. Extension and truncation 2.2.7. Checking accessibility 2.2.8. Locking 2.2.9. Disc quotas 22..33.. IInntteerrpprroocceessss ccoommmmuunniiccaattiioonn 2.3.1. Interprocess communication primitives 2.3.1.1. Communication domains 2.3.1.2. Socket types and protocols 2.3.1.3. Socket creation, naming and service establishment 2.3.1.4. Accepting connections 2.3.1.5. Making connections 2.3.1.6. Sending and receiving data 2.3.1.7. Scatter/gather and exchanging access rights 2.3.1.8. Using read and write with sockets 2.3.1.9. Shutting down halves of full-duplex connections 2.3.1.10. Socket and protocol options 2.3.2. UNIX domain 2.3.2.1. Types of sockets 2.3.2.2. Naming 2.3.2.3. Access rights transmission 2.3.3. INTERNET domain 2.3.3.1. Socket types and protocols 2.3.3.2. Socket naming 2.3.3.3. Access rights transmission 2.3.3.4. Raw access 22..44.. TTeerrmmiinnaallss aanndd ddeevviicceess 2.4.1. Terminals 2.4.1.1. Terminal input 2.4.1.1.1 Input modes 2.4.1.1.2 Interrupt characters 2.4.1.1.3 Line editing 2.4.1.2. Terminal output 2.4.1.3. Terminal control operations 2.4.1.4. Terminal hardware support PSD:5-4 4.4BSD Architecture Manual 2.4.2. Structured devices 2.4.3. Unstructured devices 22..55.. PPrroocceessss ccoonnttrrooll aanndd ddeebbuuggggiinngg II.. SSuummmmaarryy ooff ffaacciilliittiieess 4.4BSD Architecture Manual PSD:5-5 11.. NNoottaattiioonn aanndd ttyyppeess The notation used to describe system calls is a variant of a C language call, consisting of a prototype call fol- lowed by declaration of parameters and results. An addi- tional keyword rreessuulltt, not part of the normal C language, is used to indicate which of the declared entities receive results. As an example, consider the _r_e_a_d call, as described in section 2.1: cc = read(fd, buf, nbytes); result int cc; int fd; result char *buf; int nbytes; The first line shows how the _r_e_a_d routine is called, with three parameters. As shown on the second line _c_c is an integer and _r_e_a_d also returns information in the parameter _b_u_f. Description of all error conditions arising from each system call is not provided here; they appear in the pro- grammer's manual. In particular, when accessed from the C language, many calls return a characteristic -1 value when an error occurs, returning the error code in the global variable _e_r_r_n_o. Other languages may present errors in dif- ferent ways. A number of system standard types are defined in the include file _<_s_y_s_/_t_y_p_e_s_._h_> and used in the specifications here and in many C programs. These include ccaaddddrr__tt giving a memory address (typically as a character pointer), ooffff__tt giving a file offset (typically as a long integer), and a set of unsigned types uu__cchhaarr, uu__sshhoorrtt, uu__iinntt and uu__lloonngg, shorthand names for uunnssiiggnneedd cchhaarr, uunnssiiggnneedd sshhoorrtt, etc. PSD:5-6 4.4BSD Architecture Manual 22.. KKeerrnneell pprriimmiittiivveess The facilities available to a UNIX user process are logically divided into two parts: kernel facilities directly implemented by UNIX code running in the operating system, and system facilities implemented either by the system, or in cooperation with a _s_e_r_v_e_r _p_r_o_c_e_s_s. These kernel facili- ties are described in this section 1. The facilities implemented in the kernel are those which define the _U_N_I_X _v_i_r_t_u_a_l _m_a_c_h_i_n_e in which each process runs. Like many real machines, this virtual machine has memory management hardware, an interrupt facility, timers and counters. The UNIX virtual machine also allows access to files and other objects through a set of _d_e_s_c_r_i_p_t_o_r_s. Each descriptor resembles a device controller, and supports a set of operations. Like devices on real machines, some of which are internal to the machine and some of which are external, parts of the descriptor machinery are built-in to the operating system, while other parts are often imple- mented in server processes on other machines. The facili- ties provided through the descriptor machinery are described in section 2. 4.4BSD Architecture Manual PSD:5-7 22..11.. PPrroocceesssseess aanndd pprrootteeccttiioonn 22..11..11.. HHoosstt aanndd pprroocceessss iiddeennttiiffiieerrss Each UNIX host has associated with it a 32-bit host id, and a host name of up to 64 characters (as defined by MAX- HOSTNAMELEN in _<_s_y_s_/_p_a_r_a_m_._h_>). These are set (by a privi- leged user) and returned by the calls: sethostid(hostid) long hostid; hostid = gethostid(); result long hostid; sethostname(name, len) char *name; int len; len = gethostname(buf, buflen) result int len; result char *buf; int buflen; On each host runs a set of _p_r_o_c_e_s_s_e_s. Each process is largely independent of other processes, having its own pro- tection domain, address space, timers, and an independent set of references to system or user implemented objects. Each process in a host is named by an integer called the _p_r_o_c_e_s_s _i_d. This number is in the range 1-30000 and is returned by the _g_e_t_p_i_d routine: pid = getpid(); result int pid; On each UNIX host this identifier is guaranteed to be unique; in a multi-host environment, the (hostid, process id) pairs are guaranteed unique. 22..11..22.. PPrroocceessss ccrreeaattiioonn aanndd tteerrmmiinnaattiioonn A new process is created by making a logical duplicate of an existing process: pid = fork(); result int pid; The _f_o_r_k call returns twice, once in the parent process, where _p_i_d is the process identifier of the child, and once in the child process where _p_i_d is 0. The parent-child rela- tionship induces a hierarchical structure on the set of pro- cesses in the system. PSD:5-8 4.4BSD Architecture Manual A process may terminate by executing an _e_x_i_t call: exit(status) int status; returning 8 bits of exit status to its parent. When a child process exits or terminates abnormally, the parent process receives information about any event which caused termination of the child process. A second call provides a non-blocking interface and may also be used to retrieve information about resources consumed by the pro- cess during its lifetime. #include pid = wait(astatus); result int pid; result union wait *astatus; pid = wait3(astatus, options, arusage); result int pid; result union waitstatus *astatus; int options; result struct rusage *arusage; A process can overlay itself with the memory image of another process, passing the newly created process a set of parameters, using the call: execve(name, argv, envp) char *name, **argv, **envp; The specified _n_a_m_e must be a file which is in a format rec- ognized by the system, either a binary executable file or a file which causes the execution of a specified interpreter program to process its contents. 22..11..33.. UUsseerr aanndd ggrroouupp iiddss Each process in the system has associated with it two user-id's: a _r_e_a_l _u_s_e_r _i_d and a _e_f_f_e_c_t_i_v_e _u_s_e_r _i_d, both 16 bit unsigned integers (type uuiidd__tt). Each process has an _r_e_a_l _a_c_c_o_u_n_t_i_n_g _g_r_o_u_p _i_d and an _e_f_f_e_c_t_i_v_e _a_c_c_o_u_n_t_i_n_g _g_r_o_u_p _i_d and a set of _a_c_c_e_s_s _g_r_o_u_p _i_d_'_s. The group id's are 16 bit unsigned integers (type ggiidd__tt). Each process may be in several different access groups, with the maximum concurrent number of access groups a system compilation parameter, the constant NGROUPS in the file _<_s_y_s_/_p_a_r_a_m_._h_>, guaranteed to be at least 8. The real and effective user ids associated with a pro- cess are returned by: 4.4BSD Architecture Manual PSD:5-9 ruid = getuid(); result uid_t ruid; euid = geteuid(); result uid_t euid; the real and effective accounting group ids by: rgid = getgid(); result gid_t rgid; egid = getegid(); result gid_t egid; The access group id set is returned by a _g_e_t_g_r_o_u_p_s call*: ngroups = getgroups(gidsetsize, gidset); result int ngroups; int gidsetsize; result int gidset[gidsetsize]; The user and group id's are assigned at login time using the _s_e_t_r_e_u_i_d, _s_e_t_r_e_g_i_d, and _s_e_t_g_r_o_u_p_s calls: setreuid(ruid, euid); int ruid, euid; setregid(rgid, egid); int rgid, egid; setgroups(gidsetsize, gidset) int gidsetsize; int gidset[gidsetsize]; The _s_e_t_r_e_u_i_d call sets both the real and effective user- id's, while the _s_e_t_r_e_g_i_d call sets both the real and effec- tive accounting group id's. Unless the caller is the super- user, _r_u_i_d must be equal to either the current real or effective user-id, and _r_g_i_d equal to either the current real or effective accounting group id. The _s_e_t_g_r_o_u_p_s call is restricted to the super-user. 22..11..44.. PPrroocceessss ggrroouuppss Each process in the system is also normally associated with a _p_r_o_c_e_s_s _g_r_o_u_p. The group of processes in a process group is sometimes referred to as a _j_o_b and manipulated by high-level system software (such as the shell). The current process group of a process is returned by the _g_e_t_p_g_r_p call: ----------- * The type of the gidset array in getgroups and setgroups remains integer for compatibility with 4.2BSD. It may change to ggiidd__tt in future releases. PSD:5-10 4.4BSD Architecture Manual pgrp = getpgrp(pid); result int pgrp; int pid; When a process is in a specific process group it may receive software interrupts affecting the group, causing the group to suspend or resume execution or to be interrupted or ter- minated. In particular, a system terminal has a process group and only processes which are in the process group of the terminal may read from the terminal, allowing arbitra- tion of terminals among several different jobs. The process group associated with a process may be changed by the _s_e_t_p_g_r_p call: setpgrp(pid, pgrp); int pid, pgrp; Newly created processes are assigned process id's distinct from all processes and process groups, and the same process group as their parent. A normal (unprivileged) process may set its process group equal to its process id. A privileged process may set the process group of any process to any value. 4.4BSD Architecture Manual PSD:5-11 22..22.. MMeemmoorryy mmaannaaggeemmeenntt||-- 22..22..11.. TTeexxtt,, ddaattaa aanndd ssttaacckk Each process begins execution with three logical areas of memory called text, data and stack. The text area is read-only and shared, while the data and stack areas are private to the process. Both the data and stack areas may be extended and contracted on program request. The call addr = sbrk(incr); result caddr_t addr; int incr; changes the size of the data area by _i_n_c_r bytes and returns the new end of the data area, while addr = sstk(incr); result caddr_t addr; int incr; changes the size of the stack area. The stack area is also automatically extended as needed. On the VAX the text and data areas are adjacent in the P0 region, while the stack section is in the P1 region, and grows downward. 22..22..22.. MMaappppiinngg ppaaggeess The system supports sharing of data between processes by allowing pages to be mapped into memory. These mapped pages may be _s_h_a_r_e_d with other processes or _p_r_i_v_a_t_e to the process. Protection and sharing options are defined in _<_s_y_s_/_m_m_a_n_._h_> as: /* protections are chosen from these bits, or-ed together */ #define PROT_READ 0x04 /* pages can be read */ #define PROT_WRITE 0x02 /* pages can be written */ #define PROT_EXEC 0x01 /* pages can be executed */ /* flags contain mapping type, sharing type and options */ /* mapping type; choose one */ #define MAP_FILE 0x0001 /* mapped from a file or device */ #define MAP_ANON 0x0002 /* allocated from memory, swap space */ #define MAP_TYPE 0x000f /* mask for type field */ ----------- |- This section represents the interface planned for later releases of the system. Of the calls described in this section, only _s_b_r_k and _g_e_t_p_a_g_e_- _s_i_z_e are included in 4.3BSD. PSD:5-12 4.4BSD Architecture Manual /* sharing types; choose one */ #define MAP_SHARED 0x0010 /* share changes */ #define MAP_PRIVATE 0x0000 /* changes are private */ /* other flags */ #define MAP_FIXED 0x0020 /* map addr must be exactly as requested */ #define MAP_INHERIT 0x0040 /* region is retained after exec */ #define MAP_HASSEMAPHORE 0x0080 /* region may contain semaphores */ #define MAP_NOPREALLOC 0x0100 /* do not preallocate space */ The cpu-dependent size of a page is returned by the _g_e_t_p_a_g_e_- _s_i_z_e system call: pagesize = getpagesize(); result int pagesize; The call: maddr = mmap(addr, len, prot, flags, fd, pos); result caddr_t maddr; caddr_t addr; int *len, prot, flags, fd; off_t pos; causes the pages starting at _a_d_d_r and continuing for at most _l_e_n bytes to be mapped from the object represented by descriptor _f_d, starting at byte offset _p_o_s. The starting address of the region is returned; for the convenience of the system, it may differ from that supplied unless the MAP_FIXED flag is given, in which case the exact address will be used or the call will fail. The actual amount mapped is returned in _l_e_n. The _a_d_d_r, _l_e_n, and _p_o_s parame- ters must all be multiples of the pagesize. A successful _m_m_a_p will delete any previous mapping in the allocated address range. The parameter _p_r_o_t specifies the accessibil- ity of the mapped pages. The parameter _f_l_a_g_s specifies the type of object to be mapped, mapping options, and whether modifications made to this mapped copy of the page are to be kept _p_r_i_v_a_t_e, or are to be _s_h_a_r_e_d with other references. Possible types include MAP_FILE, mapping a regular file or character-special device memory, and MAP_ANON, which maps memory not associated with any specific file. The file descriptor used for creating MAP_ANON regions is used only for naming, and may be given as -1 if no name is associated with the region.|= The MAP_INHERIT flag allows a region to be inherited after an _e_x_e_c. The MAP_HASSEMAPHORE flag allows special handling for regions that may contain semaphores. The MAP_NOPREALLOC flag allows processes to allocate regions ----------- |= The current design does not allow a process to specify the location of swap space. In the future we may define an additional mapping type, MAP_SWAP, in which the file descriptor argument specifies a file or device to which swapping should be done. 4.4BSD Architecture Manual PSD:5-13 whose virtual address space, if fully allocated, would exceed the available memory plus swap resources. Such regions may get a SIGSEGV signal if they page fault and resources are not available to service their request; typi- cally they would free up some resources via _u_n_m_a_p so that when they return from the signal the page fault could be successfully completed. A facility is provided to synchronize a mapped region with the file it maps; the call msync(addr, len); caddr_t addr; int len; writes any modified pages back to the filesystem and updates the file modification time. If _l_e_n is 0, all modified pages within the region containing _a_d_d_r will be flushed; if _l_e_n is non-zero, only the pages containing _a_d_d_r and _l_e_n succeeding locations will be examined. Any required synchronization of memory caches will also take place at this time. Filesystem operations on a file that is mapped for shared modifications are unpredictable except after an _m_s_y_n_c. A mapping can be removed by the call munmap(addr, len); caddr_t addr; int len; This call deletes the mappings for the specified address range, and causes further references to addresses within the range to generate invalid memory references. 22..22..33.. PPaaggee pprrootteeccttiioonn ccoonnttrrooll A process can control the protection of pages using the call mprotect(addr, len, prot); caddr_t addr; int len, prot; This call changes the specified pages to have protection _p_r_o_t. Not all implementations will guarantee protection on a page basis; the granularity of protection changes may be as large as an entire region. 22..22..44.. GGiivviinngg aanndd ggeettttiinngg aaddvviiccee A process that has knowledge of its memory behavior may use the _m_a_d_v_i_s_e call: madvise(addr, len, behav); caddr_t addr; int len, behav; _B_e_h_a_v describes expected behavior, as given in _<_s_y_s_/_m_m_a_n_._h_>: PSD:5-14 4.4BSD Architecture Manual #define MADV_NORMAL 0 /* no further special treatment */ #define MADV_RANDOM 1 /* expect random page references */ #define MADV_SEQUENTIAL 2 /* expect sequential references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ #define MADV_SPACEAVAIL 5 /* insure that resources are reserved */ Finally, a process may obtain information about whether pages are core resident by using the call mincore(addr, len, vec) caddr_t addr; int len; result char *vec; Here the current core residency of the pages is returned in the character array _v_e_c, with a value of 1 meaning that the page is in-core. 22..22..55.. SSyynncchhrroonniizzaattiioonn pprriimmiittiivveess Primitives are provided for synchronization using semaphores in shared memory. Semaphores must lie within a MAP_SHARED region with at least modes PROT_READ and PROT_WRITE. The MAP_HASSEMAPHORE flag must have been speci- fied when the region was created. To acquire a lock a pro- cess calls: value = mset(sem, wait) result int value; semaphore *sem; int wait; _M_s_e_t indivisibly tests and sets the semaphore _s_e_m. If the previous value is zero, the process has acquired the lock and _m_s_e_t returns true immediately. Otherwise, if the _w_a_i_t flag is zero, failure is returned. If _w_a_i_t is true and the previous value is non-zero, _m_s_e_t relinquishes the processor until notified that it should retry. To release a lock a process calls: mclear(sem) semaphore *sem; _M_c_l_e_a_r indivisibly tests and clears the semaphore _s_e_m. If the ``WANT'' flag is zero in the previous value, _m_c_l_e_a_r returns immediately. If the ``WANT'' flag is non-zero in the previous value, _m_c_l_e_a_r arranges for waiting processes to retry before returning. Two routines provide services analogous to the kernel _s_l_e_e_p and _w_a_k_e_u_p functions interpreted in the domain of shared memory. A process may relinquish the processor by calling _m_s_l_e_e_p with a set semaphore: 4.4BSD Architecture Manual PSD:5-15 msleep(sem) semaphore *sem; If the semaphore is still set when it is checked by the ker- nel, the process will be put in a sleeping state until some other process issues an _m_w_a_k_e_u_p for the same semaphore within the region using the call: mwakeup(sem) semaphore *sem; An _m_w_a_k_e_u_p may awaken all sleepers on the semaphore, or may awaken only the next sleeper on a queue. PSD:5-16 4.4BSD Architecture Manual 22..33.. SSiiggnnaallss 22..33..11.. OOvveerrvviieeww The system defines a set of _s_i_g_n_a_l_s that may be deliv- ered to a process. Signal delivery resembles the occurrence of a hardware interrupt: the signal is blocked from further occurrence, the current process context is saved, and a new one is built. A process may specify the _h_a_n_d_l_e_r to which a signal is delivered, or specify that the signal is to be _b_l_o_c_k_e_d or _i_g_n_o_r_e_d. A process may also specify that a _d_e_f_a_u_l_t action is to be taken when signals occur. Some signals will cause a process to exit when they are not caught. This may be accompanied by creation of a _c_o_r_e image file, containing the current memory image of the pro- cess for use in post-mortem debugging. A process may choose to have signals delivered on a special stack, so that sophisticated software stack manipulations are possible. All signals have the same _p_r_i_o_r_i_t_y. If multiple sig- nals are pending simultaneously, the order in which they are delivered to a process is implementation specific. Signal routines execute with the signal that caused their invoca- tion _b_l_o_c_k_e_d, but other signals may yet occur. Mechanisms are provided whereby critical sections of code may protect themselves against the occurrence of specified signals. 22..33..22.. SSiiggnnaall ttyyppeess The signals defined by the system fall into one of five classes: hardware conditions, software conditions, input/output notification, process control, or resource con- trol. The set of signals is defined in the file _<_s_i_g_n_a_l_._h_>. Hardware signals are derived from exceptional condi- tions which may occur during execution. Such signals include SIGFPE representing floating point and other arith- metic exceptions, SIGILL for illegal instruction execution, SIGSEGV for addresses outside the currently assigned area of memory, and SIGBUS for accesses that violate memory protec- tion constraints. Other, more cpu-specific hardware signals exist, such as those for the various customer-reserved instructions on the VAX (SIGIOT, SIGEMT, and SIGTRAP). Software signals reflect interrupts generated by user request: SIGINT for the normal interrupt signal; SIGQUIT for the more powerful _q_u_i_t signal, that normally causes a core image to be generated; SIGHUP and SIGTERM that cause grace- ful process termination, either because a user has ``hung up'', or by user or program request; and SIGKILL, a more 4.4BSD Architecture Manual PSD:5-17 powerful termination signal which a process cannot catch or ignore. Programs may define their own asynchronous events using SIGUSR1 and SIGUSR2. Other software signals (SIGALRM, SIGVTALRM, SIGPROF) indicate the expiration of interval timers. A process can request notification via a SIGIO signal when input or output is possible on a descriptor, or when a _n_o_n_-_b_l_o_c_k_i_n_g operation completes. A process may request to receive a SIGURG signal when an urgent condition arises. A process may be _s_t_o_p_p_e_d by a signal sent to it or the members of its process group. The SIGSTOP signal is a pow- erful stop signal, because it cannot be caught. Other stop signals SIGTSTP, SIGTTIN, and SIGTTOU are used when a user request, input request, or output request respectively is the reason for stopping the process. A SIGCONT signal is sent to a process when it is continued from a stopped state. Processes may receive notification with a SIGCHLD signal when a child process changes state, either by stopping or by terminating. Exceeding resource limits may cause signals to be gen- erated. SIGXCPU occurs when a process nears its CPU time limit and SIGXFSZ warns that the limit on file size creation has been reached. 22..33..33.. SSiiggnnaall hhaannddlleerrss A process has a handler associated with each signal. The handler controls the way the signal is delivered. The call #include struct sigvec { int (*sv_handler)(); int sv_mask; int sv_flags; }; sigvec(signo, sv, osv) int signo; struct sigvec *sv; result struct sigvec *osv; assigns interrupt handler address _s_v___h_a_n_d_l_e_r to signal _s_i_g_n_o. Each handler address specifies either an interrupt routine for the signal, that the signal is to be ignored, or that a default action (usually process termination) is to occur if the signal occurs. The constants SIG_IGN and SIG_DEF used as values for _s_v___h_a_n_d_l_e_r cause ignoring or defaulting of a condition. The _s_v___m_a_s_k value specifies the signal mask to be used when the handler is invoked; it implicitly includes the signal which invoked the handler. Signal masks include one bit for each signal; the mask for a PSD:5-18 4.4BSD Architecture Manual signal _s_i_g_n_o is provided by the macro _s_i_g_m_a_s_k(_s_i_g_n_o), from _<_s_i_g_n_a_l_._h_>. _S_v___f_l_a_g_s specifies whether system calls should be restarted if the signal handler returns and whether the handler should operate on the normal run-time stack or a special signal stack (see below). If _o_s_v is non-zero, the previous signal vector is returned. When a signal condition arises for a process, the sig- nal is added to a set of signals pending for the process. If the signal is not currently _b_l_o_c_k_e_d by the process then it will be delivered. The process of signal delivery adds the signal to be delivered and those signals specified in the associated signal handler's _s_v___m_a_s_k to a set of those _m_a_s_k_e_d for the process, saves the current process context, and places the process in the context of the signal handling routine. The call is arranged so that if the signal han- dling routine exits normally the signal mask will be restored and the process will resume execution in the origi- nal context. If the process wishes to resume in a different context, then it must arrange to restore the signal mask itself. The mask of _b_l_o_c_k_e_d signals is independent of handlers for signals. It delays signals from being delivered much as a raised hardware interrupt priority level delays hardware interrupts. Preventing an interrupt from occurring by changing the handler is analogous to disabling a device from further interrupts. The signal handling routine _s_v___h_a_n_d_l_e_r is called by a C call of the form (*sv_handler)(signo, code, scp); int signo; long code; struct sigcontext *scp; The _s_i_g_n_o gives the number of the signal that occurred, and the _c_o_d_e, a word of information supplied by the hardware. The _s_c_p parameter is a pointer to a machine-dependent struc- ture containing the information for restoring the context before the signal. 22..33..44.. SSeennddiinngg ssiiggnnaallss A process can send a signal to another process or group of processes with the calls: kill(pid, signo) int pid, signo; killpgrp(pgrp, signo) int pgrp, signo; Unless the process sending the signal is privileged, it must have the same effective user id as the process receiving the 4.4BSD Architecture Manual PSD:5-19 signal. Signals are also sent implicitly from a terminal device to the process group associated with the terminal when cer- tain input characters are typed. 22..33..55.. PPrrootteeccttiinngg ccrriittiiccaall sseeccttiioonnss To block a section of code against one or more signals, a _s_i_g_b_l_o_c_k call may be used to add a set of signals to the existing mask, returning the old mask: oldmask = sigblock(mask); result long oldmask; long mask; The old mask can then be restored later with _s_i_g_s_e_t_m_a_s_k, oldmask = sigsetmask(mask); result long oldmask; long mask; The _s_i_g_b_l_o_c_k call can be used to read the current mask by specifying an empty _m_a_s_k. It is possible to check conditions with some signals blocked, and then to pause waiting for a signal and restor- ing the mask, by using: sigpause(mask); long mask; 22..33..66.. SSiiggnnaall ssttaacckkss Applications that maintain complex or fixed size stacks can use the call struct sigstack { caddr_t ss_sp; int ss_onstack; }; sigstack(ss, oss) struct sigstack *ss; result struct sigstack *oss; to provide the system with a stack based at _s_s___s_p for deliv- ery of signals. The value _s_s___o_n_s_t_a_c_k indicates whether the process is currently on the signal stack, a notion main- tained in software by the system. When a signal is to be delivered, the system checks whether the process is on a signal stack. If not, then the process is switched to the signal stack for delivery, with the return from the signal arranged to restore the previous stack. PSD:5-20 4.4BSD Architecture Manual If the process wishes to take a non-local exit from the signal routine, or run code from the signal stack that uses a different stack, a _s_i_g_s_t_a_c_k call should be used to reset the signal stack. 4.4BSD Architecture Manual PSD:5-21 22..44.. TTiimmeerrss 22..44..11.. RReeaall ttiimmee The system's notion of the current Greenwich time and the current time zone is set and returned by the call by the calls: #include settimeofday(tvp, tzp); struct timeval *tp; struct timezone *tzp; gettimeofday(tp, tzp); result struct timeval *tp; result struct timezone *tzp; where the structures are defined in _<_s_y_s_/_t_i_m_e_._h_> as: struct timeval { long tv_sec; /* seconds since Jan 1, 1970 */ long tv_usec; /* and microseconds */ }; struct timezone { int tz_minuteswest; /* of Greenwich */ int tz_dsttime; /* type of dst correction to apply */ }; The precision of the system clock is hardware dependent. Earlier versions of UNIX contained only a 1-second resolu- tion version of this call, which remains as a library rou- tine: time(tvsec) result long *tvsec; returning only the tv_sec field from the _g_e_t_t_i_m_e_o_f_d_a_y call. 22..44..22.. IInntteerrvvaall ttiimmee The system provides each process with three interval timers, defined in _<_s_y_s_/_t_i_m_e_._h_>: #define ITIMER_REAL 0 /* real time intervals */ #define ITIMER_VIRTUAL 1 /* virtual time intervals */ #define ITIMER_PROF 2 /* user and system virtual time */ The ITIMER_REAL timer decrements in real time. It could be used by a library routine to maintain a wakeup service queue. A SIGALRM signal is delivered when this timer PSD:5-22 4.4BSD Architecture Manual expires. The ITIMER_VIRTUAL timer decrements in process virtual time. It runs only when the process is executing. A SIGV- TALRM signal is delivered when it expires. The ITIMER_PROF timer decrements both in process vir- tual time and when the system is running on behalf of the process. It is designed to be used by processes to statis- tically profile their execution. A SIGPROF signal is deliv- ered when it expires. A timer value is defined by the _i_t_i_m_e_r_v_a_l structure: struct itimerval { struct timeval it_interval; /* timer interval */ struct timeval it_value; /* current value */ }; and a timer is set or read by the call: getitimer(which, value); int which; result struct itimerval *value; setitimer(which, value, ovalue); int which; struct itimerval *value; result struct itimerval *ovalue; The third argument to _s_e_t_i_t_i_m_e_r specifies an optional struc- ture to receive the previous contents of the interval timer. A timer can be disabled by specifying a timer value of 0. The system rounds argument timer intervals to be not less than the resolution of its clock. This clock resolu- tion can be determined by loading a very small value into a timer and reading the timer back to see what value resulted. The _a_l_a_r_m system call of earlier versions of UNIX is provided as a library routine using the ITIMER_REAL timer. The process profiling facilities of earlier versions of UNIX remain because it is not always possible to guarantee the automatic restart of system calls after receipt of a signal. The _p_r_o_f_i_l call arranges for the kernel to begin gathering execution statistics for a process: profil(buf, bufsize, offset, scale); result char *buf; int bufsize, offset, scale; This begins sampling of the program counter, with statistics maintained in the user-provided buffer. 4.4BSD Architecture Manual PSD:5-23 22..55.. DDeessccrriippttoorrss 22..55..11.. TThhee rreeffeerreennccee ttaabbllee Each process has access to resources through _d_e_s_c_r_i_p_- _t_o_r_s. Each descriptor is a handle allowing the process to reference objects such as files, devices and communications links. Rather than allowing processes direct access to descriptors, the system introduces a level of indirection, so that descriptors may be shared between processes. Each process has a _d_e_s_c_r_i_p_t_o_r _r_e_f_e_r_e_n_c_e _t_a_b_l_e, containing point- ers to the actual descriptors. The descriptors themselves thus have multiple references, and are reference counted by the system. Each process has a fixed size descriptor reference table, where the size is returned by the _g_e_t_d_t_a_b_l_e_s_i_z_e call: nds = getdtablesize(); result int nds; and guaranteed to be at least 20. The entries in the descriptor reference table are referred to by small inte- gers; for example if there are 20 slots they are numbered 0 to 19. 22..55..22.. DDeessccrriippttoorr pprrooppeerrttiieess Each descriptor has a logical set of properties main- tained by the system and defined by its _t_y_p_e. Each type supports a set of operations; some operations, such as read- ing and writing, are common to several abstractions, while others are unique. The generic operations applying to many of these types are described in section 2.1. Naming con- texts, files and directories are described in section 2.2. Section 2.3 describes communications domains and sockets. Terminals and (structured and unstructured) devices are described in section 2.4. 22..55..33.. MMaannaaggiinngg ddeessccrriippttoorr rreeffeerreenncceess A duplicate of a descriptor reference may be made by doing new = dup(old); result int new; int old; returning a copy of descriptor reference _o_l_d indistinguish- able from the original. The _n_e_w chosen by the system will PSD:5-24 4.4BSD Architecture Manual be the smallest unused descriptor reference slot. A copy of a descriptor reference may be made in a specific slot by doing dup2(old, new); int old, new; The _d_u_p_2 call causes the system to deallocate the descriptor reference current occupying slot _n_e_w, if any, replacing it with a reference to the same descriptor as old. This deal- location is also performed by: close(old); int old; 22..55..44.. MMuullttiipplleexxiinngg rreeqquueessttss The system provides a standard way to do synchronous and asynchronous multiplexing of operations. Synchronous multiplexing is performed by using the _s_e_l_e_c_t call to examine the state of multiple descriptors simultaneously, and to wait for state changes on those descriptors. Sets of descriptors of interest are specified as bit masks, as follows: #include nds = select(nd, in, out, except, tvp); result int nds; int nd; result fd_set *in, *out, *except; struct timeval *tvp; FD_ZERO(&fdset); FD_SET(fd, &fdset); FD_CLR(fd, &fdset); FD_ISSET(fd, &fdset); int fs; fs_set fdset; The _s_e_l_e_c_t call examines the descriptors specified by the sets _i_n, _o_u_t and _e_x_c_e_p_t, replacing the specified bit masks by the subsets that select true for input, output, and exceptional conditions respectively (_n_d indicates the number of file descriptors specified by the bit masks). If any descriptors meet the following criteria, then the number of such descriptors is returned in _n_d_s and the bit masks are updated. 2.fam T * A descriptor selects for input if an input oriented operation such as _r_e_a_d or _r_e_c_e_i_v_e is possible, or if a connection request may be accepted (see section 2.3.1.4). 4.4BSD Architecture Manual PSD:5-25 2.fam T * A descriptor selects for output if an output oriented operation such as _w_r_i_t_e or _s_e_n_d is possible, or if an operation that was ``in progress'', such as connection establishment, has completed (see section 2.1.3). 2.fam T * A descriptor selects for an exceptional condition if a condition that would cause a SIGURG signal to be gener- ated exists (see section 1.3.2), or other device-spe- cific events have occurred. If none of the specified conditions is true, the operation waits for one of the conditions to arise, blocking at most the amount of time specified by _t_v_p. If _t_v_p is given as 0, the _s_e_l_e_c_t waits indefinitely. Options affecting I/O on a descriptor may be read and set by the call: dopt = fcntl(d, cmd, arg) result int dopt; int d, cmd, arg; /* interesting values for cmd */ #define F_SETFL 3 /* set descriptor options */ #define F_GETFL 4 /* get descriptor options */ #define F_SETOWN 5 /* set descriptor owner (pid/pgrp) */ #define F_GETOWN 6 /* get descriptor owner (pid/pgrp) */ The F_SETFL _c_m_d may be used to set a descriptor in non- blocking I/O mode and/or enable signaling when I/O is possi- ble. F_SETOWN may be used to specify a process or process group to be signaled when using the latter mode of operation or when urgent indications arise. Operations on non-blocking descriptors will either com- plete immediately, note an error EWOULDBLOCK, partially com- plete an input or output operation returning a partial count, or return an error EINPROGRESS noting that the requested operation is in progress. A descriptor which has signalling enabled will cause the specified process and/or process group be signaled, with a SIGIO for input, output, or in-progress operation complete, or a SIGURG for excep- tional conditions. For example, when writing to a terminal using non- blocking output, the system will accept only as much data as there is buffer space for and return; when making a connec- tion on a _s_o_c_k_e_t, the operation may return indicating that the connection establishment is ``in progress''. The _s_e_l_e_c_t facility can be used to determine when further output is possible on the terminal, or when the connection establish- ment attempt is complete. PSD:5-26 4.4BSD Architecture Manual 22..55..55.. DDeessccrriippttoorr wwrraappppiinngg..||-- A user process may build descriptors of a specified type by _w_r_a_p_p_i_n_g a communications channel with a system sup- plied protocol translator: new = wrap(old, proto) result int new; int old; struct dprop *proto; Operations on the descriptor _o_l_d are then translated by the system provided protocol translator into requests on the underlying object _o_l_d in a way defined by the protocol. The protocols supported by the kernel may vary from system to system and are described in the programmers manual. Protocols may be based on communications multiplexing or a rights-passing style of handling multiple requests made on the same object. For instance, a protocol for implement- ing a file abstraction may or may not include locally gener- ated ``read-ahead'' requests. A protocol that provides for read-ahead may provide higher performance but have a more difficult implementation. Another example is the terminal driving facilities. Normally a terminal is associated with a communications line, and the terminal type and standard terminal access protocol are wrapped around a synchronous communications line and given to the user. If a virtual terminal is required, the terminal driver can be wrapped around a commu- nications link, the other end of which is held by a virtual terminal protocol interpreter. ----------- |- The facilities described in this section are not included in 4.3BSD. 4.4BSD Architecture Manual PSD:5-27 22..66.. RReessoouurrccee ccoonnttrroollss 22..66..11.. PPrroocceessss pprriioorriittiieess The system gives CPU scheduling priority to processes that have not used CPU time recently. This tends to favor interactive processes and processes that execute only for short periods. It is possible to determine the priority currently assigned to a process, process group, or the pro- cesses of a specified user, or to alter this priority using the calls: #define PRIO_PROCESS 0 /* process */ #define PRIO_PGRP 1 /* process group */ #define PRIO_USER 2 /* user id */ prio = getpriority(which, who); result int prio; int which, who; setpriority(which, who, prio); int which, who, prio; The value _p_r_i_o is in the range -20 to 20. The default pri- ority is 0; lower priorities cause more favorable execution. The _g_e_t_p_r_i_o_r_i_t_y call returns the highest priority (lowest numerical value) enjoyed by any of the specified processes. The _s_e_t_p_r_i_o_r_i_t_y call sets the priorities of all of the spec- ified processes to the specified value. Only the super-user may lower priorities. 22..66..22.. RReessoouurrccee uuttiilliizzaattiioonn The resources used by a process are returned by a _g_e_t_r_u_s_a_g_e call, returning information in a structure defined in _<_s_y_s_/_r_e_s_o_u_r_c_e_._h_>: PSD:5-28 4.4BSD Architecture Manual #define RUSAGE_SELF 0 /* usage by this process */ #define RUSAGE_CHILDREN -1 /* usage by all children */ getrusage(who, rusage) int who; result struct rusage *rusage; struct rusage { struct timeval ru_utime; /* user time used */ struct timeval ru_stime; /* system time used */ int ru_maxrss; /* maximum core resident set size: kbytes */ int ru_ixrss; /* integral shared memory size (kbytes*sec) */ int ru_idrss; /* unshared data memory size */ int ru_isrss; /* unshared stack memory size */ int ru_minflt; /* page-reclaims */ int ru_majflt; /* page faults */ int ru_nswap; /* swaps */ int ru_inblock; /* block input operations */ int ru_oublock; /* block output operations */ int ru_msgsnd; /* messages sent */ int ru_msgrcv; /* messages received */ int ru_nsignals; /* signals received */ int ru_nvcsw; /* voluntary context switches */ int ru_nivcsw; /* involuntary context switches */ }; The _w_h_o parameter specifies whose resource usage is to be returned. The resources used by the current process, or by all the terminated children of the current process may be requested. 22..66..33.. RReessoouurrccee lliimmiittss The resources of a process for which limits are con- trolled by the kernel are defined in _<_s_y_s_/_r_e_s_o_u_r_c_e_._h_>, and controlled by the _g_e_t_r_l_i_m_i_t and _s_e_t_r_l_i_m_i_t calls: 4.4BSD Architecture Manual PSD:5-29 #define RLIMIT_CPU 0 /* cpu time in milliseconds */ #define RLIMIT_FSIZE 1 /* maximum file size */ #define RLIMIT_DATA 2 /* maximum data segment size */ #define RLIMIT_STACK 3 /* maximum stack segment size */ #define RLIMIT_CORE 4 /* maximum core file size */ #define RLIMIT_RSS 5 /* maximum resident set size */ #define RLIM_NLIMITS 6 #define RLIM_INFINITY 0x7fffffff struct rlimit { int rlim_cur; /* current (soft) limit */ int rlim_max; /* hard limit */ }; getrlimit(resource, rlp) int resource; result struct rlimit *rlp; setrlimit(resource, rlp) int resource; struct rlimit *rlp; Only the super-user can raise the maximum limits. Other users may only alter _r_l_i_m___c_u_r within the range from 0 to _r_l_i_m___m_a_x or (irreversibly) lower _r_l_i_m___m_a_x. PSD:5-30 4.4BSD Architecture Manual 22..77.. SSyysstteemm ooppeerraattiioonn ssuuppppoorrtt Unless noted otherwise, the calls in this section are permitted only to a privileged user. 22..77..11.. BBoooottssttrraapp ooppeerraattiioonnss The call mount(blkdev, dir, ronly); char *blkdev, *dir; int ronly; extends the UNIX name space. The _m_o_u_n_t call specifies a block device _b_l_k_d_e_v containing a UNIX file system to be made available starting at _d_i_r. If _r_o_n_l_y is set then the file system is read-only; writes to the file system will not be permitted and access times will not be updated when files are referenced. _D_i_r is normally a name in the root direc- tory. The call swapon(blkdev, size); char *blkdev; int size; specifies a device to be made available for paging and swap- ping. 22..77..22.. SShhuuttddoowwnn ooppeerraattiioonnss The call unmount(dir); char *dir; unmounts the file system mounted on _d_i_r. This call will succeed only if the file system is not currently being used. The call sync(); schedules input/output to clean all system buffer caches. (This call does not require privileged status.) The call reboot(how) int how; causes a machine halt or reboot. The call may request a 4.4BSD Architecture Manual PSD:5-31 reboot by specifying _h_o_w as RB_AUTOBOOT, or that the machine be halted with RB_HALT. These constants are defined in _<_s_y_s_/_r_e_b_o_o_t_._h_>. 22..77..33.. AAccccoouunnttiinngg The system optionally keeps an accounting record in a file for each process that exits on the system. The format of this record is beyond the scope of this document. The accounting may be enabled to a file _n_a_m_e by doing acct(path); char *path; If _p_a_t_h is null, then accounting is disabled. Otherwise, the named file becomes the accounting file. PSD:5-32 4.4BSD Architecture Manual 33.. SSyysstteemm ffaacciilliittiieess This section discusses the system facilities that are not considered part of the kernel. The system abstractions described are: 1.fam T Directory contexts A directory context is a position in the UNIX file sys- tem name space. Operations on files and other named objects in a file system are always specified relative to such a context. 1.fam T Files Files are used to store uninterpreted sequence of bytes on which random access _r_e_a_d_s and _w_r_i_t_e_s may occur. Pages from files may also be mapped into process address space.|- A directory may be read as a file. 1.fam T Communications domains A communications domain represents an interprocess com- munications environment, such as the communications facilities of the UNIX system, communications in the INTERNET, or the resource sharing protocols and access rights of a resource sharing system on a local network. 1.fam T Sockets A socket is an endpoint of communication and the focal point for IPC in a communications domain. Sockets may be created in pairs, or given names and used to ren- dezvous with other sockets in a communications domain, accepting connections from these sockets or exchanging messages with them. These operations model a labeled or unlabeled communications graph, and can be used in a wide variety of communications domains. Sockets can have different _t_y_p_e_s to provide different semantics of communication, increasing the flexibility of the model. 1.fam T Terminals and other devices Devices include terminals, providing input editing and interrupt generation and output flow control and edit- ing, magnetic tapes, disks and other peripherals. They often support the generic _r_e_a_d and _w_r_i_t_e operations as well as a number of _i_o_c_t_ls. 1.fam T Processes Process descriptors provide facilities for control and debugging of other processes. ----------- |- Support for mapping files is not included in the 4.3 release. 4.4BSD Architecture Manual PSD:5-33 33..11.. GGeenneerriicc ooppeerraattiioonnss Many system abstractions support the operations _r_e_a_d, _w_r_i_t_e and _i_o_c_t_l. We describe the basics of these common primitives here. Similarly, the mechanisms whereby normally synchronous operations may occur in a non-blocking or asyn- chronous fashion are common to all system-defined abstrac- tions and are described here. 33..11..11.. RReeaadd aanndd wwrriittee The _r_e_a_d and _w_r_i_t_e system calls can be applied to com- munications channels, files, terminals and devices. They have the form: cc = read(fd, buf, nbytes); result int cc; int fd; result caddr_t buf; int nbytes; cc = write(fd, buf, nbytes); result int cc; int fd; caddr_t buf; int nbytes; The _r_e_a_d call transfers as much data as possible from the object defined by _f_d to the buffer at address _b_u_f of size _n_b_y_t_e_s. The number of bytes transferred is returned in _c_c, which is -1 if a return occurred before any data was trans- ferred because of an error or use of non-blocking opera- tions. The _w_r_i_t_e call transfers data from the buffer to the object defined by _f_d. Depending on the type of _f_d, it is possible that the _w_r_i_t_e call will accept some portion of the provided bytes; the user should resubmit the other bytes in a later request in this case. Error returns because of interrupted or otherwise incomplete operations are possible. Scattering of data on input or gathering of data for output is also possible using an array of input/output vec- tor descriptors. The type for the descriptors is defined in _<_s_y_s_/_u_i_o_._h_> as: struct iovec { caddr_t iov_msg; /* base of a component */ int iov_len; /* length of a component */ }; The calls using an array of descriptors are: PSD:5-34 4.4BSD Architecture Manual cc = readv(fd, iov, iovlen); result int cc; int fd; struct iovec *iov; int iovlen; cc = writev(fd, iov, iovlen); result int cc; int fd; struct iovec *iov; int iovlen; Here _i_o_v_l_e_n is the count of elements in the _i_o_v array. 33..11..22.. IInnppuutt//oouuttppuutt ccoonnttrrooll Control operations on an object are performed by the _i_o_c_t_l operation: ioctl(fd, request, buffer); int fd, request; caddr_t buffer; This operation causes the specified _r_e_q_u_e_s_t to be performed on the object _f_d. The _r_e_q_u_e_s_t parameter specifies whether the argument buffer is to be read, written, read and writ- ten, or is not needed, and also the size of the buffer, as well as the request. Different descriptor types and sub- types within descriptor types may use distinct _i_o_c_t_l requests. For example, operations on terminals control flushing of input and output queues and setting of terminal parameters; operations on disks cause formatting operations to occur; operations on tapes control tape positioning. The names for basic control operations are defined in _<_s_y_s_/_i_o_c_t_l_._h_>. 33..11..33.. NNoonn--bblloocckkiinngg aanndd aassyynncchhrroonnoouuss ooppeerraattiioonnss A process that wishes to do non-blocking operations on one of its descriptors sets the descriptor in non-blocking mode as described in section 1.5.4. Thereafter the _r_e_a_d call will return a specific EWOULDBLOCK error indication if there is no data to be _r_e_a_d. The process may _s_e_l_e_c_t the associated descriptor to determine when a read is possible. Output attempted when a descriptor can accept less than is requested will either accept some of the provided data, returning a shorter than normal length, or return an error indicating that the operation would block. More output can be performed as soon as a _s_e_l_e_c_t call indicates the object is writeable. Operations other than data input or output may be per- formed on a descriptor in a non-blocking fashion. These operations will return with a characteristic error indicat- ing that they are in progress if they cannot complete imme- diately. The descriptor may then be _s_e_l_e_c_ted for _w_r_i_t_e to find out when the operation has been completed. When _s_e_l_e_c_t indicates the descriptor is writeable, the operation has completed. Depending on the nature of the descriptor and 4.4BSD Architecture Manual PSD:5-35 the operation, additional activity may be started or the new state may be tested. PSD:5-36 4.4BSD Architecture Manual 33..22.. FFiillee ssyysstteemm 33..22..11.. OOvveerrvviieeww The file system abstraction provides access to a hier- archical file system structure. The file system contains directories (each of which may contain other sub-directo- ries) as well as files and references to other objects such as devices and inter-process communications sockets. Each file is organized as a linear array of bytes. No record boundaries or system related information is present in a file. Files may be read and written in a random-access fashion. The user may read the data in a directory as though it were an ordinary file to determine the names of the contained files, but only the system may write into the directories. The file system stores only a small amount of ownership, protection and usage information with a file. 33..22..22.. NNaammiinngg The file system calls take _p_a_t_h _n_a_m_e arguments. These consist of a zero or more component _f_i_l_e _n_a_m_e_s separated by ``/'' characters, where each file name is up to 255 ASCII characters excluding null and ``/''. Each process always has two naming contexts: one for the root directory of the file system and one for the cur- rent working directory. These are used by the system in the filename translation process. If a path name begins with a ``/'', it is called a full path name and interpreted rela- tive to the root directory context. If the path name does not begin with a ``/'' it is called a relative path name and interpreted relative to the current directory context. The system limits the total length of a path name to 1024 characters. The file name ``..'' in each directory refers to the parent directory of that directory. The parent directory of the root of the file system is always that directory. The calls chdir(path); char *path; chroot(path) char *path; change the current working directory and root directory con- text of a process. Only the super-user can change the root 4.4BSD Architecture Manual PSD:5-37 directory context of a process. 33..22..33.. CCrreeaattiioonn aanndd rreemmoovvaall The file system allows directories, files, special devices, and ``portals'' to be created and removed from the file system. 33..22..33..11.. DDiirreeccttoorryy ccrreeaattiioonn aanndd rreemmoovvaall A directory is created with the _m_k_d_i_r system call: mkdir(path, mode); char *path; int mode; where the mode is defined as for files (see below). Direc- tories are removed with the _r_m_d_i_r system call: rmdir(path); char *path; A directory must be empty if it is to be deleted. 33..22..33..22.. FFiillee ccrreeaattiioonn Files are created with the _o_p_e_n system call, fd = open(path, oflag, mode); result int fd; char *path; int oflag, mode; The _p_a_t_h parameter specifies the name of the file to be cre- ated. The _o_f_l_a_g parameter must include O_CREAT from below to cause the file to be created. Bits for _o_f_l_a_g are defined in _<_s_y_s_/_f_i_l_e_._h_>: #define O_RDONLY 000 /* open for reading */ #define O_WRONLY 001 /* open for writing */ #define O_RDWR 002 /* open for read & write */ #define O_NDELAY 004 /* non-blocking open */ #define O_APPEND 010 /* append on each write */ #define O_CREAT 01000 /* open with file create */ #define O_TRUNC 02000 /* open with truncation */ #define O_EXCL 04000 /* error on create if file exists */ One of O_RDONLY, O_WRONLY and O_RDWR should be speci- fied, indicating what types of operations are desired to be performed on the open file. The operations will be checked against the user's access rights to the file before allowing the _o_p_e_n to succeed. Specifying O_APPEND causes writes to automatically append to the file. The flag O_CREAT causes the file to be created if it does not exist, owned by the current user and the group of the containing directory. The protection for the new file is specified in _m_o_d_e. The file PSD:5-38 4.4BSD Architecture Manual mode is used as a three digit octal number. Each digit encodes read access as 4, write access as 2 and execute access as 1, or'ed together. The 0700 bits describe owner access, the 070 bits describe the access rights for pro- cesses in the same group as the file, and the 07 bits describe the access rights for other processes. If the open specifies to create the file with O_EXCL and the file already exists, then the _o_p_e_n will fail without affecting the file in any way. This provides a simple exclusive access facility. If the file exists but is a sym- bolic link, the open will fail regardless of the existence of the file specified by the link. 33..22..33..33.. CCrreeaattiinngg rreeffeerreenncceess ttoo ddeevviicceess The file system allows entries which reference periph- eral devices. Peripherals are distinguished as _b_l_o_c_k or _c_h_a_r_a_c_t_e_r devices according by their ability to support block-oriented operations. Devices are identified by their ``major'' and ``minor'' device numbers. The major device number determines the kind of peripheral it is, while the minor device number indicates one of possibly many peripher- als of that kind. Structured devices have all operations performed internally in ``block'' quantities while unstruc- tured devices often have a number of special _i_o_c_t_l opera- tions, and may have input and output performed in varying units. The _m_k_n_o_d call creates special entries: mknod(path, mode, dev); char *path; int mode, dev; where _m_o_d_e is formed from the object type and access permis- sions. The parameter _d_e_v is a configuration dependent parameter used to identify specific character or block I/O devices. 33..22..33..44.. PPoorrttaall ccrreeaattiioonn||-- The call fd = portal(name, server, param, dtype, protocol, domain, socktype) result int fd; char *name, *server, *param; int dtype, protocol; int domain, socktype; places a _n_a_m_e in the file system name space that causes con- nection to a server process when the name is used. The por- tal call returns an active portal in _f_d as though an access had occurred to activate an inactive portal, as now described. ----------- |- The _p_o_r_t_a_l call is not implemented in 4.3BSD. 4.4BSD Architecture Manual PSD:5-39 When an inactive portal is accessed, the system sets up a socket of the specified _s_o_c_k_t_y_p_e in the specified communi- cations _d_o_m_a_i_n (see section 2.3), and creates the _s_e_r_v_e_r process, giving it the specified _p_a_r_a_m as argument to help it identify the portal, and also giving it the newly created socket as descriptor number 0. The accessor of the portal will create a socket in the same _d_o_m_a_i_n and _c_o_n_n_e_c_t to the server. The user will then _w_r_a_p the socket in the specified _p_r_o_t_o_c_o_l to create an object of the required descriptor type _d_t_y_p_e and proceed with the operation which was in progress before the portal was encountered. While the server process holds the socket (which it received as _f_d from the _p_o_r_t_a_l call on descriptor 0 at acti- vation) further references will result in connections being made to the same socket. 33..22..33..55.. FFiillee,, ddeevviiccee,, aanndd ppoorrttaall rreemmoovvaall A reference to a file, special device or portal may be removed with the _u_n_l_i_n_k call, unlink(path); char *path; The caller must have write access to the directory in which the file is located for this call to be successful. 33..22..44.. RReeaaddiinngg aanndd mmooddiiffyyiinngg ffiillee aattttrriibbuutteess Detailed information about the attributes of a file may be obtained with the calls: #include stat(path, stb); char *path; result struct stat *stb; fstat(fd, stb); int fd; result struct stat *stb; The _s_t_a_t structure includes the file type, protection, own- ership, access times, size, and a count of hard links. If the file is a symbolic link, then the status of the link itself (rather than the file the link references) may be found using the _l_s_t_a_t call: lstat(path, stb); char *path; result struct stat *stb; Newly created files are assigned the user id of the process that created it and the group id of the directory in which it was created. The ownership of a file may be PSD:5-40 4.4BSD Architecture Manual changed by either of the calls chown(path, owner, group); char *path; int owner, group; fchown(fd, owner, group); int fd, owner, group; In addition to ownership, each file has three levels of access protection associated with it. These levels are owner relative, group relative, and global (all users and groups). Each level of access has separate indicators for read permission, write permission, and execute permission. The protection bits associated with a file may be set by either of the calls: chmod(path, mode); char *path; int mode; fchmod(fd, mode); int fd, mode; where _m_o_d_e is a value indicating the new protection of the file, as listed in section 2.2.3.2. Finally, the access and modify times on a file may be set by the call: utimes(path, tvp) char *path; struct timeval *tvp[2]; This is particularly useful when moving files between media, to preserve relationships between the times the file was modified. 33..22..55.. LLiinnkkss aanndd rreennaammiinngg Links allow multiple names for a file to exist. Links exist independently of the file linked to. Two types of links exist, _h_a_r_d links and _s_y_m_b_o_l_i_c links. A hard link is a reference counting mechanism that allows a file to have multiple names within the same file system. Symbolic links cause string substitution during the pathname interpretation process. Hard links and symbolic links have different proper- ties. A hard link insures the target file will always be accessible, even after its original directory entry is removed; no such guarantee exists for a symbolic link. Sym- bolic links can span file systems boundaries. 4.4BSD Architecture Manual PSD:5-41 The following calls create a new link, named _p_a_t_h_2, to _p_a_t_h_1: link(path1, path2); char *path1, *path2; symlink(path1, path2); char *path1, *path2; The _u_n_l_i_n_k primitive may be used to remove either type of link. If a file is a symbolic link, the ``value'' of the link may be read with the _r_e_a_d_l_i_n_k call, len = readlink(path, buf, bufsize); result int len; result char *path, *buf; int bufsize; This call returns, in _b_u_f, the null-terminated string sub- stituted into pathnames passing through _p_a_t_h. Atomic renaming of file system resident objects is pos- sible with the _r_e_n_a_m_e call: rename(oldname, newname); char *oldname, *newname; where both _o_l_d_n_a_m_e and _n_e_w_n_a_m_e must be in the same file sys- tem. If _n_e_w_n_a_m_e exists and is a directory, then it must be empty. 33..22..66.. EExxtteennssiioonn aanndd ttrruunnccaattiioonn Files are created with zero length and may be extended simply by writing or appending to them. While a file is open the system maintains a pointer into the file indicating the current location in the file associated with the descriptor. This pointer may be moved about in the file in a random access fashion. To set the current offset into a file, the _l_s_e_e_k call may be used, oldoffset = lseek(fd, offset, type); result off_t oldoffset; int fd; off_t offset; int type; where _t_y_p_e is given in _<_s_y_s_/_f_i_l_e_._h_> as one of: #define L_SET 0 /* set absolute file offset */ #define L_INCR 1 /* set file offset relative to current position */ #define L_XTND 2 /* set offset relative to end-of-file */ The call ``lseek(fd, 0, L_INCR)'' returns the current offset into the file. PSD:5-42 4.4BSD Architecture Manual Files may have ``holes'' in them. Holes are void areas in the linear extent of the file where data has never been written. These may be created by seeking to a location in a file past the current end-of-file and writing. Holes are treated by the system as zero valued bytes. A file may be truncated with either of the calls: truncate(path, length); char *path; int length; ftruncate(fd, length); int fd, length; reducing the size of the specified file to _l_e_n_g_t_h bytes. 33..22..77.. CChheecckkiinngg aacccceessssiibbiilliittyy A process running with different real and effective user ids may interrogate the accessibility of a file to the real user by using the _a_c_c_e_s_s call: accessible = access(path, how); result int accessible; char *path; int how; Here _h_o_w is constructed by or'ing the following bits, defined in _<_s_y_s_/_f_i_l_e_._h_>: #define F_OK 0 /* file exists */ #define X_OK 1 /* file is executable */ #define W_OK 2 /* file is writable */ #define R_OK 4 /* file is readable */ The presence or absence of advisory locks does not affect the result of _a_c_c_e_s_s. 33..22..88.. LLoocckkiinngg The file system provides basic facilities that allow cooperating processes to synchronize their access to shared files. A process may place an advisory _r_e_a_d or _w_r_i_t_e lock on a file, so that other cooperating processes may avoid interfering with the process' access. This simple mechanism provides locking with file granularity. More granular lock- ing can be built using the IPC facilities to provide a lock manager. The system does not force processes to obey the locks; they are of an advisory nature only. Locking is performed after an _o_p_e_n call by applying the _f_l_o_c_k primitive, flock(fd, how); int fd, how; 4.4BSD Architecture Manual PSD:5-43 where the _h_o_w parameter is formed from bits defined in _<_s_y_s_/_f_i_l_e_._h_>: #define LOCK_SH 1 /* shared lock */ #define LOCK_EX 2 /* exclusive lock */ #define LOCK_NB 4 /* don't block when locking */ #define LOCK_UN 8 /* unlock */ Successive lock calls may be used to increase or decrease the level of locking. If an object is currently locked by another process when a _f_l_o_c_k call is made, the caller will be blocked until the current lock owner releases the lock; this may be avoided by including LOCK_NB in the _h_o_w parame- ter. Specifying LOCK_UN removes all locks associated with the descriptor. Advisory locks held by a process are auto- matically deleted when the process terminates. 33..22..99.. DDiisskk qquuoottaass As an optional facility, each file system may be requested to impose limits on a user's disk usage. Two quantities are limited: the total amount of disk space which a user may allocate in a file system and the total number of files a user may create in a file system. Quotas are expressed as _h_a_r_d limits and _s_o_f_t limits. A hard limit is always imposed; if a user would exceed a hard limit, the operation which caused the resource request will fail. A soft limit results in the user receiving a warning message, but with allocation succeeding. Facilities are provided to turn soft limits into hard limits if a user has exceeded a soft limit for an unreasonable period of time. To enable disk quotas on a file system the _s_e_t_q_u_o_t_a call is used: setquota(special, file) char *special, *file; where _s_p_e_c_i_a_l refers to a structured device file where a mounted file system exists, and _f_i_l_e refers to a disk quota file (residing on the file system associated with _s_p_e_c_i_a_l) from which user quotas should be obtained. The format of the disk quota file is implementation dependent. To manipulate disk quotas the _q_u_o_t_a call is provided: #include quota(cmd, uid, arg, addr) int cmd, uid, arg; caddr_t addr; The indicated _c_m_d is applied to the user ID _u_i_d. The param- eters _a_r_g and _a_d_d_r are command specific. The file _<_s_y_s_/_q_u_o_t_a_._h_> contains definitions pertinent to the use of PSD:5-44 4.4BSD Architecture Manual this call. 4.4BSD Architecture Manual PSD:5-45 33..33.. IInntteerrpprroocceessss ccoommmmuunniiccaattiioonnss 33..33..11.. IInntteerrpprroocceessss ccoommmmuunniiccaattiioonn pprriimmiittiivveess 33..33..11..11.. CCoommmmuunniiccaattiioonn ddoommaaiinnss The system provides access to an extensible set of com- munication _d_o_m_a_i_n_s. A communication domain is identified by a manifest constant defined in the file _<_s_y_s_/_s_o_c_k_e_t_._h_>. Important standard domains supported by the system are the ``unix'' domain, AF_UNIX, for communication within the sys- tem, the ``Internet'' domain for communication in the DARPA Internet, AF_INET, and the ``NS'' domain, AF_NS, for commu- nication using the Xerox Network Systems protocols. Other domains can be added to the system. 33..33..11..22.. SSoocckkeett ttyyppeess aanndd pprroottooccoollss Within a domain, communication takes place between com- munication endpoints known as _s_o_c_k_e_t_s. Each socket has the potential to exchange information with other sockets of an appropriate type within the domain. Each socket has an associated abstract type, which describes the semantics of communication using that socket. Properties such as reliability, ordering, and prevention of duplication of messages are determined by the type. The basic set of socket types is defined in _<_s_y_s_/_s_o_c_k_e_t_._h_>: /* Standard socket types */ #define SOCK_DGRAM 1 /* datagram */ #define SOCK_STREAM 2 /* virtual circuit */ #define SOCK_RAW 3 /* raw socket */ #define SOCK_RDM 4 /* reliably-delivered message */ #define SOCK_SEQPACKET 5 /* sequenced packets */ The SOCK_DGRAM type models the semantics of datagrams in network communication: messages may be lost or duplicated and may arrive out-of-order. A datagram socket may send messages to and receive messages from multiple peers. The SOCK_RDM type models the semantics of reliable datagrams: messages arrive unduplicated and in-order, the sender is notified if messages are lost. The _s_e_n_d and _r_e_c_e_i_v_e opera- tions (described below) generate reliable/unreliable data- grams. The SOCK_STREAM type models connection-based virtual circuits: two-way byte streams with no record boundaries. Connection setup is required before data communication may begin. The SOCK_SEQPACKET type models a connection-based, full-duplex, reliable, sequenced packet exchange; the sender is notified if messages are lost, and messages are never duplicated or presented out-of-order. Users of the last two abstractions may use the facilities for out-of-band PSD:5-46 4.4BSD Architecture Manual transmission to send out-of-band data. SOCK_RAW is used for unprocessed access to internal network layers and interfaces; it has no specific semantics. Other socket types can be defined. Each socket may have a specific _p_r_o_t_o_c_o_l associated with it. This protocol is used within the domain to provide the semantics required by the socket type. Not all socket types are supported by each domain; support depends on the existence and the implementation of a suitable protocol within the domain. For example, within the ``Internet'' domain, the SOCK_DGRAM type may be implemented by the UDP user datagram protocol, and the SOCK_STREAM type may be implemented by the TCP transmission control protocol, while no standard protocols to provide SOCK_RDM or SOCK_SEQPACKET sockets exist. 33..33..11..33.. SSoocckkeett ccrreeaattiioonn,, nnaammiinngg aanndd sseerrvviiccee eessttaabblliisshhmmeenntt Sockets may be _c_o_n_n_e_c_t_e_d or _u_n_c_o_n_n_e_c_t_e_d. An uncon- nected socket descriptor is obtained by the _s_o_c_k_e_t call: s = socket(domain, type, protocol); result int s; int domain, type, protocol; The socket domain and type are as described above, and are specified using the definitions from _<_s_y_s_/_s_o_c_k_e_t_._h_>. The protocol may be given as 0, meaning any suitable protocol. One of several possible protocols may be selected using identifiers obtained from a library routine, _g_e_t_p_r_o_t_o_b_y_n_a_m_e. An unconnected socket descriptor of a connection-ori- ented type may yield a connected socket descriptor in one of two ways: either by actively connecting to another socket, or by becoming associated with a name in the communications domain and _a_c_c_e_p_t_i_n_g a connection from another socket. Datagram sockets need not establish connections before use. To accept connections or to receive datagrams, a socket must first have a binding to a name (or address) within the communications domain. Such a binding may be established by a _b_i_n_d call: bind(s, name, namelen); int s; struct sockaddr *name; int namelen; Datagram sockets may have default bindings established when first sending data if not explicitly bound earlier. In either case, a socket's bound name may be retrieved with a _g_e_t_s_o_c_k_n_a_m_e call: 4.4BSD Architecture Manual PSD:5-47 getsockname(s, name, namelen); int s; result struct sockaddr *name; result int *namelen; while the peer's name can be retrieved with _g_e_t_p_e_e_r_n_a_m_e: getpeername(s, name, namelen); int s; result struct sockaddr *name; result int *namelen; Domains may support sockets with several names. 33..33..11..44.. AAcccceeppttiinngg ccoonnnneeccttiioonnss Once a binding is made to a connection-oriented socket, it is possible to _l_i_s_t_e_n for connections: listen(s, backlog); int s, backlog; The _b_a_c_k_l_o_g specifies the maximum count of connections that can be simultaneously queued awaiting acceptance. An _a_c_c_e_p_t call: t = accept(s, name, anamelen); result int t; int s; result struct sockaddr *name; result int *anamelen; returns a descriptor for a new, connected, socket from the queue of pending connections on _s. If no new connections are queued for acceptance, the call will wait for a connec- tion unless non-blocking I/O has been enabled. 33..33..11..55.. MMaakkiinngg ccoonnnneeccttiioonnss An active connection to a named socket is made by the _c_o_n_n_e_c_t call: connect(s, name, namelen); int s; struct sockaddr *name; int namelen; Although datagram sockets do not establish connections, the _c_o_n_n_e_c_t call may be used with such sockets to create an _a_s_s_o_c_i_a_t_i_o_n with the foreign address. The address is recorded for use in future _s_e_n_d calls, which then need not supply destination addresses. Datagrams will be received only from that peer, and asynchronous error reports may be received. It is also possible to create connected pairs of sock- ets without using the domain's name space to rendezvous; this is done with the _s_o_c_k_e_t_p_a_i_r call|-: ----------- |- 4.3BSD supports _s_o_c_k_e_t_p_a_i_r creation only in the ``unix'' communication domain. PSD:5-48 4.4BSD Architecture Manual socketpair(domain, type, protocol, sv); int domain, type, protocol; result int sv[2]; Here the returned _s_v descriptors correspond to those obtained with _a_c_c_e_p_t and _c_o_n_n_e_c_t. The call pipe(pv) result int pv[2]; creates a pair of SOCK_STREAM sockets in the UNIX domain, with pv[0] only writable and pv[1] only readable. 33..33..11..66.. SSeennddiinngg aanndd rreecceeiivviinngg ddaattaa Messages may be sent from a socket by: cc = sendto(s, buf, len, flags, to, tolen); result int cc; int s; caddr_t buf; int len, flags; caddr_t to; int tolen; if the socket is not connected or: cc = send(s, buf, len, flags); result int cc; int s; caddr_t buf; int len, flags; if the socket is connected. The corresponding receive prim- itives are: msglen = recvfrom(s, buf, len, flags, from, fromlenaddr); result int msglen; int s; result caddr_t buf; int len, flags; result caddr_t from; result int *fromlenaddr; and msglen = recv(s, buf, len, flags); result int msglen; int s; result caddr_t buf; int len, flags; In the unconnected case, the parameters _t_o and _t_o_l_e_n specify the destination or source of the message, while the _f_r_o_m parameter stores the source of the message, and _*_f_r_o_m_- _l_e_n_a_d_d_r initially gives the size of the _f_r_o_m buffer and is updated to reflect the true length of the _f_r_o_m address. All calls cause the message to be received in or sent from the message buffer of length _l_e_n bytes, starting at address _b_u_f. The _f_l_a_g_s specify peeking at a message without reading it or sending or receiving high-priority out-of-band messages, as follows: #define MSG_PEEK 0x1 /* peek at incoming message */ #define MSG_OOB 0x2 /* process out-of-band data */ 4.4BSD Architecture Manual PSD:5-49 33..33..11..77.. SSccaatttteerr//ggaatthheerr aanndd eexxcchhaannggiinngg aacccceessss rriigghhttss It is possible scatter and gather data and to exchange access rights with messages. When either of these opera- tions is involved, the number of parameters to the call becomes large. Thus the system defines a message header structure, in _<_s_y_s_/_s_o_c_k_e_t_._h_>, which can be used to conve- niently contain the parameters to the calls: struct msghdr { caddr_t msg_name; /* optional address */ int msg_namelen; /* size of address */ struct iov *msg_iov; /* scatter/gather array */ int msg_iovlen; /* # elements in msg_iov */ caddr_t msg_accrights; /* access rights sent/received */ int msg_accrightslen; /* size of msg_accrights */ }; Here _m_s_g___n_a_m_e and _m_s_g___n_a_m_e_l_e_n specify the source or destina- tion address if the socket is unconnected; _m_s_g___n_a_m_e may be given as a null pointer if no names are desired or required. The _m_s_g___i_o_v and _m_s_g___i_o_v_l_e_n describe the scatter/gather loca- tions, as described in section 2.1.3. Access rights to be sent along with the message are specified in _m_s_g___a_c_c_r_i_g_h_t_s, which has length _m_s_g___a_c_c_r_i_g_h_t_s_l_e_n. In the ``unix'' domain these are an array of integer descriptors, taken from the sending process and duplicated in the receiver. This structure is used in the operations _s_e_n_d_m_s_g and _r_e_c_v_m_s_g: sendmsg(s, msg, flags); int s; struct msghdr *msg; int flags; msglen = recvmsg(s, msg, flags); result int msglen; int s; result struct msghdr *msg; int flags; 33..33..11..88.. UUssiinngg rreeaadd aanndd wwrriittee wwiitthh ssoocckkeettss The normal UNIX _r_e_a_d and _w_r_i_t_e calls may be applied to connected sockets and translated into _s_e_n_d and _r_e_c_e_i_v_e calls from or to a single area of memory and discarding any rights received. A process may operate on a virtual circuit socket, a terminal or a file with blocking or non-blocking input/output operations without distinguishing the descrip- tor type. 33..33..11..99.. SShhuuttttiinngg ddoowwnn hhaallvveess ooff ffuullll--dduupplleexx ccoonnnneeccttiioonnss A process that has a full-duplex socket such as a vir- tual circuit and no longer wishes to read from or write to this socket can give the call: PSD:5-50 4.4BSD Architecture Manual shutdown(s, direction); int s, direction; where _d_i_r_e_c_t_i_o_n is 0 to not read further, 1 to not write further, or 2 to completely shut the connection down. If the underlying protocol supports unidirectional or bidirec- tional shutdown, this indication will be passed to the peer. For example, a shutdown for writing might produce an end-of- file condition at the remote end. 33..33..11..1100.. SSoocckkeett aanndd pprroottooccooll ooppttiioonnss Sockets, and their underlying communication protocols, may support _o_p_t_i_o_n_s. These options may be used to manipu- late implementation- or protocol-specific facilities. The _g_e_t_s_o_c_k_o_p_t and _s_e_t_s_o_c_k_o_p_t calls are used to control options: getsockopt(s, level, optname, optval, optlen) int s, level, optname; result caddr_t optval; result int *optlen; setsockopt(s, level, optname, optval, optlen) int s, level, optname; caddr_t optval; int optlen; The option _o_p_t_n_a_m_e is interpreted at the indicated protocol _l_e_v_e_l for socket _s. If a value is specified with _o_p_t_v_a_l and _o_p_t_l_e_n, it is interpreted by the software operating at the specified _l_e_v_e_l. The _l_e_v_e_l SOL_SOCKET is reserved to indi- cate options maintained by the socket facilities. Other _l_e_v_e_l values indicate a particular protocol which is to act on the option request; these values are normally interpreted as a ``protocol number''. 33..33..22.. UUNNIIXX ddoommaaiinn This section describes briefly the properties of the UNIX communications domain. 33..33..22..11.. TTyyppeess ooff ssoocckkeettss In the UNIX domain, the SOCK_STREAM abstraction pro- vides pipe-like facilities, while SOCK_DGRAM provides (usu- ally) reliable message-style communications. 33..33..22..22.. NNaammiinngg Socket names are strings and may appear in the UNIX file system name space through portals|-. ----------- |- The 4.3BSD implementation of the UNIX domain embeds bound sockets in the UNIX file system name space; this may change in future releases. 4.4BSD Architecture Manual PSD:5-51 33..33..22..33.. AAcccceessss rriigghhttss ttrraannssmmiissssiioonn The ability to pass UNIX descriptors with messages in this domain allows migration of service within the system and allows user processes to be used in building system facilities. 33..33..33.. IINNTTEERRNNEETT ddoommaaiinn This section describes briefly how the Internet domain is mapped to the model described in this section. More information will be found in the document describing the network implementation in 4.3BSD. 33..33..33..11.. SSoocckkeett ttyyppeess aanndd pprroottooccoollss SOCK_STREAM is supported by the Internet TCP protocol; SOCK_DGRAM by the UDP protocol. Each is layered atop the transport-level Internet Protocol (IP). The Internet Con- trol Message Protocol is implemented atop/beside IP and is accessible via a raw socket. The SOCK_SEQPACKET has no direct Internet family analogue; a protocol based on one from the XEROX NS family and layered on top of IP could be implemented to fill this gap. 33..33..33..22.. SSoocckkeett nnaammiinngg Sockets in the Internet domain have names composed of the 32 bit Internet address, and a 16 bit port number. Options may be used to provide IP source routing or security options. The 32-bit address is composed of network and host parts; the network part is variable in size and is frequency encoded. The host part may optionally be interpreted as a subnet field plus the host on subnet; this is enabled by setting a network address mask at boot time. 33..33..33..33.. AAcccceessss rriigghhttss ttrraannssmmiissssiioonn No access rights transmission facilities are provided in the Internet domain. 33..33..33..44.. RRaaww aacccceessss The Internet domain allows the super-user access to the raw facilities of IP. These interfaces are modeled as SOCK_RAW sockets. Each raw socket is associated with one IP protocol number, and receives all traffic received for that protocol. This allows administrative and debugging func- tions to occur, and enables user-level implementations of special-purpose protocols such as inter-gateway routing pro- tocols. PSD:5-52 4.4BSD Architecture Manual 33..44.. TTeerrmmiinnaallss aanndd DDeevviicceess 33..44..11.. TTeerrmmiinnaallss Terminals support _r_e_a_d and _w_r_i_t_e I/O operations, as well as a collection of terminal specific _i_o_c_t_l operations, to control input character interpretation and editing, and output format and delays. 33..44..11..11.. TTeerrmmiinnaall iinnppuutt Terminals are handled according to the underlying com- munication characteristics such as baud rate and required delays, and a set of software parameters. 33..44..11..11..11.. IInnppuutt mmooddeess A terminal is in one of three possible modes: _r_a_w, _c_b_r_e_a_k, or _c_o_o_k_e_d. In raw mode all input is passed through to the reading process immediately and without interpreta- tion. In cbreak mode, the handler interprets input only by looking for characters that cause interrupts or output flow control; all other characters are made available as in raw mode. In cooked mode, input is processed to provide stan- dard line-oriented local editing functions, and input is presented on a line-by-line basis. 33..44..11..11..22.. IInntteerrrruupptt cchhaarraacctteerrss Interrupt characters are interpreted by the terminal handler only in cbreak and cooked modes, and cause a soft- ware interrupt to be sent to all processes in the process group associated with the terminal. Interrupt characters exist to send SIGINT and SIGQUIT signals, and to stop a pro- cess group with the SIGTSTP signal either immediately, or when all input up to the stop character has been read. 33..44..11..11..33.. LLiinnee eeddiittiinngg When the terminal is in cooked mode, editing of an input line is performed. Editing facilities allow deletion of the previous character or word, or deletion of the cur- rent input line. In addition, a special character may be used to reprint the current input line after some number of editing operations have been applied. Certain other characters are interpreted specially when a process is in cooked mode. The _e_n_d _o_f _l_i_n_e character determines the end of an input record. The _e_n_d _o_f _f_i_l_e character simulates an end of file occurrence on terminal input. Flow control is provided by _s_t_o_p _o_u_t_p_u_t and _s_t_a_r_t _o_u_t_p_u_t control characters. Output may be flushed with the 4.4BSD Architecture Manual PSD:5-53 _f_l_u_s_h _o_u_t_p_u_t character; and a _l_i_t_e_r_a_l _c_h_a_r_a_c_t_e_r may be used to force literal input of the immediately following charac- ter in the input line. Input characters may be echoed to the terminal as they are received. Non-graphic ASCII input characters may be echoed as a two-character printable representation, ``^char- acter.'' 33..44..11..22.. TTeerrmmiinnaall oouuttppuutt On output, the terminal handler provides some simple formatting services. These include converting the carriage return character to the two character return-linefeed sequence, inserting delays after certain standard control characters, expanding tabs, and providing translations for upper-case only terminals. 33..44..11..33.. TTeerrmmiinnaall ccoonnttrrooll ooppeerraattiioonnss When a terminal is first opened it is initialized to a standard state and configured with a set of standard con- trol, editing, and interrupt characters. A process may alter this configuration with certain control operations, specifying parameters in a standard structure:|- struct ttymode { short tt_ispeed; /* input speed */ int tt_iflags; /* input flags */ short tt_ospeed; /* output speed */ int tt_oflags; /* output flags */ }; and ``special characters'' are specified with the _t_t_y_c_h_a_r_s structure, ----------- |- The control interface described here is an internal interface only in 4.3BSD. Future releases will probably use a modified interface based on currently-proposed standards. PSD:5-54 4.4BSD Architecture Manual struct ttychars { char tc_erasec; /* erase char */ char tc_killc; /* erase line */ char tc_intrc; /* interrupt */ char tc_quitc; /* quit */ char tc_startc; /* start output */ char tc_stopc; /* stop output */ char tc_eofc; /* end-of-file */ char tc_brkc; /* input delimiter (like nl) */ char tc_suspc; /* stop process signal */ char tc_dsuspc; /* delayed stop process signal */ char tc_rprntc; /* reprint line */ char tc_flushc; /* flush output (toggles) */ char tc_werasc; /* word erase */ char tc_lnextc; /* literal next character */ }; 33..44..11..44.. TTeerrmmiinnaall hhaarrddwwaarree ssuuppppoorrtt The terminal handler allows a user to access basic hardware related functions; e.g. line speed, modem control, parity, and stop bits. A special signal, SIGHUP, is auto- matically sent to processes in a terminal's process group when a carrier transition is detected. This is normally associated with a user hanging up on a modem controlled ter- minal line. 33..44..22.. SSttrruuccttuurreedd ddeevviicceess Structures devices are typified by disks and magnetic tapes, but may represent any random-access device. The sys- tem performs read-modify-write type buffering actions on block devices to allow them to be read and written in a totally random access fashion like ordinary files. File systems are normally created in block devices. 33..44..33.. UUnnssttrruuccttuurreedd ddeevviicceess Unstructured devices are those devices which do not support block structure. Familiar unstructured devices are raw communications lines (with no terminal handler), raster plotters, magnetic tape and disks unfettered by buffering and permitting large block input/output and positioning and formatting commands. 4.4BSD Architecture Manual PSD:5-55 33..55.. PPrroocceessss aanndd kkeerrnneell ddeessccrriippttoorrss The status of the facilities in this section is still under discussion. The _p_t_r_a_c_e facility of earlier UNIX sys- tems remains in 4.3BSD. Planned enhancements would allow a descriptor-based process control facility. PSD:5-56 4.4BSD Architecture Manual II.. SSuummmmaarryy ooff ffaacciilliittiieess 11.. KKeerrnneell pprriimmiittiivveess 11..11.. PPrroocceessss nnaammiinngg aanndd pprrootteeccttiioonn sethostid set UNIX host id gethostid get UNIX host id sethostname set UNIX host name gethostname get UNIX host name getpid get process id fork create new process exit terminate a process execve execute a different process getuid get user id geteuid get effective user id setreuid set real and effective user id's getgid get accounting group id getegid get effective accounting group id getgroups get access group set setregid set real and effective group id's setgroups set access group set getpgrp get process group setpgrp set process group 11..22 MMeemmoorryy mmaannaaggeemmeenntt memory management definitions sbrk change data section size sstk|- change stack section size getpagesize get memory page size mmap|- map pages of memory msync|- flush modified mapped pages to filesystem munmap|- unmap memory mprotect|- change protection of pages madvise|- give memory management advice mincore|- determine core residency of pages msleep|- sleep on a lock mwakeup|- wakeup process sleeping on a lock 11..33 SSiiggnnaallss signal definitions sigvec set handler for signal kill send signal to process killpgrp send signal to process group sigblock block set of signals sigsetmask restore set of blocked signals sigpause wait for signals sigstack set software stack for signals ----------- |- Not supported in 4.3BSD. 4.4BSD Architecture Manual PSD:5-57 11..44 TTiimmiinngg aanndd ssttaattiissttiiccss time-related definitions gettimeofday get current time and timezone settimeofday set current time and timezone getitimer read an interval timer setitimer get and set an interval timer profil profile process 11..55 DDeessccrriippttoorrss getdtablesize descriptor reference table size dup duplicate descriptor dup2 duplicate to specified index close close descriptor select multiplex input/output fcntl control descriptor options wrap|- wrap descriptor with protocol 11..66 RReessoouurrccee ccoonnttrroollss resource-related definitions getpriority get process priority setpriority set process priority getrusage get resource usage getrlimit get resource limitations setrlimit set resource limitations 11..77 SSyysstteemm ooppeerraattiioonn ssuuppppoorrtt mount mount a device file system swapon add a swap device umount umount a file system sync flush system caches reboot reboot a machine acct specify accounting file 22.. SSyysstteemm ffaacciilliittiieess 22..11 GGeenneerriicc ooppeerraattiioonnss read read data write write data scatter-gather related definitions readv scattered data input writev gathered data output standard control operations ioctl device control operation ----------- |- Not supported in 4.3BSD. PSD:5-58 4.4BSD Architecture Manual 22..22 FFiillee ssyysstteemm Operations marked with a * exist in two forms: as shown, operating on a file name, and operating on a file descriptor, when the name is preceded with a ``f''. file system definitions chdir change directory chroot change root directory mkdir make a directory rmdir remove a directory open open a new or existing file mknod make a special file portal|- make a portal entry unlink remove a link stat* return status for a file lstat returned status of link chown* change owner chmod* change mode utimes change access/modify times link make a hard link symlink make a symbolic link readlink read contents of symbolic link rename change name of file lseek reposition within file truncate* truncate file access determine accessibility flock lock a file 22..33 CCoommmmuunniiccaattiioonnss standard definitions socket create socket bind bind socket to name getsockname get socket name listen allow queuing of connections accept accept a connection connect connect to peer socket socketpair create pair of connected sockets sendto send data to named socket send send data to connected socket recvfrom receive data on unconnected socket recv receive data on connected socket sendmsg send gathered data and/or rights recvmsg receive scattered data and/or rights shutdown partially close full-duplex connection getsockopt get socket option setsockopt set socket option 4.4BSD Architecture Manual PSD:5-59 22..44 TTeerrmmiinnaallss,, bblloocckk aanndd cchhaarraacctteerr ddeevviicceess 22..55 PPrroocceesssseess aanndd kkeerrnneell hhooookkss