AA PPaaggeeaabbllee MMeemmoorryy BBaasseedd FFiilleessyysstteemm _M_a_r_s_h_a_l_l _K_i_r_k _M_c_K_u_s_i_c_k _M_i_c_h_a_e_l _J_. _K_a_r_e_l_s _K_e_i_t_h _B_o_s_t_i_c Computer Systems Research Group Computer Science Division Department of Electrical Engineering and Computer Science University of California, Berkeley Berkeley, California 94720 email: mckusick@cs.Berkeley.EDU telephone: 415-642-4948 _A_B_S_T_R_A_C_T This paper describes the motivations for mem- ory-based filesystems. It compares techniques used to implement them and describes the drawbacks of using dedicated memory to support such filesys- tems. To avoid the drawbacks of using dedicated memory, it discusses building a simple memory- based filesystem in pageable memory. It details the performance characteristics of this filesystem and concludes with areas for future work. 11.. IInnttrroodduuccttiioonn This paper describes the motivation for and implementa- tion of a memory-based filesystem. Memory-based filesystems have existed for a long time; they have generally been mar- keted as RAM disks or sometimes as software packages that use the machine's general purpose memory.[White1980a] A RAM disk is designed to appear like any other disk peripheral connected to a machine. It is normally inter- faced to the processor through the I/O bus and is accessed through a device driver similar or sometimes identical to the device driver used for a normal magnetic disk. The device driver sends requests for blocks of data to the device and the requested data is then DMA'ed to or from the requested block. Instead of storing its data on a rotating magnetic disk, the RAM disk stores its data in a large array of random access memory or bubble memory. Thus, the latency of accessing the RAM disk is nearly zero compared to the 15-50 milliseconds of latency incurred when access rotating magnetic media. RAM disks also have the benefit of being 1 able to transfer data at the maximum DMA rate of the system, while disks are typically limited by the rate that the data passes under the disk head. Software packages simulating RAM disks operate by allo- cating a fixed partition of the system memory. The software then provides a device driver interface similar to the one described for hardware RAM disks, except that it uses mem- ory-to-memory copy instead of DMA to move the data between the RAM disk and the system buffers, or it maps the contents of the RAM disk into the system buffers. Because the memory used by the RAM disk is not available for other purposes, software RAM-disk solutions are used primarily for machines with limited addressing capabilities such as PC's that do not have an effective way of using the extra memory anyway. Most software RAM disks lose their contents when the system is powered down or rebooted. The contents can be saved by using battery backed-up memory, by storing critical filesystem data structures in the filesystem, and by running a consistency check program after each reboot. These condi- tions increase the hardware cost and potentially slow down the speed of the disk. Thus, RAM-disk filesystems are not typically designed to survive power failures; because of their volatility, their usefulness is limited to transient or easily recreated information such as might be found in //ttmmpp. Their primary benefit is that they have higher throughput than disk based filesystems.[Smith1981a] This improved throughput is particularly useful for utilities that make heavy use of temporary files, such as compilers. On fast processors, nearly half of the elapsed time for a compilation is spent waiting for synchronous operations required for file creation and deletion. The use of the memory-based filesystem nearly eliminates this waiting time. Using dedicated memory to exclusively support a RAM disk is a poor use of resources. The overall throughput of the system can be improved by using the memory where it is getting the highest access rate. These needs may shift between supporting process virtual address spaces and caching frequently used disk blocks. If the memory is dedi- cated to the filesystem, it is better used in a buffer cache. The buffer cache permits faster access to the data because it requires only a single memory-to-memory copy from the kernel to the user process. The use of memory is used in a RAM-disk configuration may require two memory-to-memory copies, one from the RAM disk to the buffer cache, then another copy from the buffer cache to the user process. The new work being presented in this paper is building a prototype RAM-disk filesystem in pageable memory instead of dedicated memory. The goal is to provide the speed bene- fits of a RAM disk without paying the performance penalty inherent in dedicating part of the physical memory on the 2 machine to the RAM disk. By building the filesystem in pageable memory, it competes with other processes for the available memory. When memory runs short, the paging system pushes its least-recently-used pages to backing store. Being pageable also allows the filesystem to be much larger than would be practical if it were limited by the amount of physical memory that could be dedicated to that purpose. We typically operate our //ttmmpp with 30 to 60 megabytes of space which is larger than the amount of memory on the machine. This configuration allows small files to be accessed quickly, while still allowing //ttmmpp to be used for big files, although at a speed more typical of normal, disk-based filesystems. An alternative to building a memory-based filesystem would be to have a filesystem that never did operations syn- chronously and never flushed its dirty buffers to disk. However, we believe that such a filesystem would either use a disproportionately large percentage of the buffer cache space, to the detriment of other filesystems, or would require the paging system to flush its dirty pages. Waiting for other filesystems to push dirty pages subjects them to delays while waiting for the pages to be written. We await the results of others trying this approach.[Ohta1990a] 22.. IImmpplleemmeennttaattiioonn The current implementation took less time to write than did this paper. It consists of 560 lines of kernel code (1.7K text + data) and some minor modifications to the pro- gram that builds disk based filesystems, _n_e_w_f_s. A condensed version of the kernel code for the memory-based filesystem are reproduced in Appendix 1. A filesystem is created by invoking the modified _n_e_w_f_s, with an option telling it to create a memory-based filesys- tem. It allocates a section of virtual address space of the requested size and builds a filesystem in the memory instead of on a disk partition. When built, it does a _m_o_u_n_t system call specifying a filesystem type of MFS (Memory File Sys- tem). The auxiliary data parameter to the mount call speci- fies a pointer to the base of the memory in which it has built the filesystem. (The auxiliary data parameter used by the local filesystem, _u_f_s, specifies the block device con- taining the filesystem.) The mount system call allocates and initializes a mount table entry and then calls the filesystem-specific mount routine. The filesystem-specific routine is responsible for doing the mount and initializing the filesystem-specific portion of the mount table entry. The memory-based filesys- tem-specific mount routine, _m_f_s___m_o_u_n_t(), is shown in Appendix 1. It allocates a block-device vnode to represent the memory disk device. In the private area of this vnode 3 it stores the base address of the filesystem and the process identifier of the _n_e_w_f_s process for later reference when doing I/O. It also initializes an I/O list that it uses to record outstanding I/O requests. It can then call the _u_f_s filesystem mount routine, passing the special block-device vnode that it has created instead of the usual disk block- device vnode. The mount proceeds just as any other local mount, except that requests to read from the block device are vectored through _m_f_s___s_t_r_a_t_e_g_y() (described below) instead of the usual _s_p_e_c___s_t_r_a_t_e_g_y() block device I/O func- tion. When the mount is completed, _m_f_s___m_o_u_n_t() does not return as most other filesystem mount functions do; instead it sleeps in the kernel awaiting I/O requests. Each time an I/O request is posted for the filesystem, a wakeup is issued for the corresponding _n_e_w_f_s process. When awakened, the process checks for requests on its buffer list. A read request is serviced by copying data from the section of the _n_e_w_f_s address space corresponding to the requested disk block to the kernel buffer. Similarly a write request is serviced by copying data to the section of the _n_e_w_f_s address space corresponding to the requested disk block from the kernel buffer. When all the requests have been serviced, the _n_e_w_f_s process returns to sleep to await more requests. Once mounted, all operations on files in the memory- based filesystem are handled by the _u_f_s filesystem code until they get to the point where the filesystem needs to do I/O on the device. Here, the filesystem encounters the sec- ond piece of the memory-based filesystem. Instead of call- ing the special-device strategy routine, it calls the mem- ory-based strategy routine, _m_f_s___s_t_r_a_t_e_g_y(). Usually, the request is serviced by linking the buffer onto the I/O list for the memory-based filesystem vnode and sending a wakeup to the _n_e_w_f_s process. This wakeup results in a context- switch to the _n_e_w_f_s process, which does a copyin or copyout as described above. The strategy routine must be careful to check whether the I/O request is coming from the _n_e_w_f_s pro- cess itself, however. Such requests happen during mount and unmount operations, when the kernel is reading and writing the superblock. Here, _m_f_s___s_t_r_a_t_e_g_y() must do the I/O itself to avoid deadlock. The final piece of kernel code to support the memory- based filesystem is the close routine. After the filesystem has been successfully unmounted, the device close routine is called. For a memory-based filesystem, the device close routine is _m_f_s___c_l_o_s_e(). This routine flushes any pending I/O requests, then sets the I/O list head to a special value that is recognized by the I/O servicing loop in _m_f_s___m_o_u_n_t() as an indication that the filesystem is unmounted. The _m_f_s___m_o_u_n_t() routine exits, in turn causing the _n_e_w_f_s process to exit, resulting in the filesystem vanishing in a cloud of dirty pages. 4 The paging of the filesystem does not require any addi- tional code beyond that already in the kernel to support virtual memory. The _n_e_w_f_s process competes with other pro- cesses on an equal basis for the machine's available memory. Data pages of the filesystem that have not yet been used are zero-fill-on-demand pages that do not occupy memory, although they currently allocate space in backing store. As long as memory is plentiful, the entire contents of the filesystem remain memory resident. When memory runs short, the oldest pages of _n_e_w_f_s will be pushed to backing store as part of the normal paging activity. The pages that are pushed usually hold the contents of files that have been created in the memory-based filesystem but have not been recently accessed (or have been deleted).[Leffler1989a] 33.. PPeerrffoorrmmaannccee The performance of the current memory-based filesystem is determined by the memory-to-memory copy speed of the pro- cessor. Empirically we find that the throughput is about 45% of this memory-to-memory copy speed. The basic set of steps for each block written is: 1) memory-to-memory copy from the user process doing the write to a kernel buffer 2) context-switch to the _n_e_w_f_s process 3) memory-to-memory copy from the kernel buffer to the _n_e_w_f_s address space 4) context switch back to the writing process Thus each write requires at least two memory-to-memory copies accounting for about 90% of the CPU time. The remaining 10% is consumed in the context switches and the filesystem allocation and block location code. The actual context switch count is really only about half of the worst case outlined above because read-ahead and write-behind allow multiple blocks to be handled with each context switch. On the six-MIPS CCI Power 6/32 machine, the raw reading and writing speed is only about twice that of a regular disk-based filesystem. However, for processes that create and delete many files, the speedup is considerably greater. The reason for the speedup is that the filesystem must do two synchronous operations to create a file, first writing the allocated inode to disk, then creating the directory entry. Deleting a file similarly requires at least two syn- chronous operations. Here, the low latency of the memory- based filesystem is noticeable compared to the disk-based filesystem, as a synchronous operation can be done with just two context switches instead of incurring the disk latency. 5 44.. FFuuttuurree WWoorrkk The most obvious shortcoming of the current implementa- tion is that filesystem blocks are copied twice, once between the _n_e_w_f_s process' address space and the kernel buffer cache, and once between the kernel buffer and the requesting process. These copies are done in different pro- cess contexts, necessitating two context switches per group of I/O requests. These problems arise because of the cur- rent inability of the kernel to do page-in operations for an address space other than that of the currently-running pro- cess, and the current inconvenience of mapping process-owned pages into the kernel buffer cache. Both of these problems are expected to be solved in the next version of the virtual memory system, and thus we chose not to address them in the current implementation. With the new version of the virtual memory system, we expect to use any part of physical memory as part of the buffer cache, even though it will not be entirely addressable at once within the kernel. In that system, the implementation of a memory-based filesystem that avoids the double copy and context switches will be much easier. Ideally part of the kernel's address space would reside in pageable memory. Once such a facility is available it would be most efficient to build a memory-based filesystem within the kernel. One potential problem with such a scheme is that many kernels are limited to a small address space (usually a few megabytes). This restriction limits the size of memory-based filesystem that such a machine can support. On such a machine, the kernel can describe a memory-based filesystem that is larger than its address space and use a ``window'' to map the larger filesystem address space into its limited address space. The window would maintain a cache of recently accessed pages. The problem with this scheme is that if the working set of active pages is greater than the size of the window, then much time is spent remap- ping pages and invalidating translation buffers. Alterna- tively, a separate address space could be constructed for each memory-based filesystem as in the current implementa- tion, and the memory-resident pages of that address space could be mapped exactly as other cached pages are accessed. The current system uses the existing local filesystem structures and code to implement the memory-based filesys- tem. The major advantages of this approach are the sharing of code and the simplicity of the approach. There are sev- eral disadvantages, however. One is that the size of the filesystem is fixed at mount time. This means that a fixed number of inodes (files) and data blocks can be supported. Currently, this approach requires enough swap space for the entire filesystem, and prevents expansion and contraction of the filesystem on demand. The current design also prevents the filesystem from taking advantage of the memory-resident 6 character of the filesystem. It would be interesting to explore other filesystem implementations that would be less expensive to execute and that would make better use of the space. For example, the current filesystem structure is optimized for magnetic disks. It includes replicated con- trol structures, ``cylinder groups'' with separate alloca- tion maps and control structures, and data structures that optimize rotational layout of files. None of this is useful in a memory-based filesystem (at least when the backing store for the filesystem is dynamically allocated and not contiguous on a single disk type). On the other hand, directories could be implemented using dynamically-allocated memory organized as linked lists or trees rather than as files stored in ``disk'' blocks. Allocation and location of pages for file data might use virtual memory primitives and data structures rather than direct and indirect blocks. A reimplementation along these lines will be considered when the virtual memory system in the current system has been replaced. 55.. RReeffeerreenncceess Leffler1989a. S. J. Leffler, M. K. McKusick, M. J. Karels, and J. S. Quarterman, _T_h_e _D_e_s_i_g_n _a_n_d _I_m_p_l_e_m_e_n_t_a_t_i_o_n _o_f _t_h_e _4_._3_B_S_D _U_N_I_X _O_p_e_r_a_t_i_n_g _S_y_s_t_e_m_, Addison-Wesley, Reading, MA (1989). Ohta1990a. Masataka Ohta and Hiroshi Tezuka, "A Fast /tmp File System by Async Mount Option," _U_S_E_N_I_X _A_s_s_o_c_i_a_t_i_o_n _C_o_n_- _f_e_r_e_n_c_e _P_r_o_c_e_e_d_i_n_g_s, p. ???-??? (June 1990). Smith1981a. A. J. Smith, "Bibliography on file and I/O system opti- mizations and related topics," _O_p_e_r_a_t_i_n_g _S_y_s_t_e_m_s _R_e_v_i_e_w Vol. 1144(4), p. 39-54 (October 1981). White1980a. R. M. White, "Disk Storage Technology," _S_c_i_e_n_t_i_f_i_c _A_m_e_r_i_c_a_n Vol. 224433(2), p. 138-148 (August 1980). 7 6666.... AAAAppppppppeeeennnnddddiiiixxxx AAAA ---- IIIImmmmpppplllleeeemmmmeeeennnnttttaaaattttiiiioooonnnn DDDDeeeettttaaaaiiiillllssss _/_* _* _T_h_i_s _s_t_r_u_c_t_u_r_e _d_e_f_i_n_e_s _t_h_e _c_o_n_t_r_o_l _d_a_t_a _f_o_r _t_h_e _m_e_m_o_r_y _* _b_a_s_e_d _f_i_l_e _s_y_s_t_e_m_. _*_/ ssssttttrrrruuuucccctttt mfsnode {{{{ ssssttttrrrruuuucccctttt vnode _*mfs-vnode; _/_* _v_n_o_d_e _a_s_s_o_c_i_a_t_e_d _w_i_t_h _t_h_i_s _m_f_s_n_o_d_e _*_/ caddr-t mfs-baseoff; _/_* _b_a_s_e _o_f _f_i_l_e _s_y_s_t_e_m _i_n _m_e_m_o_r_y _*_/ lllloooonnnngggg mfs-size; _/_* _s_i_z_e _o_f _m_e_m_o_r_y _f_i_l_e _s_y_s_t_e_m _*_/ pid-t mfs-pid; _/_* _s_u_p_p_o_r_t_i_n_g _p_r_o_c_e_s_s _p_i_d _*_/ ssssttttrrrruuuucccctttt buf _*mfs-buflist; _/_* _l_i_s_t _o_f _I_/_O _r_e_q_u_e_s_t_s _*_/ }}}}; _/_* _* _C_o_n_v_e_r_t _b_e_t_w_e_e_n _m_f_s_n_o_d_e _p_o_i_n_t_e_r_s _a_n_d _v_n_o_d_e _p_o_i_n_t_e_r_s _*_/ ####ddddeeeeffffiiiinnnneeee VTOMFS(vp) ((ssssttttrrrruuuucccctttt mfsnode _*)(vp)_V-_T>_Ov_M-_Fd_Sata) ####ddddeeeeffffiiiinnnneeee MFSTOV(mfsp) ((mfsp)->mfs-vnode) _M_F_S_T_O_V ####ddddeeeeffffiiiinnnneeee MFS-EXIT (ssssttttrrrruuuucccctttt buf _*)-1 _/_* _* _A_r_g_u_m_e_n_t_s _t_o _m_o_u_n_t _M_F_S _*_/ ssssttttrrrruuuucccctttt mfs-args {{{{ cccchhhhaaaarrrr _*name; _/_* _n_a_m_e _t_o _e_x_p_o_r_t _f_o_r _s_t_a_t_f_s _*_/ caddr-t base; _/_* _b_a_s_e _a_d_d_r_e_s_s _o_f _f_i_l_e _s_y_s_t_e_m _i_n _m_e_m_o_r_y _*_/ u-long size; _/_* _s_i_z_e _o_f _f_i_l_e _s_y_s_t_e_m _*_/ }}}}; 8 _/_* _* _M_o_u_n_t _a_n _M_F_S _f_i_l_e_s_y_s_t_e_m_. _*_/ mfs-mount(mp, path, data) _m_f_s___m_o_u_n_t ssssttttrrrruuuucccctttt mount _*mp; cccchhhhaaaarrrr _*path; caddr-t data; {{{{ ssssttttrrrruuuucccctttt vnode _*devvp; ssssttttrrrruuuucccctttt mfsnode _*mfsp; ssssttttrrrruuuucccctttt buf _*bp; ssssttttrrrruuuucccctttt mfs-args args; _/_* _* _C_r_e_a_t_e _a _b_l_o_c_k _d_e_v_i_c_e _t_o _r_e_p_r_e_s_e_n_t _t_h_e _d_i_s_k_. _*_/ devvp = getnewvnode(VT-MFS, VBLK, &mfs-vnodeops); mfsp = VTOMFS(devvp); _/_* _* _S_a_v_e _b_a_s_e _a_d_d_r_e_s_s _o_f _t_h_e _f_i_l_e_s_y_s_t_e_m _f_r_o_m _t_h_e _s_u_p_p_o_r_t_i_n_g _p_r_o_c_e_s_s_. _*_/ copyin(data, &args, (ssssiiiizzzzeeeeooooffff mfs-args)); mfsp->mfs-baseoff = args.base; mfsp->mfs-size = args.size; _/_* _* _R_e_c_o_r_d _t_h_e _p_r_o_c_e_s_s _i_d_e_n_t_i_f_i_e_r _o_f _t_h_e _s_u_p_p_o_r_t_i_n_g _p_r_o_c_e_s_s_. _*_/ mfsp->mfs-pid = u.u-procp->p-pid; _/_* _* _M_o_u_n_t _t_h_e _f_i_l_e_s_y_s_t_e_m_. _*_/ mfsp->mfs-buflist = NULL; mountfs(devvp, mp); _/_* _* _L_o_o_p _p_r_o_c_e_s_s_i_n_g _I_/_O _r_e_q_u_e_s_t_s_. _*_/ wwwwhhhhiiiilllleeee (mfsp->mfs-buflist != MFS-EXIT) {{{{ wwwwhhhhiiiilllleeee (mfsp->mfs-buflist != NULL) {{{{ bp = mfsp->mfs-buflist; mfsp->mfs-buflist = bp->av-forw; offset = mfsp->mfs-baseoff + (bp->b-blkno _* DEV-BSIZE); iiiiffff (bp->b-flags & B-READ) copyin(offset, bp->b-un.b-addr, bp->b-bcount); eeeellllsssseeee _/_* _w_r_i_t_e_-_r_e_q_u_e_s_t _*_/ copyout(bp->b-un.b-addr, offset, bp->b-bcount); biodone(bp); }}}} sleep((caddr-t)devvp, PWAIT); }}}} }}}} 9 _/_* _* _I_f _t_h_e _M_F_S _p_r_o_c_e_s_s _r_e_q_u_e_s_t_s _t_h_e _I_/_O _t_h_e_n _w_e _m_u_s_t _d_o _i_t _d_i_r_e_c_t_l_y_. _* _O_t_h_e_r_w_i_s_e _p_u_t _t_h_e _r_e_q_u_e_s_t _o_n _t_h_e _l_i_s_t _a_n_d _r_e_q_u_e_s_t _t_h_e _M_F_S _p_r_o_c_e_s_s _* _t_o _b_e _r_u_n_. _*_/ mfs-strategy(bp) _m_f_s___s_t_r_a_t_e_g_y ssssttttrrrruuuucccctttt buf _*bp; {{{{ ssssttttrrrruuuucccctttt vnode _*devvp; ssssttttrrrruuuucccctttt mfsnode _*mfsp; off-t offset; devvp = bp->b-vp; mfsp = VTOMFS(devvp); iiiiffff (mfsp->mfs-pid == u.u-procp->p-pid) {{{{ offset = mfsp->mfs-baseoff + (bp->b-blkno _* DEV-BSIZE); iiiiffff (bp->b-flags & B-READ) copyin(offset, bp->b-un.b-addr, bp->b-bcount); eeeellllsssseeee _/_* _w_r_i_t_e_-_r_e_q_u_e_s_t _*_/ copyout(bp->b-un.b-addr, offset, bp->b-bcount); biodone(bp); }}}} eeeellllsssseeee {{{{ bp->av-forw = mfsp->mfs-buflist; mfsp->mfs-buflist = bp; wakeup((caddr-t)bp->b-vp); }}}} }}}} _/_* _* _T_h_e _c_l_o_s_e _r_o_u_t_i_n_e _i_s _c_a_l_l_e_d _b_y _u_n_m_o_u_n_t _a_f_t_e_r _t_h_e _f_i_l_e_s_y_s_t_e_m _* _h_a_s _b_e_e_n _s_u_c_c_e_s_s_f_u_l_l_y _u_n_m_o_u_n_t_e_d_. _*_/ mfs-close(devvp) _m_f_s___c_l_o_s_e ssssttttrrrruuuucccctttt vnode _*devvp; {{{{ ssssttttrrrruuuucccctttt mfsnode _*mfsp = VTOMFS(vp); ssssttttrrrruuuucccctttt buf _*bp; _/_* _* _F_i_n_i_s_h _a_n_y _p_e_n_d_i_n_g _I_/_O _r_e_q_u_e_s_t_s_. _*_/ wwwwhhhhiiiilllleeee (bp = mfsp->mfs-buflist) {{{{ mfsp->mfs-buflist = bp->av-forw; mfs-doio(bp, mfsp->mfs-baseoff); wakeup((caddr-t)bp); }}}} _/_* _* _S_e_n_d _a _r_e_q_u_e_s_t _t_o _t_h_e _f_i_l_e_s_y_s_t_e_m _s_e_r_v_e_r _t_o _e_x_i_t_. _*_/ mfsp->mfs-buflist = MFS-EXIT; wakeup((caddr-t)vp); }}}} 10