Implementing Directories

Implementing Directories

Before a file can be read, it must be opened. When a file is opened, the operating system uses the path name supplied by the user to locate the directory entry. The directory entry provides the information required to find the disk blocks. Depending on the system, this information may be the disk address of the entire file (with contiguous allocation), the number of the first block (both linked list schemes), or the number of the i-node. In all cases, the main function of the directory system is to map the ASCII name of the file onto the information required to locate the data. A closely related issue is where the attributes should be stored. Every file system maintains file attributes, such as each file's owner and creation time, and they must be stored somewhere. One obvious possibility is to store them directly in the directory entry. Many systems do precisely that. This option is illustrated in Figure 1(a). In this simple design, a directory consists of a list of fixed-size entries, one per file, containing a (fixed-length) file name, a structure of the file attributes, and one or more disk addresses (up to some maximum) telling where the disk blocks are.

A simple directory containing fixed-size entries with the disk addresses

For systems that use i-nodes, another possibility for storing the attributes is in the i-nodes, rather than in the directory entries. In that case, the directory entry can be shorter: just a file name and an i-node number. This approach is shown in Figure 1(b). As we shall see later, this method has some advantages over putting them in the directory entry. The two approaches shown in Figure 1 correspond to Windows and UNIX, respectively, as we will see later. So far we have made the assumption that files have short, fixed-length names. In MS-DOS files have a 1-8 character base name and an optional extension  of 1-3 characters. In UNIX Version 7, file names were 1-14 characters, including any extensions. Though, nearly all modern operating systems support longer, variable-length file names. How can these be implemented?

The simplest approach is to set a limit on file name length, usually 255 characters, and then use one of the designs of Figure 1 with 255 characters reserved for each file name. This approach is simple, but wastes a great deal of directory space, since few files have such long names. For efficiency reasons, a different structure is desirable. One alternative is to give up the idea that all directory entries are the same size. With this method, each directory entry contains a fixed portion, usually starting with the length of the entry, and then followed by data with a fixed format, usually including the owner, creation time, protection information, and other attributes. This fixed-length header is followed by the actual file name, however long it may be, as shown in Figure 2(a) in big-endian format (e.g.,  SPARC). In this example we have three files, project-budget, personnel, and foo. Each file name is terminated by a special character (usually 0), which is represented in the figure by a box with a cross in it. To allow each directory entry to begin on a word boundary, each file name is filled out to an integral number of words, shown by shaded boxes in the figure.

A disadvantage of this method is that when a file is removed, a variable-sized gap is introduced into the directory into which the next file to be entered may not fit. This problem is the same one we saw with contiguous disk files, only now compacting the directory is feasible because it is entirely in memory. Another problem is that a single directory entry may span multiple pages, so a page fault may occur while reading a file name. Another way to handle variable-length names is to make the directory entries themselves all fixed length and keep the file names together in a heap at the end of the directory, as shown in Figure 2(b). This method has the advantage that when an entry is removed, the next file entered will always fit there. Certainly, the heap must be managed and page faults can still happen while processing file names. One minor win here is that there is no longer any real need for file names to begin at word boundaries, so no filler characters are required after file names in Figure 2(b) as they are in Figure 2(a). In all of the designs so far, directories are searched linearly from beginning to end when a file name has to be looked up. For very long directories, linear searching can be slow. One way to speed up the search is to use a hash table in each directory. Call the size of the table n. To enter a file name, the name is hashed onto a value between 0 and n - 1, for example, by dividing it by n and

Two ways of handling long file names in a directory

taking the remainder. Alternatively, the words comprising the file name can be added up and this quantity divided by n, or something similar.

Either way, the table entry corresponding to the hash code is inspected. If it is unused, a pointer is placed there to the file entry. File entries follow the hash table. If that slot is already in use, a linked list is constructed, headed at the table entry and threading through all entries with the same hash value. Looking up a file follows the same procedure. The file name is hashed to select a hash table entry. All the entries on the chain headed at that slot are checked to see if the file name is present. If the name is not on the chain, the file is not present in the directory. Using a hash table has the advantage of much faster lookup, but the disadvantage of more complicated administration. It is only really a serious candidate in systems where it is expected that directories will routinely contain hundreds or thousands of files. A different way to speed up searching large directories is to cache the results of searches. Before starting a search, a check is first made to see if the file name is in the cache. If so, it can be located immediately. Certainly, caching only works if a relatively small number of files comprise the majority of the lookups.


Tags

operating system, file system, attributes, i-nodes, disk blocks