FILES

FILES

In the following articles we will consider files from the user's viewpoint that is, how they are used and what properties they possess.

File Naming

Files are an abstraction mechanism. They provide a way to store information on the disk and read it back later. This must be done in such a way as to shield the user from the details of how and where the information is stored, and how the disks really work.

Perhaps the most important feature of any abstraction mechanism is the way the objects being managed are named, so we will start our examination of file systems with the subject of file naming. When a process creates a file, it gives the file a name. When the process terminates, the file continues to exist and can be accessed by other processes using its name.

The exact rules for file naming vary to some extent from system to system, but all current operating systems allow strings of one to eight letters as legal file names. In this way andrea, bruce, and cathy are possible file names. Often digits and special characters are also permitted, so names like 2, urgent, and "POSIX Threads" Figure 1 are often valid as well. Many file systems support names as long as 255 characters.

Some file systems differentiate between upper and lower case letters, whereas others do not. UNIX falls in the first category; MS-DOS falls in the second. Thus a UNIX system can have all of the following as three distinct files: maria, Maria, and MARIA. In MS-DOS, all these names refer to the same file.

An aside on file systems is probably in order here. Windows 95 and Windows 98 both use the MS-DOS file system, called FAT-16, and thus inherit many of its properties, such as how file names are constructed. Windows 98 introduced some extensions to FAT-16, leading to FAT-32, but these two are quite similar. In addition, Windows NT, Windows 2000, Windows XP and WV support both FAT file systems, which are really obsolete now. These four NT-based operating systems have a native file system (NTFS) that has different properties (such as file names in Unicode). In this section, when we refer to the MS-DOS or FAT file systems, we mean FAT-16 and FAT-32 as used on Windows unless specified otherwise. We will discuss the FAT file systems later in this section.

Many operating systems support two-part file names, with the two parts separated by a period, as in prog.c. The part following the period is called the file extension and generally indicates something about the file. In MS-DOS, for instance, file names are 1 to 8 characters, plus an optional extension of 1 to 3 characters. In UNIX, the size of the extension, if any, is up to the user, and a file may even have two or more extensions, as in homepage.html.zip, where .html indicates a Web page in HTML and .zip indicates that the file (homepage. html) has been compressed using the zip program. Some of the more common file extensions and their meanings are shown in Figure 1.

Some typical file extensions

In some systems (e.g., UNIX), file extensions are just conventions and are not enforced by the operating system. A file named file.txt might be some kind of text file, but that name is more to remind the owner than to convey any actual information to the computer. On the other hand, a C compiler may in fact insist that files it is to compile end in .c, and it may refuse to compile them if they do not.

Conventions like this are particularly useful when the same program can handle a number of different kinds of files. The C compiler, for instance, can be given a list of several files to compile and link together, some of them C files and some of them assembly language files. The extension then becomes essential for the compiler to tell which are C files, which are assembly files, and which are other files.

On the contrary, Windows is aware of the extensions and assigns meaning to them. Users (or processes) can register extensions with the operating system and specify for each one which program "owns" that extension. When a user double clicks on a file name, the program assigned to its file extension is launched with the file as parameter. For instance, double clicking on file.doc starts Microsoft Word with file.doc as the initial file to edit.

File Structure

Files can be structured in any of various ways. Three common possibilities are s in Figure 2. The file in Figure 2(a) is an unstructured sequence of bytes. In effect, the operating system does not know or care what is in the file. All it sees are bytes. Any meaning must be imposed by user-level programs. Both UNIX and Windows use this approach.

Three kinds of files
   
Having the operating system regard files as nothing more than byte sequences provides the maximum flexibility. User programs can put anything they want in their files and name them any way that is convenient. The operating system does not help, but it also does not get in the way. For users who want to do unusual things, the latter can be very important. All versions of UNIX, MS-DOS, and Windows use this file model.

The first step up in structure is shown in Figure 2(b). In this model, a file is a sequence of fixed-length records, each with some internal structure. Central to the idea of a file being a sequence of records is the idea that the read operation returns one record and the write operation overwrites or appends one record. As a historical note, in decades gone by, when the 80-column punched card was king, many (mainframe) operating systems based their file systems on files consisting of 80-character records, in effect, card images. These systems also supported files of 132-character records, which were intended for the line printer (which in those days were big chain printers having 132 columns). Programs read input in units of 80 characters and wrote it in units of 132 characters, although the final 52 could be spaces, of course. No current general-purpose system uses this model as its primary file system any more, but back in the days of 80-colurnn punched cards and 132-character line printer paper this was a common model on mainframe computers.

The third kind of file structure is shown in Figure 2(c). In this organization, a file consists of a tree of records, not necessarily all the same length, each containing a key field in a fixed position in the record. The tree is sorted on the key field, to allow rapid searching for a particular key.

The basic operation here is not to get the "next" record, although that is also possible, but to get the record with a specific key. For the zoo file of Figure 2(c), one could ask the system to get the record whose key is pony, for instance, without worrying about its exact position in the file. In addition, new records can be added to the file, with the operating system, and not the user, deciding where to place them. This type of file is clearly quite different from the unstructured byte streams used in UNIX and Windows but is extensively used on the large mainframe computers still used in some commercial data processing.

Tags

file systems, operating systems, file extension, mainframe computers