The performance and scalability of large Perforce installations is heavily dependent on the file system provided by the operating system. We will investigate the design and implementation of select file systems and perform some simple benchmarks to determine their relative advantages. For this discussion, file system scalability is defined as the ability to support very large numbers of files and directories while providing good I/O performance
The file system is one of the most important parts of an operating system. It stores and manages data on disk drives and ensures that what is read from the storage is identical to what was originally written. In addition to storing files, the file system also creates and manages information about files and about itself. In this paper, we will look at first generation file systems such as UFS in Solaris, second generation UFS derivatives such as WAFL (Write Anywhere File System) from Network Appliance and new implementations of second generation file systems, namely XFS on Linux and NTFS on Windows 2000.
File system architecture
The way file systems manage their metadata is the key to their performance and scalability. A file system stores data on your hard drive by determining where each file’s blocks content should be stored and by maintaining a table of the locations of those blocks. It must also keep track of free versus allocated space and provide mechanisms for creating and deleting files and allocating and freeing disk space as files grow, shrink or are deleted. The data structure specifying the location of a block (or a set of blocks) is called an inode. The particular methods for allocating and retrieving file blocks determine the overall performance of reads and writes to disk and the reliability of the file system itself.
Most first generation UNIX file systems are based on the Berkeley Fast File System commonly known as UFS or FFS. The first generation Microsoft file systems, FAT16 and FAT32 will not be discussed here due to their primitive nature. In the FFS design, the file system maintains a map of inodes, which point to data and directory blocks; each inode has a number that uniquely identifies it. Directory blocks contain a table associating a list of inode numbers with the names of the files and other directories stored in that directory. A file’s inode contains information describing the file. This information includes the files inode number, metadata such as ownership and permission information, size of the file, date the file was last accessed or modified, and a list of each of the file’s data blocks. For large files, this list can also contain an indirect block, which are blocks that themselves contain lists of blocks allocated to the file. In turn, indirect blocks can contain lists of double-indirect blocks and double-indirect blocks can point to triple indirect blocks. For very large files, this method becomes extremely inefficient. UFS uses traditional bitmaps to keep track of free space and in general the block size used is fixed by the implementation. Its file data is allocated randomly based on the availability of free space. UFS divides its file systems into cylinder groups. Each cylinder group contains bookkeeping information including inodes and bitmaps for managing free space. The major purpose of cylinder groups is to reduce disk head movement by keeping inodes closer to their associated disk blocks. The inode metadata is clustered in cylinder groups to help improve its locality. This method of allocation at creation also tends to make read performance a function of the initial layout.
The UFS file system was designed at a time when 32-bit computing was the norm. Originally, it only supported 2^31 bytes, but current implementations have been extended to support up to 2TB.
The techniques developed to allow UFS to survive the inevitable crash has had a major impact on its performance characteristics. To minimize the possibility of damage it would be desirable to synchronously write all data. The performance penalty for this is so large that UFS has resorted to asynchronous writes for file data and synchronous writes for the metadata.
The next generation
Second generation file systems have been designed to add support for large file systems, large files, sparse files, large directories, large numbers of files, rapid crash recovery and high I/O throughput and RAID support.
RAID is an acronym for Redundant Array of Inexpensive Disks. A RAID array is a collection of drives which collectively act as a single storage system that can tolerate the failure of a drive without losing data and which can operate independently of each other. There are a few different RAID levels, but we will only consider the most important for performance and scalability.
RAID Level 0 is not redundant, but data is split across drives, resulting in higher data throughput. Since no redundant data is stored, performance is very good, but the failure of any disk in the array results in data loss. This level is commonly referred to as striping.
RAID Level 1 provides redundancy by duplicating all data from one drive on another drive. The performance of a level 1 array is only slightly better than a single drive but the cost per megabyte is doubled. This level is commonly referred to as mirroring.
RAID Level 4 stripes data at a block level across several drives, with parity stored on one drive. The parity information allows recovery from the failure of any single drive. The performance of a level 4 array is very good for reads. Writes, however, require that parity data be updated each time. This slows small random writes, in particular, though large writes or sequential writes are fairly fast. Because only one drive in the array stores redundant data, the cost per megabyte of a level 4 array can be fairly low.
RAID Level 10 is implemented as a striped array whose segments are RAID Level 1 arrays. High I/O rates are achieved but it is expensive since it requires a minimum of 4 drives to implement.
RAID Level 0+1 is implement as a mirrored array whose segments are RAID Level 0 arrays. High I/O rates are achieved thanks to multiple stripe segments but a single drive failure will cause the array to become, in essence, a RAID Level 0 array.
XFS was designed by SGI in the 1990’s from scratch to meet the needs of its customers for file system scalability and performance. It was ported to Linux from IRIX in 2000. The design of NTFS began in the late 1980’s. While NTFS could be considered a second generation file system based on the fact that it was developed from scratch, many of the scalability lesson that could have been learned from first generation UNIX file systems seem to have been ignored.
Network Appliance designed a file system designed to work specifically in an NFS appliance, but it can trace its root to UFS. The primary focus of the design was to implement a scheme for Snapshots, which are read-only clones of the active file system. In addition, since they decided to use a RAID Level 4 architecture, WAFL contains enhancements to mitigate the write penalty for parity computation as well as extended crash recovery features.
It is common knowledge that the slowest part of the I/O system is the disk drive since its performance depends in part on mechanical rather than electronic components. To reduce the bottleneck created by disk drives to overall I/O performance, file systems cache important information from the disk image of the file system in memory. This information is periodically flushed to disk, to synchronize it.
This caching process, so essential to system performance, has serious consequences in the event of an unexpected system crash or power outage. Since the latest updates to the file system may not have been transferred from memory to disk, the file system will not be in a consistent state. Traditional file systems use a checking program to go through and examine the file system structures and return them to consistency.
As file systems have grown larger and servers have grown to support more and more of them, the time taken by traditional file systems to run these checking programs and recover from a crash has become significant. On the largest servers, the process can literally take many hours as the program sweeps through all data. File systems that depend on these checking programs also must keep their internal data structures simple to preserve any efficiency. Having critical data offline for long periods of time during the repair process is considered an unacceptable cost in many environments.
Most modern file systems use journaling techniques borrowed from the database world to improve crash recovery. Disk transactions are written sequentially to an area of disk called the journal or log before being written to their final locations within the file system. These writes are generally performed synchronously, but gain some acceleration because they are contiguous. If a failure occurs, these transactions can then be replayed from the journal to ensure the file system is up to date. Implementations vary in terms of what data is written to the log. Some implementations write only the file system metadata, while others log all writes to the journal. The journaling file systems discussed in this paper log only metadata during normal operation. Depending on the implementation, journaling may have significant consequences for I/O performance.
It is also important to note that using a transaction log does not entirely obsolete the use of file system checking programs. Hardware and software errors that corrupt random blocks in the file system are not generally recoverable with the transaction log, yet these errors can make the contents of the file system inaccessible. This type of event is relatively rare, but still important.
Filesystem Feature Comparisons