Short Notes: Virtual File Systems (VFS)

Deeper Dive into the Virtual File Systems

Krishanu Konar

8 minute read


Dentries

The dentry cache is a portion of the Linux kernel that stores directory entries, it is a part of the Virtual File System (VFS) layer. A dentry is the glue that holds inodes and files together by relating inode numbers to file names. Dentries also play a role in directory caching which, ideally, keeps the most frequently used files on-hand for faster access. It tracks the locations of files and directories on a Linux file system, allowing for faster path name resolution.

To maximize efficiency in handling dentries, Linux uses a dentry cache, which consists of two kinds of data structures:

  • A set of dentry objects in the in-use, unused, or negative state.
  • A hash table to derive the dentry object associated with a given filename and a given directory quickly.

Directory entries can be obtained by calling getdents (return a FD pointing to that directory). It returns a set of directories, just like read() returns at least 1 entry.

  • Each dentry is a data structure containing name, inode, type (file, dir), size. The dentry cache also acts as a controller for an inode cache.

  • The inodes in kernel memory that are associated with unused dentries are not discarded, since the dentry cache is still using them. Thus, the inode objects are kept in RAM and can be quickly referenced by means of the corresponding dentries.

  • All the “unused” dentries are included in a doubly linked “Least Recently Used” list sorted by time of insertion. In other words, the dentry object that was last released is put in front of the list, so the least recently used dentry objects are always near the end of the list.

  • When the dentry cache has to shrink, the kernel removes elements from the tail of this list so that the most recently used objects are preserved. /proc/<PID>/dcachecontains this info

How linux dcache created?

The Linux dcache (directory cache) is created during the initialization of the Linux kernel.

  • Initialization: During the kernel boot process, the dcache is created as part of the initialization of the virtual file system (VFS) layer.
  • Allocation: The dcache is allocated dynamically using memory management functions in the kernel. The size of the dcache depends on various factors, including system configuration and available memory.
  • Populating: The dcache is populated with directory entries from the file system. When a file system is mounted, the VFS layer scans the file system, reads the directory entries, and caches them in the dcache.
  • Caching: The dcache acts as a cache for directory entries, storing frequently accessed directory information in memory. This helps to speed up file system operations such as file lookup and traversal.
  • Invalidation and Updates: The dcache is constantly updated and invalidated as file system operations occur. When a directory entry is accessed, modified, or deleted, the corresponding entry in the dcache is updated accordingly.

The dcache in Linux is a crucial component for efficient file system access and provides a faster lookup mechanism for directory entries compared to directly accessing the file system.

The process of reading a directory entry (dentry) into the directory cache (dcache) happens when a file or directory is accessed. When a file or directory is accessed, the kernel checks if the corresponding dentry is present in the dcache. If it is not present, the kernel needs to read the dentry information from the file system. Once the dentry information is retrieved from the file system, the kernel creates a new dentry object and populates it with the relevant information, such as the file or directory name, inode number, and other metadata.

Inodes

An individual dentry usually has a pointer to an inode. An inode exists in, or on, a file system and represents metadata about a file. A single inode can be pointed to by multiple dentries (hard links, for example, do this).

The look up on an inode requires that the VFS calls the lookup() method of the parent directory inode. The stat(2) operation, once the VFS has the dentry, peeks at the inode data and passes some of it back to userspace.

File Object

Opening a file requires another operation: allocation of a file structure (this is the kernel-side implementation of file descriptors). The freshly allocated file structure is initialized with a pointer to the dentry and a set of file operation member functions.

These are taken from the inode data. The open() file method is then called so the specific filesystem implementation can do its work. The file structure is placed into the file descriptor table for the process. Reading, writing and closing files (and other assorted VFS operations) is done by using the userspace file descriptor to grab the appropriate file structure, and then calling the required file structure method to do whatever is required.

Mounting

To register and unregister a filesystem, use the following API functions:

#include <linux/fs.h>

extern int register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);
  • The passed struct file_system_type describes the filesystem.

  • When a request is made to mount a filesystem onto a directory in your namespace, the VFS will call the appropriate mount() method for the specific filesystem. We can see all filesystems that are registered to the kernel in the file /proc/filesystems.

  • The mount() method must return the root dentry of the tree requested by caller. An active reference to its superblock must be grabbed and the superblock must be locked. mount can be used to setup a new partition with root directory, if the mount point already exixts, it is not overwriiten, but hidden from the view and be viewed once the overlaying partition is unmounted.

Superblock

  • The superblock is essentially file system metadata and defines the file system type, size, status, and information about other metadata structures (metadata of metadata).
  • The superblock is a structure that represents a file system. It includes the necessary information to manage the file system during operation. The superblock is very critical to the file system and therefore is stored in multiple redundant copies for each file system.
  • It includes the file system name (such as ext3), the size of the file system and its state, a reference to the block device, and metadata information (such as free lists and so on).
  • The superblock is a very “high level” metadata structure for the file system. For example, if the superblock of a partition, /var, becomes corrupt then the file system in question (/var) cannot be mounted by the operating system. You need to run fsck which will automatically select an alternate, backup copy of the superblock and attempt to recover the file system.

Sparse Files

A sparse file is a file that is mostly empty, i.e. it contains large blocks of bytes whose value is 0 (zero). On the disk, the content of a file is stored in blocks of fixed size (usually 4 KiB or more). When all the bytes contained in such a block are 0, a file system that implements sparse files does not store the block on disk, instead it keeps the information somewhere in the file meta-data.

Advantages of using sparse files:

  • Empty blocks of data do not occupy disk space; they are not stored as the regular blocks of data, their identifiers (that use only several bytes) are stored instead in the file meta-data; this way 4 KiB of disk space (or more) are saved for each empty block;
  • Reading an empty block of data from a sparse file does not take time. This happens because no data is read from disk. Since the file system knows all the bytes in the block are 0, it just sets to 0 all the bytes in the input buffer and the data is ready. There is no need to access the slow storage device.
  • Writing an empty block of data into a sparse file does not take time. On writing, the file system detects that the block is empty (all its bytes are 0) and puts the block ID into the list of empty blocks (in the file meta-data). No data is written to the disk.
  • When you read output file, the empty bytes are generated by the filesystem at runtime dynamically. They’re not really physically stored on disk, and the file’s size as reported by stat is the logical size, and the physical size is zero for output.

Block Suballocation

Block suballocation is a feature of some file systems which allows large blocks or allocation units to be used while making efficient use of empty space at the end of large files, space which would otherwise be lost for other use to internal fragmentation. As of 2015, the most widely used read-write file systems with support for block suballocation are bttrfs and UFS2.

Some Useful Commands

  • dd: Its primary purpose of which is to convert and copy files.

  • stat: It gives information about the file and filesystem. lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information about the link itself, not the file that it refers to. fstat() is also identical to stat(), except that the file about which information is to be retrieved is specified by a file descriptor (instead of a file name).

  • du: The output contains the combined space usage of all the files through the directory tree beginning at the level of the directory where the command is issued. Because the usage value displayed by the du command also includes the data blocks for directories, it is higher than the value displayed by a quota report. du -b is shorthand for --apparent-size, which is the number of bytes your applications think are in the file. du uses stat(2) to find the number of blocks used by a file.

  • df: “disk free” df is a utility that provides information about total space and available space on a file system.


References

comments powered by Disqus