Skip to content

Latest commit

 

History

History
1009 lines (834 loc) · 47.6 KB

11.md

File metadata and controls

1009 lines (834 loc) · 47.6 KB

Chapter 11: Swap Management

  • Just as linux uses free memory for buffering data from a disk, the reverse is true as well - eventually there's a need to free up private or anonymous pages used by a process. The pages are copied to backing storage, sometimes called the 'swap area'.

  • Strictly speaking, linux doesn't swap because 'swapping' refers to copying an entire process address space to disk and 'paging' to copying out individual pages, however it's referred to as 'swapping' in discussion and documentation so we'll call it swapping regardless.

  • There are 2 principal reasons for the existence of swap space:

  1. It expands the amount of memory a process may use - virtual memory and swap space allows a large process to run even if the process is only partially resident. Because old pages may be swapped out, the amount of memory addressed may easily exceed RAM because demand paging will ensure the pages are reloaded if necessary.

  2. Even if sufficient memory, swap is useful - a significant number of pages referenced by a process early on in its life may only be used for initialisation then never used again. It's better to swap out these pages and create more disk buffers than leave them resident and unused.

  • Swap is slow. Very slow as disks are slow (relatively to memory) - if processes are frequently addressing a large amount of memory no amount of swap or fast disks will make it run without a reasonable time, only more RAM can help.

  • It's very important that the correct page be swapped out (as discussed in chapter 10), and also that related pages be stored close together in the swap space so they are likely to be swapped in at the same time while reading ahead.

11.1 Describing the Swap Area

/*
 * The in-memory structure used to track swap areas.
 */
struct swap_info_struct {
        unsigned int flags;
        kdev_t swap_device;
        spinlock_t sdev_lock;
        struct dentry * swap_file;
        struct vfsmount *swap_vfsmnt;
        unsigned short * swap_map;
        unsigned int lowest_bit;
        unsigned int highest_bit;
        unsigned int cluster_next;
        unsigned int cluster_nr;
        int prio;                       /* swap priority */
        int pages;
        unsigned long max;
        int next;                       /* next entry on swap list */
};
  • All the swap_info_structs in the running system are stored in a statically declared array, swap_info which holds MAX_SWAPFILES (defined as 32) - this means that at most 32 swap areas can exist on a running system.

  • Looking at each field:

  1. flags - A bit field with 2 possible values - SWP_USED (0b01) and SWP_WRITEOK (0b11). SWP_USED implies the swap area is currently active, and SWP_WRITEOK when linux is ready to write to the area. SWP_WRITEOK contains SWP_USED as the former implies the latter must be the case.

  2. swap_device - The device corresponding to the partition for this swap area. If the swap area is a file, this is set to NULL.

  3. sdev_lock - - Spinlock protecting the struct, most pertinently swap_map. It's locked and unlocked via swap_device_lock() and swap_device_unlock().

  4. swap_file - The struct dentry for the actual special file that is mounted as a swap area, for example this file may exist in /dev in the case that a partition is mounted. This field is needed to identify the correct swap_info_struct when deactivating a swap area.

  5. swap_vfsmnt - The struct vfsmount object corresponding to where the device or file for this swap area is located.

  6. swap_map - A large array containing one entry for every swap entry or page-sized slot in the area. An 'entry' is a reference count of the number of users of the page slot, with the swap cache counting as one user and every PTE that has been paged out to the slot as the other users. If the entry is equal to SWAP_MAP_MAX, the slot is allocated permanently. If it's equal to SWAP_MAP_BAD, the slot will never be used.

  7. lowest_bit - Lowest possible free slot available in the swap area and is used to start from when linearly scanning to reduce the search space - there are definitely no free slots below this mark.

  8. highest_bit - Highest possible free slot available. Similar to lowest_bit, there are definitely no free slots above this mark.

  9. cluster_next - Offset of the next cluster of blocks to use. The swap area tries to have pages allocated in cluster blocks to increase the chance related pages will be stored together.

  10. cluster_nr - Number of pages left to allocate in this cluster.

  11. prio - The 'priority' of the swap area - this determines how likely the area is to be used. By default the priorities are arranged in order of activation, but the sysadmin may also specify it using the -p flag of swapon.

  12. pages - Because some slots on the swap file may be unusable, this field stores the number of usable pages in the swap area. This differs from max in that swaps marked SWAP_MAP_BAD are not counted.

  13. max - Total number of slots in this swap area.

  14. next - The index in the swap_info array of the next swap area in the system.

  • The areas are not only stored in an array, they are also kept in a struct swap_list_t 'pseudolist' swap_list. swap_list_t is a simple type:
struct swap_list_t {
        int head;       /* head of priority-ordered swapfile list */
        int next;       /* swapfile to be used next */
};
  • head is the index of the swap area of the highest priority swap area in use, and next is the index of the next swap area that should be used.

  • This list enables areas to be looked up in order of priority when necessary but still remain easily looked up in the swap_info array.

  • Each swap area is divided into a number of page-sized slots on disk (e.g. 4KiB each on i386.)

  • The first slot is always reserved because it contains information about the swap area that must not be overwritten, including the first 1KiB which stores a disk label for the partition that can be retrieved via userspace tools.

  • The remaining space in this initial slot is used for information about the swap area which is filled when the swap area is created with mkswap. This is represented by the union swap_header union:

/*
 * Magic header for a swap area. The first part of the union is
 * what the swap magic looks like for the old (limited to 128MB)
 * swap area format, the second part of the union adds - in the
 * old reserved area - some extra information. Note that the first
 * kilobyte is reserved for boot loader or disk label stuff...
 *
 * Having the magic at the end of the PAGE_SIZE makes detecting swap
 * areas somewhat tricky on machines that support multiple page sizes.
 * For 2.5 we'll probably want to move the magic to just beyond the
 * bootbits...
 */
union swap_header {
        struct
        {
                char reserved[PAGE_SIZE - 10];
                char magic[10];                 /* SWAP-SPACE or SWAPSPACE2 */
        } magic;
        struct
        {
                char         bootbits[1024];    /* Space for disklabel etc. */
                unsigned int version;
                unsigned int last_page;
                unsigned int nr_badpages;
                unsigned int padding[125];
                unsigned int badpages[1];
        } info;
};
  • Looking at each of the fields:
  1. reserved - Dummy field used to position magic correctly at the end of the page.

  2. magic - Used for identifying the magic string that identifies a swap header - this is in place to ensure that a partition that is not a swap area will never be used by mistake, and to determine what version of swap area is to be used - if the string is "SWAP-SPACE", then it's version 1 of the swap file format. If it's "SWAPSPACE2", then version 2 will be used.

  3. bootbits - Reserved area containing information about the partition such as the disk label, retrievable via userspace tools.

  4. version - Version of the swap area layout.

  5. last_page - Last usable page in the area.

  6. nr_badpages - Known number of bad pages that exist in the swap area.

  7. padding - Disk sectors are usually 512 bytes in size. version, last_page and nr_badpages take up 12 bytes, so this field takes up the remaining 500 bytes to sector-align badpages.

  8. badpages - The remainder of the page is used to store the indices of up to MAX_SWAP_BADPAGES number of bad page slots. These are filled in by the mkswap userland tool if the -c switch is specified to check the area.

  • MAX_SWAP_BADPAGES is a compile-time constant that varies if the struct changes, but is 637 entries in its current form determined by:
MAX_SWAP_BADPAGES = (PAGE_SIZE - <bootblock size = 1024> - <padding size = 512> -
                     <magic string size = 10>)/sizeof(long)

11.2 Mapping Page Table Entries to Swap Entries

  • When a page is swapped out, linux uses the corresponding PTE to store enough information to locate the page on disk again. Rather than storing the physical address of the page, the PTE contains this information and has the appropriate flags set to indicate that it is not an address.

  • Clearly, a PTE is not large enough to store precisely where on the disk the page is located, but it's more than enough to store an index into the swap_info array and an offset within the swap_map.

  • Each PTE, regardless of architecture, is large enough to store a swp_entry_t:

/*
 * A swap entry has to fit into a "unsigned long", as
 * the entry is hidden in the "index" field of the
 * swapper address space.
 *
 * We have to move it here, since not every user of fs.h is including
 * mm.h, but mm.h is including fs.h via sched .h :-/
 */
typedef struct {
        unsigned long val;
} swp_entry_t;
  • pte_to_swp_entry() and swp_entry_to_pte() are used to translate between PTEs and swap entries.

  • As always we'll focus on i386, however all architectures has to be able to determine if a PTE is present or swapped out. In the swp_entry_t two bits are always kept free - on i386 bit 0 is reserved for the _PAGE_PRESENT flag and bit 7 is reserved for _PAGE_PROTNONE (both discussed in 3.2.)

  • Bits 1 to 6 are used for the 'type' which is the index within the swap_info array and is returned via SWP_TYPE().

  • Bits 8 through 31 (base-0 so all remaining bits) are used to the the offset within the swap_map from the swp_entry_t. This means there are 24 bits available, limiting the swap area to 64GiB (2^24 * 4096 bytes.)

  • The offset is extracted via SWP_OFFSET(). To encode a type and offset into a swp_entry_t, SWP_ENTRY() is available which does all needed bit-shifting.

  • Looking at the relationship between the various macros diagrammatically:

       -------          ---------- --------
       | PTE |          | Offset | | Type |
       -------          ---------- --------
          |                 |         |
          |                 |         |
          v                 v         v
----------------------   ---------------
| pte_to_swp_entry() |   | SWP_ENTRY() |         Bits reserved for
----------------------   ---------------   _PAGE_PROTNONE  _PAGE_PRESENT
          |                     |                  |              |
          /---------------------/                  |              |
          | BITS_PER_LONG                         8|7            1|0
          |       ---------------------------------v--------------v-
          \------>|            Offset             | |    Type    | |
                  --------------------------------------------------
             swp_entry_t        |                           |
                                |                           |
                                v                           v
                        ----------------             --------------
                        | SWP_OFFSET() |             | SWP_TYPE() |
                        ----------------             --------------
                                |                           |
                                |                           |
                                v                           v
                           ----------                   --------
                           | Offset |                   | Type |
                           ----------                   --------
  • The six bits for type should allow up to 64 (2^6) swap areas to exist in a 32-bit architecture, however MAX_SWAPFILES is set at 32. This is due to the consumption of the vmalloc address space (see chapter 7 for more on vmalloc.)

  • If a swap area is the maximum possible size, 32MiB is required for the swap_map (2^24 * sizeof(short)) - remembering that each page uses one short for the reference count. This means that if MAX_SWAPFILES = 32 swaps exist, 1GiB of virtual malloc space is required, which is simply impossible given the user/kernel linear address space split.

  • You'd think this would mean that supporting 64 swap areas is not worth the additional complexity, but this is not the case - if a system has multiple disks, it's worth having this complexity in order to distribute swap across all disks to allow for maximal parallelism and hence performance.

11.3 Allocating a Swap Slot

  • As discussed previously, all page-sized slots are tracked by the struct swap_info->swap_map unsigned short array each of which act as a reference count (number of 'users' of the slot, with the swap cache acting as the first 'user' - a shared page can have a lot of these of course.)

  • If the entry is SWAP_MAP_MAX the page is permanently reserved for that slot. This is unlikely but not impossible - it's designed to ensure the reference count does not overflow.

  • If the entry is SWAP_MAP_BAD, the slot is unusable.

  • Finding and allocating a swap entry is divided into two major tasks - the first is performed by get_swap_page():

  1. Starting with swap_list->next, search swap areas for a suitable slot.

  2. Once a slot has been found, record what the next swap to be used will be and return the allocated entry.

  • The second major task is is the searching itself, which is performed by scan_swap_map(). In principle it's very simple because it linearly scans the array for a free slot, however the implementation is little more involved than that:
static inline int scan_swap_map(struct swap_info_struct *si)
{
        unsigned long offset;
        /*
         * We try to cluster swap pages by allocating them
         * sequentially in swap.  Once we've allocated
         * SWAPFILE_CLUSTER pages this way, however, we resort to
         * first-free allocation, starting a new cluster.  This
         * prevents us from scattering swap pages all over the entire
         * swap partition, so that we reduce overall disk seek times
         * between swap pages.  -- sct */
        if (si->cluster_nr) {
                while (si->cluster_next <= si->highest_bit) {
                        offset = si->cluster_next++;
                        if (si->swap_map[offset])
                                continue;
                        si->cluster_nr--;
                        goto got_page;
                }
        }
        si->cluster_nr = SWAPFILE_CLUSTER;

        /* try to find an empty (even not aligned) cluster. */
        offset = si->lowest_bit;
 check_next_cluster:
        if (offset+SWAPFILE_CLUSTER-1 <= si->highest_bit)
        {
                int nr;
                for (nr = offset; nr < offset+SWAPFILE_CLUSTER; nr++)
                        if (si->swap_map[nr])
                        {
                                offset = nr+1;
                                goto check_next_cluster;
                        }
                /* We found a completly empty cluster, so start
                 * using it.
                 */
                goto got_page;
        }
        /* No luck, so now go finegrined as usual. -Andrea */
        for (offset = si->lowest_bit; offset <= si->highest_bit ; offset++) {
                if (si->swap_map[offset])
                        continue;
                si->lowest_bit = offset+1;
        got_page:
                if (offset == si->lowest_bit)
                        si->lowest_bit++;
                if (offset == si->highest_bit)
                        si->highest_bit--;
                if (si->lowest_bit > si->highest_bit) {
                        si->lowest_bit = si->max;
                        si->highest_bit = 0;
                }
                si->swap_map[offset] = 1;
                nr_swap_pages--;
                si->cluster_next = offset+1;
                return offset;
        }
        si->lowest_bit = si->max;
        si->highest_bit = 0;
        return 0;
}
  • Linux tries to organise pages into 'clusters' on disk of size SWAPFILE_CLUSTER. It allocates SWAPFILE_CLUSTER pages sequentially in swap, keeps count of the number of sequentially allocated pages in struct swap_info_struct->cluster_nr and records the current offset in swap_info_struct->cluster_next.

  • After a sequential block has been allocated, it searches for a block of free entries of size SWAPFILE_CLUSTER. If a large enough block can be found, it will be used as another cluster-sized sequence.

  • If no free clusters large enough can be found in the swap area, a simple first-free search is performed starting from swap_info_struct->lowest_bit.

  • The aim is to have pages swapped out at the same time close together, on the premise that pages swapped out together are likely to be related. This makes sense because the page replacement algorithm will use swap space most when linearly scanning the process address space.

  • Without scanning for large free blocks and using them, it is likely the scanning would degenerate to first-free searches and never improve, with it exiting processes are likely to free up large blocks of slots.

11.4 Swap Cache

  • Pages that are shared between many processes cannot be easily swapped out because there's no quick way to map a struct page to every PTE that references it.

  • If special care wasn't taken this fact could lead to the rare condition where a page that is present for one PTE and swapped out for another gets updated without being synced to disk, thereby losing the update.

  • To address the problem, shared pages that have a reserved slot in backing storage are considered to be part of the 'swap cache'.

  • The swap cache is simply a specialisation of the page cache with the main difference between pages in the swap cache and the page cache being that pages in the swap cache always use swapper_space as their struct address_space in page->mapping.

  • Another difference is that pages are added to the swap cache with add_to_swap_cache() instead of add_to_page_cache().

  • Taking a look at the swap cache API:

  1. get_swap_page() - Allocates a slot in a swap_map by searching active swap areas. Covered in more detail in 11.3.

  2. add_to_swap_cache() - Adds a page to the swap cache - first checking to see whether it already exists by calling swap_duplicate() and if not adding it to the swap cache via add_to_page_cache_unique().

  3. lookup_swap_cache() - Searches the swap cache and returns the struct page corresponding to the specified swap entry. It works by searching the normal page cache based on swapper_space and the swap_map offset.

  4. swap_duplicate() - Verifies a swap entry is valid, and if so, increments its swap map count.

  5. swap_free() - The complement to swap_duplicate(). Decrements the relevant counter in the swap_map. Once this count reaches zero, the slot is effectively free.

  • Anonymous pages are not part of the swap cache until an attempt is made to swap them out.

  • Taking a look at swapper_space:

struct address_space swapper_space = {
        LIST_HEAD_INIT(swapper_space.clean_pages),
        LIST_HEAD_INIT(swapper_space.dirty_pages),
        LIST_HEAD_INIT(swapper_space.locked_pages),
        0,                              /* nrpages      */
        &swap_aops,
};
  • A page is defined as being part of the swap cache when its struct page->mapping field has been set to swapper_space. This is determined by PageSwapCache().

  • Linux uses the exact same code for keeping pages between swap and memory sync as it uses for keeping file-backed pages and memory in sync - they both share the page cache code, the differences being in the functions used.

  • The address space for backing storage, swapper_space, uses swap_aops in its struct address_space->a_ops.

  • The page->index field is then used to store the swp_entry_t structure instead of a file offset (its usual purpose.)

  • The struct address_space_operations struct swap_aops is defined as follows, using swap_writepage() and block_sync_page():

static struct address_space_operations swap_aops = {
        writepage: swap_writepage,
        sync_page: block_sync_page,
};
  • When a page is being added to the swap cache, a slot is allocated with get_swap_page(), added to the page cache with add_to_swap_cache() and then marked dirty.

  • When the page is next laundered, it will actually be written to backing storage on disk as the normal page cache would operate. Diagrammatically:

    Anonymous
   struct page
-----------------
/               /
\      ...      \
/               /
|---------------|
|  mapping=NULL |
|---------------|       ---------------------
/               /       | try_to_swap_out() |
\      ...      \------>| attempts to swap  |
/               /       |  pages out from   |
|---------------|       |   process space   |
|    count=5    |       ---------------------
|---------------|                 |
/               /                 |
\      ...      \                 v
/               /       ---------------------
-----------------       |  get_swap_page()  |
                        | allocates slot in |-----------------\
                        |  backing storage  |                 |
                        ---------------------                 |   -----
                                  |                           |   |   |
                                  |                           |   |---|
                                  v                           \-> | 1 |
                 -----------------------------------              |---|
                 |       add_to_swap_cache()       |              /   /
                 | adds the page to the page cache |              \   \
                 |    using swapper_space as a     |              /   /
                 |  struct address_space. A dirty  |              -----
                 |   page will now sync to swap    |            Swap map
                 -----------------------------------
                                  |                       /---------\
                                  |                       |         |
                                  v                       |         v
                              Anonymous                   |     ---------
                             struct page                  |     |   |   |
                          -----------------               |     ----|----
                          /               /               |         v
                          \      ...      \               |     ---------
                          /               /               |     |   |   |
                          |---------------|               |     ----|----
                          |   mapping =   |               |         v
                          | swapper_space |               |     ---------
                          |---------------|               |     |       |
                          /               /               |     ---------
                          \      ...      \---------------/    Page Cache
                          /               /                  (inactive_list)
                          |---------------|
                          |    count=4    |
                          |---------------|
                          /               /
                          \      ...      \
                          /               /
                          -----------------
  • Subsequent swapping of the page from shared PTEs results in a call to swap_duplicate() which simply increments the reference to the slot in the swap_map.

  • If the PTE is marked dirty by the hardware as the result of a write, the bit is cleared and its struct page is marked dirty via set_page_dirty() so the on-disk copy will be synced before the page is dropped. This ensures that until all references to a page have been dropped, a check will be made to ensure the data on disk matches the data in the page frame.

  • When the reference count to the page reaches 0, the page is eligible to be dropped from the page cache and the swap map count will equal the count of the number of PTEs the on-disk slot belongs to so the slot won't be freed prematurely. It is laundered then finally dropped with the same LRU ageing and logic described in chapter 10.

  • If, on the other hand, a page fault occurs for a page that is swapped out, do_swap_page() checks to see if the page exists in the swap cache by calling lookup_swap_cache() - if it exists, the PTE is updated to point to the page frame, the page reference count is incremented and the swap slot is decremented with swap_free().

11.5 Reading Pages From Backing Storage

  • The principal function used for reading in pages is read_swap_cache_async() which is mainly called during page faulting:
/*
 * Locate a page of swap in physical memory, reserving swap cache space
 * and reading the disk if it is not already cached.
 * A failure return means that either the page allocation failed or that
 * the swap entry is no longer in use.
 */
struct page * read_swap_cache_async(swp_entry_t entry)
{
        struct page *found_page, *new_page = NULL;
        int err;

        do {
                /*
                 * First check the swap cache.  Since this is normally
                 * called after lookup_swap_cache() failed, re-calling
                 * that would confuse statistics: use find_get_page()
                 * directly.
                 */
                found_page = find_get_page(&swapper_space, entry.val);
                if (found_page)
                        break;

                /*
                 * Get a new page to read into from swap.
                 */
                if (!new_page) {
                        new_page = alloc_page(GFP_HIGHUSER);
                        if (!new_page)
                                break;          /* Out of memory */
                }

                /*
                 * Associate the page with swap entry in the swap cache.
                 * May fail (-ENOENT) if swap entry has been freed since
                 * our caller observed it.  May fail (-EEXIST) if there
                 * is already a page associated with this entry in the
                 * swap cache: added by a racing read_swap_cache_async,
                 * or by try_to_swap_out (or shmem_writepage) re-using
                 * the just freed swap entry for an existing page.
                 */
                err = add_to_swap_cache(new_page, entry);
                if (!err) {
                        /*
                         * Initiate read into locked page and return.
                         */
                        rw_swap_page(READ, new_page);
                        return new_page;
                }
        } while (err != -ENOENT);

        if (new_page)
                page_cache_release(new_page);
        return found_page;
}
  • This does the following:
  1. The function starts by searching the swap cache via find_get_page(). It doesn't use the usual lookup_swap_cache() swap cache search function as that updates statistics on number of searches performed however we are likely to search multiple times so find_get_page() makes more sense here.

  2. If the page isn't already in the swap cache, allocate one using alloc_page() and add it to swap cache using add_to_swap_cache(). If, however, a page is found, skip to step 5.

  3. If the page cannot be added to the swap cache, it'll be searched again in case another process put the data there in the meantime.

  4. The data is read from backing storage via rw_swap_page() (discussed in 11.7), and returned to the user.

  5. If the page was found in the cache and didn't need allocating,page_cache_release() is called to decrement the reference count accumulated via find_get_page().

11.6 Writing Pages to Backing Storage

  • When any page is being written to disk the struct address_space->a_ops field is used to determine the appropriate 'write-out' function.

  • In the case of backing storage, the address_space is swapper_space and the swap operations are contained in swap_aops which uses swap_writepage() for its write-out function.

  • swap_writepage() behaves differently depending on whether the writing process is the last user of the swap cache page or not - it determines this via remove_exclusive_swap_page() which simply checks the page count with pagecache_lock held - if no other process is mapping the page it is removed from the swap cache and freed.

  • If remove_exclusive_swap_page() removed the page from the swap cache and freed it, swap_writepage() unlocks the page because it is no longer in use.

  • If the page still exists in the swap cache, rw_swap_page() is called which writes the data to the backing storage.

11.7 Reading/Writing Swap Area Blocks

  • The top-level function for reading/writing to the swap area is rw_swap_page() which ensures that all operations are performed through the swap cache to prevent lost updates.

  • rw_swap_page() invokes rw_swap_page_base() in turn that does the heavy lifting:

/*
 * Reads or writes a swap page.
 * wait=1: start I/O and wait for completion. wait=0: start asynchronous I/O.
 *
 * Important prevention of race condition: the caller *must* atomically
 * create a unique swap cache entry for this swap page before calling
 * rw_swap_page, and must lock that page.  By ensuring that there is a
 * single page of memory reserved for the swap entry, the normal VM page
 * lock on that page also doubles as a lock on swap entries.  Having only
 * one lock to deal with per swap entry (rather than locking swap and memory
 * independently) also makes it easier to make certain swapping operations
 * atomic, which is particularly important when we are trying to ensure
 * that shared pages stay shared while being swapped.
 */

static int rw_swap_page_base(int rw, swp_entry_t entry, struct page *page)
{
        unsigned long offset;
        int zones[PAGE_SIZE/512];
        int zones_used;
        kdev_t dev = 0;
        int block_size;
        struct inode *swapf = 0;

        if (rw == READ) {
                ClearPageUptodate(page);
                kstat.pswpin++;
        } else
                kstat.pswpout++;

        get_swaphandle_info(entry, &offset, &dev, &swapf);
        if (dev) {
                zones[0] = offset;
                zones_used = 1;
                block_size = PAGE_SIZE;
        } else if (swapf) {
                int i, j;
                unsigned int block = offset
                        << (PAGE_SHIFT - swapf->i_sb->s_blocksize_bits);

                block_size = swapf->i_sb->s_blocksize;
                for (i=0, j=0; j< PAGE_SIZE ; i++, j += block_size)
                        if (!(zones[i] = bmap(swapf,block++))) {
                                printk("rw_swap_page: bad swap file\n");
                                return 0;
                        }
                zones_used = i;
                dev = swapf->i_dev;
        } else {
                return 0;
        }

         /* block_size == PAGE_SIZE/zones_used */
         brw_page(rw, page, dev, zones, block_size);
        return 1;
}
  • Looking at how the function operates:
  1. It checks if the operation is a read - if so, it clears the struct page->uptodate flag via ClearPageUptodate() because the page is clearly not up to date if I/O is required to fill it with data (the flag is set again if the page is successfully read from disk.)

  2. The device for the swap partition of the inode for the swap file is acquired via get_swaphandle_info() - this information is needed by the block layer which will be performing the actual I/O.

  3. If the swap area is a file bmap() is used to fill a local array with a list of all blocks in the filesystem that contain the page data. If the backing storage is a partition, only one page-sized block requires I/O and since no filesystem is involved bmap() is not required.

  4. Either a swap partition or files can be used because rw_swap_page_base() uses brw_page() to perform the actual disk I/O which can handle both cases generically.

  5. All I/O that is performed is asynchronous so the function returns quickly - after the I/O is complete the block layer will unlock the page and any waiting process will wake up.

11.8 Activating a Swap Area

  • Activating a swap area is conceptually quite simple - open the file, load the header information from disk, populate a struct swap_info_struct, and add that to the swap list.

  • sys_swapon() is the function that activates the swap via a syscall - long sys_swapon(const char * specialfile, int swap_flags). It takes the path to the special file (quite possibly a device) and some swap flags.

  • Since this is 2.4.22 :) the 'Big Kernel Lock' (BKL) is held during the process, preventing any application from entering kernel space while the operation is being performed.

  • The function operates as follows:

  1. Find a free swap_info_struct in the swap_info array and initialise it with default values.

  2. Call user_path_walk() (and subsesquently __user_walk()) to traverse the directory tree for the specified specialfile and populates a struct nameidata with the available data on the file, such as the struct dentry data and struct vfsmount data on where the file is stored.

  3. Populates the swap_info_struct fields relating to the dimensions of the swap area and how to find it. If the swap area is a partition, the block size will be aligned to the PAGE_SIZE before calculating the size. If it is a file, the information is taken directly from the struct inode.

  4. Ensures the area is not already activated. If it isn't, it allocates a page from memory and reads the first page-sized slot from the swap area which contains information about the number of good slots and how to populate the swap map with bad entries (see 11.1 for more details.)

  5. Allocate memory with vmalloc() for swap_info_struct->swap_map and initialise each entry with 0 for good slots and SWAP_MAP_BAD otherwise (ideally the header will indicate v2 as v1 was limited to swap areas of just under 128MiB for architectures with a 4KiB page size like x86.)

  6. After ensuring the header is valid fill in the remaining swap_info_struct fields like the maximum number of pages and the available good pages, then update the global statistics for nr_swap_pages and total_swap_pages.

  7. The swap area is not fully active and initialised so insert it into the swap list in the correct position based on the priority of the newly activated area.

  8. Release the BKL.

11.9 Deactivating a Swap Area

  • In comparison to activating a swap area, deactivation is incredibly expensive. The main reason for this is that the swap can't simply be removed - every page that was swapped out has to now be swapped back in again.

  • Just as there's no quick way to map a struct page to every PTE that references it, there is no quick way to map a swap entry to a PTE either - it requires all process page tables be traversed to find PTEs that reference the swap area that is being deactivated and swap them all in.

  • Of course all this means swap deactivation will fail if the physical memory is not available.

  • The function concerned with deactivating an area is sys_swapoff(). The function is mostly focused on updating the struct swap_info_struct, leaving the heavy lifting of paging in each paged-out page via try_to_unuse(), which is extremely expensive.

  • For each slot used in the swap_info_struct's swap_map field, the page tables have to be traversed search for it, and in the worst case all page tables belonging to all struct mm_structs have to be traversed.

  • Broadly speaking sys_swapoff() performs the following:

  1. Call user_path_walk() to retrieve information about the special file to be deactivated, then take the BKL.

  2. Remove the swap_info_struct from the swap list and update the global statistics on the number of swap pages available(nr_swap_pages) and the total number of swap entries (total_swap_pages.) Once complete, the BKL can be released.

  3. Call try_to_unuse() which will page in all pages from the swap area to be deactivated. The function loops through the swap map using find_next_to_unuse() to locate the next used swap slot (see below for more details on what try_to_unuse() does.)

  4. If there was not enough available memory to page in all the entries, the swap area is reinserted back into the running system because it cannot simply be dropped. If, on the other hand, the process succeeded, the swap_info_struct is placed into an uninitialised state and the swap_map memory is freed with vfree().

  1. Call read_swap_cache_async() to allocate a page for the slot saved on disk. Ideally, it exists in the swap cache already, but the page allocator will be called if it isn't.

  2. Wait on the page to be fully paged in and lock it. Once looked, call unuse_process() for every process that has a PTE referencing the page. This function traverses the page table searching for the relevant PTE then updates it to point to the correct struct page.If the page is a shared memory page with no remaining reference, shmem_unuse() is called instead.

  3. Free all slots that were permanently mapped. It's believed that slots will never become permanently reserved, so a risk is taken here.

  4. Delete the page from the swap cache to prevent try_to_swap_out() from referencing a page in the event it still somehow has a reference in swap from a swap map.