Chapter 3: Page Table Management

(LS) Overview

In linux 2.4.22 a 32-bit virtual address actually consists of 4 offsets, indexing into the PGD, PMD, PTE and the address's actual physical page of memory (the PGD, PMD and PTE are defined below.)

Each process has a known PGD address. Subsequently (noting these are all physical addresses):

pmd_table = pgd_table[pgd_offset] & PAGE_MASK
pte_table = pmd_table[pmd_offset] & PAGE_MASK
phys_page = pte_table[pte_offset] & PAGE_MASK
phys_addr = phys_page[page_offset]

NOTE: This is pseudo-code.

NOTE: PAGE_MASK = ~(1<<12 - 1) excludes the least significant bits since pages are page-aligned.

This indirection exists in order that each process need only take up 1 physical page of memory for its overall page directory - i.e. entries can be empty and are filled as needed (the layout is 'sparse'.)

A virtual address therefore looks like:

<-                    BITS_PER_LONG                      ->
<- PGD bits -><- PMD bits -><- PTE bits -><- Offset bits ->
[ PGD Offset ][ PMD Offset ][ PTE Offset ][  Page Offset  ]
                                          <----- PAGE_SHIFT
                            <-------------------- PMD_SHIFT
              <-------------------------------- PGDIR_SHIFT

As discussed below, it's possible for offsets to be of 0 bits length, meaning the directory in question is folded back onto the nearest higher directory.

In 32-bit non-PAE we have:

<-  10 bits -><-  0 bits  -><- 10 bits  -><-   12 bits   ->
[ PGD Offset ][ PMD Offset ][ PTE Offset ][  Page Offset  ]

In 32-bit PAE we have:

<-  2 bits  -><-  9 bits  -><-  9 bits  -><-   12 bits   ->
[ PGD Offset ][ PMD Offset ][ PTE Offset ][  Page Offset  ]

Importantly here all physical pages are stored as 64-bit values. A 32-bit intel PAE configuration could physically address 36 bits (64GB) so only 4 bits of the higher order byte are used for addressing (some of the other bits are used for flags, see the PAE wikipedia article for more details.)

(LS) Page Table Function Cheat Sheet

pgd_offset(mm, address) - Given a process's struct mm_struct (e.g. for the current process this would be current->mm) and a virtual address, returns the virtual address of the corresponding PGD entry (i.e. a pgd_t *.)
pmd_offset(dir, address) - Given dir, a virtual address of type pgd_t *, i.e. a pointer to the entry which contains the PMD table's physical address, and a virtual address, returns the virtual address of the corresponding PMD entry of type pmd_t *.
pte_offset(dir, address) - Given dir, a virtual address of type pmd_t *, i.e. a pointer to the entry which contains the PTE table's physical address, and a virtual address, returns the virtual address of the corresponding PTE entry of type pte_t *.
__[pgd,pmd,pte]_offset(address) - Given a virtual address, returns the index of the [pgd,pmd,pte]_t entry for this address within the PGD, PMD, PTE.
pgd_index(address) - Given the virtual address, pgd_index() returns the index of the pgd_t entry for this address within the PGD. Equivalent to __pgd_offset(address).
[pgd,pmd]_page(entry) - Given entry, a page table entry containing a physical address of type [pgd,pmd]_t, returns the virtual address of the page referenced by this entry. So pgd_page() returns pmd_t *pmd_table and pmd_page() returns pte_t *pte_table.
pte_page(entry) - Given entry, a page table entry containing a physical address of type pte_t, pte_page() returns the struct page entry (also typedef'd to mem_map_t) for the referenced page.
[pgd,pmd,pte]_val(entry) - Given entry, a page table entry containing a physical address of type [pgd,pmd,pte]_t, returns the absolute value of the physical address (i.e. unsigned long or unsigned long long.) Note that when using PAE this will be a 64-bit value.
[pgd,pmd,pte]_none(entry) - Given entry, a page table entry containing a physical address of type [pgd,pmd,pte]_t, returns true if it does not exist, e.g. pmd_none().
[pgd,pmd,pte]_present(entry) - Given entry, a page table entry containing a physical address of type [pgd,pmd,pte]_t, returns true if it has the present bit set e.g. pmd_present().
[pgd,pmd,pte]_clear(entry) - Given entry, a page table entry containing a physical address of type [pgd,pmd,pte]_t, clears the entry e.g. pmd_clear().
[pgd,pmd]_bad(entry) - Given entry, a page table entry containing a physical address of type [pgd,pmd]_t, returns true if the entry is not in a state where the contents can be safely modified, e.g. pmd_bad().
pte_dirty(entry) - Given entry, a page table entry containing a physical address of type pte_t, returns true if the PTE has the dirty bit set, e.g. pte_dirty(). Important: pte_present() must return true for this to be useable.
pte_mkdirty(entry), pte_mkclean(entry) - Given entry, a page table entry containing a physical address of type pte_t, sets or clears the dirty bit respectively - pte_mkdirty(), pte_mkclean().
pte_young(entry) - Given entry, a page table entry containing a physical address of type pte_t, returns true if the PTE has the accessed flag set (on x86), e.g. pte_young(). Important: pte_present() must return true for this to be useable.
pte_mkyoung(entry), pte_mkold(entry) - Given entry, a page table entry containing a physical address of type pte_t, sets or clears the accessed bit respectively - pte_mkyoung(), pte_mkold().
pte_read(entry) - Given entry, a page table entry containing a physical address of type pte_t, returns true if the PTE is readable, which is implemented by the user flag on x86, e.g. pte_read().
pte_mkread(entry), pte_rdprotect(entry) - Given entry, a page table entry containing a physical address of type pte_t, sets or clears the user flag (on x86) respectively - pte_mkread(), pte_rdprotect().
pte_write(entry) - Given entry, a page table entry containing a physical address of type pte_t, returns true if the PTE is writeable, e.g. pte_write().
pte_mkwrite(entry), pte_wrprotect(entry) - Given entry, a page table entry containing a physical address of type pte_t, sets or clears the R/W flag respectively - pte_mkwrite(), pte_wrprotect().
pte_exec(entry) - Given entry, a page table entry containing a physical address of type pte_t, returns true if the PTE is executable. Note that on x86 at the time of 2.4.22 there was no NX bit, in fact it was only added circa pentium 4, and only in 32-bit mode if PAE is enabled, as it is set at the 63rd (most significant) bit. In i386, it's equivalent to pte_read() and checks for the user bit.
pte_mkexec(entry), pte_exprotect(entry) - Given entry, a page table entry containing a physical address of the type pte_t, sets or clears the exec bit (or rather user bit in i386) respectively - pte_mkexec(), pte_exprotect().
mk_pte(page, pgprot) - Given page, a struct page and pgprot, a set of protection bits of type pgprot_t, returns a pte_t ready to be inserted into the page table. This macro in turn calls __mk_pte which uses a page number (i.e. the address shifted to the right by PAGE_SHIFT bits) to determine how to populate the pte_t - mk_pte().
mk_pte_phys(physpage, pgprot) - Given physpage, a physical address and pgprot, a set of protection bits of type pgprot_t, returns a pte_t ready to be inserted into the page table, similar to mk_pte() - mk_pte_phys().
set_pte(ptep, pte) - Given ptep, a pointer to a pte_t in the page table, and a PTE entry of type pte_t, this function sets it in place. This is a function rather than simply *ptep = ptep; because in some architectures (including PAE x86) additional work needs to be done. This is where the PTE returned by mk_pte() or mk_pte_phys() is assigned to the actual page table - set_pte() for 2-level (non-PAE) x86, set_pte() for 3-level (PAE) x86.
pte_clear(xp) - Given xp, a pointer to a pte_t in the page table, clears the entry altogether - pte_clear().
ptep_get_and_clear(ptep) - Given ptep, a pointer to a pte_t in the page table, returns the existing PTE entry and clears it - this is useful when modification needs to be made to either the PTE protection or the struct page itself - pte_get_and_clear() for 2-level x86 and pte_get_and_clear() for 3-level (PAE) x86.
pgd_alloc(mm) - Given a process's mm, a struct mm_struct, allocate a physical page for a PGD table, zero USER_PTRS_PER_PGD entries and copy over entries from swapper_pg_dir to populate the rest of the entries (in x86 at least) - pgd_alloc(). Typically this calls get_pgd_fast() in turn, resorting to get_pgd_slow() if necessary - pgd_alloc(). Note this allocates a PGD table page.
pmd_alloc(mm, pgd, address) - Given a process's mm, a struct mm_struct, pgd a pointer to an entry in a PGD table which is to contain the PMD, and address, a virtual address, simply returns pmd_offset() on i386 - i.e. since all PMDs are either folded into the PGD (2-level), or are allocated along with PGD, the function simply returns a the virtual address of the associated pmd_t - pmd_alloc(). Note this allocates a PMD table page.
pte_alloc(mm, pmd, address) - Given a process's mm, a struct mm_struct, pmd a pointer to an entry in a PMD table which is to contain the PTE, and address, a virtual address, allocates a new PTE table if the PMD entry is empty via pte_alloc_one_fast() if possible (which uses pte_quicklist in a way very similar to get_pgd_fast(), otherwise it resorts to pte_alloc_one() to allocate the slow way - pte_alloc(). Note this allocates a PTE table page.
pgd_free(pgd) - Given pgd a pointer to a PGD table, frees the associated page. In the PAE case, PMD entries are also freed. In i386 is just an alias for free_pgd_slow() - pgd_free().
pmd_free(pmd) - In (2.4.22 :) i386 this is a no-op. In the 2-level page table case PMDs don't exist, in the 3-level case (PAE), PMDs are cleared via pgd_free() - pmd_free().
pte_free(pte) - Given pte, a pointer to a PTE table, frees the associated page. In i386 this is just an alias for pte_free_fast(), which simply adds the PTE entry back to the cache's free list - pte_free().

Introduction

Linux has a unique means of abstracting the architecture-specific details of physical pages. It maintains a concept of a three-level page table (note: this is relevant to 2.4.22 of course, in later kernels it is 4 levels.) The 3 levels are:

Page Global Directory (PGD)
Page Middle Directory (PMD)
Page Table Entry (PTE)

This three-level page table structure is implemented even if the underlying architecture does not support it.
Though this is conceptually easy to understand, it means the distinction between different types of pages is really blurry, and page types are identified by their flags or what lists they reside in rather than the objects they belong to.
Architectures that manage their Memory Management Unit (MMU) differently are expected to emulate the three-level page tables. E.g. on x86 without PAE, only two page table levels are available, so the PMD is defined to be of size 1 (2^0) and 'folds back' directly onto the PGD, and this is optimised out at compile time.
For architectures that do not manage their cache or Translation lookaside buffer (TLB) automatically, architecture-dependent hooks have to be left in the code for when the TLB and CPU caches need to be altered and flushed, even if they are null operations on some architectures like the x86 (discussed further in section 3.8.)
Virtual memory is divided into separate directories so we can have a 'sparse' set of data structures for each process. Each PGD takes up a page of memory (4KiB on i386), rather than the 4MiB it would take to map the whole 4GiB address space if it were only a single list.

3.1 Describing the Page Directory

Each process has a pointer mm_struct->pgd to its own PGD, which is a physical page frame containing an array of type pgd_t, an architecture specific type defined in asm/page.h:

struct mm_struct {
    struct vm_area_struct * mmap;           /* list of VMAs */
    rb_root_t mm_rb;
    struct vm_area_struct * mmap_cache;     /* last find_vma result */
    pgd_t * pgd;
    atomic_t mm_users;                      /* How many users with user space? */
    atomic_t mm_count;                      /* How many references to "struct mm_struct" (users count as 1) */
    int map_count;                          /* number of VMAs */
    struct rw_semaphore mmap_sem;
    spinlock_t page_table_lock;             /* Protects task page tables and mm->rss */

    struct list_head mmlist;                /* List of all active mm's.  These are globally strung
                                             * together off init_mm.mmlist, and are protected
                                             * by mmlist_lock
                                             */

    unsigned long start_code, end_code, start_data, end_data;
    unsigned long start_brk, brk, start_stack;
    unsigned long arg_start, arg_end, env_start, env_end;
    unsigned long rss, total_vm, locked_vm;
    unsigned long def_flags;
    unsigned long cpu_vm_mask;
    unsigned long swap_address;

    unsigned dumpable:1;

    /* Architecture-specific MM context */
    mm_context_t context;
};

Page tables are loaded differently depending on architecture. On x86 the process page table is loaded by copying mm_struct->pgd into the cr3 register, which has the side effect of flushing the TLB.
In fact, on x86, __flush_tlb() is implemented by copying cr3 into a temporary register, then copying that result back in again.
Each active entry in the PGD table points to a page frame (i.e. physical page of memory) containing an array of PMD entries of type pmd_t, which in term point to page frames containing PTEs of type pte_t, which finally point to page frames containing the actual user data.
The following code defines how we can determine a physical address manually (of course in the majority of cases access is performed transparently by the MMU):

/* Shared between 2-level and 3-level i386: */

#define pgd_offset(mm, address) ((mm)->pgd+pgd_index(address))
#define pgd_index(address) ((address >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
#define pgd_page(pgd) \
	((unsigned long) __va(pgd_val(pgd) & PAGE_MASK))
#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
/*
 * Note that pgd_t is a struct containing just the pgd pointer, used to take
 * advantage of C type checking.
 */
#define pgd_val(x) ((x).pgd)

#define PAGE_SHIFT   12 /* 4 KiB. */
#define PAGE_SIZE    (1UL << PAGE_SHIFT)
#define PAGE_MASK    (~(PAGE_SIZE-1))

#define __pte_offset(address) \
		((address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))
#define pte_offset(dir, address) ((pte_t *) pmd_page(*(dir)) + \
			__pte_offset(address))

#define pmd_page(pmd) \
	((unsigned long) __va(pmd_val(pmd) & PAGE_MASK))

/* In 2-level (non-PAE) configuration: */

#define PGDIR_SHIFT  22
#define PTRS_PER_PGD 1024

#define PMD_SHIFT    22
#define PTRS_PER_PMD 1

#define PTRS_PER_PTE 1024

static inline pmd_t * pmd_offset(pgd_t * dir, unsigned long address)
{
	return (pmd_t *) dir;
}

/* In 3-level (PAE) configuration: */

#define PGDIR_SHIFT  30
#define PTRS_PER_PGD 4

#define PMD_SHIFT    21
#define PTRS_PER_PMD 512

#define PTRS_PER_PTE 512

#define pmd_offset(dir, address) ((pmd_t *) pgd_page(*(dir)) + \
		__pmd_offset(address))
#define __pmd_offset(address) \
		(((address) >> PMD_SHIFT) & (PTRS_PER_PMD-1))

pgd_offset() Determines the physical address of the PGD entry for a given virtual address given that address and the process's mm_struct, via pgd_index(). We shift the address right by PGDIR_SHIFT bits, before limiting to the number of possible pointers, PTRS_PER_PGD. This number will be a power of 2, so the number subtracted by 1 is a mask of all possible indexes into the PGD, e.g. 1000 allows all permutations of [0000,0111]. Finally we add the known physical address of the pgd from mm->pgd.

2-level i386

The structure is a PGD table with 1024 entries -> PTE table with 1024 entries. The PMD is just an alias for the PGD.
In a 2-level i386 configuration, PGDIR_SHIFT is 22 and PTRS_PER_PGD is 1024 so we use the upper 10 bits to offset into 1024 entries. pmd_offset simply returns a (physical) pointer to the PGD entry, so when we translate to PMD, we are just looking at the same PGD entry again. The compiler will remove this unnecessary intermediate step, but we retain the same abstraction in generic code. Very clever!
Note that the input dir value is a virtual address of type pgd_t *, whose value is a physical address for a PGD entry, so the returned pmd_t * is exactly the same - a virtual address which contains a physical address value.
Now we have the physical address of our 'PMD' entry (i.e. the same PGD entry), we can use this to determine our PTE and finally get hold of the address of the physical page we want, exactly the same as we would with a 3-level system.

3-level i386 (PAE)

The structure is a PGD table with only 4 entries -> PMD table with 512 64-bit entries (to store more address bits) -> PTE table with 512 entries.
In a 3-level i386 PAE configuration, PGDIR_SHIFT is 30 and PTRS_PER_PGD is 4 and we have an actually real PMD :) to get it, we use pmd_offset(), which starts by calling pgd_page() to get the virtual address of the physical PMD contained inside the PGD entry. It starts by calling pgd_val():
pgd_val() gets the physical address of the specified PGD entry - pgd_t is actually a struct containing a physical page address to make use of C type checking, so it does this by simply accessing the member.
Now we come back to pgd_page() which uses PAGE_MASK (this masks out the lower 12 bits, i.e. 4KiB's worth) to get the physical page of the PMD, ignoring any flags contained in the PGD entry, then uses __va to convert this physical kernel address into a virtual kernel address, i.e. simply offsetting by __PAGE_OFFSET (kernel addresses are mapped directly like this.)
Finally now we have a virtual address for the start of the PMD entry table, pmd_offset() needs to get the offset of the specific pmd_t we're after in order to return a pmd_t * virtual address to it. It does this via __pmd_offset which does the same job as pgd_index(), only using PMD_SHIFT (21) and PTRS_PER_PMD (512.) It is a 21-bit shift because we've used 2 bits for the PMD entry, leaving us with 9 bits to address 512 entries.

(LS) Looking up the PTE

Similar to pmd_offset(), pte_offset() grabs the contents of the PMD entry via pmd_page() (in a 2-level system this will just be the same as pgd_page()), using pmd_val() to grab the pmd field from the pmd_t and finally return a virtual pte_t * value. Finally, it returns the physical address of the PTE as a pte_t *.
In a 3-level (PAE) system, the pte_t is a 64-bit value, meaning pte_val() returns a 64-bit physical address.

(LS) Some additional functions/values

PAGE_SIZE - Size of each page of memory.
PMD_SIZE - The size of values that are mapped by a PMD, which is derived from PMD_SHIFT which determines the number of bits mapped by a PMD (PMD_SIZE = 1 << PMD_SHIFT.)
PMD_MASK - The mask for the upper bits of a PMD address, i.e. PGD+PMD.
PGDIR_SIZE, PGDIR_MASK - Similar to PMD equivalents for the PGD.
If a page needs to be aligned on a page boundary, PAGE_ALIGN() adds PAGE_SIZE - 1 to the address then simply bitwise ands PAGE_MASK - since PAGE_MASK is a mask for the high bits of an address without the lower page offset bits, this means that addresses that are already page aligned will remain the same (since it's PAGE_SIZE - 1), and any other addresses will be moved to the next page.
Somewhat discussed above, but PTRS_PER_PGD, PTRS_PER_PMD and PTRS_PER_PTE each describe the number of pointers provided in each table.

3.2 Describing a Page Table Entry

Each entry in the page tables are described by pgd_t, pmd_t and pte_t. Though the values the entries contain are unsigned integers, they are defined as structs. This is firstly for type safety, to ensure these are not used incorrectly, and secondly to handle e.g. PAE which uses an additional 4 bits for addressing.
As seen above, type casting from these structs to unsigned integers is done via pgd_val(), pmd_val() and pte_val().
Reversing the process, i.e. converting from an unsigned integer to the appropriate struct is done via __pte(), __pmd() and __pgd().
Each PTE entry has protection bits, for which pgprot_t is used.
Converting to and from pgprot_t is similar to the page table entries - pgprot_val() and __pgprot().
Where the protection bits are actually stored is architecture-dependent, however on 32-bit x86 non-PAE, pte_t is just a 32-bit integer in a struct. Since each pte_t points to an address of a page frame, which is guaranteed to be page-aligned, we are left with PAGE_SHIFT bits free for status bits.
Some of the potential status values are:

_PAGE_PRESENT - The page is resident in memory and not swapped out.
_PAGE_RW - Page can be written to.
_PAGE_USER - Page is accessible from user space.
_PAGE_ACCESSED - Page has been accessed.
_PAGE_DIRTY - Page has been written to.
_PAGE_PROTNONE - The page is resident, but not accessible.

The bit referenced by _PAGE_PROTNONE is, in pentium III processors and higher, referred to as the 'Page Attribute Table' (PAT) bit and used to indicate the size of the page that the PTE is referencing. Equivalently, this bit is referred to as the 'Page Size Extension' (PSE) bit.
Linux doesn't use either PAT or PSE in userspace, so this bit is free for other uses. It's used to fulfil cases like an mprotect()'d region of memory set to PROT_NONE where the memory needs to be resident but also inaccessible to userland. This is achieved by clearing _PAGE_PRESENT and setting _PAGE_PROTNONE.
Since _PAGE_PRESENT is clear, the hardware will invoke a fault if userland tries to access it, but the pte_present() macro will detect that PAGE_PROTNONE is defined so the kernel will know to keep this page resident.

3.3 Using Page Table Entries

(See the cheat sheet for details on useful functions.)
There is a lot of page table walk code in the VM, and it's important to be able to recognise it. For example, follow_page() from mm/memory.c:

/*
 * Do a quick page-table lookup for a single page.
 */
static struct page * follow_page(struct mm_struct *mm, unsigned long address, int write)
{
	pgd_t *pgd;
	pmd_t *pmd;
	pte_t *ptep, pte;

	pgd = pgd_offset(mm, address);
	if (pgd_none(*pgd) || pgd_bad(*pgd))
		goto out;

	pmd = pmd_offset(pgd, address);
	if (pmd_none(*pmd) || pmd_bad(*pmd))
		goto out;

	ptep = pte_offset(pmd, address);
	if (!ptep)
		goto out;

	pte = *ptep;
	if (pte_present(pte)) {
		if (!write ||
		    (pte_write(pte) && pte_dirty(pte)))
			return pte_page(pte);
	}

out:
	return 0;
}

We first grab a pointer to the PGD entry, check to make sure it isn't empty and isn't bad, use this to grab a pointer to the PMD entry, again checking for validity before finally grabbing a pointer to the PTE entry which we dereference and place a copy in local variable pte.
After this we are done with our walking code and use pte_page() to return the PTE's associated struct page if the present bit is set and if the write argument is truthy, checking whether the page is writeable and dirty before returning it.

3.4 Translating and Setting Page Table Entries

(See the cheat sheet for details on useful functions.)

3.5 Allocating and Freeing Page Tables

(See the cheat sheet for details on useful functions.)
The allocation and freeing of page tables is a relatively expensive operation, both in terms of time and the fact that interrupts are disabled during page allocation.
In addition, the allocation and deletion of page tables at any of the 3 levels is very frequent, so it's important that the operation is as quick as possible.
The pages used for page tables are therefore cached in a number of different lists known as quicklists.
The implementation of these differ between architectures, e.g. not all cache PGDs because allocating and freeing them only happens during process creation/exit, and since these are expensive operations, the allocation of another page is negligible.
Allocation of pages is achieved via pgd_alloc(), pmd_alloc(), and pte_alloc().
Freeing of pages is achieved via pgd_free(), pmd_free(), and pte_free().
Generally speaking, caching is implemented for each of the page tables using pgd_quicklist, pmd_quicklist, and pte_quicklist.
These lists are defined in an architecture-dependent way but one means this is achieved is via a LIFO list (i.e. a stack.) When a page is adding back to the cache (i.e. is freed), it becomes the head of the list and the cache size count is increment. A pointer to next page in the list is stored at the start of the page, e.g.:

static inline void pte_free_fast(pte_t *pte)
{
        *(unsigned long *)pte = (unsigned long) pte_quicklist;
        pte_quicklist = (unsigned long *) pte;
        pgtable_cache_size++;
}

The quick allocation function from pgd_quicklist is not necessarily the same for each architecture, however commonly, get_pgd_fast() is used, which pops an entry off the free list if possible, otherwise allocates via get_pgd_slow():

static inline pgd_t *get_pgd_fast(void)
{
        unsigned long *ret;

        if ((ret = pgd_quicklist) != NULL) {
                pgd_quicklist = (unsigned long *)(*ret);
                ret[0] = 0;
                pgtable_cache_size--;
        } else
                ret = (unsigned long *)get_pgd_slow();
        return (pgd_t *)ret;
}

Here you can see the dereference of the next element in the list, followed by clearing this from the page and decrementing the count.
pmd_alloc_one and pmd_alloc_one_fast are not implemented in i386 as PMD handling is wrapped inside of the PGD.
On the slow path, get_pgd_slow() and pte_alloc_one() are used to allocate these table pages when the cache doesn't have any free entries.
Obviously over time these caches grow rather large. As the pages grow and shrink a counter is incremented/decremented, and there are high and low watermarks for this counter.
check_pgt_cache() is called from two different places to check these watermarks - the system idle task and after clear_page_tables() is run. If the high watermark is reached, pages are freed until the size returns to the low watermark.

3.6 Kernel Page Tables

When the system first starts, paging is not enabled and needs to be initialised before turning paging on. This is handled differently by each architecture, but again we'll be focusing on i386.
Page table initialisation is split into two phases - the bootstrap phase sets up the page tables up to 8MiB so the paging unit can be enabled, the second phase initialises the rest of the page tables.

3.6.1 Bootstrapping

The assembler function startup_32() enables the paging unit.
While all normal kernel code in vmlinuz is compiled with base address PAGE_OFFSET (== __PAGE_OFFSET) + 1MiB, the kernel is loaded beginning at the first megabyte of memory (0x00100000.)
This is because the first megabyte of memory is (sometimes, potentially) used by some devices with the BIOS and is therefore skipped.
The bootstrap code deals with this by simply subtracting __PAGE_OFFSET from any address until the paging unit is enabled.
Before the paging unit is enabled, a page table mapping has to be established that translates 8MiB of physical memory to the virtual address PAGE_OFFSET.
Initialisation begins at compile time by statically defining swapper_pg_dir as a PGD and using a linker directive to place it at 0x101000 (i.e. .org 0x1000 which specifies an absolute address.):

.org 0x1000
ENTRY(swapper_pg_dir)
        .long 0x00102007
        .long 0x00103007
        .fill BOOT_USER_PGD_PTRS-2,4,0
        /* default: 766 entries */
        .long 0x00102007
        .long 0x00103007
        /* default: 254 entries */
        .fill BOOT_KERNEL_PGD_PTRS-2,4,0

Note that the '7' (0b00000111) in the referenced addresses are flags to set the pages 'present' (i.e. in memory), read/write-able, and user-accessible.
The two subsequent pages of memory (i.e. 0x102000 and 0x103000) are then statically defined as pg0 and pg1.
If the processor supports the Page Size Extension (PSE) feature, the pages referenced will be 4MiB in size rather than 4KiB (LS - I can't see how this is achieved in the code, confused :/)
The first pointers to pg0 and pg1 in swapper_pg_dir are placed to cover the region 1-9MiB, the second pointers to them are placed at PAGE_OFFSET + 1MiB - PAGE_OFFSET + 9MiB. This is done so that when paging is enabled, both physical and virtual addressing will work.
Once the mapping has been established the paging unit is turned on by setting a bit in the cr0 register and the code jumps immediately to ensure EIP is set correctly.
Once the bootstrap paging configuration is set, the rest of the kernel page tables are initialised ('finalised') via paging_init() (and subsequently pagetable_init().)

3.6.2 Finalising

pagetable_init() initialises page tables necessary to access ZONE_DMA and ZONE_NORMAL.
As discussed previously, high memory in ZONE_HIGHMEM cannot be directly referenced and temporary mappings are set up for it as needed.
For each pgd_t used by the kernel, the boot memory allocator (more on this in chapter 5) is called to allocate a page for the PGD, and the PSE bit is set if available to use 4MiB TLB entries instead of 4KiB.
If the PSE bit is not supported, a page for PTEs will be allocated for each pmd_t.
If the Page Global Enabled (PGE) flag is available to be set (this prevents automatic TLB invalidation for the page) this will be set for each page to make them globally visible to all processes.
Next, fixrange_init() is called to set up fixed address space mappings at the end of the virtual address space starting at FIXADDR_START.
These mappings are used for special purposes such as the local Advanced Programmable Interrupt Controller (APIC), and atomic kmappings between FIX_KMAP_BEGIN and FIX_KMAP_END required by kmap_atomic() (see the highmem kernel doc for more on this.)
Finally, fixrange_init() is called again to initialise page table entries required for normal high memory mapping with kmap().
After pagetable_init() returns, the page tables for kernel space are now fully initialised, so the static PGD (swapper_pg_dir) is loaded into the cr3 register so the static table is used by the paging unit.
The next task of paging_init() is to call kmap_init() to initialise each of the PTEs with the PAGE_KERNEL protection flags.
Finally zone_sizes_init() is called to initialise zone structures.

3.7 Mapping Addresses to a struct page

Linux needs to be able to quickly map virtual addresses to physical addresses and for mapping struct page to their physical address.
In order to achieve this by knowing where in both physical and virtual memory the global mem_map array is - it contains pointers to all struct pages representing physical memory in the system.
All architectures do this using a similar mechanism, however as always we'll focus on i386.

3.7.1 Mapping Physical to Virtual Kernel Addresses

As discussed in 3.6, linux sets up a direct mapping from physical address 0 to virtual address PAGE_OFFSET at 3GiB on i386.
This means that any virtual address can be translated to a physical address by simply subtracting PAGE_OFFSET, which is essentially what virt_to_phys() does, via the __pa() macro, and vice-versa with phys_to_virt() and __va():

#define __pa(x)	                ((unsigned long)(x)-PAGE_OFFSET)
#define __va(x)                 ((void *)((unsigned long)(x)+PAGE_OFFSET))

/**
 *	virt_to_phys	-	map virtual addresses to physical
 *	@address: address to remap
 *
 *	The returned physical address is the physical (CPU) mapping for
 *	the memory address given. It is only valid to use this function on
 *	addresses directly mapped or allocated via kmalloc.
 *
 *	This function does not give bus mappings for DMA transfers. In
 *	almost all conceivable cases a device driver should not be using
 *	this function
 */

static inline unsigned long virt_to_phys(volatile void * address)
{
        return __pa(address);
}

/**
 *	phys_to_virt	-	map physical address to virtual
 *	@address: address to remap
 *
 *	The returned virtual address is a current CPU mapping for
 *	the memory address given. It is only valid to use this function on
 *	addresses that have a kernel mapping
 *
 *	This function does not handle bus mappings for DMA transfers. In
 *	almost all conceivable cases a device driver should not be using
 *	this function
 */

static inline void * phys_to_virt(unsigned long address)
{
        return __va(address);
}

3.7.2 Mapping struct pages to Physical Addresses

As we saw in 3.6.1, the kernel image is located at physical address 1MiB, translating to virtual address PAGE_OFFSET + 0x00100000, and a virtual region of 8MiB is reserved for the image (the maximum the 2 PGD entries allow for.)
This seems to imply that the first available memory would be located at 0xc0800000 (given PAGE_OFFSET is 0xc0000000, and we're using the first 8MiB), however this is not the case.
Linux tries to reserve the first 16MiB of memory for ZONE_DMA so the first virtual area used for kernel allocations is actually 0xc1000000, and is where the global mem_map is usually allocated.
ZONE_DMA will still get used, but only when absolutely necessary.
Physical addresses get translated into struct page's by treated them as an index into the mem_map array.
Shifting a physical address PAGE_SHIFT bits to the right will convert them into a 'Page Frame Number' (PFN) from physical address 0, which is also an index within the mem_map array.
This is precisely what the virt_to_page() macro does:

#define virt_to_page(kaddr)     (mem_map + (__pa(kaddr) >> PAGE_SHIFT))

3.8 Translation Lookaside Buffer (TLB)

Initially, when a process neesd to map a virtual address to a physical one, it has to traverse the full page directory structures to find the relevant PTE.
In theory at least, this implies that each assembly instruction that references memory actually requires a number of memory references to conduct this traversal.
This is a serious overhead, however the issues are mitigated somewhat by the fact that most processes exhibit a 'locality of reference', i.e. a lot of accesses tend to be for a small number of pages.
This 'reference locality' can be taken advantage of by providing a 'Translation Lookaside Buffer' (TLB), which is a small area of memory caching references from virtual to physical addresses.
Linux assumes that most architectures will have a TLB, and exposes hooks where an architecture might need to provide a TLB-related operation, e.g. after a page fault has completed, the TLB may need to be updated for that virtual address mapping.
Not all architectures need all of these hooks, but since some do they need to be present. The beauty of using pre-processor variables to determine the kernel configuration is that architectures which don't need these will have stub entries for these and have the code completely removed by the kernel, so there's no performance impact.
There is a fairly large set of TLB API hooks available, described in the cachetlb documentation.
It is possible to just have a single TLB flush function, but since TLB flushes/refills are potentially very expensive operations, unnecessary flushes need to be avoided if at all possible. For example, Linux will often avoid loading new page tables via 'Lazy TLB Flushing', which will be discussed further in section 4.3.

3.9 Level 1 CPU Cache Management

Linux manages the CPU cache in a very similar fashion to the TLB. CPU caches, like TLB caches, take advantage of locality of reference - to avoid an expensive fetch from main memory, instead very small caches are used where possible.
There are usually (at least) 2 levels of cache - level 1 (L1) and level 2 (L2). L2 is larger but slower than L1, however linux (at 2.4.22) concerns itself with L1 only anyway.
CPU caches are organised into 'line's. Each line is small, typically 32 bytes, and aligned to its boundary size, i.e. a 32-byte line will be aligned on a 32-byte address.
On Linux the size of the line is L1_CACHE_BYTES and architecture-defined.
How addresses map to cache lines varies between architectures, but generally can be categorised into direct, associative and set associative mappings.
Direct mapping is the simplest approach - each block of memory maps to only one possible cache line.
Associative mapping means that any block of memory can map to any cache line.
Finally, set associative mapping is a hybrid approach where any block of memory can map to any line, but only within a subset of available lines.
Regardless of how the mappings are performed, addresses that are close together and aligned to the cache size are likely to use different lines.
Given that this is the case, linux uses some simple tricks to improve cache 'hits':

Frequently accessed structure fields are placed at the start of the structure to increase the chance that only 1 line is needed to access commonly used fields.
Unrelated items in a structure should be at least cache-size bytes apart to avoid 'false sharing' between CPUs.
Objects in the general caches such as the mm_struct cache are aligned to the L1 CPU cache to avoid 'false sharing'.

If the CPU references an address that is not in the cache, a 'cache miss' occurs and data is fetched from main memory. Roughly (in the 2.4.22 era) a reference to a cache can typically be performed in less than 10ns whereas a main memory access can take 100-200ns.
Obviously the main objective is to have as many cache hits and as few cache misses as possible.
Just as some architectures do not automatically manage their TLBs, some do not automatically manage their CPU caches, and as with TLBs, hooks are provided.
The hooks are placed where virtual to physical mappings change, such as during a page table update.
CPU cache flushes should always take place first as some CPUs require a virtual to physical mapping to exist when virtual memory is flushed - ordering matters.
The functions that require careful ordering are:

flush_cache_mm() - Change all page tables.
flush_cache_range() - Change page table range.
flush_cache_page() - Change single PTE.

On i386 these are no-ops. There is a pleasing comment - 'Caches aren't brain-dead on the intel.' :)
There are additional functions for dealing with cache coherency problems where an architecture selects cache lines based on virtual address meaning a single physical page may be mapped in several places. We are focusing on x86 here and since it's not 'brain-dead' these aren't needed either!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.md

3.md

Chapter 3: Page Table Management

(LS) Overview

(LS) Page Table Function Cheat Sheet

Introduction

3.1 Describing the Page Directory

2-level i386

3-level i386 (PAE)

(LS) Looking up the PTE

(LS) Some additional functions/values

3.2 Describing a Page Table Entry

3.3 Using Page Table Entries

3.4 Translating and Setting Page Table Entries

3.5 Allocating and Freeing Page Tables

3.6 Kernel Page Tables

3.6.1 Bootstrapping

3.6.2 Finalising

3.7 Mapping Addresses to a struct page

3.7.1 Mapping Physical to Virtual Kernel Addresses

3.7.2 Mapping struct pages to Physical Addresses

3.8 Translation Lookaside Buffer (TLB)

3.9 Level 1 CPU Cache Management

Files

3.md

Latest commit

History

3.md

File metadata and controls

Chapter 3: Page Table Management

(LS) Overview

(LS) Page Table Function Cheat Sheet

Introduction

3.1 Describing the Page Directory

2-level i386

3-level i386 (PAE)

(LS) Looking up the PTE

(LS) Some additional functions/values

3.2 Describing a Page Table Entry

3.3 Using Page Table Entries

3.4 Translating and Setting Page Table Entries

3.5 Allocating and Freeing Page Tables

3.6 Kernel Page Tables

3.6.1 Bootstrapping

3.6.2 Finalising

3.7 Mapping Addresses to a struct page

3.7.1 Mapping Physical to Virtual Kernel Addresses

3.7.2 Mapping struct pages to Physical Addresses

3.8 Translation Lookaside Buffer (TLB)

3.9 Level 1 CPU Cache Management