Physical and virtual memory
Traditionally, one has physical memory, that is, memory that is actually present in the machine, and virtual memory, that is, address space. Usually the virtual memory is much larger than the physical memory, and some hardware or software mechanism makes sure that a program can transparently use this much larger virtual space while in fact only the physical memory is available. Nowadays things are reversed: on a Pentium II one can have 64 GB physical memory, while addresses have 32 bits, so that the virtual memory has a size of 4 GB. We’ll have to wait for a 64-bit architecture to get large amounts of virtual memory again. The present situation on a Pentium with more than 4 GB is that using the PAE (Physical Address Extension) it is possible to place the addressable 4 GB anywhere in the available memory, but it is impossible to have access to more than 4 GB at once.
Kinds of memory -> Kernel and user space work with virtual addresses (also called linear addresses) that are mapped to physical addresses by the memory management hardware. This mapping is defined by page tables, set up by the operating system. DMA devices use bus addresses. On an i386 PC, bus addresses are the same as physical addresses, but other architectures may have special address mapping hardware to convert bus addresses to physical addresses. Under Linux one has #include phys_addr = virt_to_phys(virt_addr); virt_addr = phys_to_virt(phys_addr); bus_addr = virt_to_bus(virt_addr); virt_addr = bus_to_virt(bus_addr); All this is about accessing ordinary memory. There is also “shared memory” on the PCI or ISA bus. It can be mapped inside a 32-bit address space using ioremap(), and then used via the readb(), writeb() (etc.) functions. Life is complicated by the fact that there are various caches around, so that different ways to access the same physical address need not give the same result. See asm/io.h and Documentation/{IO-mapping.txt,DMA-mapping.txt,DMA-API.txt}.
Kernel memory handling Pages –>The basic unit of memory is the page. Nobody knows how large a page is (that is why the Linux swapspace structure, with a signature at the end of a page, is so unfortunate), this is architecture-dependent, but typically PAGE_SIZE = 4096. (PAGE_SIZE equals 1 << PAGE_SHIFT, and PAGE_SHIFT is 12, 13, 14, 15, 16 on the various architectures). If one is lucky, the getpagesize() system call returns the page size. Usually, the page size is determined by the hardware: the relation between virtual addresses and physical addresses is given by page tables, and when a virtual address is referenced that does not (yet) correspond to a physical address, a page fault occurs, and the operating system can take appropriate action. Most hardware allows a very limited choice of page sizes. (For example, a Pentium II knows about 4KiB and 4MiB pages.) Kernel memory allocation Buddy system The kernel uses a buddy system with power-of-two sizes. For order 0, 1, 2, …, 9 it has lists of areas containing 2^order pages. If a small area is needed and only a larger area is available, the larger area is split into two halves (buddies), possibly repeatedly. In this way the waste is at most 50%. Since 2.5.40, the number of free areas of each order can be seen in /proc/buddyinfo. When an area is freed, it is checked whether its buddy is free as well, and if so they are merged. Read the code in mm/page_alloc.c. get_free_page The routine __get_free_page() will give us a page. The routine __get_free_pages() will give a number of consecutive pages. (A power of two, from 1 to 512 or so. The above buddy system is used.)
kmalloc
The routine kmalloc() is good for an area of unknown, arbitrary, smallish length, in the range 32-131072 (more precisely: 1/128 of a page up to 32 pages), preferably below 4096. For the sizes, see . Because of fragmentation, it will be difficult to get large consecutive areas from kmalloc(). These days kmalloc() returns memory from one of a series of slab caches (see below) with names like “size-32″, …, “size-131072″. Priority Each of the above routines has a flags parameter (a bit mask) indicating what behaviour is allowed. Deadlock is possible when, in order to free memory, some pages must be swapped out, or some I/O must be completed, and the driver needs memory for that. This parameter also indicated where we want the memory (below 16M for ISA DMA, or ordinary memory, or high memory). Finally, there is a bit specifying whether we would like a hot or a cold page (that is, a page likely to be in the CPU cache, or a page not likely to be there). If the page will be used by the CPU, a hot page will be faster. If the page will be used for device DMA the CPU cache would be invalidated anyway, and a cold page does not waste precious cache contents.
vmalloc
The routine vmalloc()
has a similar purpose, but has a better chance of being able to return larger consecutive areas, and is more expensive. It uses page table manipulation to create an area of memory that is consecutive in virtual memory, but not necessarily in physical memory. Device I/O to such an area is a bad idea. It uses the above calls with GFP_KERNEL to get its memory, so cannot be used in interrupt context.
bigphysarea
There is a patch around, the ” BIGPHYSAREA PATCH”, allowing one to reserve a large area at boot time. Some devices need a lot of room (for video frame grabbing, scanned images, wavetable synthesis, etc.). See, e.g., video/zr36067.c
.
The slab cache
The routine kmalloc
is general purpose, and may waste up to 50% space. If certain size areas are needed very frequently, it makes sense to have separate pools for them. For example, Linux has separate pools for inodes, dentries, buffer heads. The pool is created using kmem_cache_create()
, and allocation is by kmem_cache_alloc()
.
The number of special purpose caches is increasing quickly. I have here a 2.4.18 system with 63 slab caches, and a 2.5.56 one with 149 slab caches. Probably this number should be reduced a bit again.