Linux memory management · A primer in motion

The SLUB
Allocator

An animated field guide to how the Linux kernel parcels out small objects — without thrashing, without locking, and without claiming more memory than it needs.

2007

Default since

~10⁶/s

Allocs per CPU

4 ld/st

Fast path cost

Locks (happy path)

task_struct_cachep · CPU 0 ACTIVE

freelist

§ I — The problem

A million tiny objects, every second.

A running kernel is a blizzard of small allocations — file descriptors, network buffers, dentries, task structures. Asking the page allocator (which deals in 4 KB units) for each one would shred memory, cache lines, and CPU cycles. The slab allocator's job is to make those allocations cheap and dense.

~thousands / s

task_struct

~9.5 KB

One per process and thread. Created at fork, destroyed at exit. Asks the slab allocator constantly.

~10⁵ / s

struct dentry

192 B

Cached path component. The VFS chews through these on every open().

~10⁶ / s

struct sk_buff

232 B

Network packet header. One per packet in flight on a busy NIC.

~10⁵ / s

struct inode

~600 B

Filesystem object metadata. Allocated on every cold-cache stat.

SLUB has been the kernel's default slab allocator since 2.6.23 (October 2007). It replaced the older SLAB allocator with less metadata, simpler code, and per-CPU lockless fast paths.

§ II — Anatomy

A cache, per CPU, per node.

Each kind of object lives in its own kmem_cache. Inside the cache, every CPU gets its own private kmem_cache_cpu with an active slab and a small partial-slab cache that no other CPU may touch.

One step out, every NUMA node gets a kmem_cache_node that holds a longer list of partial slabs — shared, but rarely contended.

A slab itself is one or more contiguous pages, sliced into N equally-sized objects.

Cache

kmem_cache

name = "task_struct"

size = 9472, align = 64

Per-CPU

kmem_cache_cpu (CPU 0)

freelist → 0xff80…

slab → active

partial → [3 slabs]

→

active slab

Per-Node

kmem_cache_node (node 0)

partial = list[7]

nr_partial = 7

→

partial slabs

§ III — The freelist trick

Free objects are the metadata.

The first machine word of every free object holds a pointer to the next free object in the same slab. The chain ends in NULL. The cache stores only the head.

Allocate? Pop the head. Free? Push the head. No bitmaps. No external bookkeeping. The freelist threads through memory the CPU was about to touch anyway, so the next-pointer is hot in cache by the time we need it.

(Modern kernels XOR-hash the pointer with a per-cache cookie — CONFIG_SLAB_FREELIST_HARDENED — to break heap-overflow exploits. Same data structure, defanged.)

A slab in the wild PARTIAL

freelist

8 × 256 B objects page @ 0xff8000…

§ IV — The fast path

kmalloc() in four loads.

The overwhelming majority of allocations resolve here. There is no spinlock, no atomic — just a CPU-local pointer dance protected by this_cpu_cmpxchg_double.

slab_alloc — fast path (paraphrased)
// Per-CPU. No locks. No retries (mostly).
void *slab_alloc(kmem_cache *s) {
void *obj = c->freelist;
if (likely(obj)) {

              c->freelist = get_freepointer(s, obj);
return obj;          // done.

            }
return__slab_alloc(s);  // slow path

          }
        

CPU 0 · active slab ACTIVE

freelist

3 alloc · 5 free 0 alloc/s

[ready]Press kmalloc() to pop the freelist head.

§ V — When the freelist runs dry

Demote, promote, repeat.

If the active slab's freelist is NULL, the slow path takes over. The active slab is simply dropped — SLUB doesn't track full slabs, which is half the reason it's leaner than SLAB.

A slab is then promoted from the per-CPU partial list. Promotion is still lockless: a cmpxchg pulls the head off the partial chain.

If the per-CPU partial list is empty, we drop into the per-node partial list (with one spinlock — but contention here is rare). Only if every list is empty do we ask the page allocator for fresh memory.

Active slab's freelist hits NULL
Active slab is detached (no longer tracked)
Head of CPU partial list is promoted to active
Allocation proceeds against the new active slab

CPU 0 · active slab ACTIVE

freelist

CPU 0 partial list

[idle]Hit play to watch the active slab deactivate and a partial slab take its place.

§ VI — Refilling from below

A blank page, carved.

When neither the per-CPU nor the per-node partial lists can spare a slab, the cache asks the underlying page allocator (the buddy allocator) for fresh contiguous pages.

The page is sliced into size/align-sized objects. The first word of each future object is wired into a freelist chain — obj[0] → obj[1] → obj[2] → … → NULL — and the result is installed as the new active slab.

The cost is real but rare; once the slab exists, all subsequent allocations from it are back on the lockless fast path.

alloc_pages(GFP_KERNEL, order=0) idle

fresh slab · CPU 0 EMPTY

[idle]The slab starts empty. Watch the kernel ask for a page, then thread the freelist through it.

§ VII — kfree

Push the head, mind the page.

Given a pointer, the kernel can find the slab it belongs to in constant time — the page descriptor for any address points back to its kmem_cache.

If the object lives on the current per-CPU active slab, the free is the mirror of the alloc: write the current head into *obj, then store obj as the new head. Lockless.

If the object belongs to some other slab, the per-slab freelist gets the push — and the slab may transition from full to partial (joining the partial list) or even partial to empty (potentially returning to the page allocator).

CPU 0 · active slab ACTIVE

freelist

[ready]Click any filled cell directly — or hit kfree(random) — to push it back onto the freelist.

§ VIII — Playground

Try it yourself.

A full cache with one active slab and a partial list. kmalloc and kfree at will. Watch full slabs disappear from tracking and empty slabs return to the page allocator when the partial list overflows.

cache: demo_cachep · 256 B objects

t = 0

0total kmallocs

0total kfrees

0live objects

1slabs in cache

0untracked (full)

1pages held

CPU 0 · active ACTIVE

freelist

CPU 0 partial list 0 slabs

[ready]The active slab starts empty. Allocate to populate it, then experiment.

§ IX — Footnotes & further reading

Where to go from here.

The source

mm/slub.c in the Linux kernel is famously dense but readable. Start at __slab_alloc() and work outward. About 4,500 lines.

The original paper

Christoph Lameter, Slub allocator (2007). The design rationale, written by the author. Short, lucid.

Adjacent allocators

SLAB (deprecated 2024), SLOB (tiny systems, also deprecated), and the page-level buddy allocator below them all.

What we glossed

NUMA-aware promotion, CPU partial-list overflow rules, freelist-pointer hardening, and the kmalloc-N general-purpose caches.

Tracing it live

/sys/kernel/slab/<cache>/ exposes per-cache stats. slabtop, perf trace -e 'kmem:*', and bpftrace are your friends.

Why it matters

Every syscall on the system passes through this code. Saving four cycles here saves them everywhere, on every machine.

The SLUBAllocator

A million tiny objects, every second.

A cache, per CPU, per node.

Free objects are the metadata.

kmalloc() in four loads.

Demote, promote, repeat.

A blank page, carved.

Push the head, mind the page.

Try it yourself.

Where to go from here.

The source

The original paper

Adjacent allocators

What we glossed

Tracing it live

Why it matters

The SLUB
Allocator