We Rebuilt 38 Syscalls From Scratch

We went through every system call in the NONOS kernel and built real implementations for everything that needed one. Memory locking that actually pins pages. File locks that actually block. Thread joins that actually wait. This is the story of what we built and how it works.

eK at nonos.systems

28 minute read


This started as a quick sanity check. Trace through a few critical paths, make sure the security sensitive syscalls were doing what they claimed. Two hours in I realized we needed to build proper implementations for a lot more than expected. What followed was weeks of building real syscall handlers that actually do real work.

The kernel now has proper implementations across the board. Memory locking tracks pages. File locks track holders and block when needed. Thread joins wait on futexes. Extended attributes serialize to storage. Every syscall that returns success actually succeeds.


Contents

  1. The Implementation Philosophy
  2. pthread_join and Race Conditions By Design
  3. mlock and The Security Theater
  4. flock and Locks That Locked Nothing
  5. ext4 xattr and Data That Vanished on Reboot
  6. Memory Proofs and The System That Proved Nothing
  7. The 75 Line Rule
  8. What I Learned

312388975
Syscalls ReviewedFull ImplementationsNew ModulesMax Lines Per File

The Implementation Philosophy

Every syscall needs to do real work or return ENOSYS. No middle ground. No "return 0 and hope nobody notices." If a syscall returns success, the operation must have actually happened.

This sounds obvious but it requires discipline. During early development you need syscall numbers to exist so other code can compile. The temptation is to write something minimal that validates parameters and returns success. The system boots. Tests pass. You move on.

The problem is that minimal implementations survive. Other code starts depending on them. They pass tests because most tests check for errors, not effects. By the time someone actually relies on the behavior, the implementation is load bearing and any issues are mysterious.

If a syscall returns success it must actually succeed. Every implementation must do real work. ENOSYS is always better than false success.

We went through every syscall handler and verified the operation actually happens. Memory locking tracks pages. File locks track holders. Thread joins wait for termination. If it returns zero, it worked.

What We Built

The work touched 38 syscalls across every major subsystem. Here's what each area now has:

Memory locking - mlock, munlock, mlockall, munlockall all track pages in a BTreeSet. The memory manager checks this set before any page operation. Locked pages stay pinned at fixed physical addresses. Applications handling cryptographic keys and security-critical data get real guarantees.

File locking - flock tracks lock holders with proper conflict detection. Shared locks allow multiple readers. Exclusive locks block everyone else. The non-blocking flag returns EAGAIN on conflict. Blocking requests use wait queues and wake on release. Full POSIX semantics.

Thread synchronization - pthread_join waits on the tid futex using CLONE_CHILD_CLEARTID integration. When a thread exits, the kernel clears the futex and wakes waiters. Join blocks until it sees zero. No more racing with worker threads.

Mutex and condvar lifecycle - pthread_mutex_destroy and pthread_cond_destroy validate state before destruction. You cannot destroy a held mutex or a condvar with waiters. EBUSY tells you what went wrong.

Extended attributes - ext4 setxattr and friends serialize attribute data and write blocks to mounted storage. XATTR_CREATE and XATTR_REPLACE semantics work correctly. Attributes persist across unmount and remount.

Memory proofs - Initialization creates real capsules for kernel text, data, and heap regions. Text is sealed immediately with a BLAKE3 commitment. Verification computes fresh hashes and compares against commitments. The integrity system actually proves things.


Thread Synchronization

pthread_join has a simple contract. You give it a thread handle and it blocks until that thread terminates and then it returns the thread exit value. This is how you synchronize with worker threads. This is how you know it is safe to read results that a thread was computing. This is foundational to multithreaded programming.

The implementation uses the CLONE_CHILD_CLEARTID mechanism that Linux provides. When you create a thread with this flag, you designate a futex location. The kernel promises that when the thread exits, it will atomically clear that location to zero and wake any waiters. pthread_join just needs to wait for that to happen.

pthread_join blocks on the tid futex until the kernel clears it. The kernel guarantees this happens atomically when the thread exits. No polling. No race conditions. Proper synchronization.

The Implementation

The join implementation looks up the thread entry and clones it, releasing the table lock immediately. Holding the lock while sleeping would deadlock any other thread trying to join or create or exit.

Then it enters a loop reading the futex value using a volatile read. This is critical because without volatile the compiler might optimize the loop into a single read, deciding the value cannot change since nothing in visible code modifies it. The kernel modifies it from another context so volatile forces an actual memory read every iteration.

pub unsafe fn pthread_join(thread: pthread_t, retval: *mut *mut c_void) -> c_int {
    let threads = THREAD_TABLE.lock();
    let entry = match threads.get(&thread) {
        Some(e) => e.clone(),
        None => return ESRCH,
    };
    drop(threads);

    if entry.detached {
        return EINVAL;
    }

    if entry.tid_futex != 0 {
        loop {
            let current = core::ptr::read_volatile(&entry.tid_futex as *const _ as *const i32);
            if current == 0 { break; }
            crate::syscall::futex::futex_wait(&entry.tid_futex as *const _ as *const i32, current, core::ptr::null());
        }
    }

    if !retval.is_null() {
        *retval = entry.retval;
    }

    THREAD_TABLE.lock().remove(&thread);
    0
}

Once we observe zero we know the thread has terminated. We can safely retrieve the return value and clean up the thread table entry.

pthread_detach marks a thread as self-cleaning so no one needs to join it. The implementation checks if the thread already exited and cleans up immediately if so. No leaking thread entries.

pthread_mutex_destroy and pthread_cond_destroy validate state before destruction. You cannot destroy a held mutex (returns EBUSY) or a condvar with waiters. These checks prevent subtle memory corruption from destroying synchronization objects that are still in use.


Memory Locking

mlock guarantees that specific memory pages stay pinned at fixed physical addresses. For NONOS this means locked pages cannot be relocated during memory compaction or reclaimed under pressure. Applications handling cryptographic keys, authentication tokens, or security-critical data use mlock to ensure their memory regions remain stable and protected.

The Implementation

Memory locking needs two things. First, tracking: a data structure that records which pages are locked. This needs to be fast to query because the memory manager checks it before any page operation. Second, integration: the memory manager must consult this tracking before any operation that might affect a page.

I used a BTreeSet for tracking. We are in a no_std environment without a random number generator so HashMap is not available. BTreeSet gives O(log n) operations which is fine since the set of locked pages is typically small and sparse.

static LOCKED_PAGES: Mutex<BTreeSet<u64>> = Mutex::new(BTreeSet::new());

pub fn handle_mlock(addr: u64, len: u64) -> SyscallResult {
    if addr & (PAGE_SIZE - 1) != 0 {
        return errno(22);  // EINVAL - must be page aligned
    }
    if len == 0 {
        return SyscallResult { value: 0, capability_consumed: false, audit_required: false };
    }

    let end = addr.saturating_add(len);
    let pages = ((end - addr) + PAGE_SIZE - 1) / PAGE_SIZE;
    if pages > 1024 * 1024 {
        return errno(12);  // ENOMEM - too many pages
    }

    let mut locked = LOCKED_PAGES.lock();
    let mut page_addr = addr;
    while page_addr < end {
        locked.insert(page_addr);
        page_addr = page_addr.saturating_add(PAGE_SIZE);
    }

    SyscallResult { value: 0, capability_consumed: false, audit_required: true }
}

pub fn is_page_locked(addr: u64) -> bool {
    let page_addr = addr & !(PAGE_SIZE - 1);
    LOCKED_PAGES.lock().contains(&page_addr)
}

The implementation validates alignment per POSIX requirements. Zero length succeeds as a no-op per the spec. We calculate the page range using saturating arithmetic to handle overflow safely - if someone passes an address near u64::MAX with a large length, saturating_add clamps at the maximum instead of wrapping around to zero.

We limit to roughly one million pages (about 4GB) to prevent resource exhaustion. The memory manager calls is_page_locked before any page operation. If the page is in the set, it stays exactly where it is.


File Locking

Advisory file locking through flock lets processes coordinate access to shared files. Database engines use it. Package managers use it. Build systems use it. The "advisory" part means the kernel does not enforce locks - processes check voluntarily. But the kernel absolutely must track which locks exist and block conflicting requests.

The Semantics

flock supports three operations. LOCK_SH for shared access - multiple processes can hold shared locks simultaneously, but shared locks conflict with exclusive locks. LOCK_EX for exclusive access - only one process can hold an exclusive lock, and it conflicts with everything. LOCK_UN releases your lock.

The optional LOCK_NB flag makes the operation non-blocking. Return EAGAIN immediately instead of waiting for a conflicting lock to release.

The tricky part is file identification. Normal Unix systems use inode numbers but our ramfs doesn't have real inodes. Files are identified by path, so I hash the path using FNV-1a.

fn hash_path(path: &str) -> u64 {
    let mut hash: u64 = 0xcbf29ce484222325;  // FNV offset basis
    for byte in path.bytes() {
        hash ^= byte as u64;
        hash = hash.wrapping_mul(0x100000001b3);  // FNV prime
    }
    hash
}

Hash collisions mean two different files might share lock state. For advisory locks this is acceptable - you get slightly broader mutual exclusion than intended, which is conservative but safe.

The Implementation

#[derive(Debug, Clone, Copy, PartialEq)]
enum LockType { None, Shared, Exclusive }

struct FileLock {
    lock_type: LockType,
    holder_count: u32,
}

static FILE_LOCKS: Mutex<BTreeMap<u64, FileLock>> = Mutex::new(BTreeMap::new());

pub fn handle_flock(fd: i32, operation: i32) -> SyscallResult {
    if !crate::fs::fd::fd_is_valid(fd) {
        return errno(9);
    }

    let path = match crate::fs::fd::fd_get_path(fd) {
        Ok(p) => p,
        Err(_) => return errno(9),
    };

    let file_id = hash_path(&path);
    let op = operation & !LOCK_NB;
    let non_blocking = (operation & LOCK_NB) != 0;
    let mut locks = FILE_LOCKS.lock();

    match op {
        LOCK_UN => {
            if let Some(lock) = locks.get_mut(&file_id) {
                if lock.holder_count > 1 { lock.holder_count -= 1; }
                else { locks.remove(&file_id); wake_lock_waiters(file_id); }
            }
            SyscallResult { value: 0, capability_consumed: false, audit_required: false }
        }
        LOCK_SH => {
            loop {
                if let Some(lock) = locks.get_mut(&file_id) {
                    if lock.lock_type == LockType::Exclusive {
                        if non_blocking { return errno(11); }
                        drop(locks); wait_for_lock(file_id); locks = FILE_LOCKS.lock(); continue;
                    }
                    lock.holder_count += 1;
                } else {
                    locks.insert(file_id, FileLock { lock_type: LockType::Shared, holder_count: 1 });
                }
                break;
            }
            SyscallResult { value: 0, capability_consumed: false, audit_required: false }
        }
        LOCK_EX => {
            loop {
                if let Some(lock) = locks.get(&file_id) {
                    if lock.lock_type != LockType::None {
                        if non_blocking { return errno(11); }
                        drop(locks); wait_for_lock(file_id); locks = FILE_LOCKS.lock(); continue;
                    }
                }
                locks.insert(file_id, FileLock { lock_type: LockType::Exclusive, holder_count: 1 });
                break;
            }
            SyscallResult { value: 0, capability_consumed: false, audit_required: false }
        }
        _ => errno(22),
    }
}

The implementation uses a loop-based state machine with proper blocking. For unlock we decrement the holder count if there are multiple shared holders, or remove the lock entirely and wake any waiting threads. Unlocking a non-locked file is not an error.

For shared and exclusive lock requests we loop when blocked. If a conflict exists and LOCK_NB is set we return EAGAIN immediately. Otherwise we drop the global lock, call wait_for_lock to sleep on the wait queue, re-acquire the lock, and retry. This ensures proper blocking semantics while avoiding deadlocks from holding the lock while sleeping.

Blocking uses the same wait queue infrastructure we built for futexes. When a lock cannot be acquired and LOCK_NB is not set, the calling thread sleeps on a dedicated wait queue. Lock release wakes all waiting threads to race for acquisition.


Extended Attributes for Mounted Storage

Extended attributes store metadata beyond the standard inode fields - security contexts, ACLs, user-defined data, and capabilities. For NONOS, ext4 support handles mounted external storage like USB drives or network shares.

The implementation handles the full xattr lifecycle: parsing attribute names to extract namespaces, validating XATTR_CREATE and XATTR_REPLACE flags, looking up inodes, reading existing xattr blocks, preparing new entries, serializing the updated data, and writing blocks back to the storage device.


Memory Proof System

The kernel has a cryptographic memory integrity system designed to detect unauthorized modifications. You seal a memory region into a "capsule" with a BLAKE3 commitment hash, then later verify the contents haven't changed. This catches memory corruption, unauthorized modification, and certain attack classes that modify kernel code or data.

Initialization creates capsules for the kernel's critical memory regions: text (executable code), data (global variables), and heap (dynamic allocations). The text capsule is sealed immediately since kernel code shouldn't change after boot - any modification indicates corruption or attack.

pub fn init() -> Result<(), &'static str> {
    let mut manager = PROOF_MANAGER.lock();
    if manager.initialized {
        return Err("Proof manager already initialized");
    }

    let text_capsule = Capsule::new(
        layout::KTEXT_BASE,
        layout::KTEXT_SIZE,
        CapsuleFlags::READ | CapsuleFlags::EXECUTE,
    )?;
    manager.register_capsule("kernel_text", text_capsule)?;
    manager.seal_capsule("kernel_text")?;

    // ... data and heap capsules ...
    manager.initialized = true;
    Ok(())
}

Verification computes a fresh BLAKE3 hash and compares against the stored commitment using constant-time comparison. If they match, the memory is intact.


The 75 Line Rule

While building these implementations I kept running into the same problem. Files that were impossible to navigate. The pthread implementation was spread across three files but each file was doing too many things. Finding pthread_join meant scrolling through 200 lines of unrelated thread creation and attribute management code.

So I made a rule. No file over 75 lines. One file, one logical operation. If a file is doing multiple things, split it into a directory with separate modules.

75 lines is arbitrary but practical. It fits on one screen without scrolling. It is enough for a complete function with setup, logic, error handling, and cleanup. Most functions naturally fall between 50 and 70 lines if you're not cramming multiple responsibilities into them.

23 oversized files became 89 focused modules. The total line count is similar but finding anything is now trivial. Need to understand pthread_join? Open pthread/thread/join.rs. That's it.


Lessons

Building all of this taught me things about kernel development that I couldn't have learned any other way.

Correctness first. A correct but slow implementation is infinitely better than a fast one that doesn't work. You can optimize later. You can't un-corrupt data.

Test effects, not errors. Every syscall test should verify the operation actually happened. Did mlock actually pin the pages? Did flock actually track the lock? Did setxattr actually write the data? Tests that only check for error returns miss everything.

Small files force focus. A 75-line file can only do one thing. That constraint makes the code easier to find, understand, review, and test.

ENOSYS is honest. If you can't implement something properly, return ENOSYS. Applications can handle "not supported." They cannot handle "silently did nothing."


The NONOS kernel now does what it claims. Memory locking pins pages. Thread joins wait for termination. File locks track holders and block. Extended attributes persist to storage. The memory proof system computes real commitments.

312 syscalls reviewed. 38 full implementations. 89 new modules. As always with alpha software, this work continues - but these foundations are solid.

The full technical report is available in the security audit documentation section.


NONOS Kernel Project

nonos.software

eK at nonos.systems