Reading notes for Graduate Operating Systems at UCSD. The notes are based on the readings and discussions from the course.
Historical Perspective
The Structure of the "THE’-Multiprogramming System
- Layered design: (1) Processes (scheduler) (2) memory; storage (paging) (3) I/O for peripheral (4) human operator (5) all of your apps
- Synchronization via semaphores:
P
wait,V
signal; 2 ways to use them (1) mutual exclusion (2) coordination - Summary: layered OS design; abstraction of processes; semaphores for synchronization; correctness
The Nucleus of a Multiprogramming System
- Goals: extensibility, flexible OS design
- Nucleus: small, supports multiple OSes
- Components: IPC via messages; processes (create, stop, remove; internal vs external); similar to microkernel
- OSes on top of Nucleus: memory & process management (hierarchical approach)
- Economy of abstraction: processes and messages
- External processes (replaced by files today), Clock process
- Synchronization with messages (buggy processes, dynamic)
- Messages: overhead, buffer management (partitioned pool); API with separate send/wait and send/wait for message/answer
- Summary: enable flexible OS design; nucleus with layered OSes; synchronization via message passing; clean abstractions
TENEX, a Paged Time Sharing System for the PDP-10
- TENEX Goals: virtual memory, good engineering (CLI), easy debugging/modification/maintenance, performance, effective file system, backwards compatibility
- Virtual Memory: address spaces (process, OS), memory maps (page table), BBN pager (MMU, TLB)
- Virtual Address Translation: virtual address → page table → physical address; with TLB: virtual address → TLB → (hit) physical address or (miss) page table
- BBN Pager: 9 input bits → 11 output bits; physical memory >> virtual memory
- Sharing: pages of memory, page table entries (direct, indirect, shared); copy-on-write
- Backward Compatibility: library/compatibility layer for binary compatibility;
jsys
instruction for old syscalls - Scheduling: more tolerant of page faults
- File System: hierarchical
C\dir\file ext.version
(device, directory, name, extension, version number) - Summary: virtual memory; sharing, copy-on-write; time-sharing system; backwards compatibility
HYDRA
- Goal: decompose OS into subsystems (file system, scheduler, virtual memory); provide protection between subsystems; user-defined subsystems
- Key terms:
- Protection domains: e.g., process vs kernel; same rights/capabilities in one region
- Protected control transfer: syscall
- Rights augmentation
- Hydra:
- Resources: objects (type, interface, methods, data, pointers)
- Execution domain: local name space (LNS), stack frame
- Protection mechanisms: capability (pointer | rights bits); kernel rights (standard), auxiliary rights (user-defined)
- Procedures:
write(F, buffer, length) { local rights, local vars }
, F: capability; rights checking; templates for changing rights (caller vs callee, protected control transfer, rights augmentation)
- Summary: capability-based protection; nucleus-like system for building OSes on top
Structure
Protection (Lampson)
- Goals: abstract model; unify discussion of protection concepts
- Object: things to protect (Hydra: object, Multics: segments)
- Domains: equivalent class with same rights (Hydra: LNS, Multics: principal identifier (user), ring)
- Attributes: read, write, execute
- Access Matrix: rows (capability lists), columns (access control lists)
- Summary: objects, domains, access matrix, ACLs (columns), capability lists (rows)
Multics
- Protection Model Principles: Permission not exclusion; check every process; design is not secret; least privilege; easy to use
- Segmented Memory Model: Descriptor mapping to app, file, lib, fs, etc.; virtual address:
segment number, offset
; Segments + paging; special registers (cs, ds, ss); Descriptor (capability): addr, RWX, call, ring (rights) - Protection in Multics: Check protection efficiently; Login; Principle identifier (user ID)
- File System Protection: Derived from File System; ACLs (had to store, inefficient) → capabilities;
open("file.txt", permissions)
returns file descriptor;read(fd, buffer, length)
- Protected Subsystems (e.g. File System): Gates: entry points; Rings: generalize user-level vs kernel level; least to most privilege; (apps, lib, fs, drivers, core os) use different ring levels
- Summary: Uniform data protection using ACL; memory access protection using descriptors (capability)
Unix
- Goals: Easy to use, general purpose, inexpensive, efficient use of hardware, provide protection, unifying abstraction (files, byte streams)
- File System:
- Data model: Before Unix (imposed structure), Unix (uninterpreted bytes, variable-length)
- API: Before Unix (no buffer cache, separate & async APIs), Unix (buffer cache, simple APIs e.g.
write(fd, buffer, length)
) - File names & dirs: hierarchical, mount, dirs stored like files
- Devices: device independent I/O, streaming interface, driver, ioctl for corner cases
- Protection Model: 7-bit R/W/X, user/other;
set-user-id
→ execute w/ owner’s permissions, enables protected subsystems - Processes:
fork()
&exec()
vscreateProcess()
. Prosfork()
: easy concurrency,fork
+exec
combo, simple; ProscreateProcess()
: flexible, efficient; Consfork()
: implementation complexity, not thread safe, insecure - Summary: unifying abstractions (files), careful decomposition
Singularity
- Motivation: software-based isolation, reconsider OS design, security vulnerabilities, failures, robustness
- Singularity Approach: language features (e.g. bounds checking), verification, apply to an OS
- Design of Singularity: SIPs (domains communicate with messages), one address space, CBCs, Manifest-Based Program (MBP)
- CBC: Contract-based channels; Client ← channel → sever; state machine
- Summary: Software-based isolation
Synchronization
Monitors
- Monitor: Sync by compiler, enforced at runtime. Guarantees mutual exclusion; one thread at a time. Java:
synchronize
keyword. Monitor invariant: safety property; holds on thread entry/exit. Condition Variable: for thread progress (liveness) in monitor - Hoare semantics:
wait
: block & release lock;signal
: wake up waiter, context switch; InvariantI
:I wait I & B
,B & I signal I
; ConditionB
: situation that needs wait - Mesa semantics:
wait
: same as Hoare;signal/notify
:while (!condition) wait()
;I wait I
,xB xI signal xI xB
- Trends: Hoare → compiler; Java: compiler handles locks, explicit CV; C++/Rust: hybrid
- Summary: language support for sync monitors; Hoare vs Mesa semantics; challenges (aborts, nesting)
Scalability
RCU
- Multicore CPUs ( 2000 first CPUs with 2 cores). Scalable: perf increases linearly w/ # cores
- Goals: Concurrent reads (even during updates); Low overhead; Deterministic completion time
- Approaches: (1) spin lock (2) read-write lock: multi readers OR 1 writer. Drawbacks: instructions, readers wait for writer, space & execution overhead
- RCU ideas: (1) writes make a copy (2) Memory barriers to enforce ordering (problem: use-after-free) (3) grace period: writers wait until all CPUs context switch; readers can’t hold pointer through context switch
- Linux’s RCU API:
- readers: rcu_read_lock (disable preemption), rcu_read_unlock (enable preemption), rcu_dereference(pointer)
- writer: synchronize_rcu (wait for all cores to context switch), call_rcu(call_back, argument) (async), rcu_assign_pointer(pointer_addr, pointer)
- RCU vs read-write locks:
- Reader: RCU imposes almost no cost, allows reads while writes occur
- Writer: RCU can take longer, better with fewer writes (not latency-sensitive)
- Drawbacks: complex, space overheads from copy, hard to use outside kernel, confusing semantics, relies on frequent context switches, hard to apply to some structures, reads ≫ writes
- Summary: multiple CPU cores → scalable synchronization; no waiting for readers, grace period for writers
Scalable Commutativity Rule
- Scalable commutativity rule: if interface ops commute → exists conflict-free (scalable) implementation
- Memory accesses: read-only or write on 1 core: scalable; write a cache line read/written by another core: not scalable
- Commutativity: getpid() always commutes; open(‘/a/new.txt’, O_CREATE) & open(‘/b/new.txt’, O_CREATE) commute
- SIM-commutativity: State-dependent, Interface-based, Monotonic (any prefix should be commutative)
- Commutative interfaces: POSIX; decompose complex ops (fork() → posix_spawn, CreateProcess); allow non-determinism (file descriptors)
- Caveats: failing rule ≠ not scalable; all ops passing ≠ scalable OS (implementation could conflict)
- Summary: Scalability at interface level, ops commute → scalable implementation exists
Virtual Memory
Virtual Memory Management in the VAXVMS Operating System
- Goals: support diff hardware & apps; efficiency (limited mem, slow I/O)
- Address space: virtual → physical addresses
- HW support: base+length registers (exceed→segfault, update@context switch, security issues like spectre→mitigation: KPTI)
- Multiple page tables for P1 & P0 to save space
- Space efficiency: 32-bit addr space; small 512B pages; allow page table swap out
- OS page table: in phys mem to avoid faulting; modern OS: pin root page table only
- TLB: cache of PTEs to avoid 2 mem accesses per access; split b/w user & kernel
- Local page replacement: simple FIFO (today: LRU variants)
- Page caching: 2nd chance; proactively reclaim pages→free list; maintains modified list
- Clustering: bring/write multiple virtually contiguous 512B pages in phys contig mem
- Summary: address space; HW support for mem mgmt
Machine-Independent Virtual Memory Management for Paged Uniprocessor and Multiprocessor Architectures
- Virtual memory for Mach (Richard Rashid → founded MSR, Avadis Tevanian → Apple CTO)
- IBM RTPC (RISC PC) → PowerPC
- Goals: machine independence, evaluate hardware platforms, support sparse address spaces
- Resident page table: inverted page table, physical pages mapping to address space items
- Memory object: backing store (executable, disk file, paging file, etc.),
userfaultfd
for user-controlled swapping - pmap: hardware-dependent VM data & code (page frames, address bits, protection, paging vs segmentation, page table structure)
- Summary: machine independence of VM
Distributed OSes
The Distributed V Kernel and its Performance for Diskless Workstations
- Diskless workstation claims: diskless = local disk; manageability; message IPC enables high perf file I/O
- V IPC: inherit from Thoth, extend to distributed networks; synchronous; small fixed-size messages (32 bytes, send/reply); separate data transfer mechanism (
moveTo/moveFrom
,receiveWithSegment/replyWithSegment
); perf optimizations (rawEthernet) - Network Penalty: minimum transfer time
- Kernel Primitives: same machine vs over network; clock speed matters; remote adds ms of overhead (benefits of offloading, file access takes 20ms)
- File access: streaming (network, disk IO no prefetching); I/O bound → blocked on disk no matter what (needs strong assumption that sequential prefetching doesn’t change perf); not I/O bound → doesn’t matter
- Benchmark from micro to larger, end-to-end ones
- Summary: diskless workstations; arguments based on performance
Sprite
- Larger memories → caching
- Networking: connected workstations; management/administration; transparency; sharing (single global namespace, potentially idle machines → process migration (today: migration done on container/VM level), home machine (if incompatible b/c of different kernel env, RPC back to home machine))
- Multiprocessor: sharing & parallelism → shared fork
- OS adaptation to multicore: monitor at fine-grained level
- Caching: sharing is rare; sequential (use version #); concurrent (disable caching on file-by-file basis); strategy: optimize for common case
- Summary: file caching protocol; technology trends
Plan-9
- UNIX not a good fit: workstations instead of mainframes (interconnected vs connecting to central mainframe); distributed nature; admin/management; cost; graphics, networking
- Plan 9 Approach: Terminal, CPU, FS, connected by network
- Benefit: cost; longevity (can replace some components only)
- Unifying abstraction – files: everything is a file (
/dev/mouse
can be local or remote in distributed setting,/proc
clean way to examine & control running processes,/net
,/dev/bitblt
,/dev/con
) - Namespaces: local namespaces; easy to customize; implement:
mount
,umount
,bind
; union directory (e.g., search single/bin
which all/bin
s mount to) - 9P: network protocol, unifying all file operations (NFS, SSH, X11); applications responsible for interpreting semantics; similar to HTTP where applications interpret bytes
- Storage hierarchy: memory - cache; disk - file server; WORM (Write Once Read Many) - drawback: space, scalability, unreliability, slow to repopulate
- Process:
rfork
- fine-grained sharing (similar toclone
in Linux); concurrency-oriented language Alef influenced Go - Impact: cost of disk not really important; privacy
LegoOS
- Building OS for disaggregated hardware
- Monolithic server vs Hardware disaggregation
- Benefits: fine-grained failure domains; scalability (independently); resource efficiency (solves bin-packing problem); adding specialized hardware; non-goal: performance; taking early ideas to the extreme → disaggregate everything
- LegoOS: Splitkernel (monitors: pComp, mComp, sComp)
- Evaluation: small working sets (performance not that bad); large working sets (actually better); resource utilization?
- Limitations: assume no sharing of memory; security; policy
- Distributed OS adopted? Disaggregated storage, but nothing more aggressive
The Performance of μ-Kernel-Based Systems
- Goals: Microkernels performance - fundamental? 5-10%; Extensibility, Specialization (extensibility – new features (cache), specialization – existing services (RPC)); Do we need colocation? no
- Abstractions: threads; IPC (messages); address spaces
- L4 architecture:
- Syscall: Standard:
read()
→ libc → syscall; L4: call modified libc to send message to L4, L4 decides to route to Linux process (more overhead b/c more boundary crossings) - Page faults → similar; Linux → two parts arch-dependent vs not; modified arch-dependent part to use L4 as a platform, other part unmodified
- Memory management: hierarchical address spaces; page tables
- Summary: design of modern microkernel; performance = monolithic OS; specialization, extension
Extending the OS
Exokernel
- Goals: expose hardware; push everything → user level
- Benefits: specialization; extensibility; lower overhead
- Track resource ownerships → tables
- Protection via secure bindings (decouple authorization from access)
- Resource revocation
- Memory management: TLB miss → trap to exokernel (downside: overhead; solution: software TLB)
- Summary: move OS to user-level; untrusted library OS - manage HW resource directly?
Virtualization
The Origin of the VM/370 Time-Sharing System
- Why VM? Continuous operation; isolation; full access to hardware; OS development; emulate new hardware
- Trap and emulate: trap to hypervisor which emulates the behavior
Xen
- AWS built upon Xen; hence Xen became popular
- Why VMs? Data centers/cloud computing
- Requirements: isolation (security & performance); heterogeneous OSes; unmodified applications; performance; scalability; accounting
- Challenge: virtualizing x86 (TLB - hardware managed (x86 ARM); doesn’t trap on every privileged instruction; different side effects – popf)
- CPU: Paravirtualization (modify guest OS); privileged instructions → hyper call; VMWare → binary rewriting (scan next basic block & see if there are any instructions, call into hypervisor); syscall optimization (insert into hypervisor)
- Xen – similar to Exokernel: Write → trap to hypervisor; Guest OS can still walk page table as normal x86
- VMWare approach: shadow page table; use binary rewriting to copy to shadow page table
- I/O: disk, network, mouse; too many device drivers
- Hardware support for Virtualization
- Summary: trap & emulate; x86 hard to virtualize; paravirtualization; binary rewriting; hardware support
Container-Based Operating System Virtualization (VServer)
- Benefits vs VMs: lower overhead, more efficiency & scalability (overcommit)
- Drawbacks vs VMs: less isolation, same OS, need to modify OS
- Isolation: fault (buggy, malicious), resource (CPU, memory), security (access to info)
- Contexts: Namespaces (Process PIDs, Users UIDs/GIDs, Networks IP ports, Mount files), Filters (local objects only)
- Resource Allocations:
- CPU: Token Bucket for CPI scheduling & IO
- Memory, Network, Disk
- File System: chroot (part of FS container can see), CoW for sharing common files (1 container 500MB, 10 containers 700MB)
- Emulation: Xen higher overhead for microbenchmarks, CPU & memory intensive similar to Xen
- Summary: Containers virtualize at OS level; implemented w/ namespaces & filters
Firecracker
- Serverless apps: goals (performance & utilization), simple function/script, storage usually elsewhere (e.g., s3)
- Use cases: data streaming, web apps, edge settings (IoT), event-driven/on-demand, short-lived (≤15 min)
- Benefits: auto-scaling (0→∞), manageability, pay-per-use, fault tolerance, simplicity
- Motivation requirements:
- Fast switching ( 100ms) for lower latency & charging
- Density (1000s of functions on single machine)
- Before Firecracker: language runtime in containers
- Firecracker: hardware-based virtualization per function to minimize overhead
- Evaluation:
- Boot-time 125ms (remove devices…)
- Memory 3MB per VMM
- Summary: Hardware-based virtualization & optimization for practical requirements (main benefits, nothing else new)
Scheduling
The Linux Scheduler, A Decade of Wasted Cores
- Goals: Work conserving (no idle core if runnable threads), efficient (cache behavior), policies (fairness, prioritization), power, general purpose
- Single-core scheduling: Weighted (vruntime), preemptedl Red-black tree (pick lowest vruntime & run next)
- Multi-core:
- Avoid synchronized access
- Challenges: load imbalance, complex cache hierarchies
- Periodic → load factor
- “Emergency” (idle core)
- Threads created/wakeup
- Metrics: # of threads, sum of weights, load factor (weights & CPU utilization)
- Algorithm: hierarchy approach
- Bugs:
- Group imbalance bug (fix: replace average w/ minimum)
- Summary: Scalable schedule (multiple runqueues), load imbalance, CPU scheduling is tricky
Arachne
- Motivating apps for Arachne: low-latency key-value stores, storage (RAMCloud), web services (care about p99 tail latency)
- User-level threads: less overhead, kernel-level events hidden, customization, portable
- Kernel threads: hardware parallelism, OS awareness, better isolation
- Arachne: all user-level threading, N → constantly changing, 2-level scheduling
- Threading library:
- cooperative (no preemption), call
yield()
, blocking kernel calls (async I/O, rare page faults) - load balancing: at thread creation, power-of-2 choices
- cooperative (no preemption), call
- Core Arbiter: ask then revoke, blocking read
- Summary: 2-level scheduling (user & kernel), microseconds-scale app
Implementing Remote Procedure Calls
- Goals of Remote Procedure Call (RPC): efficiency, security, identical semantics as local procedure calls (e.g., timeouts), generality
- Optimizations:
- Transport: piggy-backing ACKs with data → synchronous (one thread, one outstanding RPC)
- Processes: pool of idle processes, process hints
- Connections: no setup/teardown, few states for inactive RPC
- Binding: at runtime (Grapevine)
- Exception & errors: shutdown? retry?
- RPC today: datacenters (Thrift, gRPC), browser
- Summary: make remote functionality appear local, optimize design for RPC
Networking
Eliminating Receive Livelock in an Interrupt-Driven Kernel
- Applications: routing (firewall, NAT, VPN, load balancer), monitoring (passive), services (file service, web server). High traffic volumes, not rate controlled
- Network Stack - receive many queues: bursts, batching at different layers, organizational
- Problem: receive livelock → overload, spend all time on interrupt processing.
- Polling: check for incoming packets. Polling cons: waste CPU cycles at low load, pros: efficient at high load. Interrupt cons: high overhead at high load, pros: lower latency (high priority), proportional work
- Solutions: hybrid approach (interrupt & polling), round robin, remove layer of queueing
- Summary: hybrid approach (polling + interrupts), on overloads → drop tasks as early as possible
Demikernel
- Portability: heterogeneous hardware (DPDK, RDMA), uniform interface for applications
- ns-scale overhead, simplifying development
- Library OS: optimized libOSes
- General API: POSIX variant (queues instead of files, async interfaces)
- Memory Management: zero copy, DMA capable memory (pinned, registered), use-after-free (reference counts). No concerns for regular OS b/c copy buffers
- CPU management: coroutines (similar to user-level threads & Arachne), 12 cycles to switch, non-preemptive
- Summary: General purpose API for Kernel-bypass I/O. Techniques: zero-copy I/O, memory management, scheduling
Snap
- Goals: easy deployment, efficient CPU scheduling, high-performance networking, easy development & optimization
- LibOS (Demikernel) vs Microkernel (Snap):
- LibOS: lower latency, better isolation, more tailored
- Microkernel: centralized management, easy deployment, can have higher overhead, handles scheduling of network processing (polling), explicit control
- Both: userspace, decoupled with OS
- Threading: kernel-level threads
- IPC: multicore to optimize
- Snap Microkernel:
- control & data plane
- engines: dedicated cores, spreading engines (scales CPU with load), compacting engines (fewer cores, higher tail latency)
- Upgrades:
- start new Snap, for each engine: write state to memory, read state in new Snap, brownout, pause old snap, copy changes, start new engine, blackout, 250ms median
- kill old snap
- Summary: microkernel-based approach (easier upgrades), scheduling modes for engines
I/O and File Systems
Fast-File System
- Unix Fast File System (FFS)
- Disk: Cylinder Group (read together relatively closely), Access Time (seek ms, rotation 3600-15K, transfer - read sector), Disk Interface (FFS - specify platter, surface, etc; Block interface - modern)
- Unix File System: Disk blocks 4KB. Optimize for common case of small files
- FFS Objectives: Performance, Reliability
- Limitations: Layout (lots of seeks between inode and data, block size (512-byte blocks → 1024 bytes)
- FFS Solutions:
- random block placement → cylinder group (related blocks together, same file has blocks in same group)
- inodes far from data blocks → cylinder groups
- low bandwidth utilization → increase block size to 4KB
- waste with large block sizes (internal fragmentation) → fragments
- small max file size → increase block size to 4KB (indirect blocks also bigger)
- recovery → replicated to multiple known locations
- device oblivious → parameterized file system to the device (place blocks in rotationally optimal ways; not used today by DRAM on disk)
- Allocation: global (which cylinder groups), local (which blocks in a cylinder? leave 10% unused)
- Summary: aware of device characteristics
Log-Structured File System
- Trends: large memory (larger file buffer cache), faster CPUs, write-heavy disk workloads, disk (increased capacity, seek time still slow, increase transfer bandwidth), small files
- Problems with FFS: layout (seek back and forth still a problem), synchronous writes for metadata
- LFS Approach: think of disk as log (1MB segment)
- Challenge: finding data inodes (inode map), free space (garbage collection)
- Cleaning:
- copy in-use blocks to log (cost: wasted bandwidth)
- policy questions: cold (clean more aggressively) and hot segments (wait longer), cost-benefit policy
- Used in SSD (disks wear out if written repeatedly), Flash Translation Layer (FTL)
- Summary: optimize for trends, writing data sequentially in a log
Soft Updates
- Metadata: inode, directories, bitmaps (need to always be consistent)
- Ordering constraints:
- no dangling pointers
- never reset old ptr until new ptr persists (e.g., rename)
- no reuse until old pointers are removed
- Soft updates:
- track dependencies, write in a safe order (asynchronous, less disk write, consistency)
- circular dependencies solution: fine-grained dependency tracking, undo/redo
- memory overhead (keep track of versions), cpu overhead
- Optimization: write many blocks at once, temporary files (don’t need to write)
- Evaluation: Compare FFS “Conventional”, FFS async writes “No Order” ( 5% differences), Soft Updates, Write-ahead logging. Don’t need recovery, Con: add complexity
- Summary: metadata consistency & ordering constraints, soft updates (fine-grained dependency tracking, rollback)
SplitFS
- SplitFS: hybrid approach – data (user space), metadata (kernel)
- Appends:
- needs to update metadata
- staging file → relink
- Summary: hybrid design, staging file & relink operation
The Google File System
- GFS Setting: scale, failure is common (scale & cheap machines, automated recovery), write (sequential & append), performance > consistency, throughput >> latency, large files (100 MB - 100 GB), reads (sequential and random), concurrent writes (MapReduce)
- GFS: user-level file system, no POSIX interface
- Files: chunks (64 MB), stored as Linux files
- Namespaces: hierarchical (
a/b/foo.txt
) - Directories: illusion, file path → files
- GFS Architecture: Master, Chunkserver, Clients
- Read: version number, no explicit caching
- Write: pick primary & 3 chunk servers, let client know where to write; primary writes to next chunk server, and so on (full network bandwidth); chunk servers rely, primary says commit when all done
- Concurrency appends: record append,
write()
(specific offset) → primary orders record append, at-least-once semantics, inconsistency for apps (duplicate records (record ID), padding & fragments (checksums)), assumed GFS only used internally by experts - Summary: first large-scale distributed file system, challenges (scale, failures, consistency)
Push-Button Verification of File Systems via Crash Refinement
- Goals: low effort, deterministic, insightful (counter examples), high performance
- User’s Perspective: Specification ((dir inode #, name) → child inode #), Implementation, Consistency Invariant
- Automated Verification:
- crash-free, crashes (non-determination)
- crash refinement
- recovery function
- Tools for verification: symbolic execution → SMT encoding → SMT solver (e.g., Z3)
- Summary: correctness for implementation, notion of crash refinement