CS 350 - Operating Systems

Lesley Istead

Estimated reading time: 71 minutes

Table of contents

miscellaneous notes… Also, check my help page

Note: Lesley’s lectures recording starts from lec 7.

Intro

OS:

  • manages resource
  • creates execution environments.

Three views:

  1. Application view: provide services
    • part cop: provides protection from program errors. part facilitator: abstract interface to underlying system
    • What services?
      • resources
      • interfaces
      • isolates running programs
  2. System view: solve problems
    • manages the hardware resources
    • allocates resources
    • controls the access to or sharing of resources among programs.
  3. Implementation view: how is it built?
    • concurrent: multi progs or seqs of instrs running, or appearing to run, at the same time.
    • real-time: must respond to an event in a specific amount of time.

kernel: The operating system kernel is the part of the operating system that responds to system calls, interrupts and exceptions.

operation system: The operating system as a whole includes the kernel, and may include other related programs that provide servicesfor applications. This may include things like:

  • utility programs (e.g. disk defragmenter, task manager)
  • command interpreters (e.g. cmd.exe in Windows, bash in Linux)
  • programming libraries (e.g. POSIX threads in Linux)

monolithic kernel: the entire operating system (which includes the device drivers, file system, and the application IPC) is working in kernel space. “everything and the kitchen sink” is a part of the kernel. This includes device drivers, file system, virtual memory, etc.

microkernel: only absolutely necessary components are a part of the kernel. All other elements are user programs.

real-time OS: an OS with stringent event response times, guarantees, and preemptive scheduling.

The execution environment provided by the OS includes a variety of abstract entities that can be manipulated by a running program.

Threads

seq of instructions.

a single program can have

  • diff threads responsible for different roles
  • multi threads — same roles

Reasons we use threads:

  • Efficient Use of Resource
  • parallelism
  • responsiveness
  • priority
  • modularization

A thread blocks, when it ceases execution for a period of time, or, until some condition has been met (e.g. a website respondsor a user moves the mouse). CPU is idle.

Key ideas:

  • A thread can create new threads using thread_fork
  • new threads start execution in a function specified as a param to thread_fork
  • original (called thread_fork) and new thread proceed concurrently
  • All threads share access to program’s global vars and heap
  • each thread has its own stack (private to that thread)

In the OS, a thread is represented as a struct or object

An example of using thread_fork to create a new thread that is running the function vehicle_sumulation.

for (i = 0; i < NumThreads; i++) {
    error = thread_fork("vehicle simulation thread", NULL,
        vehicle_simulation, NULL, i);
    if (error) {
        panic("traffic sim: thread_fork failed: %s",
            strerror(error));
    }
}

Interface

  • create a new thread:
    int thread_fork(
      const char *name, // name of new thread
      struct proc *proc, // thread’s process
      void (*func) // new thread’s function
          (void *, unsigned long),
      void *data1, // function’s first param
      unsigned long data2 // function’s second param
    );
    
  • terminate the calling thread: void thread_exit(void);
  • volutarily yield execution: void thread_yield(void);

Review from whatever course

Review: Sequential Program Execution

  • fetch instr (from code segment) that the Program counter points to
  • decode and execute the instr and
  • increment the PC

Note: In CS241 we names the MIPS registers $0, $1, $2, etc. In CS350 we referred to them by their default function, e.g. v0, a0, t0.

t0-t7: temps (caller-save). if the calling subroutine has a value in one of these registers that it wants preserved, it should store the value on the stack before calling any subroutines and restore it afterwards.

s0-s7: saved (callee-save). if the called subroutine uses one of these registers is must first store its current value on the stack and restore it before exiting

This strategy tries to miminize situations where the callee is saving values that the caller is not using.

Review: The Stack: grows downward in slides.

Conceptually, each thread executes sequentially using its private register contents and stack.

Concurrent Program Execution

Key Question: What is shared between threads?

  • They are both executing (perhaps at the same time on different cores) the same program.
  • Both threads share the same code, read-only data, global variables and heap.
  • Each thread has its own stack and PC.

Three options:

  • Hardware support (P processors, C cores, M multithreading per core)
  • Timesharing: Multiple threads tak turns on the same hardware; rapidly switching.
  • Hardware support + Timesharing

When timesharing, the switch from one thread to another is called a context switch.

Context Switch

  • decide which thread will run next (scheduling)
  • save register contents of current thread
  • load register contents of next thread

Thread context (register values) must be saved/restored carefully, since thread execution continuously changes the context.

The threads share the CPU, giving the user the illusion of multiple programs running at the same time.

On MIPS

switchframe_switch:
    /* a0: address of switchframe pointer of old thread. */
    /* a1: address of switchframe pointer of new thread. */

    /* Allocate stack space for saving 10 registers. 10*4 = 40 */
    addi sp, sp, -40
    sw ra, 36(sp)   /* Save the registers */
    sw gp, 32(sp)
    sw s8, 28(sp)   /* a.k.a. frame pointer */
    sw s6, 24(sp)
    sw s5, 20(sp)
    sw s4, 16(sp)
    sw s3, 12(sp)
    sw s2, 8(sp)
    sw s1, 4(sp)
    sw s0, 0(sp)

    /* Store the old stack pointer in the old thread */
    sw sp, 0(a0)

    /* Get the new stack pointer from the new thread */
    lw sp, 0(a1)
    nop             /* delay slot for load */

    /* Now, restore the registers */
    lw s0, 0(sp)
    lw s1, 4(sp)
    lw s2, 8(sp)
    lw s3, 12(sp)
    lw s4, 16(sp)
    lw s5, 20(sp)
    lw s6, 24(sp)
    lw s8, 28(sp)   /* a.k.a. frame pointer */
    lw gp, 32(sp)
    lw ra, 36(sp)
    nop             /* delay slot for load */

    /* and return. */
    j ra
    addi sp, sp, 40 /* in delay slot */
    .end switchframe_switch

C function thread_switch calls the assembly language subroutine switchframe_switch.

thread_switch caller: save/restore the caller-save regs, including return address (ra)

switchframe_switch callee: save/restore callee-save regs. In OS/161 the frame pointer is callee saved.

switchframe_switch, saves callee-save registers to the old thread’s stack; it restores the callee-save registers from the new thread’s stack.

MIPS R3000 is pipelined; delay-slots are used to protect against load-use hazards, control hazards.

What Causes Context Switches?

  1. thread_yield: voluntarily allows other threads to run
  2. thread_exit: terminated
  3. blocks, via a call to wchan_sleep: it is waiting for some resource (such as network access) or for some event to happen
  4. is preempted: it involuntarily stops running (because the thread scheduler stopped it)

Thread States

Running: current executing. Ready: ready to execute. Blocked: waiting for sth, so not ready to execute.

thread_yield (C function) yield CPU. thread_yield calls (C function) thread_switch (high level context switch). thread_switch chooses a new thread then calls (assembly subroutine) switchframe_switch (low level).

Scheduling quantum imposes a limit on processor time: upper bound on how long a thread can run before it must yield the CPU.

An interrupt is an event that occurs during the execution of a program.

  • caused by system devices (hardware): timer, disk controller, network interface card.
  • when interrupt occurs, the hardware automatically transfers control to a fixed location in memory (specified by the designer of the CPU).
  • At that location, the the thread library must place a procedure called an interrupt handler.
  • the interrupt handler normally:
    • creates a trap frame to record the thread context at the time of the interrupt,
    • determines which device caused the interrupt and performs device-specific processing,
    • restores the saved thread context from the trap frame and resumes execution of the thread.

Thread context is all the information (i.e. register values) needed to resume executing a thread after it has been suspended. and it’s stored in a switchframe.

When a thread is interrupted all the register values are stored in a trap frame.

Preemptive Scheduling

  • Threads may block or yield before their quantum has expired.
  • If a thread has run too long, the timer interrupt handler preempts the thread by calling thread_yield.
  • The preempted thread changes state from running to ready, and it is placed on the ready queue.
  • OS161 threads use preemptive round-robin scheduling:
    • scheduler maintains a queue of threads, often called the ready queue.
    • On a context switch, the running thread is moved to the end of the ready queue, and the first thread on the queue is allowed to run.
    • Newly created threads are placed at the end of the ready queue.
  • Threads can be migrated to other processors or interrupted by device so the order they run is nondeterministic.

Then see two examples in slides (one and two threads): involuntary/voluntary context switch.

Synchronization

Concurrent threads interact with each other in a variety of ways:

  • All threads in a concurrent program share access to the program’s global variables and the heap.
  • Threads share access, through the operating system, to system devices, such as the hard drive, the display, etc.

A common synchronization problem is to enforce mutual exclusion.

The part of a concurrent program in which a shared object is accessed (e.g. the intersection) is called a critical section.

This coordination of access to a shared resource is called synchronization.

What happens if several threads try to access the same global variable or heap object at the same time?

We require that concurrent programs will run correctly regardless of which order the threads are executed.

A race condition is when the program result depends on the order of execution.

To find the critical sections…

  • Inspect each variable and ask is it possible for multiple threads to read and write it concurrently?
  • Constants and memory that all threads only read do not cause race conditions.

The course text defines a critical section as

A critical section is a piece of code that accesses a shared variable (or more generally, a shared resource) and must not be concurrently executed by more than one thread.

Acquire(bool *lock) {
while (*lock == true) {}; // spin until lock is free
    *lock = true;       // grab the lock
}

Release(book *lock) {
    *lock = false;      // give up the lock
}

This approach fails because between the time a thread breaks out of the while loop (i.e. *lock = false) and sets *lock to true it could be preempted and another thread could come along and set *lock to true.

We want to create a function so that between when we test *lock and we set it we do not want another thread to change its value.

We will call this approach test-and-set and we will look at three approaches to implementing it.

  1. x86-64’s xchg. Atomic
Acquire(bool *lock) {
    while (xchg(lock,true) == true) {}; // test and set
}

Release(bool *lock) {
    *lock = false;                      // give up lock
}

This construct is known as a spinlock, since a thread busy-waits (loops or spins) in Acquire until the lock is free.

  1. ARM’s LDREX and STREX
  2. MIPS’s ll and sc

More on MIPS Synchronization Instructions

  • ll (load linked): load value at address addr.
  • sc (store conditional): store new value at addr if the value at addr has not changed since the instruction ll was executed.
  • sc returns SUCCESS if the value stored at the address has not changed since ll was executed.
  • sc does not check what that value at the address is. It only checks if it has changed.
MIPSTestAndSet(addr, new_val) {
    old val = ll addr           // test
    status = sc addr, new_val   // set
    if ( status == SUCCEED ) return old_val
    else return true            // lock is being held
}

Acquire(bool *lock) {           // spin until hold lock
    while( MIPSTestAndSet(lock, true) == true ) {};
}
  lock value at ll lock value before sc lock value after sc status Lock State
1. false false true succeed lock acquired
2. true true true succeed keep spinning, no lock
3. false true true fail keep spinning, no lock
4. true false false fail keep spinning, no lock

Spinlocks in OS/161

A spinlock is a lock that spins, i.e. it repeatedly tests the lock’s availability in a loop until the lock is obtained.

When threads uses a spinlock they busy-wait i.e. it uses the CPU while they wait for the lock.

struct spinlock {
    volatile spinlock_data_t lk_lock;
    struct cpu *lk_holder;
};

void spinlock_init(struct spinlock *lk);
void spinlock_acquire(struct spinlock *lk);
void spinlock_release(struct spinlock *lk);

spinlock_acquire calls spinlock_data_testandset in a loop until the lock is acquired.

OS/161 uses ll and sc to implement test-and-set.

OS/161 Locks

In addition to spinlocks, OS/161 also has locks.

Like spinlocks, locks are used to enforce mutual exclusion, i.e. they are a type of mutex.

struct lock *mylock = lock_create("LockName");

lock_acquire(mylock);
    // critical section
lock_release(mylock);

Spinlocks spin whereas locks block:

  • spinlock_acquire spins until the lock can be acq
  • lock_acquire blcoks until the lock can be acq

Locks can be used to protect larger critical sections without being a burden on the CPU.

Spinlocks and Locks: Similarities and Difference

For both you would

  • acquire it, access the critical section, then
  • release it when you have finished accessing the critical section.

Both have owners and can only be released by their owner.

  • A spinlock is owned by a CPU.
  • A lock is owned by a thread.

In OS/161, and most modern OSs, interrupts are disabled on the CPU that holds the spinlock. Preemption is disabled on that CPU (hence, owned by that CPU) but not other CPUs, which minimizes spinning. So don’t use spinlocks to protect large critical sections.

Spin Locks

  • void spinlock_init(struct spinlock *lk)
  • void spinlock_acquire(struct spinlock *lk)
  • void spinlock_release(struct spinlock *lk)
  • bool spinlock_do_i_hold(struct spinlock *lk)
  • void spinlock_cleanup(struct spinlock *lk)

Locks

  • struct lock *lock_create(const char *name)
  • void lock_acquire(struct lock *lk)
  • void lock_release(struct lock *lk)
  • bool lock_do_i_hold(struct lock *lk)
  • void lock_destroy(struct lock *lk)

Wait Channels

When a thread blocks, it stops running, i.e. stops using the processor.

void wchan_sleep(struct wchan *wc);     // blocks calling thread on wait channel wc

void wchan_wakeall(struct wchan *wc);   // unblock all threads sleeping on wait channel wc

void wchan_wakeone(struct wchan *wc);   // unblock one thread sleeping on wait channel wc

void wchan_lock(struct wchan *wc);      // prevent operations on wait channel wc

Semaphore

is a synchronization primitive that can be used to enforce mutual exclusion requirements. It can also be used to solve other kinds of synchronization problems.

It’s an object has an integer value which supports two operations:

  • P: if the semaphore value is greater than 0, decrement it. Otherwise, wait until the value is greater than 0 and then decrement it.
  • V: : increment the value of the semaphore

By defn, P and V operations of a semaphore are atomic.

Difference between a lock and a semaphore

  • V does not have to follow P
  • A semaphore can start with a count of 0
  • Calling V increments the semaphore by 1.
  • Semaphore do not have owners.

Producer/Consumer Synchronization with Bounded Buffer

  1. producers: threads that (create and) add items to a buffer
  2. consumers: thread that remove (and process) items from the buffer

Semaphore Implementation

struct semaphore {
    char *sem_name;             // for debug purposes
    struct wchan *sem_wchan;    // queue where threads wait
    struct spinlock sem_lock;   // to synch access to this struct
    volatile int sem_count ;    // value of the semaphore
};

void
P(struct semaphore *sem)
{
        KASSERT(sem != NULL);
        KASSERT(curthread->t_in_interrupt == false);

	spinlock_acquire(&sem->sem_lock);
        while (sem->sem_count == 0) {       // prepare to sleep
    		wchan_lock(sem->sem_wchan);
    		spinlock_release(&sem->sem_lock);
            wchan_sleep(sem->sem_wchan);    // sleep
    		spinlock_acquire(&sem->sem_lock);
        }
        KASSERT(sem->sem_count > 0);
        sem->sem_count--;                   // claim one item
	spinlock_release(&sem->sem_lock);
}

void
V(struct semaphore *sem)
{
        KASSERT(sem != NULL);

	spinlock_acquire(&sem->sem_lock);

        sem->sem_count++;
        KASSERT(sem->sem_count > 0);
	wchan_wakeone(sem->sem_wchan);

	spinlock_release(&sem->sem_lock);
}

Incorrect Semaphore Implementation: Suppose spinlock_release preceeds wchan_lock… Thread 1 is blocked on a semaphore that has resources.

Instead Suppose wchan_lock preceeds spinlock_release.

Condition Variables

OS/161 supports another common synchronization primitive: condition variables.

Each CV is intended to work together with a lock: only used from within the critical section that is protected by the lock.

  1. wait: This causes the calling thread to block, and it releases the lock associated with the condition variable. Once the thread is unblocked it reacquires the lock.
  2. signal: If threads are blocked on the signaled condition variable, then one of those threads is unblocked.
  3. broadcast: Like signal, but unblocks all threads that are blocked on the condition variable.
// solution 1
int SafeToWalk() {
    lock_acquire(geeseMutex);
    while (numberOfGeese > 0) {
        lock_release(geeseMutex); // allow access
        // allow some context switch to accur and another thread
        //   might then acq the lock and modify numberOfGeese
        lock_acquire(geeseMutex); // restrict access
    }
    GoToClass();
    lock_release(geeseMutex);
}

// soln2
int volatile numberOfGeese = 100;
lock geeseMutex;
cv zeroGeese;
int SafeToWalk() {
    lock_acquire(geeseMutex);
    while (numberOfGeese > 0) {
        cv_wait(zeroGeese, geeseMutex);
    }
    GoToClass();
    lock_release(geeseMutex);
}

Lesley’s recordings start here.

We left last class: semaphores and CV. Different primitives for concurrency problem.

  • Semaphore:
    • any number of resources. P acquire, V release. Use as a lock. Keep in mind it’s not lock. There’s no built-in protection for it, no track of ownership: someone else tries to release binary semaphore on me, they can, and we have race condition.
    • counting: keep counting # of resources
    • barrier: force one thread to wait for others to complete before we proceed. Not separate implementation.
    • keep in mind do not touch the counter directly. It’s a shared resource.
  • CV: you are inside critical section. You really need condition to be true to proceed. But the lock you own it the lock you are using to protect the shared variable that has unique value. CV lets us do simultaneously safely let go of the lock and fall asleep so that another thread can modify the condition and wake us back up.
    Most of the type we are using: when we are woken back up, we need to recheck the condition.

Other sources of Race conditions

The previous RC is because of the code you wrote.

memory models describe how thread access to memory in shared regions behave. a memory model tells the compiler and CPU which optimizations can be performed

compiler: it does optimization. Reduce the # of assembly instrs: load and stores.

int sharedTotal = 0;
int FuncA() {
    //... code that uses sharedTotal ...
}
int FuncB() {
    //... code that uses sharedTotal ...
}

Instead of go to RAM, much more efficient to keep it in register: register allocation. Here’s the problem: It doesn’t know if A and B run at the same time. If one put in $3 one in $8. Big race condtion. There are also others optimization.

Always load/store var from RAM, not reordering: volatile. Disable compiler optimizations, but not solve race conditions in your codes.

CPU: fetch the program counter, decode it, execute it and move on. But it’s more complex: pipeline, branch prediction, reorder instrs… It also has memory model and so that you can notify the CPU: multi-threaded code is going on here. Barrier or fence instrs to entry gate: CPU stops rearranging. And until end gate, turn on opt back on. Don’t need to worry in OS161.

Deadlocks

lock lockA, lockB;

int FuncA() {
    lock_acquire(lockA)
    lock_acquire(lockB)
    ...
    lock_release(lockA)
    lock_release(lockB)
}

int FuncB() {
    lock_acquire(lockB)
    lock_acquire(lockA)
    ...
    lock_release(lockB)
    lock_release(lockA)
}
  • Thread 1 executes lock_acquire(lockA)
  • Thread 2 executes lock_acquire(lockB)
  • Thread 1 executes lock_acquire(lockB)
  • Thread 2 executes lock_acquire(lockA)

Deadlock only shows up under certain order of thread execution. It is very difficult to detect: between problematic wait and wait on purpose.

Generally, it happens when multiple threads, each acquire multiple resources.

  1. don’t try to aquire a lock if you have already owned.
  2. Two strategies

No hold and wait: while you own the resources, no wait (no block, no spin).
Being able to aquire all resources at once. We need special implementation of lock to support.

If you can’t acq a lock, you can’t fall asleep. Special implementation of acq.

// true if you get the lock
// false if cannot, but do not go to sleep.
bool try_acq(lk) {
    acq(lk->spin)
        if (lk->held) {
            release(lk->spin)
            return false
        }
        lk->held = true
        lk->owner = me
    release(lk->spin)
    return true
}

Solution to previous deadlock:

acq(A); // I own nothing and safely acquire

while (!try_acq(B)) {
    release(A);
    // if you want to improve the performance, put yield here
    acq(A);
}

But it’s nasty if we want acq lots of things, also it’s spinning.

Resource ordering: assign every single resource a single number. Acquire them only in strictly increasing order. There’s problem if different programmers have different rules.

Process

is an is an environment in which an application program runs.

includes virtualized resources that its program can use: one (or more) threads

each program’s process isolates it from other programs in other processes

fork

creates a brand new process (the child) that is a clone of the original (the parent)

  • new process structure
  • new address space
  • new thread array
  • new PID clone
  • content of address space

_exit

terminate the calling process. Not necessarily disappear: stop executing, address space get cleaned, but if it has parents, that parent is alive. Parents want to figure out why child dies. Leave behind a msg for parent.

waitpid

offer synchronization between processes. Cause the caller to wait for the PID process to terminate. Restricted to: parent can only wait for their children to die.

Children cannot wait for their parents. If they die, they are old.

execv

Now, the last of the five system calls: execv. It does not create a new process.

It changes the program and a existing process is running. Process structure is same, but the address space has changed.

Hello world program, and call execv matlab.

Create a new address space, and load the program into the new address space.

Parent child relationship: If you go home tonight, and dye your hair in green, your parent may not like it. But they cannot change genetically they are your parents. The same is true for processes: you can change the program that you are running but you cannot change parent-child relationship.

Example from slides:

int main()
{
    int rc = 0;
    char *args[4];

    args[0] = (char *) "/testbin/argtest";
    args[1] = (char *) "first";
    args[2] = (char *) "second";
    args[3] = 0; // null terminator

    rc = execv("/testbin/argtest", args); // takes two parameters
    printf("If you see this execv failed\n");
    printf("rc = %d errno = %d\n", rc, errno);
    exit(0);
}

takes two parameters. Many times you need command line arguments. Array of pointers.

execv if it succeeds, does not return. Because it successfully created the new address space and new program. We successfully loaded the program into it. The calling program will not exist if execv succeeded. The only time it returns is when it fails. Why would it fail? 30mb ram then run Matlab. It’s possible that there’s no memory left or you call the program wrongly. When you write your execv, you need to test for these error cases and return appropriate error codes. (but cs350 stuff don’t have such tests on these).

You notice that we can’t tell how many arguments there are automatically. So we use null terminator array and you are responsible for counting.

Another example. This time with fork.

int main()
{
    char *args[4];
    /* set args here */
    rc = fork(); /* returns 0 to child, pid to parent */
    if (rc == 0) {
        status = execv("/testbin/argtest",args);
        printf("If you see this execv failed\n");
        printf("status = %d errno = %d\n", status, errno);
        exit(0);
    } else {
        child_pid = rc;
        parent_code();
        p = waitpid(child_pid,&child_exit,0);
    }
}

Now I have two processes. Even though they are separate, but their contents are identical with the exception of value of rc. 0 for child, pid for parent.

If child, execv. Child is running different process. Parent can still wait for child to terminate. Parent child relationship is not impacted by execv which stays forever.

Go home and write thread fork bomb.

System Calls

  • Process management calls, e.g., fork, are called by user programms. They are also system calls.
  • System calls are the interface between user processes and the kernel.
  • The kernel code runs at the highest privilege level, where any CPU instructions can be executed.
  • Program in user space cannot execute code or instructions belonging to a higher-level of privilege.

Kernel Privilege

The CPU implements different levels (or rings) of execution privilege as a security and isolation mechanism. Kernel code at highest level.

Key Question: Since application programs cannot directly call the kernel, how does a program make a system call such as fork?

  1. Interrupts
    • Interrupts are raised by devices (hardware)
    • An interrupt causes the CPU to transfer control to a fixed location in memory (specified by the designers of the CPU) where an interrupt handler must be located.
    • Interrupt handlers are part of the kernel.
  2. Exceptions
    • conditions that occur during the execution of a program instruction.
    • The CPU handles exceptions like it handles interrupts:
      • Control is transferred to a fixed location, where an exception handler is located.
      • The processor switches to privileged execution mode.
    • In OS/161 the same routine is used to handle exceptions and interrupts.

When CPU receives interrupt or exception, your program stops executing and exception/interrupt handler is called, part of kernel code. Then CPU switches from unprivileged mode to privileged mode, so now we can execute kernel’s code. Produces the trapframe: storing every single reg value (including special ones) so we can return to the exact point of program execution after we handle.

Extra: Spectre and Meltdown Papers.

Key Question: There is only one syscall exception, EX_SYS. So how does the OS distinguish between a fork and getpid system call?

system call codes

  • The kernel expects the code to be placed in a specified location before executing syscall (for OS/161 on MIPS, reg v0)
  • the codes and code location are part of the kernal ABI (Application Binary Interface)

ABI is not secret because we need user land to know that info and it absolutely tells nothing what’s happening in the kernel – blackbox.

Example:

  • li v0, 0 where 0 is the syscall code for fork.
  • Syscall parameters: load into regs a0 to a3. If more, put into stack or heap, and put the address here.
  • Use syscall to raise exception.

Return two values:

  • reg a3: success/fail
  • reg v0: return value/error code

User and Kernel Stacks

Every OS/161 process thread has two stacks, although it only uses one at a time.

The only thing that goes into user stack us the user application stack. But before we save the trapframe, we must change the stack pointer from user stack to kernel stack.

  1. The User (Application) Stack is used while the thread is executing application code.
    • Saw this in CS 241.
    • The kernel creates this stack when it sets up the virtual address space for the process (or thread if the OS supports multi-threaded code).
  2. The Kernel Stack is used while the thread is executing kernel code, i.e. after an exception or interrupt.
    • This stack is a kernel structure (i.e. it is located in the kernel’s address space).
    • It also holds trap frames and switch frames (because it is the kernel that creates trap frames and switch frames)

Exception Handling

  • first to run assembly code, common_exception
    • save stack pointer
    • switches SP to point to the thread’s kernel stack,
    • carefully saves app state and address of the instruction,
    • calls mips_trap, passing a pointer to the trap frame as an arg.
  • After mips_trap is finished, the common_exception handler will
    • restore application’s state
    • jump back to application instruction that was interrupted and switch back to unprivileged execution mode.
  • See kern/arch/mips/locore/exception-mips1.S (assembly file)

mips_trap

determines what type of exception it is by looking at the exception code.

(there is a separate handler in the kernel for each type of exception.) Based on that exception code, call appropriate handler.

  • interrupt? call mainbus_interrupt
  • address translation exception? vm_fault
  • system call? call syscall (kernel function), passing it the trap frame pointer
  • kern/arch/mips/syscall/syscall.c

See kern/arch/mips/locore/trap.c

One important thing: when exception raised, one of the first thing common_exception does it to disable the interrupt on CPU. When we call mips_trap, interrupt exceptions are off for now.

  • If it was actually an interrupt, they stay off until the hardware has been handled.
  • If not, don’t have to call mainbus_interrupt to handle it, then we are ok to be interrupted again. It is not hardware interrupt, then we take interrupt exceptions back on.

increase program counter: if we don’t do this, the system call exception will be raised over and over again. We are halting execution and doing sth else, thus we need manully. 4 byte.

Kernel is just a program, a sequence of instructions.

Two type of calls:

  • procedure calls: used by apps to execute other application code,
  • sys calls: exec kernel code.

Multiprocessing

or multitasking.

  • The OS ensures that processes are isolated from one another.
  • Interprocess communication should be possible.
  • all processes must share the available hardware resources. \(\implies\) This sharing is coordinated by the operating system.

Keep in mind, it’s not processes that run. Threads execute. Process has to have a thread in order to execute, so we really are talking about multithreads. Now we have a pool of threads, which belong to different processes. Only when quantum expires, we do context switch. (Timer interrupt)

Stack Diagrams

In the next few minutes, I’m gonna show you stack diagrams because you’d better bet they are going to show up on your exams.

mips_trap’s job is to figure out what kind of exception that actually raised. Then call the handler that is specific to this exception. Now we are going to execute mips_trap. Now after we check that if it’s an interrupt or not, then it realizes that it is not an interrupt, then mips_trap will turn exceptions and interrupts back on. It means that while we are executing our sys call, we CAN be interrupted. KEEP THAT IN MIND.

Look at v0 and that’s going to tell you which syscal we actually want to run (a great big old switch statement). We want sys_fork. Now interrupts are on which means you can be preempted while executed this fork.

So now what happens to our stack: If we have a timer interrupt. First thing you see is trap frame (produced by common_exception). Push a trapframe onto the kernel stack. Since we are already in privilege mode and on kernel stack, you actually know the common_exception code. If we are on kernel stack, we don’t do anything but just save the trapframe right. Then call mips_trap to figure out what just happened. It’s mainbus_interrupt. It’s hardware interrupt. While they are handling interrupts from hardware, we don’t want to handle other hardware interrupts because this is the behaviour that is outside the control that thread is running and we don’t to disrupt it for too long so generally leave it off.

In mainbus_interrupt, we discover it’s the clock. Call interrupt handler for the clock: exceed schedule quantum, I’d better preempt you. So call thread_yield -> thread_switch -> switch_frame. Context switch occurs. Popping off the stack, then returns to … Then go back to common_exception where we restore the trapframe. Btw, we are turning the interrupts back on. Then back to user stack, reset the CPU back to unprivileged mode and now we go back to have a process to thread running its regular user mode. Similar to multithreads, but we have two stacks.

Now suppose we don’t have interrupt. Return to syscall. At the very end of syscall, set return value and success/failure values for that system. Zero: success; One: failure. The last thing in syscall dispatcher(调度), increment program counter because when the exception is raised we didn’t actually increment it.

if (err) { /* err */
    tf->tf_v0 = err;
    tf->tf_a3 = 1;
} else { /* no err */
    tf->tf_v0 = retval;
    tf->tf_a3 = 0;
}

/* advance PC */
tf->tf_epc += 4;

mips_trap returns to common_exception. The trapframe data is restored. Switch from kernel to user stack. Switch to unprivileged mode (rfe: magical instruction). User code continues execution.

  • These diagrams are always on either midterm or final.
  • It is not possible for you to have two track frames back-to-back in OS/161. Reason for that is because when the exception is raised at the CPU, interrupts are turned off and they do not come back on again until at least halfway through mips_trap which means that each trapframe must be separated by at least mips_trap.

Inter-Process Communication (IPC)

Processes can talk to each other. IPC is a family of methods used to send data between processes.

File, Socket, Pipe, Shared Memory, Message Passing/Queue.

Virtual Memory

Physical addresses are provided directly by the hardware, i.e. the amount of installed RAM. One PA space per computer.

Virtual addresses (or logical addresses) are addresses provided by the OS to the processes. One VA space per process.

The conversion of a virtual address to a physical address is called address translation.

The OS provides each process with illusion that it has a large amount of continuous memory available exclusively to itself, i.e. and address space.

The goals are

  • to efficiently translate between virtual and physical addresses,
  • to provide transparency so the programmer does not need to worry about the difference,
  • to protect one process’s address space from other (perhaps buggy) processes.

We here use \(P\): Physical memory takes \(P\) bits to specify the physical address of each type, then the maximum amount of addressable physical memory is \(2^P\) bytes.

In Sys/161, it uses \(P=32\), 32-bit physical addresses. In slide \(P=18\).

Address Space Size of Address space (\(s\)) Num of Bits Needed for (\(b\))
0 1 2 1
00 01 10 11 4 2
000 001 010 100 011 101 110 111 8 3

Use the formula \(b=\log_2(s)\) and \(s=2^b\).

Address Translation

Similarly, we have \(V\) bits for virtual addresses.

For MIPS \(V=32\). In slide \(V=16\).

Running applications only see virtual addresses. Each process is isolated in its virtual memory and cannot access another process’s virtual memory.

Virtual memory is used to

  • isolate processes from each other and from the kernel,
  • potentially support virtual memory that is larger than physical memory.

Each virtual memory address is mapped to a different part of physical memory.

Address translation is performed in hardware, in the Memory Management Unit (MMU), using information provided by the kernel.

We will consider a series of five increasingly more sophisicated methods of address translation.

1. Dynamic Relocation

The virtual address of each process is translated using two values:

  1. offset or relocation (R): addr in physical memory where the process’s memory begins
  2. limit (L): the amount of memory used by the process.

MMU does the following calculation

if (v < limit) then
    p ← v + offset
else
    raise memory exception

However, this suffers from fragmentation.

Using just two registers per process, dynamic relocation: allows multiple processes to share RAM and isolates each process’s address space from all other processes.

2. Segmentation

Extra Reading

Instead of mapping the entire virtual memory to physical memory, we map each segment of the address space separately. The kernel maintains an offset and limit value for each segment. With segmentation, a virtual address can be thought of as having two parts: (segment ID, offset within segment)

With \(K\) bits for segment ID, we can have up to

  • \(2^K\) segments
  • \(2^{V-K}\) bytes per segment

So if there are 4 segments, then \(\log 4=2\) bits are required to represent the segment number and the maximum size of each segment is \(2^{V-2}\) btyes.

Fragmentation of physical memory is still possible.

Many approaches for translation.

Approach 1: MMU has a relocation register and a limit register for each segment

  • for \(i\)th segment, \(R_i\) be the relocation offset and \(L_i\) be the limit.
  • To translate v to p:
s ← SegmentNumber(v)
a ← AddressWithinSegment(v)

if (a ≥ limit[s]) then
    raise memory exception
else
    p ← a + offset[s]

Approach 2: Maintain a segment table

If the segment number in v is greater than the number of segments
then
    raise an exception
else
    use the segment number to lookup the limit and relocation values from the segment table.

No internal fragmentation.

Limitation: space for heap and stack is cannot grow beyond a certain point, even though there is more memory available or space between various segment gets wasted because it is too small to fit anything (external fragmentation).

3. Paging: Physical Memory

Key Idea: Divide Physical memory into fixed-size chunks called frames (a.k.a. physical pages).

Take physical memory, logically divide physical memory into fixed-sized chunks called frames (physical pages). Now our frames commonly, 4kb. size of physical memory divides frame size -> # of frames.

Let’s do the same thing with virtual memory. Take virtual memory and divide into equal sized pages, programmatically, not physically.

Page size (virtual memory) must equal frame size (physical memory).

Now we get code segments: 7 pages in slides. Instead of getting contiguous, just ask RAM for any 7 pages. So we map every single page to whichever frame it’s available.

We don’t have to track the limit, but we do track of which frame the page map to. Any page can map to any frame ⇒ we will need a method to track this mapping, i.e. page tables.

Page table has an entry for every single page regardless whether we are using or not.

In our page table (for now), we will have

  • frame number: which frame is this page map to
  • valid bit: 1 if we are actually using this page. 0 if not.

One register holds the addr of page table, stored in RAM.

This is virtual addr. Higher order bits, like segmentation, indicate which page. Remaining bits, page offset.

Extract the page number, use it to get an entry in page table. Check valid bit. 0 then throw exception. If 1, then do combination: physical address = (frame number × frame size) + offset.

The page offset is only related to the size of page. Number of Bits for Offset = log(Page Size).

Number of PTEs = Maximum Virtual Memory Size / Page Size

Number of Bits for Page Number = log(Number of PTEs)

Number of Bits for Frame Number = log(Number of Frames)

We want to have some entries read-only, like code segment.

  • protection bit
  • reference (use) bit: has this PTE been used to translate any addr?
  • dirty bit: has any addr translation resulted in memory being written to particular page
  • present bit: is this page in RAM or not?

How big are page tables? We have known # of PTEs. And size of each PTE. We take the product of these two.

Page tables can get very, very large.

Page tables are kernel data structures. They live in the kernel’s memory.

As the number of bits increases for our virtual addresses, it just doesn’t scale at all.

Shrink the page table. Extra orgnization. We are going to turn it into a tree. There are no valid pages in that sub page table, do not make a table for it. There will be significantly fewer.

4. Two-Level Paging

Insteadd of four bits for page #, we are going to use 2 bits for first page # and 2 bits for second page #.

Instead of two parts address (page # and page offset), we now have multi-part address. We still have the same page offset which is only related to the page size. and a bunch of page numbers.

And our adress translation is the same:

Physical Address = Frame Number * Page Size + offset

the look up is different.

5. Multi-Level Paging

We want each individual page table (dir, each subtable) exactly fit in one page. Since “on demand paging” is very convenient. If we want to look for one table and it’s not in memory, we only need to do one read from my disk.

So we can figure out how many levels we need to create multi-level tree. The goal is to reduce the size of individual page tables.

In addition to the formula above: How many bits do I need to index that table? Bits for page number = log(PTEs per page).

LEVELS. Say we have \(V\) bits to index virtual address.

Then the levels = \(\left\lceil{V- \text{bits for page offset}\over \text{bits for page number}}\right\rceil\). There are different conventions that you can split the page numbers…

Summary: Roles of the Kernel and the MMU

Managing virtual memory requires both software and hardware.

  • Kernel (software):
    • Manages MMU registers on address space switches (context switch from thread in one process to thread in a different process)
    • Creates and manages page tables
    • Manages (allocates/deallocates) physical memory
    • Handles exceptions raised by the MMU
  • MMU (hardware):
    • Translates virtual addresses to physical addresses
    • Checks for and raises exceptions when necessary

MMU only job: actual addr translation. When it can’t, it raises exceptions. Then kernel is responsible for handle the exceptions raised by the MMU.

The operating system is responsible for implementing virtual memory and managing it. So that’s responsible for creating page tables, and map page to frames… All these done by kernel.

TLBs

We have a bit of problem… As u know, programs are all using virtual addr. So PC is VA. For example,

add $1 $2 $3
sub $1 $2 $3
...

For add, I need to go to RAM and grab a page table entry and bring it back to me. Then do for sub. There’s high probability that add and sub are at the same page of VM, sequential instructions. Going back and forth is waste of time. When you have to go to RAM, it’s slow, since RAM is not at CPU, it’s a separate entity. MMU is at the CPU. We have an address bus (it takes some time for signal to propogate along the wire).

Keep in mind: Each assembly instruction requires a minimum of one memory operation. Address translation through a page table adds a minimum of one extra memory operation (for page table entry lookup) for each memory operation performed during instruction execution.

Problem: This extra step can be slow!

Solution: include a Translation Lookaside Buffer (TLB) (a cache for PTEs) in the MMU

  • TLB is a small, fast, dedicated cache of address translations in the MMU
  • Each TLB entry stores a single (page# → frame#) mapping.
  • I.e. store the pair (p,f) which means translate p to f

MMU has TLB in it, which is cache for PTEs. When we try to do addr translation, we first ask TLB: hey do u have this? if it has, then we just use local cache. If not, we have a cache miss, and only then we actually go to RAM. It speeds us up.

As it turns out, there are two ways to implement TLB.

Hardware-Managed TLB

Algorithm for a Hardware-Managed TLB:

if (p has an entry (p,f) in the TLB) then
    return f // TLB hit!
else
    find p’s frame number (f) from the page table
    add (p,f) to the TLB, evicting another entry if full
    return f // TLB miss

In else, hardware goes to RAM, gets the PTE, tries to put in TLB. If can’t fit, then it will choose an entry to kick out, and insert the new entry, and then return frame entry.

Advantage: fast

Downside: Hardware must dictate to the programmer with PTEs will play. Because hardware needs to know how when it gets to a TLB miss, how to interpret that PTE and how to update TLB. So there’s no flexibility, we are enforcing hardware restriction on to the kernel must obey it.

For a hardware-managed TLB the MMU handles TLB misses, including page table lookup and replacement of TLB entries.
⇒ MMU must understand the kernel’s page table format.

Software-Managed TLB

if (p has an entry (p,f) in the TLB) then
    return f // TLB hit!
else
    raise exception // TLB miss

In case of a TLB miss, the kernel must

  • determine the frame number for p,
  • add (p,f) to the TLB, evicting another entry if necessary

After the miss is handled, the instruction that caused the exception is re-tried. At this time, it will succeed. This gives us more flexibility.

As a warning, OS/161 currently has no cache eviction strategy. If TLB is full, it panics and dies.

Now because we have changed our hardware, we also must change what it happens on a context switch. So now our problem: what do we do with our TLB on a context switch? So we know we update MMU on a context switch by changing the value of page table’s base reg to match the new process that is running. But you must also now change the TLB (frames cached in it are w.r.t running process). When we context switch to a different process, I must also clear the TLB out so I don’t have any other’s process mapping here.

As a side note, as_activate: it is tempting to call it to activate addr space. Don’t call it, it does not do what it advertises. It clears the TLB. See details here in A2b.

  • The MIPS TLB has room for 64 entries.
  • Each entry is 64 bits (8 bytes) long, as shown.
  • If TLBLO_DIRTY (a.k.a. write permission) is set (i.e. is 1) it means that you can write to this page. (The write permission bit indicates whether the mapped page can be written to. If this TLBL_DIRTY is clear (i.e. 0) an attempt to translate an address on this page for a store instruction (e.g. sw or sb) will result in an exception.)
  • See kern/arch/mips/include/tlb.h

High order words contains page number, and low words frame number. What is 6 bit PID? The process that these entries belong to? Theoretically speaking it’s exactly what it could be used for except it’s not used. In fact this is sth exists in every TLB entry. Up until very very recently was not used. One downside of this is you are clearing TLB each time on a context switch. So we have no cache pinning. So that field there can be used for other things: cache affinity. When we come back to running process again, we don’t just check the page number, we map page # and PID. So we don’t need to clear TLB on a context swidth. This field can also be used for security reasons: the process can keep track of which PID is actually running right now and will only perform addr translations if two PID matches up. That’s adding isolation as well. That’s fairly new things these days.

Paging - Conclusion

Benefits

  • paging does not introduce external fragmentation
  • multi-level paging reduces the amount of memory required to store page-to-frame mappings

Costs

  • TLB misses are increasingly expensive with deeper page tables
    • To translate an address causing a TLB miss for a three-level page table requires three memory accesses, i.e. one for each page table

FYI, DOS, Disk Operating System, predecessor of Windows, didn’t use virtual memory. The earliest Windows didn’t. Linux: Support for 5-level paging with 57-bit (odd number, not power of 2, seems very bizzare) virtual addresses is coming and Linux already supports it.

dumbvm in OS/161

Now let’s move our focus away from theoretical implementation of VM and start to talk about how OS/161 implements VM. It is called dumbvm because it is really really dumb.

Recall: : MIPS uses 32-bit paged virtual and physical addresses and MIPS has a software-managed TLB.

OS/161 does segmentation. OS/161 takes each segment and sets it to be a continuous block of pages and requires that those pages map to contiguous block of frames (physical memory). So if we need 12 pages of contiguous segment, then I need 12 contiguous frames for the code segment in physical memory. In A3, you are going to fix it. You are going to add page back and combine paging and segmentation back which is an excellent way of doing virtual memory.

The implementation of virtual memory for OS/161 is in dumbvm.c. Any time MMU throws an exception, the function that is going to handle that exception is called vm_fault. If you see

  • TLB Miss on LOAD -> you were trying to read from an address,
  • TLB Miss on STORE -> you were trying to write to an address,

that is not valid.

The addrspace Structure

struct addrspace {
    vaddr_t as_vbase1;      /* base virtual address of code segment */
    paddr_t as_pbase1;      /* base physical address of code segment */
    size_t as_npages1;      /* size (in pages) of code segment */
    vaddr_t as_vbase2;      /* base virtual address of data segment */
    paddr_t as_pbase2;      /* base physical address of data segment */
    size_t as_npages2;      /* size (in pages) of data segment */
    paddr_t as_stackpbase;  /* base physical address of stack */
};

Note that there are three segments to the addr space for OS/161: code, data, stack segment. Because it is using segmentation, we keep track of these start physical address, the number of the pages, i.e., the limits in pages of each segments, and the start virtual addr of that segment. We are not set aside and create a seg table. We are also not doing this by setting aside higher order bits for seg number. We will see a minute why we don’t do that.

For the code segment, we have vbase: virtual bases address which is the first addr in the diagram that of the virtual segment region and then that’s our code. Then we have pbase, which is the beginning of the code segment in physical memory, those will be paged align because MMU used pages. Then we have data segment. You will notice that for stack segment, we don’t have all those things store. The only thing: stackpbase: physical based address. The reason: OS/161’s stack has fixed size and fixed location for all processes in virtual memory: 12 pages, which is a macro.

Then, how does OS/161 do its own addr translation?

// First calculate the top and bottom address for each segment.
vbase1 = as->as_vbase1;
vtop1 = vbase1 + as->as_npages1 * PAGE_SIZE;
vbase2 = as->as_vbase2;
vtop2 = vbase2 + as->as_npages2 * PAGE_SIZE;
stackbase = USERSTACK - DUMBVM_STACKPAGES * PAGE_SIZE;
stacktop = USERSTACK;

Where the following are predefined constants

USERSTACK = 0x8000 0000
DUMBVM_STACKPAGES = 0xC  // decimal 12
PAGE_SIZE = 0x1000       // decimal 4096 or 4K

Key Task: To perform the address translation of faultaddress

  1. Determine if the address is valid by checking if it is in the proper range for any of the code, data or stack segments.
  2. If so then
    • (a) calculate its offset from the virtual base address for that segment. Then
    • (b) add the physical base address to that offset.
if (faultaddress >= vbase1 && faultaddress < vtop1) {
    paddr = (faultaddress - vbase1) + as->as_pbase1; // code segment
}
else if (faultaddress >= vbase2 && faultaddress < vtop2) {
    paddr = (faultaddress - vbase2) + as->as_pbase2; // data segment
}
else if (faultaddress >= stackbase && faultaddress < stacktop) {
    paddr = (faultaddress - stackbase) + as->as_stackpbase; // stack segment
}
else {
    return EFAULT;
}

Conclusion: We now know how to translate a virtual address.

Next Question: But where do those virtual addesses come from?

Answer: The come from the program’s object file…

ELF

Binary file tells the operating system data, code segment’s size and starting address. When we use load_elf to load the addr space, it is going to read the info from the header of binary form. In this course, we use ELF (Executable and Linking Format) format. In the header, it tells you all about the addr space: what it needs to look like and where it needs to live. In particular, the ELF header describes the segment images:

  • the virtual address of the start of the segment - the length of the segment in the virtual address space - the location of the segment in the ELF file - the length of the segment in the ELF file

However, you might be wondering why do I need to specify file’s size and virtual address space’s size. Should these two be the same thing? Yes, but it is also possible that you requested more space of segment of virtual memory than we actually need so we have room roll a heap a little bit.

The ELF le identi es the (virtual) address of the program’s first instruction (i.e. the entry point).

The ELF le also contains lots of other information (e.g., section descriptors, symbol tables, relocation tables, external symbol references) that is useful to compilers, linkers, debuggers, loaders and other tools used to build programs.

ElF also contains info for each segment: is this a read-only segment? read-and-write?

The only thing that is missing from this file is the description of the stack. Because OS is going to dictate the size and location of the stack, not the compiler.

OS/161 ELF Files

OS/161’s dumbvm implementation assumes that an ELF le contains two segments:

  1. a text segment, containing the program code and any read-only data
  2. a data segment, containing any other global program data

The images in the ELF le are an exact copy of the binary data to be stored in the address space. But the ELF le does not describe the stack, Because the initial contents of the stack (command line arguments) are unknown until run time. dumbvm creates a stack segment for each process – It is 12 pages long, running from virtual address 0x7FFF C000 to 0x7FFF FFFF.

Virtual Memory for the Kernel

It would be nice if we let kernel live in virtual memory, but there are some challenges:

  • Bootstrapping: Since the kernel helps to implement virtual memory, how can the kernel run in virtual memory when it is just starting?
    How dowe actually stall the bootstrapping? It is usually solved by architecture and it is surprisingly simple solution sometimes.
  • Sharing: Sometimes data need to be copied between the kernel and application programs? How can this happen if they are in different virtual address spaces?
    We are going to make the kernel’s virtual memory overlap with every single process. So the kernel has access to every single process’s virtual memory without overlapping any addresses.

You might be wondering why does the stack live in the middle? Because we are diving our virtual address range in half. The low half is for user, the upper is for kernel. MMU will use different translation scheme for these two. In OS/161, we just use dynamic relocation, which solves bootstrapping problem.

It looks like every process has a copy of kernel’s memory. There are not multiple copies, there is only one copy. When in unprivileged mode, MMU should only be translating user’s address. When we change the state of the CPU to be a privileged mode, the MMU should now be able to translate both.

OS/161 Memory

Key Point: Sys/161 only supports 1GB of physical memory. The remaining 3GB are not available (i.e. not usable).

Take a closer look to virtual memory, we have first 2 GB dedicated to be used for user programs. We call it kuseg, user segment. If we look at remaing 50%, the kernel’s virtual memory is divided into three segments:

  • kseg0 (512MB): for kernel data structures, stacks, etc.
  • kseg1 (512MB): for addressing devices. Not really storing anything. If we see these addr, we know that we are trying to talk to a device.
  • kseg2 (1GB): not used (this is true not only in OS/161)

How do we manage the memory? Physical memory is divided into frames which are managed by the kernel in the coremap: keep track of memory is available or not. But OS/161 does not have one: it has a pointer. It points to where it is freed. When you request, it advances the ptr, and when you free, it does nothing. So for A3, you will implement the coremap, so it will do proper memory management and free memory when it is done.

So we are dividing physical memory into frames and user memory into pages.

kseg0: not paged, not cached. We are going to use dynamic relocation. To translate kseg0 addresses to physical ones subtract 0x8000 0000 from the virtual address. I.e. the TLB is not used. kseg0 maps to the first 512MB of physical memory, though it may not use all of this space. This 512MB is the maximum memory that kernel could use. The memory it left can be used by user programs.

kseg1: magical segment. Addresses within kseg1 are for accessing devices, such as the hard drive or the timer. To translate kseg1 addresses to physical ones subtract 0xA000 0000 from the virtual address. Again, the TLB is not used. kseg1 also maps to the first 512MB of physical memory, though it does not use all of this space. Though kseg1 and kseg0 map to the first 512MB of physical memory, in practice, the memory used for kseg1 will not be used for kseg0, so it works out.

By the way, my favorite final exam problem is that I’ll give you a panic msg and you tell me what’s wrong.

Exploiting Secondary Storage

Right now what we are doing is called addr space preloading. We can load the whole address space into RAM, which allows fast access. But here is the problem, let’s say Matlab, how much matlab do you use when you start up? You load 5GB but only use 512MB, kind of waste. Which makes a lot more sense is to only keep addr space components that we are actively using in RAM and keep everything else in secondary storage and only load it if needed. So we want to exploit secondary storage.

Goals:

  • Allow virtual address spaces that are larger than the physical address space (i.e. installed RAM).
  • Allow greater multiprogramming levels by using less of the available primary memory (i.e. installed RAM) for each process.

Method:

  • Allow pages from virtual memories to be stored in secondary storage, i.e., on a hard disk drive (HHD) or solid state drive (SSD).
  • Swap pages (or segments) between secondary storage and primary memory so that they are in primary memory when they are needed.

Resident Sets and Present Bits

What additional information needs to be tracked to allow us to exploit secondary storage?

  • When swapping is used, some pages of virtual memory will be in primary memory and others will not be in primary memory.
    • The set of virtual pages present in physical memory is called the resident set of a process.
    • A process’s resident set will change over time as pages are swapped in and out of physical memory
  • To track which pages are in physical memory, each PTE needs to contain an extra bit, called the present bit:
    • valid=1, present=1 => page is valid and in memory
    • valid=1, present=0 => page is valid, but not in memory
    • valid=0 => invalid page

If the present bit is zero, it is not in RAM, the no frame number. It is possible to have valid but not in memory. Because we have ranges of addr space that are technically used but simply not in memory right now. And we don’t care valid=0, since it is not a valid addr mapping. Some interesting things: how about TLB entries? Should we have an entry in our TLB that is not present in physical memory? So in addition to present bit, when we make changes to TLB or changes to some page mappings present status, we need to make sure anything that is not physically in memory right now, shouldn’t be in our TLB either. If not in physical memory, it doesn’t have frame #, thus no addr translation, so why do you put it in TLB? TLB is supposed to be fast, not garbage.

Page Faults

When a process tries to access a page that is not in memory, the situation is detected because the page’s present bit is zero.

  • On a machine with a hardware-managed TLB
    • the MMU detects this situation when it checks the page’s PTE,
    • it generates an exception, which the kernel must handle.
  • On a machine with a software-managed TLB
    • the kernel detects this situation when it checks the page’s PTE after a TLB miss
    • i.e., the TLB should not contain any entries that are not present in RAM.  This event (attempting to access a non-resident page) is called a page fault.

When a page fault happens, it is the kernel’s job to:

  1. Swap the page into memory from secondary storage, possibly evicting another page (move it from RAM to secondary storage) if necessary.
  2. Update the PTE for that page (i.e. set the present bit).
  3. Return from the exception so that the application can retry the virtual memory access that caused the page fault.

Page Faults are Slow。。。

  • The impact of swapping on the average memory access time depends on the fault frequency.
  • If secondary storage access is 1000 times slower then …

To improve the performance of virtual memory with on-demand paging, reduce the occurrence of page faults.

  • Limit the number of processes, so that there is enough physical memory per process.
  • Try to be smart about which pages are kept in physical memory and which are evicted, i.e. have a replacement policy.
  • Hide latencies, e.g., by prefetching pages before a process needs them.

A Simple Replacement Policy: FIFO

A replacement policy: speci es how to determine which page to evict from physical memory. The FIFO policy: replace the page that has been in memory the longest (see example in slide)

Though it is fair and ez to implement… The best possible algorithm:

Optimal Page Replacement

There is an optimal page replacement policy for demand paging, called MIN: replace the page that will not be referenced for the longest time.

  • MIN requires knowledge of the future so it is only a theoretical (impossible) policy.
  • Other practical policies can be compared to MIN’s performance.

Locality

Real programs do not access their virtual memories randomly. Instead, they exhibit locality.

There are two types of locality.

  1. temporal locality: programs are more likely to access pages that they have accessed recently than pages that they have not accessed recently.
  2. spatial locality: programs are likely to access parts of memory that are close to parts of memory they have accessed recently.

Locality helps the kernel keep the page fault rates low.

Least Recently Used (LRU) Page Replacement

Using the same three-frame example as before, Least Recently Used (LRU) Page Replacement would evict the page that has not be used in the longest period of time.

Challenges with using LRU

  • It must track usage and nd a maximum value which is expensive.
  • The kernel is not aware which pages a program is using unless there is an exception.
  • This makes it dicult for the kernel to exploit locality by implementating a replacement policy like LRU.

Solution: Have the MMU track page accesses in hardware.

  • Simple scheme: add a use bit (or reference bit) to each PTE.
    This bit:
    • is set by the MMU each time the page is used, i.e., each time the MMU translates a virtual address on that page
    • can be read and cleared by the kernel.
  • The use bit provides a small amount of memory usage information that can be exploited by the kernel.

use bit: everytime MMU performs a translation on that particular page, we are going to set the use bit to 1. Periodically, the kernel will take all use bits and reset them to zero. So we are going to use the use bit to indicate how does this address translation being used since the last time kernel clears that. So we are not keeping track the exact time it was used, we are just keeping a rough idea since the last time the kernel clears all these out, how’s these addr translation used.

The Clock Replacement Algorithm

The Clock Algorithm (also known as Second Chance) is one of the simplest algorithms that exploits the use bit.

The clock algorithm can be visualized as a victim pointer that cycles through the page frames. The pointer moves whenever a replacement is necessary:

while (use bit of victim is set) {
    clear use bit of victim // get a 2nd chance
    victim = (victim + 1) % num_frames
}
evict victim // its use bit is 0
victim = (victim + 1) % num_frames

… Next time I have a page fault, I start where the victim pointer left off. You might be wondering what happens if I used every single address? use bit are all 1. Because when you try to victimize an entry, I realized it was used, then you reset the use bit to zero, so keep track of it again in the near future. After you travel from your staring pos back around to the start pos again, one cycle of clock, when you come back to init pos you were in, the use bit was reset to zero, which means you end up victimizing it and stop. So it doesn’t go on forever.

It’s sth we can implement since we only need to worry about unsuccessful translations.

more on wiki.

Scheduling

Lesley: lec 17; Kevin: page 327

Devices and I/O

Lesley: lec 18, 19 and 20; Kevin: page 370

File Systems

Lesley: lec 21, 22 and 23; Kevin: page 409

Virtual Machines

course note

Lesley: lec 23 and 24; Kevin: N/A since this material is cut off in W20


Back to top

Copyright © 2017-2024 Sibelius Peng.