跳转至

Thread-Level Parallelism

Threads

  • Threads (short for thread of execution) is a single stream of instructions.

  • Each thread has:

    • Its own registers (including stack pointer)
    • Its own program counter (PC)
    • Shared memory (heap, global variables) with other threads
  • Each processor provides one (or more) hardware threads (through multi-core or single-core multithreading, later) that actively execute instructions

  • Within a given program’s process, threads can run concurrently.

  • Operating system (OS) multiplexes multiple software threads onto the available hardware threads

Fork-Join Model

Fork-Join model: A program / process can split, or fork itself into separate threads, which can (in theory) execute concurrently.

  • Main thread executes sequentially until first parallel task region.
  • Fork: Main thread then creates a team of parallel subthreads.
  • Join: When subthreads complete their parallel task region, they synchronize and terminate, leaving only the main thread.

OpenMP

OpenMP: Threads

C
#include <omp.h>
C
#pragma omp parallel
{
    // This code is executed by all threads
}

OpenMP creates as many threads as specified in the environment variable OMP_NUM_THREADS.

  • Set this variable to max number of threads you want to use
  • Generally, default: (# physical cores) * (# threads/core). Use lscpu command to obtain the numbers (e.g. 6 physical cores * 2 threads/core = 12 threads)

OpenMP threads are OS (software) threads, which are then multiplexed onto available hardware threads.

OpenMP Intrinsic Description
omp_set_num_threads(x) Set number of threads to x.
num_th = omp_get_num_threads() Get number of threads.
th_ID = omp_get_thread_num() Get thread ID number.

OpenMP: Shared/Private Variables

  • Shared variables: All threads read/write the same variable.

    • Variable declared outside of parallel region
    • Heap-allocated variables
    • Static variables
  • Private variables: Each thread has its own copy of the variable.

    • Variables declared inside parallel region (recall separate stack frames)
C
int var1, var2;
char *var3 = malloc(...);
#pragma omp parallel private(var2)
{
    int var4;
    // var1 shared (default)
    // var2 private
    // var3 shared (heap)
    // var4 private (thread’s stack)
    ...
}

Data Race

  • Two memory accesses form a data race if:

    • They are from different threads to the same location
    • At least one is a write, and
    • They occur one after another.
  • Recall thread model: shared memory.

    • For a given thread, these two operations don't necessarily happen together...! Thread scheduling is non-deterministic.
      • Read current value of x
      • Write new value of x

Not a data hazard!

  • Data hazard: Sequential instructions have data dependencies during concurrent execution (instruction-level parallelism via pipelining).

  • Here, even with no ILP, can have nondeterministic results. This results from lack of synchronization on which thread accesses memory first.

Data Race Example

C
int x = 0;
#pragma omp parallel
{
    x = x + 1;
}

x could be 0~ # threads.

OpenMP Synchronization

  • To enforce multithreaded program correctness, we often need to synchronize threads, i.e., coordinate their execution.
    • Most commonly, know when one task is finished writing so that it is safe for another to read.

Critical Sections with OpenMP

  • A critical section is a segment of code that must be executed by a single thread at a time, thereby enforcing synchronization.
C
#pragma omp barrier
  • Forces all threads to wait until all threads have hit the barrier.
C
#pragma omp critical
  • Creates a critical section within a parallel code segment; only one thread can run a critical section at a time.

OpenMP has very restrictive parallelism.

  • Really only good for parallelizing loops
    • If your critical section is too large, then effectively serial program → Amdahl's law quickly rears its ugly head
    • If critical sections not defined well, can run into deadlock.

Lock Synchronization

  • Use a “Lock” to grant access to a region (critical section) so that only one thread can have the lock and operate at a time

    • Need all processors to be able to access the lock, so use a location in shared memory as the lock
  • Processors read lock and either wait (if locked) or set lock and go into critical section

    • 0 means lock is free / open / unlocked / lock off
    • 1 means lock is set / closed / locked / lock on
  • Locks are one approach to implementing thread synchronization.

  • Suppose lock in shared memory @ 0x5555 0000

  • Pseudocode:

Text Only
Check lock @ 0x5555 0000
Set lock @ 0x5555 0000
Critical section (e.g. change shared variables)
Unset lock @ 0x5555 0000
  • Two operations, formally:
    • Acquire: try to acquire the lock. If successful, keep going. Otherwise, wait and try later;
    • Release: Unlock and continue (only works if we originally had the lock)

Lock Synchronization Realization

Hardware Synchronization

  • Hardware support required to prevent an interloper (another thread) from changing the value

    • Atomic read/write memory operation (all other operations must to happen strictly before/after the read/write)
    • No other access to the location allowed between the read and write
  • How to implement in software?

    • Single instruction: atomic swap of register ↔ memory through atomic instructions;
    • Pair of instructions: one for read (and lock), one for write (and unlock);
  • Needed even on uniprocessor systems

    • Interrupts can happen: can trigger thread context switches...

RISC-V: Two solutions

  • Option 1: Read/Write Pairs

    • Pair of instructions for “linked” read and write
    • Load-reserved and Store-conditional
    • No other access permitted between read and write
    • Must use shared memory (multiprocessing)
  • Option 2: Atomic Memory Operations

    • Atomic swap of register ↔ memory

Option 1: Read/Write Pairs

  • Load-reserved instruction: lr rd, rs

    • Load the word (doubleword) pointed to by rs into rd, and register a (hardware thread) reservation set
  • Store-conditional: sc rd, rs1, rs2

    • Store the value in rs2 into the memory location pointed to by rs1, only if the reservation is still valid and set the status in rd
      • Returns 0 (success) to rd if location has not changed since the lr
      • Returns nonzero (failure) to rd if location has changed: Actual store will not take place
    • Invalid the (hardware thread) reservation whether success or not

C
    li t2, 1
Try:
    lr t1, s1
    bne t1, zero, Try
    sc t0, s1, t2
Locked:
    # critical section
Unlocked:
    sw zero, 0(s1)

Option 2: RISC-V Atomic Memory Operations (AMOs)

  • Encoded with an R-type instruction format

    • swap, add, and, or, xor, max, min
    • AMOSWAP rd, rs2, (rs1)
    • AMOADD rd, rs2, (rs1)
  • Take the value pointed to by rs1

    • Load it into rd
    • Apply the operation to that value with the contents in rs2
      • If rs2 = rd, use the old value in rd
    • Store the result back to where rs1 is pointed to
  • This allows atomic swap as a primitive

    • It also allows "reduction operations" that are common to be efficiently implemented
Text Only
    li t0, 1
Try:
    amoswap.w.aq t1, t0, 0(a0)
    bnez t1, Try
    # critical section
Unlocked:
    amoswap.w.rl zero, zero, 0(a0)

评论