跳转至

Parallelism

Parallelism Overview

  • Parallel Requests
  • Parallel Threads
  • Parallel Instructions
  • Parallel Data
  • Hardware Descriptions
  • Programming Languages

Amdahl's Law

\[ \text{Execution time after improvement} = \frac{\text{Execution time affected by improvement}}{\text{Amount of improvement}} + \text{Execution time not affected} \]
\[ \text{Speed-up} = \frac{\text{Original execution time}}{\text{Execution time after improvement}} = \frac{1}{(1 - F) + \frac{F}{S}} \]

Strong and Weak Scaling

  • Strong scaling: when speedup can be achieved on a parallel processor without increasing the size of the problem;
  • Weak scaling: when speedup is achieved on a parallel processor by increasing the size of the problem proportionally to the increase in the number of processors

Flynn's Taxonomy

Single Instruction Single Data (SISD)

  • Sequential computer that exploits no parallelism in either the instruction or data streams. Examples of SISD architecture are traditional uniprocessor machines.
    • E.g. Our RISC-V processor up to now;
    • Superscalar is SISD because programming model is sequential

Single Instruction Multiple Data (SIMD)

  • SIMD computer exploits multiple data streams against a single instruction stream to operations that may be naturally parallelized.
    • Intel SIMD instruction extensions
    • NVIDIA Graphics Processing Unit (GPU)
    • Vector processors

Multiple Instruction Multiple Data (MIMD)

  • Multiple autonomous processors simultaneously executing different instructions on different data.
    • Multicore
    • Warehouse-scale computers (WSC)

Multiple Instruction Single Data (MISD)

  • Multiple-Instruction, Single-Data stream computer that exploits multiple instruction streams against a single data stream.
    • Rare, mainly of historical interest only
    • Some literatures categorize systolic array as MISD

SIMD

SIMD Architecture

  • Instruction: One instructinon is fetched & decoded for entire operation
  • Register: Use registers (128-bit XMM, 256-bit YMM, 512-bit ZMM) to store multiple data
  • Pipelining & Parallelism: Data loading, ALU operation, and data storing are pipelined and parallelized
  • Special Operation Unit: Use vector operation unit to perform SIMD operations (SSE/AVX, ARM NEON, RISC-V Vector)

Loop Unrolling

  • Unroll loop and adjust iteration rate to reveal more DLP than available in a single iteration of a loop.
C
for (i = 1000; i > 0; i--) {
    x[i] = x[i] + s;
}
C
for (i = 1000; i > 0; i -= 4) {
    x[i] = x[i] + s;
    x[i-1] = x[i-1] + s;
    x[i-2] = x[i-2] + s;
    x[i-3] = x[i-3] + s;
}
  • Given a loop with \(n\) iterations, \(k\) copies of the body of the loop, and assume \(n \text{ mod } k \neq 0\),
  • Then we will run the loop with 1 copy of the body \(n \text{ mod } k\) times, and with \(k\) copies of the body $\left\lfloor \frac{n}{k} \right\rfloor $ times.

RISC-V Vector Extension

  • 32 vector registers
  • Need to setup length of data and number of parallel registers to work on before usage (vconfig)!
  • vflw.s: vector float load word . stride: load a single word, put in v1 ‘vector length’ times
  • vsetvl: ask for certain vector length – hardware knows what it can do (maxvl)!
Text Only
    # assume x1 contains size of array
    # assume t1 contains address of array
    # assume x4 contains address of scalar s
    vconfig 0x63        # 4 vregs, 32b data(float)
    vflw.s v1.s, 0(x4)  # load scalar value to v1

loop:
    vsetvl x2, x1       # will set vl and x2 both to min(maxvl, x1)
    vflw v0, 0(t1)      # will load `vl` elements out of `vec`
    vfadd.s v2, v1, v0  # do the add
    vsw v2, 0(t1)       # store result back to `vec`
    slli x5, x2, 2      # bytes consumed from `vec` (x2 * sizeof(float))
    add t1, t1, x5      # increment `vec` pointer
    sub x1, x1, x2      # subtract from total (x1) work done this iteration
    bne x1, x0, loop    # if x1 not yet zero, still work to do

评论