Cache¶
Intro¶
DRAM & SRAM¶
- Dynamic Random Access Memory as main memory
Locality¶
-
Temporal locality (locality in time)
- If a memory location is referenced, then it will tend to be referenced again soon
-
Spatial locality (locality in space)
- If a memory location is referenced, the locations with nearby addresses will tend to be referenced soon
Principle of Locality¶
- Programs access small portion of address space at any instant of time (spatial locality) and repeatedly access that portion (temporal locality)
Bane of Locality: Pointer Chasing¶
-
Special data structures: linked list, tree, etc.
- Easy to append onto and manipulate...
-
But they have horrid locality preferences
- Every time you follow a pointer it is to an unrelated location: No spacial reuse from previous pointers
- And if you don't chase the pointers again you don't get temporal reuse either
Simple Cache¶
- With cache, the datapath/core does not directly access the main memory;
- Instead the core asks the caches for data with improved speed;
- A hardware cache controller is deviced to provide the desired data
Example
Load word instruction lw t0, 0(t1)
,
t0
: 1234
in memory address 0x12F0
t1
: 0x12F0
- Processor issues address
0x12F0
to memory - Memory reads
1234
@ address0x12F0
- Memory sends
1234
to Processor - Processor loads
1234
into registert0
- Processor issues address
0x12F0
to memory - Cache checks if data @address
0x12F0
is in it- if it is in the cache, cache hit and read
1234
- if not matched, called cache miss and
- Cache sends address
0x12F0
to memory - Memory read address
0x12F0
and send1234
to cache - Due to limited size, cache replaces some data with
1234
- Cache sends address
- if it is in the cache, cache hit and read
- Cache sends
1234
to Processor - Processor loads
1234
into registert0
Typical Values¶
-
L1 cache
- size: tens of KB
- hit time: complete in one clock cycle
- miss rate: 1% to 5%
-
L2 cache
- size: hundreds of KB
- hit time: few clock cycles
- miss rate: 10% to 20%
-
The L2 miss rate is the fraction of L1 misses that also miss in L2.
Cache Terminology¶
- Cache line/block: a single entry in cache
- Cache line/block size: #byte per cache line/block
- Capacity: total #byte that can be stored in a cache
Cache "Tag"¶
- We need a wag to tell if the cache have copy of location in memory so that can decide on hit or miss;
- On cache miss put memory address of block in "tag address" of cache block.
Understanding Cache Misses: 3Cs
-
Compulsory (cold start or process migration, 1st reference):
- First access to block impossible to avoid; small effect for long running programs
- Solution: increase block size (increases miss penalty; very large blocks could increase miss rate)
-
Capacity:
- Cache cannot contain all blocks accessed by the program
- Solution: increase cache size (may increase access time)
-
Conflict (Collision):
- Multiple memory locations mapped to the same cache location, conflict even when the cache has not reached full capacity.
- Solution 1: increase cache size
- Solution 2: increase associativity (may increase access time)
Block Replacement Policy¶
Least Recently Used (LRU)¶
-
Replace the entry that has not been used for the longest time, i.e. has the oldest previous access.
-
Pro: Temporal locality!
- Recent past use implies likely future use
-
Cons: Complicated hardware to keep track of access history
-
Add extra information to record cache usage.
Most Recently Used (MRU)¶
- Replace the entry that has the newest previous access
First In First Out (FIFO)¶
- Replace the oldest line in the set (queue)
Last In First Out (LIFO)¶
- Replace the most recent line in the set (stack)
Random¶
Cache Mapping¶
Fully Associative Cache¶
Arbitrary memory address can go to arbitrary cache blocks.
- Cache size: total size of the cache (\(C\))
- Cache block size: \(C_B \to\) decides the number of offset bits (\(b\))
- Number of cache blocks (\(N\)): \(N = C / C_B\)
- Bit width of memory address (\(w\)): 16-bit in our examples
- Bit width of tag (\(t\)): \(t = w - b\)
Direct Mapped Cache¶
The data at a memory address can be stored at exactly one possible block in the cache.
- Cache capacity/size: total size of the cache (\(C\))
- Cache block size: \(C_B \to\) decides the number of offset bits (\(o\)): \(2^o = C_B\)
- Number of cache blocks (\(N\)): \(N = C / C_B\)
- Bit width of memory address (\(w\))
- Bit width of Index (\(i\)): \(i = \log_2(N)\)
- Bit width of tag (\(t\)): \(t = w - o - i\)
Set Associative Cache¶
The data can only be stored at one index, but there are multiple slots/blocks.
- Cache capacity/size: total size of the cache (\(C\))
- Cache block size: \(C_B \to\) decides the number of offset bits (\(o\)): \(2^o = C_B\)
- Number of cache blocks (\(M\)): \(M = C / C_B\)
- Bit width of memory address (\(w\))
- \(N\)-way SA cache: \(N\) cache blocks in a set
- Number of sets (\(S\)): \(S = M / N\)
- Bit width of Index (\(i\)): \(i = \log_2(S) = \log_2(M / N)\)
- Bit width of tag (\(t\)): \(t = w - o - i\)
- Given \(N\)-way, \(t\), \(i\), and \(o\), total capacity = \(2^o * 2^i * N\)
Feature | Direct Mapped | Set Associative | Fully Associative |
---|---|---|---|
Hit time | Fast | Mid | Slow |
Miss rate | High | Mid | Low |
Miss penalty | \(\sim\) | \(\sim\) | \(\sim\) |
Write Policy¶
Write Policy
- Store instructions write to memory, which changes values.
- Hardware needs to ensure that cache and memory have consistent information.
Write hit:
- Write to both cache and memory at the same time.
- (more writes to memory \(\to\) longer time)
- (not the "Write back" phase in pipeline)
- Write data in cache and set a dirty bit to 1.
- When this block gets replaced from the cache (and "back" to memory), write to memory.
Write miss:
- Allocate in cache a space to deal with this write (cache block replacement)
- Update LRU
- Set dirty bit and implement write-back policy; or write-through
- The data is directly written to the main memory without loading it into the cache.
Cache Performance & Metrics¶
- Hit rate: fraction of accesses that hit in the cache
- Miss rate: 1 – Hit rate
- Miss penalty: time to replace a line/block from lower level in memory hierarchy to cache
- Hit time: time to access cache memory (including tag comparison)
AMAT¶
AMAT: Single-Level Cache
Average Memory Access Time (AMAT) is the average time to access memory considering both hits and misses in the cache.
AMAT: Multi-Level Cache
- Local miss rate: the fraction of references to one level of a cache that miss
- Global miss rate: the fraction of references that miss in all levels of a multilevel cache
Improve Cache Performance¶
-
Reduce hit time: use smaller cache but may increase miss rate (capacity misses)
-
Reduce miss rate:
- Program dependent;
- Larger capacity (may decrease capacity miss but increase hit time and hardware cost);
- Higher associativity (may reduce conflict miss; but require extra considertaion for replacement policy and increase hardware cost)
- Larger cache blocks (may reduce compulsory miss; better spatial locality usage; but may harm temporal locality with recently-used data evicted)
-
Reduce miss penalty:
- Prefetch
- Increase level of cache
- Victim cache
Advanced Cache¶
Processors with Shared Memory¶
Multiprocessor with Shared Memory¶
- A multiprocessor with shared-memory offers multiple cores/processors a single, shared, coherent memory.
- Should be called shared-address multiprocessor, because all processors share single physical address space (more later, VM)
Multiprocessor Cache¶
-
Memory is a performance bottleneck even with one processor.
-
Use private caches to reduce bandwidth demands on main memory!
-
Only cache misses have to access the shared common memory
- Each core/processor has its own cache
- All cores communicate with each other and memory through a bus
- One memory shared by all cores
Cache Coherence¶
Cache Coherence¶
-
Coherent: any read of a data item returns the most recently written value of that data item
-
Because there is shared memory, a computer architect must design the system to keep cache values coherent.
-
Idea: When any processor has cache miss or writes, use the bus to notify other processors.
- If only reading, many processors can have copies
- If a processor writes, invalidate any other copies.
-
One cache coherence protocol: Each cache controller "snoops" for write transactions on the common bus
- Bus is a broadcast medium
- On any block request to the bus, check if own cache has a copy
- If exists, then invalidate own cache's copy
Coherence Miss¶
New cache miss type: coherence miss (a.k.a. communication miss), caused by writes to shared data made by other processors.
- For some parallel programs, coherence misses can dominate total misses;
- The 4th "C" of cache misses
Snoopy Cache¶
Write Invalidate¶
- Processor \(k\) wanting to write to an address, grabs a bus cycle and sends a "write invalidate" message
- All the other snooping caches invalidate their copy of appropriate cache line
- Processor \(k\) writes to its cached copy (assume for now that it also writes through to memory)
- Any shared read in the other processors will now miss in cache and refetch new data.
Write Update¶
- CPU wanting to write grabs bus cycle and broadcasts new data as it updates its own copy
- All snooping caches update their copy
Advanced Cache¶
-
Inclusiveness of multi-level caches
-
If all blocks in the higher level cache are also present in the lower level cache, then the lower level cache is said to be inclusive of the higher level cache.