Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data...
Transcript of Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data...
![Page 1: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/1.jpg)
©2019 VMware, Inc.
Project PBerry: FPGA Acceleration for Remote Memory
Irina Calciu, Ivan Puddu, Aasheesh Kolli
Andreas Nowatzyk, Jayneel Gandhi, Onur Mutlu
Pratap Subrahmanyam
HotOS 2019
![Page 2: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/2.jpg)
©2019 VMware, Inc. 2
Recent remote memory systems
✧ Systems: Infiniswap, HotPot, Remote regions
✧ Same goal: pool remote memory
✧ Different interfaces: swapping, dspm, file system
✧ Same mechanism: page faults
![Page 3: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/3.jpg)
©2019 VMware, Inc. 3
Remote memory systems use page faults
Overheads:
1. Page faults overheads
2. Page granularity
Use cache coherence (w/ FPGA)
Implications:
1. Reduce page faults
2. Cache line granularity
1. Remote memory
2. Live migration
3. Code analysis
4. Security
5. …
Problem Solution Impact
Use cache coherence traffic to track application memory accesses
![Page 4: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/4.jpg)
©2019 VMware, Inc. 4
Remote memory
OS
Memory
Application host(Local)
Memory
Memory host(Remote)
1
Microsecond network
APP1
2
3 4
1 2
3 4
APP2 APP3
OS
1. Demand paging (fetch pages)
2. Evict pagesLocal Remote
X X
2’. Dirty data tracking
![Page 5: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/5.jpg)
©2019 VMware, Inc. 5
How to track dirty data?
Current solution:
1. Write-protect page
2. On page fault: mark page dirty
OS
Memory
Application host
APP1
1 2
3 4X
![Page 6: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/6.jpg)
©2019 VMware, Inc. 6
Overheads in current remote memory systems
1. Dirty data amplification (4KB, 2MB, 1GB)
• Breaking large pages to 4KB pages [Smarduch et al., CloudOpen-Japan 2015]
Application Amplification (4KB pages) Amplification (2MB pages) Amplification (64B cache-line)
Redis-random 96.82 % (3.87 KB) 99.99 % (1.99 MB) 31.70 % (20.29 B)
Redis-seq 82.91 % (3.32 KB) 99.83 % (1.99 MB) 17.18 % (10.99 B)
![Page 7: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/7.jpg)
©2019 VMware, Inc. 7
Overheads in current remote memory systems
1. Dirty data amplification (4KB, 2MB, 1GB)
• Breaking large pages to 4KB pages
2. Application slowdown due to (write) page faults
3. Cost of write-protecting pages
[Smarduch et al., CloudOpen-Japan 2015]
![Page 8: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/8.jpg)
©2019 VMware, Inc. 8
Remote memory systems use page faults
Overheads:
1. Page faults overheads
2. Page granularity
Use cache coherence (w/ FPGA)
Implications:
1. Reduce page faults
2. Cache line granularity
1. Remote memory
2. Live migration
3. Code analysis
4. Security
5. …
Problem Solution Impact
Use cache coherence traffic to track application memory accesses
![Page 9: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/9.jpg)
©2019 VMware, Inc. 9
Memory
Key idea: use cache coherence for tracking memory accesses
CPU Caches
Modified
Exclusive
Shared
InvalidW
rite
Writ
e
Writ
e
FlushXM
Flush
MESI protocol
Rea
d
Read
![Page 10: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/10.jpg)
©2019 VMware, Inc. 10
Field Programmable Gate Array (FPGA)
✧ Integrated circuit
✧ Reconfigured after manufacturing
✧ “Program the hardware”
![Page 11: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/11.jpg)
©2019 VMware, Inc. 11
Key idea: use cache coherence for tracking memory accesses
CPU Caches
Memory
S
Memory
Coherent
interconnect
FPGACPU
![Page 12: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/12.jpg)
©2019 VMware, Inc. 12
Key idea: use cache coherence for tracking memory accesses
CPU Caches
Memory
M
Memory
Coherent
interconnect
FPGACPU
![Page 13: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/13.jpg)
©2019 VMware, Inc. 13
FPGA primitives
✧ Read and write tracking using cache coherence
✧ Data pages
✧ Code pages
✧ OS pages (page tables)
✧ Local and remote copy
✧ In-place transform
![Page 14: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/14.jpg)
©2019 VMware, Inc. 14
Remote memory using a cache coherent FPGA
Application host
Memory
Memory host
Microsecond network
MemoryMemory
OS
APP1
OS
DRAM cache for remote memory
1. Demand paging (fetch pages)
2. Evict pagesLocal Remote
2’. Dirty data tracking
![Page 15: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/15.jpg)
©2019 VMware, Inc. 15
1. “Demand paging” using a cache coherent FPGA
Application host
Memory
Memory host
Microsecond network
MemoryMemory 1
1
OS
APP1
OS
Fake physical address space
1
![Page 16: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/16.jpg)
©2019 VMware, Inc. 16
1. “Demand paging” optimization: translation triggered prefetching
Application host
Memory
Memory host
Microsecond network
MemoryMemory 1
1
OS
APP1
OS
Fake physical address space
1
![Page 17: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/17.jpg)
©2019 VMware, Inc. 17
Remote memory systems use page faults
Overheads:
1. Page faults overheads
2. Page granularity
Use cache coherence (w/ FPGA)
Implications:
1. Reduce page faults
2. Cache line granularity
1. Remote memory
2. Live migration
3. Code analysis
4. Security
5. …
Problem Solution Impact
Use cache coherence traffic to track application memory accesses
![Page 18: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/18.jpg)
©2019 VMware, Inc. 18
Benefits
✧ Reduce (write) page faults
✧ Reduce dirty data amplification (cache-line)
✧ Determine access pattern, smart prefetching
✧ Downside: FPGA memory slower than CPU memory
![Page 19: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/19.jpg)
©2019 VMware, Inc. 19
Preliminary results (emulated)
✧ Redis-random:
✧ 89% less dirty data (cache-line vs. page granularity)
✧ 27% faster than WP
✧ Redis-seq: ✧ 50% less dirty data (cache-line vs. page granularity) ✧ 4% faster than WP
1. Demand paging (fetch pages)
2. Evict pagesLocal Remote
2’. Dirty data tracking
![Page 20: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/20.jpg)
©2019 VMware, Inc. 20
Other use cases
Use case Page type Primitive
Live migration Data, page tables Read/write tracking, remote copy
Compression, Encryption Data Read/write tracking, in-place transform
Replication, Checkpointing Data Write tracking, local/remote copy
Near-memory compute Data Read/write tracking, in-place transform
Code analysis Code Read tracking
Security Data Read/write tracking
![Page 21: Project PBerry: FPGA Acceleration for Remote Memory€¦ · Redis-random: 89% less dirty data (cache-line vs. page granularity) 27% faster than WP Redis-seq: 50% less dirty data (cache-line](https://reader033.fdocument.pub/reader033/viewer/2022052105/6040ba0b1d7309641d225d81/html5/thumbnails/21.jpg)
©2019 VMware, Inc. 21
Remote memory systems use page faults
Use cache coherence (w/ FPGA)
Other use cases
Problem Solution Impact
Use cache coherence traffic to track application memory accesses
Contact: [email protected]
Thank you!