Post on 14-Feb-2017
Agenda
NVM Evolution
Persistent Memory Linux Software Stack
Using , Emulating PMEM on Linux
Remote PMEM
Micro Storage Architecture
NVM Evolution
Persistent MemoryYesterday : Battery Backed RAM
Today : NVDIMM with RAM + FLASH
Power Down - copy to Flash, Power Up copy Back to RAM
Emerging NVDIMM : PCM - 3DX Point - Memristor - etc…
Offer 1000x speed vs NAND -> closer to RAM
Characteristics as seen by software : Synchronous Model
Load / Store memory instruction
No paging
Reasonably stall CPU
New Generation HW NVM is no longer the bottleneck
But still limited by Block stack latency + Asynchronous Model
Asynchronous Model : NVMe
“When Poll is Better than Interrupt” Yang & Al . Usenix Fast 2012 https://www.usenix.org/legacy/events/fast12/tech/full_papers/Yang.pdf
● Active Polling ( SYNC ) lower latency ( at the expense of CPU) vs interrupt MSI-X (ASYNC)
● Used in Intel SPDK
Enter persistent Memory
Source: Intel4KBRead
64BRead
Moving away from Block I/O
LATENCY
ACCESS
Lead to a new Tiered Software Stack
Challenge: Durability
PMEM Linux Software Stack
Linux kernel (>4.2) subsystem
NVDIMM Software Architecture
http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
BTT vs DAXBTT : Block translation table
provides atomic sector update semantics for persistent memory devices
applications that rely on sector writes not being torn can continue to do so.
For Legacy application
DAX : stands for Direct Access
Allows mapping a pmem range directly into userspace via mmap
If the application is aware of persistent, byte-addressable memory, and can use it to an advantage, DAX is the best path for it
If the application relies on atomic sector update semantics, it must use the BTT
Note that PMEM page are not backed by Page struct , only by PFN (so far)
Using , Emulating PMEM on Linux
Kernel Config ( > 4.2 )
Enable NVDIMM dynamic debug before you start playing with NVDIMMsAdd to the kernel cmd line:libnvdimm.dyndbg nfit.dyndbg nd_pmem.dyndbg nd_blk.dyndbg ignore_loglevel
Pick your PMEMUse ACPI 6.0 compatible NVDIMM hardware or
legacy NVDIMMs
Use virtual NVDIMMs provided by hypervisor
RAM as persistent memory
PCMSIM: NVM-disk Emulation
Emulation : RAM as PMEMBare metal :
Add 'memmap=16G!16G' to the kernel boot parameters will reserve 16G of memory, starting at 16G.
cat /proc/cmdline :
BOOT_IMAGE=/boot/vmlinuz-4.3.0-1-default root=UUID=39635fd6-64ee- 4538-9964-7de6bb181181 resume=/dev/sda1 splash=silent quiet showopts memmap=1G!5G memmap=1G!7G
BTT works
QEMU NVDIMMQemu :
qemu-system-x86_64 -object memory-backend-file,share,id=mem1,mem-path=/dax/D1 -device nvdimm,memdev=mem1,reserve-label-data,id=nv1 -m 2048,maxmem=100G,slots=10 ….
Not yet in Upstream Qemu :
https://github.com/xiaogr/qemu/tree/nvdimm-v9
Seabios integration :
http://www.seabios.org/pipermail/seabios/2015-September/009770.html
Still Missing some feature + high overhead for some operations
Supports PMEM only -> Good for NFIT dev
Playing with DAXOnly ext2, ext4 and xfs currently support DAX
Note that block size should match page size
mkfs.ext4 -b 4096 /dev/pmem1
mount -t ext4 -o dax /dev/pmem1 /tmp/dax/
Playing with DAX - Cont
Then you just have to mmap it!
But remember: CFLUSH, etc.. for durability
NVML : Lets somebody else do the heavy lifting
http://pmem.io/
libpmem – Basic persistency handling
Libvmmalloc - Transparently converts all the dynamic memory allocations into persistent memory allocations.
libpmemblk – Block access to pmem
libpmemlog - Log file on pmem (append-mostly)
libpmemobj - Transactional Object Store on pmem
Many more… pynvm , C++ bidings , etc..
Remote PMEM
Remote NVMe : using RDMA to transfer NVMe commands & data
http://blog.pmcs.com/flash-memory-summit-2015-special-nvm-express-rdma-awesome/
Transitioning from Indirect to Direct Flow
● Project Donard ( PMC - Microsemi)● Page Struct backed Pmem patch (I/O mem are normally accessed via PFN only)
Comes with Challenge : Durability vs Visibility
http://www.snia.org/sites/default/files/SDC15_presentations/persistant_mem/ChetDouglas_RDMA_with_PM.pdf
RDMA + DDIO
RDMA + Non Allocating write
Peer 2 Peer : Bypassing CPU + SW bottleneck
● NVM HW - Expose BAR address
● March 16 : RFC patchset for DAX allowing DMA to I/O mem
● CCIX fabric
● Use case: ○ Pre-process in Data
path○ Avoid RAM buffer
( HMM style ) ○ SW only fetch what is
necessary
Future Hyperscale Architecture
NVMe gravy train for 3-5 years
Transition to Pmem optimised apps and
Natural evolution of Ethernet Connected Drive => Fabric connected Pmem
Durable Array of Wimpy Nodes
Direct PMEM
Low power High perf K/V storage
Use pluggable front end
Rearranged based on needs
LinksDrivers specs: http://pmem.io/documents/
NVDIMM Namespace Specification: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf NVDIMM Drivers Writers Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf NVDIMM DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf Linux docs: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/nvdimm/nvdimm.txtQemu : https://github.com/xiaogr/qemu/tree/nvdimm-v9Seabios : http://www.seabios.org/pipermail/seabios/2015-September/009770.html Libraries:
https://github.com/pmem/nvml/ https://github.com/perone/pynvm http://opennvm.github.io/index.html https://github.com/spdk/spdk
Project :PMFS : https://github.com/linux-pmfs/pmfs NOVA: NOn-Volatile memory Accelerated log-structured file system https://github.com/NVSL/NOVAPCMSIM : https://code.google.com/p/pcmsim/
Patch : Donard: A PCIe Peer-2-Peer kernel patch https://github.com/sbates130272/donard adds struct page backing for IO memory and as such allows IO memory to be used as a DMA target :
http://www.spinics.net/lists/linux-mm/msg103990.html
Thank You!Questions ?
NVDIMM block I/O path