MongoDB Internals

MongoDBhttps://www.mongodb.com/

Prutha Date (dprutha1@umbc.edu)Siraj Memon (siraj1@umbc.edu)

Outline• Introduction to MongoDB• Storage Layout• Data Management Features• Performance Analysis• Limitations• Conclusion• Demo• References

What is MongoDB?• MongoDB is a NoSQL Document-Oriented database.

• It provides semi-structured flexible schema.

• It provides high performance, high availability, and easy scalability.

• MongoDB is free and open source software.

• License: GNU Affero General Public License (AGPL) and Apache License

• MongoDB is a server process that runs on Linux, Windows and OS X. It can be

run both as a 32 or 64-bit application.

When to use MongoDB?

“Knowing when to use a hammer, and when to use a screwdriver.”• Account and user profiles: can store arrays of addresses with ease (MetLife)• Content Management Systems (CMS): the flexible schema of MongoDB is great for heterogeneous

collections of content types (MongoPress)• Form data: MongoDB makes it easy to evolve the structure of form data over time (ADP)• Blogs / user-generated content: can keep data with complex relationships together in one object (Forbes,

AOL)• Messaging: vary message meta-data easily per message or message type without needing to maintain

separate collections or schemas (Viber)• System configuration: just a nice object graph of configuration values, which is very natural in MongoDB

(Cisco)• Log data of any kind: structured log data is the future (ebay)• Location based systems: makes use of Geospatial indices (Foursquare, City government of Chicago)

Terminologies – RDBMS vs MongoDB

*JSON – JavaScript Object Notation

Storage Internals - Directory LayoutData Directory is found at /data/db

Internal File Format

Extent Structure

Extents and Records

To Sum Up: Internal File Format• Files on disk are broken into extents which contain the documents.• A collection has one or more extents.• Extent grow exponentially up to 2GB.• Namespace entries in the ns (namespace) file point to the first extent

for that collection.

Virtual Address Space

Storage Engine - MMAP (Memory Mapped)• All data files are memory mapped to Virtual Memory by the

OS.• MongoDB just reads / writes to RAM in the filesystem cache • OS takes care of the rest! • Virtual process size = total files size + overhead (connections,

heap)• Uses Memory-mapped file using mmap() system call.

Storage Engine - WiredTiger• Designed especially for Write-Intensive applications• Document level locking• Compression and Record-level locking•Multi-version concurrency control (MVCC)•Multi-document transactions• Support for Log Structured Merge (LSM) trees for very high

insert workloads

What makes MongoDB cool?

• Sharding• Aggregation Framework and Map-Reduce• Capped Collection• GridFS• Geo-Spatial Indexing

Sharding• Horizontal scaling - divides the data set and distributes the data over

multiple servers, or shards. • Used to support deployments with very large data sets and high

throughput operations.• Sharded Cluster Components – • Shards – mongod instance or replica sets• Config Server – Multiple mongod instances• Routing Instances – Multiple mongos instances

• Shards are divided into fixed size chunks using ranges of shard key values.

Sharding Internals

Choosing a Shard keyThe choice of shard key affects:• Distribution of reads and writes• Uneven distribution of reads/writes across shards.• Solution – Hashed ids

• Size of chunks• Jumbo chunks cause uneven distribution of data.• Moving data between shards becomes difficult.• Solution – Multi-tenant compound index

• The number of shards each query hits

Aggregation Framework• Aggregation Pipeline• Map-Reduce• Single Purpose Aggregation Operations (deprecated in latest version)

Aggregation Pipeline• The aggregation pipeline is a framework for performing aggregation

tasks, modeled on the concept of data processing pipelines. • Using this framework, MongoDB passes the documents of a single

collection through a pipeline. • The pipeline transforms the documents into aggregated results, and is

accessed through the aggregate database command.• Operators: $match, $project, $unwind, $sort, $limit• User gets to choose the operator.

Aggregation Pipeline - Example

Continued…

Map-Reduce

Capped Collection• Fixed size collection called capped collection• Use the db.createCollection command and marked it as capped• e.g - db.createCollection(‘logs’, {capped: true, size: 2097152})

• When it reaches the size limit, old documents are automatically removed• Guarantees preservation of the insertion order• Maintains insertion order identical to the order on disk by prohibiting

updates that increase document size• Allows the use of tailable cursor to retrieve documents

GridFS• GridFS is a specification for storing and retrieving files that exceed

the BSON (binary JSON) document size limit of 16MB.• Instead of storing a file in a single document, GridFS divides a file into

parts, or chunks, and stores each of those chunks as a separate document. • By default GridFS limits chunk size to 255k. • GridFS uses two collections to store files. One collection stores the file

chunks, and the other stores file metadata.• GridFS is useful not only for storing files that exceed 16MB but also for

storing any files for which you want access without having to load the entire file into memory.

GeoSpatial Indexing• To support efficient queries of geospatial coordinate data, MongoDB

provides two special indexes: • 2d indexes that uses planar geometry when returning results.• 2sphere indexes that use spherical geometry to return results.

• Store location data as GeoJSON objects with this coordinate-axis order: longitude, latitude.• GeoJSON Object Supported: Point, LineString, Polygon, etc.• Query Operations: Inclusion, Intersection, Proximity.• You cannot use a geospatial index as the shard key index.

Performance Analysis• Yahoo! Cloud Serving Benchmark (YCSB)• Throughput (ops/second)

WORKLOADS Cassandra Couchbase MongoDB

50% read, 50% update 134,839 106,638 160,719

95% read, 5% update 144,455 187,798 196,498

50% read, 50% update (Durability Optimized) 6,289 1,236 31,864

Limitations• Need to have enough memory to fit your working set into memory,

otherwise performance might suffer.• MapReduce and Aggregation are single-threaded. To be more specific,

one per mongod. • No joins across collections.• On 32-bit, it has limitation of 2.5 Gb data. • Sharding has some unique exceptions. If you plan to shard your data,

you need to shard early as some things that are feasible on a single server are not feasible on a sharded collection.

Conclusion• MongoDB is a semi-structured document-oriented NoSQL Database.• It has two storage engines: MMAP and WiredTiger• Multiple Aggregation Frameworks: Aggregation Pipeline and Map-

Reduce• Support for GridFS, GeoSpatial Indexing, Capped Collection• Better Performance as compared to Cassandra and Couchbase.• On-going work – In-memory and HDFS support

References• https://www.mongodb.com/presentations/storage-engine-internals• http://docs.mongodb.org/manual/core/data-modeling-introduction/• http://docs.mongodb.org/manual/core/aggregation-introduction/• https://

2013.nosql-matters.org/bcn/wp-content/uploads/2013/12/storage-talk-mongodb.pdf• http://

info-mongodb-com.s3.amazonaws.com/High Performance Benchmark White Paper final.pdf• https://www.mongodb.com/collateral/mongodb-architecture-guide• Book - MongoDB: The Definitive Guide by Kristina Chodorow and Michael Dirolf

Questions?

Thank you!

MongoDB Internals

Technology

Transcript of MongoDB Internals

MongoDB Profiler Deep Dive; MongoDB Austin 2013

[E6]2012. netty internals

Webinar: Typische MongoDB Anwendungsfälle (Common MongoDB Use Cases)

云数据库 MongoDB - UCloud · MongoDB⽬前⽀持MongoDB 2.4、MongoDB 2.6、MongoDB 3.0、MongoDB 3.2、MongoDB3.4、MongoDB 3.6和MongoDB 4.0，⽤⼾可以根据需求选择相应的云数据库版本。

JVM Internals Demystified

Adar marek oracle-rman-internals

Android Security Internals (Lesson 3)

innodb Internals

Linux Internals

Solaris (Branded) Zone Internals

DeSymfony 2012: Symfony internals

MFC Internals

TokuDB internals / Лесин Владислав (Percona)

MongoDB in use(김인범, mongodb korea)

Front-end optimisation & jQuery Internals (Pycon)

NoSQL Concepts MongoDB Concepts MongoDB Demos Agenda.

Microsoft Windows Internals, 4 ed

pkgsrc Internals - tools, wapper and buildlink

Hierarchy Viewer Internals

Thread Internals