MongoDB Internals

Post on 11-Apr-2017

485 views 1 download

Transcript of MongoDB Internals

MongoDBhttps://www.mongodb.com/

Prutha Date (dprutha1@umbc.edu)Siraj Memon (siraj1@umbc.edu)

Outline• Introduction to MongoDB• Storage Layout• Data Management Features• Performance Analysis• Limitations• Conclusion• Demo• References

What is MongoDB?• MongoDB is a NoSQL Document-Oriented database.

• It provides semi-structured flexible schema.

• It provides high performance, high availability, and easy scalability.

• MongoDB is free and open source software.

• License: GNU Affero General Public License (AGPL) and Apache License

• MongoDB is a server process that runs on Linux, Windows and OS X. It can be

run both as a 32 or 64-bit application.

When to use MongoDB?

“Knowing when to use a hammer, and when to use a screwdriver.”• Account and user profiles: can store arrays of addresses with ease (MetLife)• Content Management Systems (CMS): the flexible schema of MongoDB is great for heterogeneous

collections of content types (MongoPress)• Form data: MongoDB makes it easy to evolve the structure of form data over time (ADP)• Blogs / user-generated content: can keep data with complex relationships together in one object (Forbes,

AOL)• Messaging: vary message meta-data easily per message or message type without needing to maintain

separate collections or schemas (Viber)• System configuration: just a nice object graph of configuration values, which is very natural in MongoDB

(Cisco)• Log data of any kind: structured log data is the future (ebay)• Location based systems: makes use of Geospatial indices (Foursquare, City government of Chicago)

Terminologies – RDBMS vs MongoDB

*JSON – JavaScript Object Notation

Storage Internals - Directory LayoutData Directory is found at /data/db

Internal File Format

Extent Structure

Extents and Records

To Sum Up: Internal File Format• Files on disk are broken into extents which contain the documents.• A collection has one or more extents.• Extent grow exponentially up to 2GB.• Namespace entries in the ns (namespace) file point to the first extent

for that collection.

Virtual Address Space

Storage Engine - MMAP (Memory Mapped)• All data files are memory mapped to Virtual Memory by the

OS.• MongoDB just reads / writes to RAM in the filesystem cache • OS takes care of the rest! • Virtual process size = total files size + overhead (connections,

heap)• Uses Memory-mapped file using mmap() system call.

Storage Engine - WiredTiger• Designed especially for Write-Intensive applications• Document level locking• Compression and Record-level locking•Multi-version concurrency control (MVCC)•Multi-document transactions• Support for Log Structured Merge (LSM) trees for very high

insert workloads

What makes MongoDB cool?

• Sharding• Aggregation Framework and Map-Reduce• Capped Collection• GridFS• Geo-Spatial Indexing

Sharding• Horizontal scaling - divides the data set and distributes the data over

multiple servers, or shards. • Used to support deployments with very large data sets and high

throughput operations.• Sharded Cluster Components – • Shards – mongod instance or replica sets• Config Server – Multiple mongod instances• Routing Instances – Multiple mongos instances

• Shards are divided into fixed size chunks using ranges of shard key values.

Sharding Internals

Choosing a Shard keyThe choice of shard key affects:• Distribution of reads and writes• Uneven distribution of reads/writes across shards.• Solution – Hashed ids

• Size of chunks• Jumbo chunks cause uneven distribution of data.• Moving data between shards becomes difficult.• Solution – Multi-tenant compound index

• The number of shards each query hits

Aggregation Framework• Aggregation Pipeline• Map-Reduce• Single Purpose Aggregation Operations (deprecated in latest version)

Aggregation Pipeline• The aggregation pipeline is a framework for performing aggregation

tasks, modeled on the concept of data processing pipelines. • Using this framework, MongoDB passes the documents of a single

collection through a pipeline. • The pipeline transforms the documents into aggregated results, and is

accessed through the aggregate database command.• Operators: $match, $project, $unwind, $sort, $limit• User gets to choose the operator.

Aggregation Pipeline - Example

Continued…

Map-Reduce

Capped Collection• Fixed size collection called capped collection• Use the db.createCollection command and marked it as capped• e.g - db.createCollection(‘logs’, {capped: true, size: 2097152})

• When it reaches the size limit, old documents are automatically removed• Guarantees preservation of the insertion order• Maintains insertion order identical to the order on disk by prohibiting

updates that increase document size• Allows the use of tailable cursor to retrieve documents

GridFS• GridFS is a specification for storing and retrieving files that exceed

the BSON (binary JSON) document size limit of 16MB.• Instead of storing a file in a single document, GridFS divides a file into

parts, or chunks, and stores each of those chunks as a separate document. • By default GridFS limits chunk size to 255k. • GridFS uses two collections to store files. One collection stores the file

chunks, and the other stores file metadata.• GridFS is useful not only for storing files that exceed 16MB but also for

storing any files for which you want access without having to load the entire file into memory.

GeoSpatial Indexing• To support efficient queries of geospatial coordinate data, MongoDB

provides two special indexes: • 2d indexes that uses planar geometry when returning results.• 2sphere indexes that use spherical geometry to return results.

• Store location data as GeoJSON objects with this coordinate-axis order: longitude, latitude.• GeoJSON Object Supported: Point, LineString, Polygon, etc.• Query Operations: Inclusion, Intersection, Proximity.• You cannot use a geospatial index as the shard key index.

Performance Analysis• Yahoo! Cloud Serving Benchmark (YCSB)• Throughput (ops/second)

WORKLOADS Cassandra Couchbase MongoDB

50% read, 50% update 134,839 106,638 160,719

95% read, 5% update 144,455 187,798 196,498

50% read, 50% update (Durability Optimized) 6,289 1,236 31,864

Limitations• Need to have enough memory to fit your working set into memory,

otherwise performance might suffer.• MapReduce and Aggregation are single-threaded. To be more specific,

one per mongod. • No joins across collections.• On 32-bit, it has limitation of 2.5 Gb data. • Sharding has some unique exceptions. If you plan to shard your data,

you need to shard early as some things that are feasible on a single server are not feasible on a sharded collection.

Conclusion• MongoDB is a semi-structured document-oriented NoSQL Database.• It has two storage engines: MMAP and WiredTiger• Multiple Aggregation Frameworks: Aggregation Pipeline and Map-

Reduce• Support for GridFS, GeoSpatial Indexing, Capped Collection• Better Performance as compared to Cassandra and Couchbase.• On-going work – In-memory and HDFS support

DEMO

Questions?

Thank you!