MongoDB Interview Questions

Follow the top basic and advanced MongoDB interview questions and turn yourself into an essential MongoDB Developer. We have covered the most-asked questions on how MongoDB supports efficient querying against array fields and also how recursive queries are supported within MongoDB. With these top interview questions on MongoDB, you will understand the detailed structure of MongoDB and the different applications of MongoDB. These will qualify you to become a MEAN Stack Developer, Backend engineer, front-end developer and many more.

  • 4.8 Rating
  • 69 Question(s)
  • 30 Mins of Read
  • 3293 Reader(s)

Beginner

db.<collection>.find().skip(n).limit(n)

Note: n is the pagesize, for the first page skip(n) will not be applicable

limit(n) limits the documents to be returned from the cursor to n, skip(n) will skip n documents

from the cursor

This can be achieved in MongoDB using the $type operator. A null value, i.e., BSON type null has the type number 10.  Using this type number, only those documents can be retrieved whose value is null. 

Take the example of the below two documents in startup collection

{ _id: 1, name: "XYZ Tech", website: null },   { _id: 2, name: “ABC Pvt Ltd” }

The query { website : { $type: 10 } } will retrieve only those documents where the website is null, in the above case it would be the startup “XYZ Tech”

Note: The query { website : null } on the other hand will match documents where the website is null or the documents where the website field does not exist. For the above collection data, this query will return both the startups.

only those documents that contain the field specified in the query. 

For the following documents in employee collection

{ _id: 1, name: "Jonas", linkedInProfile: null },   { _id: 2, name: “Williams” }

The query { linkedInProfile: { $exists: true } } will return only the employee “Jonas” 

In MongoDB we have Built-in roles as well as custom roles. Built-in roles already have pre-defined access associated with them. We can assign these roles directly to users or groups for access. To run mongostat we would require access to run the server status on the server.

Built-in role cluster monitor comes with required access for the same.

Custom roles or user-defined roles are the ones where we have to manually define access actions to a particular resource. MongoDB provides method db.createRole() for creating user-defined roles. These roles can be created in a specific database as MongoDB uses a combination of database and role name to uniquely define the role.

We will create a custom role mongostatRole that provides only the privileges to run mongostat.

First, we need to connect to mongod or mongos to the admin database with a user that has privileges to create roles in the admin as well as other databases.

mongo --port 27017 -u admin -p 'abc***' --authenticationDatabase 'admin'

Now we will create a desired custom role in the admin database.

use admin
db.createRole(
     role: "mongostatRole",
     privileges: [
       {resource: { cluster: true }, actions: [ "serverStatus" ] }
     ],
     roles: []
)

This role can now be assigned to members of monitoring team.

Intermediate

In MongoDB data is stored as JSON documents. These documents can have different sets of fields, with different data type for each field. For example, we can have a collection with number, varchar and array all as different documents.

         { “a” : 143 }

         { “name” : “john” }

         { “x” : [1,2,3] }

It is not correct to say MongoDB is schemaless, in fact, schema plays an important role in the designing of MongoDB applications. MongoDB has a dynamic schema having database structure with collections and indexes. These collections can be created either implicitly or explicitly.

Due to the dynamic behaviour of the schema, MongoDB has several advantages over RDBMS systems.

Schema Migrations become very easy as in traditional systems we had to use ALTER TABLE command after adding any column which could result in downtime. In MongoDB, such adjustments become transparent and automatic. For example, if we want to add CITY field to people collection, we can add the attribute and resave, that’s it. Whereas in a traditional system we would have to run ALTER TABLE command followed by reorg which would require downtime.

The first part of the query would give all documents where y>=10. So we will have 2 documents i.e

d> { "_id" : 4, "x" : 4, "y" : 10 }
e> { "_id" : 5, "x" : 5, "y" : 75 }

Now the second part of the query would update the value of Y for above 2 documents to 75, but we already have a document with value y:75, that will not be updated.

Finally, we will have one 1 document that will be updated by the provided query.

d> { "_id" : 4, "x" : 4, "y" : 10 }

Every operation on the primary is logged in operation logs known as oplog. These oplogs are replicated to For a healthy replica set system, it is recommended that all members are in sync with no replication lag. Data is first written on primary by the applications then replicated to secondary. This synchronization is important to maintain up-to-date copies of data on all members. Synchronization happens in 2 ways: initial sync and continuous replication.

Initial Sync: When we add new member to replica set data from one member is copied to the new member. When we perform an initial sync, MongoDB copies all databases one by one except the local database. This is done by scanning all collections in the source database and inserting them on a new member. All indexes are also copied during the initial syn. There might be changes to the data set when initial sync happens. At the end of copy, the changes from already copied collections are applied using oplog.

Continuous Replication: After the initial sync the secondary members replicate data continuously. We can decide which member sync happens. The replication of secondary member from their sync source happens asynchronously. These replications happen using oplog.

The oplog is operation logs that keep an update of all operations that modify the data stored in databases. We can define the oplog size while starting MongoDB by specifying the --oplog option. If we do not specify this option it will take the default values which is 5% of physical memory in case of wiredTiger. While the default value is sufficient for most workloads in some cases we may need to change the oplog size for the replica set.

OPlog size is changed in a rolling manner, first, we change on all secondary and then a primary member of the replica set. To change oplog size

  • First we need to connect to any secondary member.
  • Verify the size of current oplog by running below command on the local database.

use local

db.oplog.rs.stats().maxSize

  • Change the oplog size using admin command replSetResizeOplog specifying a new size for oplog.

db.adminCommand({replSetResizeOplog: 1, size: "Size-in-MB"})

  • Repeat the same process for other secondary members and then on the primary member of the replica set.

MongoDB applies database operations on the primary and then records the operations on the primary’s oplog. The secondary members then copy and apply these operations in an asynchronous process. For each operation, there is separate oplog entry.

First, let’s check how many rows the query would fetch by changing delete to find operation.

db.sample.find( { state : "WA" } )

This will give all the rows with the state is WA.

{"firstName" : "Arthur", "lastName" : "Aaronson", "state" : "WA", "city" : "Seattle", "likes" : [ "dogs", "cats" ] }
{"firstName" : "Beth", "lastName" : "Barnes", "state" : "WA", "city" : "Richland", "likes" : [ "forest", "cats" ] }
{"firstName" : "Dawn", "lastName" : "Davis", "state" : "WA", "city" : "Seattle", "likes" : [ "forest", "mountains" ] }

Now Ideally delete should remove all matching rows but query says deleteOne.

If the query would have said deleteMany then all the matching rows would have been deleted and there would be 3 oplog entries but deleteOne will remove first matching row. So 1 oplog entry will be generated with provided query

Idempotence is the property of certain operations whereby they can be applied multiple times without changing the result beyond the initial application. In MongoDB, oplog is idempotent meaning even if they are applied multiple times the same output will be produced. So if the server goes down and we need to apply oplogs there would not be any inconsistency, even if it applies any logs that were already applied there will not be changed in the database end state.

Also, there was a desire to have a new state of a document to be independent of the previous state. For these all operators that rely on the previous state to determine new value needed to be transformed to see the actual values. For example, if an addition operation results in modifying the value from ‘21’ to ’30’, the operation should be changed to set value ‘30’ on the field. Replaying the operator multiple times should produce the same result.

In MongoDB we can read from Primary as well as secondary members of the replica set. This behaviour can be controlled by us as we can define the desired Read preference to which clients route read operations to a member of the replica set. If we do not specify any real preference by default MongoDB will read from primary. There are situations when you would want to reduce the load on your primary by forcing applications to read from secondary.

Below are different MongoDB read preference modes:

primary

This is the default mode. Applications read from the replica set

primary.  

primaryPreferred

In this mode, all applications read from

primary but if the primary member is not available they start reading from

secondary.

secondary

All applications read from the secondary

members of the replica set.

secondaryPreferred

In this mode, all applications read from

secondary but if any secondary member is not available they start reading from

primary.

nearest

In this mode, applications read from the member

which is nearest to them in terms of network latency, irrespective of the

member being primary or secondary.

Shard key selection is based on the workload. Since the first query is being used 90% it should be driving the selection for selection of shard key.

Combination of fields from that query would make the best shard key. This eliminates option b, c and d.

Option a and e uses a subset of fields from the most used workload and both can be shard key but option has more fields and so would be more suitable.

Chunk split operations are carried out automatically by the system when any insert operation causes chunk to exceed the maximum chunk size. Balancer then migrates recently split chunks to new shards. But in some cases we may want to pre-split the chunks manually:

  • If we have deployed a cluster using existing data, we may have large data and very few chunks. In cases, pre-splitting would be beneficial for even distribution.
  • If the cluster is using hashed shard key, or we know the distribution of our data very well, we can arrange for a distribution of data to be equilibrated between shards and pre-split the chunks.
  • If we perform initial bulk load, all data would go to single shard and then those documents will migrate to other shards later doubling the number of writes. Alternatively, if we can pre-split the collection across the values avoiding the documents to be written twice.

To split the chunks manually we can use the split command with helper sh.splitFind() and sh.splitAt().

Example:

To split the chunk of employee collection for employee id field at a value of 713626 below command should be used.

sh.splitAt( "test.people", { "employeid": "713626" } )

We should be careful while pre-splitting chunks as sometimes it can lead to a collection with different sized chunks.

In some cases, chunks can grow beyond the specified chunk size but cannot undergo a split. The most common scenario is when a chunk represents a single shard key value. Since the chunk cannot split, it continues to grow beyond the chunk size, becoming a jumbo chunk. These jumbo chunks can become a performance bottleneck as they continue to grow, especially if the shard key value occurs with high frequency.

The addition of new data or new shards can result in data distribution imbalances within the cluster. A particular shard may acquire more chunks than another shard, or the size of a chunk may grow beyond the configured maximum chunk size.

MongoDB ensures a balanced cluster using two processes: chunk splitting and the balancer.

In replication we have multiple copies of the same data in sync with each other. It is mainly useful for High availability purpose. While in sharding we divide our entire dataset in small chunks and distribute among several servers. Sharding is used where we have some sort of bottleneck in terms of hardware or for getting the benefits of query parallelism. If our dataset is very small sharding would not provide many advantages but as the data grows we should move to sharding.

Below are a few of the situations where sharding is recommended over replication.

  • If our MongoDB instance cannot keep up with the application's write load. We have exhausted RAM and CPU options for the server.
  • Out data set is too big to fit in a single MongoDB instance. We have reached the disk limits for the server.
  • To improve read performance for the application. By using targeted queries in the sharded cluster we can only view the data required skipping other data.
  • The data set is taking too much time to backup and restore.

By breaking the dataset over shards will mean having more resources available to handle the subset of data it owns, and operations of moving data across machines for replication, backups, restores will also be faster.

Loading every document into RAM means that query will not be using index efficiently and will have to fetch documents from disk to ram.

For using an index, the initial match in the find statement should either use index or index prefix.

Below query has b key to find that does not use any existing index or index prefix so it would have to fetch from disk.

db.sample.find( { b : 1 } ).sort( { c : 1, a : 1 } ) 

If we start Mongo server using mongod with the --auth option it will enable authorization but we will not be able to do anything, even listing databases using show DBS will fail with an error "not authorized on admin to execute the command". This is because we have not authenticated yet to the database. But this is a new database with no users in this database, so how can we create new users with no authorization.

MongoDB provides localhost exception for creating the first user on the database without any authorization. With this first user, we can create other users with relevant access.

But there are a few considerations to it:

  • localhost exception only applies for the first user in a database.
  • localhost exception only applies when connected to the database via the localhost interface, meaning on the same computer, if not we will not be able to utilize localhost exception.

MongoDB follows Role access control authorization model(RBAC). In this model, users are assigned one or more roles which provide them access to the database resources and operations. Apart from these role assignments users do not have any access to the resources. When we enable internal authentication it automatically enables client authorization. The authorization can be enabled by starting mongod with –auth option or providing security.authorization setting in the config file.  

Roles: Roles are groups of privileges, actions over resources that are granted to users over a given database. A role grants privileges to perform the specified actions on the resource. Each privilege is either specified explicitly in the role or inherited from another role or both.

ActionsAll operations and commands that a user can perform in MongoDB are called actions. Actions are performed on resources.

Resources: Resources are any objects that hold a state in a database.

Privilege: When a user performs an action on a given resource that constitutes privilege.

These group of privileges are roles that can be assigned to the user.

MongoDB provides several Build-in roles like Database User Roles, Database Administration Roles, Cluster Administration Roles, Backup and Restoration Roles, All-Database Roles, Superuser Roles But we can also create our own custom roles based on the requirement which are called User-defined roles.

For Example: Suppose we want to create a role to manage all operations in the cluster we can create below user-defined role.

use admin
db.createRole(
     role: "mongostatRole",
     privileges: [
       { resource: { cluster: true }, actions: [ "serverStatus" ] }
     ],
     roles: []
)

If secondaries are falling behind they are experiencing replication lag, which is a delay in the application of oplog from primary to secondary. Replication lag can be a significant issue and can seriously affect MongoDB replica set deployments. Excessive replication lag makes a member ineligible to quickly become primary and increases the possibility of distributed read operations to be inconsistent.

We can check the replication lag by calling the rs.printSlaveReplicationInfo() method or by running the rs.status() command.

Possible causes of replication lag include:

  1. Issues related to network between primary and secondary server. Sometimes the network becomes a bottleneck. We should ensure there is no packet loss or any other network related issues.
  2. Disk issues on the secondary server. Sometimes there are different hardware on primary and secondary. If the primary is using SSD and secondary HDD, then disk flushes will be slow and secondary will lag behind. There might be other disk issues which can be checked by vmstat and iostat.
  3. Sometimes there can be an issue due to concurrency issues as long-running operations on the primary can block replication on secondary. We should configure write concern to require acknowledgement of replication to secondary. This prevents write operations from returning if replication cannot keep up with the write load.

The storage engine is the component that lies between the database and storage layer and is primarily responsible for managing data. MongoDB provides few choices of storage engines, enabling us to use best suited for our applications. Choosing the appropriate storage engine can significantly impact performance.

WiredTiger Storage Engine (Default)

WiredTiger replaced MMAPV1 in 3.2 to become default storage engine. If we install MongoDB and do not specify any storage engine, wiredTiger is enabled. As it provides a document-level concurrency model, checkpointing, and compression it is suited for most workloads. It also supports encryption at rest in MongoDB Enterprise.

In-Memory Storage Engine

Various applications require predictable latencies which can be achieved by storing the documents in memory rather than on disk. In-Memory Storage Engine is helpful for such applications. It is available only in MongoDB enterprise edition.

MMAPv1 Storage Engine (Deprecated as of MongoDB 4.0)

MongoDB started with MMAPv1 storage engine only but it was successful for a specific subset of use cases only due to which it is deprecated from version 4.0.

Concurrency in MongoDB is different for different storage engines. While WiredTiger uses document level concurrency control for write operations, MMAPV1 has concurrency at the collection level.

In WiredTiger locking is at document level due to which multiple clients can modify documents at the same time and so uses optimistic concurrency control for most read and writes. Instead of exclusive locks, it uses only intent locks at the global, database and collection levels. In case the conflict is detected by storage engine between operations, one will incur a write conflict causing MongoDB to transparently retry that operation. 

Sometimes global “instance-wide” locks are required for global operations which involve multiple databases. Exclusive database lock incurs for operations such as dropping a collection.

MMAPv1 still uses collection level lock meaning if we have 3 applications running against the same collection, 2 would have to wait before first application completes its own as it applies a collection level write lock.

MongoDB WiredTiger ensures data durability with journaling and checkpoints. Journals are write-ahead logs which checkpoints are point-in-time snapshots.

Checkpoints:

In wiredTiger, with the start of each operation, a point-in-time snapshot is taken which presents a consistent view of in-memory data. WiredTiger then writes all snapshot data to the disk in a consistent way across all data files. This data on disk is durable and acts as a checkpoint in the data files. The checkpoint ensures all data files are consistent from the last checkpoint.

These checkpoints usually occur every 60sec so we have a consistent snapshot every 60sec of interval thus ensuring durability.

Journal:

Journal is write-ahead logs which persist all data changes between 2 checkpoints. In case we require data between checkpoints for recovery these journal files can be used. These general files act as crash recovery files in case of interrupts. Once the system is back up these journal files can be replayed for recovery.

In MongoDB data set consistency is ensured by locking. In any database system long running queries degrade the performance as requests and operations have to wait for a lock. Locking issues are intermittent and so need to be resolved immediately.

MongoDB provides us with tools and utilities to troubleshoot these locking issues. The serverStatus() command provides us a view of the system including the locking-related information. We should look for locks and a globalLock section of serverStatus() command for troubleshooting locking issues.

We can use below commands to filter locking related information from the serverStatus output.

db.serverStatus().globalLock
db.serverStatus().locks

To get the approximate average wait time for a lock mode we can divide locks. timeAcquiringMicros by locks.acquireWaitCount.

To check the number of times deadlocks occurred locks.deadlockCount should be checked.

If the application performance is constantly degraded there might be concurrency issues, in such cases, we should look at globalLock.currentQueue.total. A high value indicates concurrency issues.

Sometimes globalLock.totalTime is high relative to uptime which suggests database has been in a lock state for a significant time.

Indexes are important to consider while designing the databases as they impact the performance of the applications. Without index, query will perform collection scan in which all the documents of the collection are scanned one by one to find the matching fields of the query. If we have indexed for a particular query, MongoDB can use the index to limit the number of documents scanned to execute the query as the indexes store the values of the query in ascending or descending ordered form.

While the indexes help in improving the performance of find operations for write operations like insert and update there can be a significant negative impact of adding indexes as with modifications with each write MongoDB would need to update the indexes associated with the index also. This would be overhead on the system and we may end up with performance degradation.

 So while the find() will improve performance the operators updateOne and insertOne would degrade the performance as with every update or insert related indexes would need to be updated.

MongoDB provides several utilities for data movement activities like mongodump, mongoexport etc. Mongodump is used to export the contents of the collection in an external file in the BSON(binary) format. The contents exported by this method can then be used by mongorestore command to restore in another database or different collection. Mongodump does not capture index data and only captures the data present in the backup. Since the contents are exported in binary format using this method we cannot use it for exporting to CSV file.

To export the contents in JSON or CSV format we can use the mongoexport command. The exported collection can then be restored using mongoimport command. Since Mongoexport exports cannot export in BSON all the rich BSON data types are not preserved while exporting the data. Due to this reason, mongoexport should be used with careful consideration.

Below is the command that can be used for the same.

mongoexport --host host:27017 -d test -c sample --type=csv -f fields -o sample.csv

In sharded cluster we may have a database which has sharded as well as non sharded collections. While the sharded collections are spread across all the shards, all the unsharded collections are stored on a single shard known as a primary shard. Every database in the sharded collection has its own primary shard. When we create any new database, mongos pick the shard with the least amount of data in cluster and marks it as a primary shard.

 If there is a need to change the primary shard we can do so by using the movePrimary command. This migration may take significant time to complete. We should not access any collections associated with migrating database until the process completes. Also, the migration of primary shard should be done at the lean time as it may impact the performance of the overall cluster.

Eg. To migrate the primary shard of accounts database to Shard0007 below command should be used.

db.adminCommand( { movePrimary : "accounts", to : "shard0007" } )

When we create any collection in MongoDB, a unique index on the _id field is created automatically. This unique index prevents applications from inserting multiple same documents with the same value for the _id field. This is forced by the system and we cannot drop this index on the _id field. Moreover, in replica sets the unique _id values are used in the oplog to reference documents to update.

In a sharded cluster if we do not have unique _id values across the sharded collection chunk migrations may fail as when documents migrate to another shard, any identical values will not be inserted to receiver shard. In such cases, we should code application such that it ensures uniqueness on _id for given collection across all shards in a sharded cluster.

If we use _id as the shard key, this uniqueness of values will automatically be forced by the system. In such case chunk ranges will be assigned to single shard and then the shard will force uniqueness on the values in that range.

MongoDB provides database profiler which captures and stores detailed information about commands executed on running instance. Captured details include CRUD operations, administrative commands and configuration commands. The data collected by the profiler is stored in the system.profile collection in the admin database.

By default, the profile is turned off. We can enable it and set to different profiling levels based on the requirements. It provides 3 profiling levels:

0 – Profiler OFF, no data collected, Default.

1 – Profiler ON, data collected for operations taking time lower than slowms.

2 – Profiler ON, data collected for all operations.

To capture the slow running queries, we can start the profiler with profiling level 1 or 2 and Default slow operation threshold is 100 milliseconds. We can change this threshold by specifying a slowms option.

Eg: To enable profiler which captures all queries slower than 50ms below command should be used:

db.setProfilingLevel(1, { slowms: 50 })

Advanced

  • Compound indexes not only support queries that match all the index fields, they also support queries on the index prefixes as well.

         Consider the following compound index
                       { "accountHolder": 1, "accountNumber": 1, "currency": 1 }

The index prefixes are

                      { accountHolder: 1 }

                     { accountHolder: 1, accountNumber: 1 }

Query plan will use this index if the query has the following fields

  • accountholder
  • accountHolder and  accountNumber
  • accountholder and accountNumber and currency
  • Ordering is very important, the order of fields in the queries should match the order of fields in a compound index (left to right) for the index to be used

The $addToSet operator should be used with the $each modifier for this. The $each modifier allows the $addToSet operator to add multiple values to the array field.

Example, start ups are tagged as per the technology skill that they excel in

{ _id: 5, name: "XYZ Technology", skills: [ "Big Data", "AI", “Cloud” ] } 

Now the start up needs to be updated with additional skills

db.startups.update(   { _id: 5 },

         { $addToSet: { skills: { $each: [ "Machine Learning", "RPA" ] } } }

      )

The resultant document after update()

{ _id: 5, name: "XYZ Technology", skills: [ "Big Data", "AI", “Cloud”,  "Machine Learning", "RPA"] }

Note: There is no particular ordering of elements in the modified set, $addToSet does not guarantee that. Duplicate items will not be added.   

When "fast reads" are the single most important criteria, Embedded documents can be the best way to model one-to-one and one-to-many relationships.

Consider the example of certifications awarded to an employee, in the below example the certification data is embedded in the employee document which is a denormalized way of storing data

{
   _id: "10",
   name: "Sarah Jones",
   certifications: [
                { certification: "Certified Project Management Professional”, 
   certifying_auth: “PMI”,
    date: "06/06/2015"
                },
                {  certification: "Oracle Certified Professional”,
     certifying_auth: “Oracle Corporation”,
     date: "10/10/2017"
                }
              ]
 }

In a normalized form, there would be a reference to the employee document from the certificate document, example

{                  employee_id: "10",
         certification: "Certified Project Management Profesional”, 
        certifying_auth: “PMI”,
         date: "06/06/2015"
}

Embedded documents are best used when the entire relationship data needs to be frequently retrieved together. Data can be retrieved via single query and hence is much faster.

Note: Embedded documents should not grow unbounded, otherwise it can slow down both read and write operations. Other factors like consistency and frequency of data change should be considered before making the final design decision for the application.

MongoDB has the db.collection.explain(), cursor.explain() and explain command to provide information on the query plan. The results of explain contain a lot of information, key ones being

  • rejectedPlans, other plans considered by the database (if any)
  • winningPlan, the plan selected by the query optimizer for execution
  • executionStats, gives detailed information on the winning plan
  • IXSCAN stage, if the query planner selects an index, the explain result will include this stage. This is one of the key things to look for when analyzing the query plan for performance optimization

Recursive queries can be performed within a collection using $graphLookUp which is an aggregate pipeline stage.

If a collection has a self-referencing field like the classic example of Manager for an employee, then a query to get the entire reporting structure for manager “David” would look like this

db.employees.aggregate( [
   {
      $graphLookup: {
         from: "employees",
         startWith: "David",
         connectFromField: "manager",
         connectToField: "name",
         as: "Reporting Structure"
      }
   }
] )

For the following documents in the employee collection,

{ "_id" : 4, "name" : " David "  , "manager" : "Sarah" }
{ "_id" : 5, "name" : "John"      , "manager" : "David" }
{ "_id" : 6, "name" : "Richard", "manager" : " John " }
{ "_id" : 7, "name" : "Stacy"    , "manager" : " Richard " }

Output of the above $graphLookup operation would result in the following 3 documents returned

{ "_id" : 5, "name" : "John"      , "manager" : "David", … }
{ "_id" : 6, "name" : "Richard", "manager" : " John ",  … }
{ "_id" : 7, "name" : "Stacy"    , "manager" : " Richard", … }

The hierarchy starts with “David” which is specified in startWith and there on the data for each of the members in that reporting hierarchy are fetched recursively 

The $graphLookup looks like this for a query from the employees collection where “manager” is the self-referencing field

db.employees.aggregate( [
   {
      $graphLookup: {
         from: "employees",
         startWith: "David",
         connectFromField: "manager",
         connectToField: "name",
         as: "Reporting Structure"
      }
   }
] )

The value of as, which is “Reporting Structure” in this case is the name of the array field which contains the documents traversed in the $graphLookup to reach the output document.   

For the following documents in the employee collection,

{ "_id" : 4, "name" : " David "  , "manager" : "Sarah" }
{ "_id" : 5, "name" : "John"      , "manager" : "David" }
{ "_id" : 6, "name" : "Richard", "manager" : " John " }
{ "_id" : 7, "name" : "Stacy"    , "manager" : " Richard " }

“Reporting Structure” for each output document would look like this

{
"_id" : 5, "name" : "John" , "manager" : "David",
"Reporting Structure" : []
}
{
"_id" : 6, "name" : "Richard", "manager" : " John ",  
"Reporting Structure" : [{ "_id" : 5, "name" : "John" , "manager" : "David" }]
}
{
"_id" : 7, "name" : "Stacy"    , "manager" : " Richard",
"Reporting Structure" : [{ "_id" : 5, "name" : "John", "manager" : "David" }
      { "_id" : 6, "name" : "Richard", "manager" : " John " }]
 }

Yes, there is very much a simpler way of achieving this without having to do this programmatically. The $unwind operator deconstructs an array field resulting in a document for each element. 

Consider user “John” with multiple addresses

{
"_id" : 1, "name" : "John",  addresses: [ "Permanent Addr", "Temporary Addr", "Office Addr"]
} 
db.users.aggregate( [ { $unwind : "$addresses" } ] ) 

would result in 3 documents, one for each of the addresses

{ "_id" : 1, " name " : " John ", " addresses " : "Permanent Addr" }
{ "_id" : 1, " name " : " John ", " addresses " : "Temporary Addr" }
{ "_id" : 1, " name " : " John ", " addresses " : "Office Addr" }

MongoDB supports Capped collections which are fixed-size collections. Once the allocated space is filled up, space is made for new documents by removing (overwriting) oldest documents. The insertion order is preserved and if a query does not specify any ordering then the ordering of results is same as the insertion order. The oplog.rs collection is a capped collection, thus ensuring that the collection of logs do not grow infinitely.

A query that is able to return entire results only by using the index is called a Covered Query. This is one of the optimization techniques that can be used with queries for faster retrieval of data. A query can be a covered query only if

  • all the fields in the query are part of an index and
  • all the fields that are returned are also part of the same index

Since everything is part of the index, there is no need for the query to check the documents for any information.

Multikey indexes can be used for supporting efficient querying against array fields. MongoDB creates an index key for each element in the array.

Note: MongoDB will automatically create a multikey index if any indexed field is an array, no separate indication required.

Consider the startups collection with array of skills

{ _id: 1, name: "XYZ Technology", skills: [ "Big Data", "AI", “Cloud” ] }

Multikey indexes allow to search on the values in the skills array

db.startups.createIndex( { skills :  1 } )

The query db.startups.find( { skills :  "AI" } ) will use this index on skills to return the matching document

All the 3 projection operators, i.e., $, $elemMatch, $slice are used for manipulating arrays. They are used to limit the contents of an array from the query results.   

For example, db.startups.find( {}, { skills: { $slice: 2 } } ) selects the first 2 items from the skills array for each document returned.

Starting in version 4.0, multi-document transactions are possible in MongoDB. Earlier to this version, atomic operations were possible only on a single document.

With embedded documents and arrays, data in the documents are generally denormalized and stored in a single structure. With this as the recommended data model, MongoDB's single document atomicity is sufficient for most of the applications. 

Multi-document transactions now enable the remaining small percentage of applications which require this (due to related data spread across documents) to depend on the database to handle transactions automatically rather than implement this programmatically into their application (which can cause performance overheads).

Note: Performance cost is more for multi-document transactions (in most of the cases), hence it should be judiciously used.

In the case of an error, whether the remaining operations get processed or not is determined if the bulk operation is ordered or unordered.  If it is orderedthen MongoDB will not process the remaining operations, whereas if it is unordered , MongoDB will continue to process the remaining operations.

Note: “ordered is an optional Boolean parameter that can be passed to bulkWrite(), by default this is true.

The MongoDB enterprise version includes auditing capability and this is fairly easy to set up. Some salient features of auditing in MongoDB

  • DML, DDL as well as authentication and authorization actions can be captured.
  • Logging every event will impact performance, usage of audit filters is recommended to log only specific events.
  • Audit logs can be written in multiple formats and to various destinations – to console and syslog , to a file (JSON / BSON). Performance wise, printing to a file in BSON format is better than JSON format.  
  • The file can be passed to the MongoDB utility bsondump for a human readable output.

Note: Auditing adds performance overhead and the amount of overhead is determined by a combination of the several factors listed above. The specific needs of the application should be taken into account to arrive at the optimal configuration.

Once selected, the shard key can't be changed later automatically. Hence it should be chosen after a lot of consideration. The distribution of the documents of a collection between the cluster shards is based on the shard key. Effectiveness of the chunk distribution is important for the efficient querying and writing of the MongoDB database and this effectiveness of the chunk distribution is directly related to the shard key. That is why choosing of the right shard key up front is of utmost importance.

When any text content within a document needs to be searchable, all the string fields of the document can be indexed using the $** wildcard specifier.  db.articles.createIndex( { "$**" : "text" } )

NoteAny new string field added to the document after creating the index will automatically be indexed. When data is huge, wildcard indexes will have an impact on performance and hence should be used with due consideration of this.

BSON is a binary JSON. Inside the database, there is a need for binary representation for efficiency.

There are 3 major reasons for preference to BSON:

  •  Fast Scannability - In Mongo, we know documents can be quite large. BSON helps to skip undesired portions of documents thus enabling fast scannability.

Example: In below document, we have a large subdocument named hobbies, now suppose we want to query field "active" skipping "hobbies" we can do so in BSON due to its linear serialization property.

{-id: "32781",
   name: "Smith”, age: 30,
hobbies: { .............................500 KB ..............},
 active: "true”}
  • Data types - BSON provides several extra data types than BSON like Data datatype, Bin data datatype, Object Id datatype etc.
  • Compact Storage - Data is stored in a compact manner (binary format), utilizing less space. Also, data movement from client to server and vice-versa is in BSON format thus securing data on the fly. Data can then be converted to JSON format at the client side using a custom program or MongoDB drivers.

First, we have the MongoDB query language.

This is the set of instructions and commands that we have to use to interact with MongoDB.All CRUD operations and the documents that we send back and forth in MongoDB are managed by this layer. They translate the incoming BSON wire protocol messages that MongoDB uses to communicate with the client side application libraries that we call drivers into MongoDB operations.

Then, we have the MongoDB Data Model Layer.

This is the layer responsible for applying all the CRUD operations defined in the MongoDB query language and how they should result in the data structures managed by MongoDB. Management of namespaces, database names, and collections, which indexes are defined per namespace and which interactions need to be performed to respond to the incoming requests are all managed here.

This is also the layer where a replication mechanism is defined. This is where we define WriteConcerns, ReadConcerns that applications may require. 

Next, we have the storage layer.

At this layer, we will have all the persistence in physical medium calls, how data is stored on disk, what kind of files does it use, what levels of compression amongst other settings can be set. MongoDB has several different types of storage engines that will persist data with different properties, depending on how the system is configured. WiredTiger is the default storage engine. At this layer, all the actions regarding flushers to this, journal commits, compression operations, and low-level system access happens.

We also have to traversal layers, which are security and administration layer.

All operations regarding user management, authentication, network, encryption are managed by the security layer.

All the operations around server administration, like creating databases, renaming collections, logging infrastructure, and such are managed by the administration layer. 

MongoDB is also a distributed data management system supporting replica sets and sharded clusters for high availability and scalability respectively.

Replica sets are groups of different Mongo Ds that contain the same data. The purpose of a replica set cluster is to ensure high availability and automatic failover. Replica set cluster nodes can have different roles (Primary, Secondary, Arbiter), different hardware configuration, and different operating system. 

MongoDB is also a scalable database and allows to segment dataset into shards and grow storage capabilities by adding shard nodes to the cluster. 

Shards themselves are replica sets-- highly availability in units. Where we have other components as well, like:

MongoS - which are our shared cluster routing components. MongoSs will be responsible for routing all of our operations and commands to the shards.

Shards - MongoDB replica sets where actual data is stored.

Config service - A special type of replica set managing all meta-information of our cluster.

Suppose we have 3 servers abc.com,xyz.com andpqr.com.

  • First, we need to start MongoD on each of the servers with appropriate options.

mongod --replSet "rs0" --bind_ip abc.com –port 27017

mongod --replSet "rs0" --bind_ip xyz.com --port 27017

mongod --replSet "rs0" --bind_ip pqr.com –port 27017

option –replset is used for creating a replica set. We have given a replica set name as rs0. Bind IP is the IP to which server can be connected to from outside.

  • Connect to any of the servers and connect to mongo shell.

Login to server abc.com and run command 

mongo>

it will take you to mongo shell.

Now we need to initiate the replica set with a configuration of all 3 members.

rs.initiate( {
   _id : "rs0",
   members: [
      { _id: 0, host: "abc.com:27017" },
      { _id: 1, host: "xyz.com:27017" },
      { _id: 2, host: "pqr.com:27017" }
})

MongoDB initiates a replica set, using the default replica set configuration.

  • To view the configuration of the replica set from any member we can run a command 

rs.conf()

Also to check the status for each member we can run command 

rs.status()

The server from which we run rs.initiate will become primary and other 2 servers will become secondary.

The first requirement eliminates any five-node replica set where one node is an arbiter, as arbiters do not have a copy of the data. 

The second requirements eliminate setting Priority of 0 for dc1-01, dc1-02 or dc2-01, dc2-02. They can be assigned any positive integer, or the default value of 1 to be electable as Primary.

As per the third requirement, dc3-01 can never be primary so its priority has to be set 0.

Finally, as per the fourth dc3-01 configuration cannot be listed as hidden, as this will prevent reading from this replica member.

So below will be the config file meeting all the requirements.

{ "_id" : "rs0", 
  "version" : 1, 
   "members" : [ 
    { "_id" : "dc1-01", "host" : "mongodb0.example.net:27017" }, 
    { "_id" : "dc1-02", "host" : "mongodb1.example.net:27017" },     
     { "_id" : "dc2-01", "host" : "mongodb2.example.net:27017"},
{ "_id" : "dc2-02", "host" : "mongodb3.example.net:27017"},
    { "_id" : "dc3-01", "host" : "mongodb4.example.net:27017","priority" : 0 } ] 
}

When the primary of a replica set is not available secondary becomes primary, this is done via elections where the most appropriate member of the replica set is promoted to primary. Apart from unavailability, there are few situations when elections are triggered such as:

  • When we initiate any replica set
  • Adding new members to the replica set
  • Changing the configuration of the replica set using rs.reconfig() command.
  • Performing maintenance on a replica set by stepping down the member using rs.stepDown().
  • Loss of connectivity between primary and secondary members for more than configured time.
    Until the elections are completed and a new primary is elected, the replica set cannot accept write operations and can only work in read-only mode, if configured for reading from secondary. Below Factors affect the election of new primary: 
  1. MongoDB has introduced new Replication protocolVersion: 1 to reduces replica set failover time. In the earlier protocolVersion time taken for election of the new primary was high.
  2. All members in replica set send ping requests as heartbeats to every other member every two seconds. If it does not receive the reply of ping requests in 10 seconds, it assumes member is down and marks it inaccessible.
  3. While initiating the mongod for replica sets members are assigned priority which helps in the decision for the election of the primary. Members with high priority are given more preference to become primary than members with low priority. Zero priority members can never become primary. We can use this configuration to control which member can become primary. For example, if we want a member from a particular data centre to never become primary we can assign it to zero priority. Arbiters always have zero priority.
  4. There might be situations when complete datacenter is down. In such cases, the ability of a replica set to elect primary from other data centre may be affected.
  5. In case of a network partition, we may have primary in the partition with a minority member of nodes. In such cases, primary steps down to secondary as it can see only minority nodes. In case we have a partition with majority members a primary would be elected out of them.
  6. Of all replica set members, 7 can be voting members. These members participate in the election for the election of the primary. We can control it by selecting appropriate members to be voting members and assigning them votes. For example, we can assign a member more votes to influence the decision of primary election.

Big data systems with large data sets or high throughput requirements usually challenge the capacity of a single server like a large number of parallel queries can exhaust CPU capacity for the server. Also, larger working sets than the RAM of the system can cause I/O bottleneck and disk performance disruption. Such growth is generally handled either by vertical scaling or horizontal scaling.

Vertical Scaling

Here bottlenecks are handled by increasing the capacity of a single server by adding more RAM, having a more powerful CPU or adding more storage. This is fine up to a limit as even the biggest server has limits of RAM, CPU and storage as beyond a point we cannot add capacity. Also, this scaling method is very expensive as bigger servers cost must more than commodity servers.

Horizontal Scaling

Here bottlenecks are handled by dividing the dataset across multiple commodity servers. We get the benefit of more storage, RAM and CPU when data is spread. This also allows having high throughput as we can use parallelism of resources for the same. We also get the benefit of comparatively lower cost due to the use of commodity servers.

MongoDB supports horizontal scaling through sharding. It supports very large data sets and high throughput operations with sharding. In sharding data is distributed among several machines called shards.   

A MongoDB sharded cluster consists of the following components:

shard: application data in MongoDB sharded cluster is stored in shards. Each shard has a subset of collection data divided on the basis of shard key which we define at the time of creating a collection. These shards can also be deployed as replica sets. If a query is performed at single shard it will return a subset of data. Applications usually should not connect to individual shards. Connections to individual shards should be made by administrators for maintenance purpose.

mongos: In a sharded cluster applications should connect through mongos which acts as a query router which acts as an interface between clients’ applications and sharded cluster. Mongos fetches the metadata from config server regarding what data is on which shard and caches it. This metadata is then used by config server to route the query to appropriate shard. We should have multiple mongos for redundancy and they can either be deployed on a separate server or mixed with application servers. To reduce latency, it is recommended to deploy them on application servers. These mongos utilize minimal server resources and do not have any persistent state.

config servers: All the metadata and configuration settings for the sharded cluster are stored in config servers. Metadata shows which data is stored in which shard, number of chunks, and distribution of shard keys across the cluster. It is recommended to deploy config server as a replica set. In case the config server does not have primary at any time the cluster cannot perform metadata changes and becomes read-only for the time period so config server replica set should also be monitored and maintained as the application data shards.

MongoDB sharded cluster has 3 components namely shards, mongos and config servers. We will deploy all components using the below process. 

  • Deploy shards as a replica set.

We need to start all the members of replica sets with the –shardsvr option.

mongod --replSet "rs0" --shardsvr 
mongod --replSet "rs1" --shardsvr 

Suppose we have 2 shards with 3-member replica set each, all 6 shards should be started with the above option. These shard members are deployed as replica sets on the host (h1, h2, h3…. h6) at port 27017.

sh1(M1, M2, M3 as replica set “rs0”) and sh2(M4, M5, M6 as replica set “rs1”)

  • Deploy Config Server Replica Set 

We need to start all members of config servers as a replica set with --configsvr

mongod --configsvr --replSet “cf1”

Config Sever (Member c1, c2 and c3 as a replica set cf1) on host h7, h8 at port 27017.

  • Deploy mongos.

Start the mongos specifying the config server replica set name followed by a slash / and at least one of the config server hostnames and ports. Mongos is deployed on server h9 at port 27017.

mongos --configdb cf1/h7:27017, h8:27017, h9:27017
  • Add all shard replica sets(rs0 and rs1) to the cluster with sh.addShard command from the mongos.
mongo h9:27017/admin
sh.addShard( "rs0/h1:27017,h2:27017,h3:27017" )
sh.addShard( "rs1/h5:27017,h6:27017,h7:27017" )
  • At this point we have sharded cluster ready, we can check the status using sh.status command.
  • At last, we need to enable sharding at the database level, create an index on the shard key and shard a collection on the indexed shard key.
mongo h9:27017/admin
sh.enableSharding( "test" )
use test
db.test_collection.createIndex( { a : 1 } )
sh.shardCollection( "test.test_collection", { "a" : 1 } )

Shard key selection is an important aspect of the sharded cluster as it affects the performance and overall efficiency of a cluster. Chunk creation and distribution among several shards is based on the choice of the shard key. Ideally shard key should allow MongoDB to distribute documents evenly across all the shards in the cluster.

There are three main factors that affect the selection of the shard key:

Cardinality

Cardinality refers to a number of distinctive values for a given shard key. Ideally shard key should have high cardinality. It represents the maximum number of chunks that can exist in clusters.

For example, suppose we have an application that was used only by members of a particular city and we are sharding on the state, we will have a maximum of one chunk as both upper and lower values of chunk would be that state only. And one chunk would only allow us to have one shard. Hence we need to ensure the shard key field has high cardinality.

If we cannot have a field with high cardinality we can increase the cardinality of our shard key by creating compound shard key. So in the above scenario, we can have shard key with a combination of state and name for ensuring cardinality.

Frequency

Apart from having a large number of different values for our shard key, it is important to have even distribution for each value. It certain values occur more often than others then we may not have an equal distribution of load across the cluster. This limits the ability to handle scaled read and writes. For example, suppose we have an application where the majority of people using it have last name ‘jones’, the throughput of our application would be constraint with shard having those values. Chunks containing these values grow larger and larger and may sometimes become jumbo chunks. These jumbo chunks reduce the ability to scale horizontally as they cannot be split. To address such issues, we should choose a good compound shard key. In the above scenario, we can add _id as a compound field to have a high frequency for compound shard key.

Rate of change of Shard key values

We should avoid shard keys on fields which values are always increasing or decreasing. For example, ObjectId in MongoDB whose value is always increasing with each new document. In such case, all our writes will go to the same chunk having an upper bound key. For monotonically decreasing values writes will go to the first shard with a lower bound. We can have shard key as objectId as long as it’s not the first field.

To backup sharded cluster we need to take the backup for config database and individual shards.

  • Disable the balancer

First, we would need to disable the balancer from mongos. If we do not stop the balancer, the backup could duplicate data or omit data as chunks migrate while recording backups.

use config
sh.stopBalancer()
  • Lock one secondary member of each replica set

For each shard replica set in the sharded cluster, connect a mongo shell to the secondary member’s mongod instance and run db.fsyncLock().

db.fsyncLock()
  • Lock config server replica set secondary

Connect to secondary of config server replica set and run

db.fsyncLock()
  • Backup one config server and then unlock the member

Now we will backup locked config secondary member. We are using mongodump for backup but we can also use any other method like cp or rsync etc.

Once the backup is taken, we can unlock the member so that it starts getting oplog from config primary.

mongodump --oplog
db.fsyncUnlock()
  • Back up a replica set member for each shard

Now we will backup locked member of each shard. We are using mongodump for backup but we can also use any other method like cp or rsync etc.

Once the backup is taken, we can unlock the member so that it starts getting oplog from shard primary.

mongodump --oplog
db.fsyncUnlock()
  • Re-enable the balancer process

Once we have the backup from config and each shard we will enable the balancer by connecting to config database.

use config
sh.setBalancerState(true)

We can broadly divide MongoDB authentication mechanism in 2 parts namely client/user authentication which mainly deals with how clients of database authenticate to MongoDB and internal authentication which is how different members of replica sets or sharded clusters authenticate with each other.

  • Client/User authentication: Below are the supported authentication mechanism which MongoDB supports to authenticate client access to the database.
SCRAM-SHA-1
MONGODB-CR
X.509
LDAP
KERBEROS

Community Editions – SCRAM-SHA-1, MONGODB-CR and X.509 are available with MongoDB community versions.

SCRAM-SHA-1 and MONGODB-CR are considered as a challenge/Response mechanism. From version 3.0 SCRAM-SHA-1 is the default security mechanism and has replaced MONGODB-CR.

SCRAM-SHA-1 is a client response mechanism for authentication. The client sends a response the o server to authenticate. The response sent is never in plain text and so secured from several kinds of attacks.

X.509 is a certificate-based authentication mechanism. It became an authentication option as of version 2.6. With X.509, we are required to have a TLS connection. MongoDB 3.2.6 or greater, is already compiled with TLS support. 

Enterprise Editions – LDAP and KERBEROS are only available with enterprise versions.

LDAP is a directory service protocol commonly used by companies. With LDAP authentication support, users can authenticate to MongoDB using their LDAP credentials. This makes LDAP an external authentication mechanism. This means that the actual credentials used to authenticate the client are not stored directly in MongoDB. LDAP wasn’t designed specifically for authentication but rather for storing metadata about users in an organization but is widely used as an authentication mechanism also.

Kerberos is an industry standard authentication protocol for large client-server systems. It is widely accepted to be a very secure authentication mechanism and was designed specifically for the purpose of authentication. 

Like LDAP, Kerberos is also an external authentication mechanism. This means that the actual credentials used to authenticate the client are not stored in MongoDB.

  • Internal Authentication: If our replica set or sharded cluster spans multiple data centres or touches the internet in any way, it's very important to enable internal authentication.

MongoDB currently supports two internal authentication mechanisms. There's keyfile authentication which uses SCRAM-SHA-1 and X.509 authentication. 

With keyfile authentication, the contents of keyfile essentially act as a shared password between the members of a replica set or sharded cluster. The same keyfile must be present on each member that talks to one another. 

X.509 is another internal authentication mechanism. And it utilizes certificates to authenticate members to one another. We can use the same certificate on all members, it is recommended to issue a different certificate to each member. This way, if one of the certificates is compromised, we only need to reissue and deploy that one certificate instead of having to update your entire cluster. 

It's important to note that whenever we enable internal authentication, either with X.509 or with keyfile based authentication, this automatically will enable client authentication.

There are a few key differences while setting authentication on the sharded cluster. To set up authentication we should connect to mongos instead of mongod. Also, clients who want to authenticate to the sharded cluster must do from mongos.

Ensure sharded cluster has at least two mongos instances available as it requires restarting each mongos in the cluster. If the sharded cluster has only one mongos instance, this results in downtime during the period that the mongos is offline.

  • Generate keyfile from any method of your choice. Copy the keyfile to each server hosting the sharded cluster members. Ensure that the user running the mongod or mongos instances is the owner of the file and can access the keyfile.
  • From mongos create a user with admin clusterAdmin and userAdmin role on the admin database.
db.createUser({
    user: "admin",
    pwd: "<password>",
    roles: [
      { role: "clusterAdmin", db: "admin" },
      { role: "userAdmin", db: "admin" }]});
  • Change current mongos configuration with keyfile authentication enabled file.

security:

   transitionToAuth: true

   keyFile: <path-to-keyfile>

The new configuration file should contain all of the configuration settings previously used by the mongos as well as the new security settings.

  • Now restart all mongos one at a time starting with a new configuration file.
  • Now change the configuration file to enable keyfile authentication for all members of the config database. First, all secondary nodes should be updated. For updating primary force, a failover, change primary to secondary and then update the configuration file.,
  • Now we will create the shard-local administrator for each shard. In a sharded cluster that enforces authentication, each shard replica set should have its own shard-local administrator. we cannot use a shard-local administrator for one shard to access another shard or the sharded cluster.

Connect to the primary member of each shard replica set and create a user with the db.createUser() method.

db.createUser({
    user: "admin1",
    pwd: "<password>",
    roles: [
      { role: "clusterAdmin", db: "admin" },
      { role: "userAdmin", db: "admin" }]});

This user can be used for maintenance activities on individual shards.

  •  Now change the configuration file to enable keyfile authentication for all shards. First, all secondary nodes should be updated. For updating primary force, a failover, change primary to secondary and then update the configuration file.

When deploying MongoDB in production, we should have a strategy for capturing and restoring backups in the case of data loss events. Below are the different backup options:

Back Up with Atlas

MongoDB Atlas, the official MongoDB cloud service, provides 2 fully-managed methods for backups:

Continuous Backups, which take incremental backups of data in your cluster, ensuring your backups are typically just a few seconds behind the operational system.

Cloud Provider Snapshots, which provide localized backup storage using the native snapshot functionality of the cluster’s cloud service provider.

Back Up with MongoDB Cloud Manager or Ops Manager

MongoDB Cloud Manager and Ops Manager provide back up, monitoring, and automation service for MongoDB. They support backing up and restoring MongoDB replica sets and sharded clusters from a graphical user interface.

Back Up by Copying Underlying Data Files

Back Up with Filesystem Snapshots

MongoDB can also be backed up with operating system features which are not specific to MongoDB. Point-in-time filesystem snapshots can be used for backup If the volume where MongoDB stores its data files supports snapshots.

Back Up with cp or rsync

MongoDB deployments can also be backed up using system commands cp or rsync in case storage system does not support snapshots. It is recommended to stop all writes to mongo before copying database files as copying multiple is not an atomic operation.

Back Up with mongodump

mongodump is the utility using which we can take a backup of the MongoDB database in BSON files format. The backup files can then be used by a mongorestore utility for restoring to another database. Mongodump reads data page by page hence taking a lot of time and so is not recommended for large sized deployments.

Encryption plays a key role in securing any production environment. MongoDB offers encryption at-rest as well as transport encryption.

Transport encryption offers to encrypt information over the network traffic between the client and the server. MongoDB supports TLS/SSL (Transport Layer Security/Secure Sockets Layer) to encrypt all of MongoDB’s network traffic. TLS/SSL ensures that MongoDB network traffic is only readable by the intended client.

Encryption at rest encrypts the data on disk. This can be achieved either encrypting at the storage engine level or at the application level. Application level encryption is done at application end and is similar to masking as done earlier in RDBMS.

Encrypted Storage Engine

MongoDB Enterprise 3.2 introduces a native encryption option for the WiredTiger storage engine. This allows MongoDB to encrypt data files such that only parties with the decryption key can decode and read the data.

The data encryption process includes:

  • Generating a master key.
  • Generating keys for each database.
  • Encrypting data with the database keys.
  • Encrypting the database keys with the master key.

The encryption occurs transparently in the storage layer; i.e. all data files are fully encrypted from a file system perspective, and data only exists in an unencrypted state in memory and during transmission.

Application Level Encryption

Application Level Encryption provides encryption on a per-field or per-document basis within the application layer. To encrypt document or field level data, write custom encryption and decryption routines or use a commercial solution.

The MongoDB balancer is a background process that monitors the number of chunks on each shard. When the number of chunks on a given shard reaches specific migration thresholds, the balancer attempts to automatically migrate chunks between shards and reach an equal number of chunks per shard.

All chunk migrations use the following procedure:

  1. The moveChunk command is sent to the source shard by the balancer.
  2. First internal chunks at the source shard move due to moveChunk command. All operations are routed to the source shard during the migration process. All writes for the chunks are taken by source shard at this point.
  3. All required indexes are built at the destination shard.
  4. Once index built is completed destination shard starts requesting documents in a chunk from source shard and starts receiving them.
  5. Once the final document chunk is received, destination shard does synchronization so that all changes occurred during the migration are also migrated.
  6. Once synchronization is complete cluster metadata is updated with the new location of the chunk in config database by source shard.
  7. Finally, source shard verifies that cluster metadata is updated correctly with new chunk location and once verified source shard deletes its copy of the migrated document.

MongoDB wiredTiger storage engine uses both WiredTiger internal cache and file system cache for storing data. If we do not define wiredTiger internal cache by default it utilizes larger of either 256MB or 50% of (RAM -  1GB). For example, if a system has a total 0f 6GB RAM, so 2GB (50% 0f 6GB – 1 GB) will be allocated to wiredTiger internal cache. This default setting assumes that there is only one mongod process running. In case we have multiple mongodb instances on the server we should decrease the wiredTiger internal cache size to accommodate other instances.

WiredTiger also provides compression options for both collections and indexes by default. While snappy compression is used for collections, prefix compression is used for all indexes. We can set the compression at the database as well as collection and index level.

 WiredTiger internal cache and filesystem cache differs in terms of data representation from on-disk format.

  1. Data is stored in the same manner for filesystem cache as the on-disk format, including the data files compression. The operating system uses the filesystem cache to reduce disk i/o.
  2. Although indexes are created in different representation in wiredTiger internal cache than on-disk format, they still take advantage of index prefix compression to reduce RAM.
  3. The collected data in the WiredTiger internal cache uses different representation than the on-disk format. This data is uncompressed which allows it to be manipulated by the server. While on-disk format uses block compression which provides significant storage savings.

All free memory that is not used by wiredTiger cache or by any other process is automatically used by MongoDB filesystem cache.

Any query on sharded cluster goes through mongos to config database where it looks for metadata information about the chunk distribution.

These queries are generally divided into broadly 2 groups:

Scatter gather queries:

Scatter-gather queries are the one which does not include the shard key. Since there are no shard keys, mongos does not know which shard to send this query to, hence it searches on all shards in the cluster. These queries are generally inefficient and are unfeasible for routine operations on large clusters.

Targeted queries:

If a query includes the shard key, the mongos directs the query to specific shards only that are part of query as per shard key. These queries are very efficient.

Now, in this case, we have a query with a shard key search 15000<=employeeid<=70000, which is a subset of the data from the entire cluster and so it’s a targeted query. Any shard with employee id within this range will be queries. From the above sample, we can see below shards fall within this range and will all be accessed by the query.

  • Shard0000
  • Shard0002
  • Shard0003
  • Shard0004
  • Shard0005
  • Shard0006
  • Shard0007

If MongoDB cannot split a chunk that exceeds the specified chunk size or contains a number of documents that exceeds the max, MongoDB labels the chunk as jumbo. If the chunk size no longer hits the limits, MongoDB clears the jumbo flag for the chunk when the mongos reloads or rewrites the chunk metadata.

But in some we need to follow the below process to clear the jumbo flag manually:

Divisible Chunks

If the chunk is divisible, MongoDB removes the flag upon successful split of the chunk.

Process

  • Connect to mongos and run sh.status(true) looking for jumbo chunks.

Below output from sh.status(true) shows that chunk with shard key range { "x" : 2 } -->> { "x" : 4 } is jumbo.

--- Sharding Status ---
 ..................
 ..................
test.foo
           shard key: { "x" : 1 }
        chunks:
             shard-b  2
             shard-a  2
        { "x" : { "$minKey" : 1 } } -->> { "x" : 1 } on : shard-b Timestamp(2, 0)
        { "x" : 1 } -->> { "x" : 2 } on : shard-a Timestamp(3, 1)
        { "x" : 2 } -->> { "x" : 4 } on : shard-a Timestamp(2, 2) jumbo
        { "x" : 4 } -->> { "x" : { "$maxKey" : 1 } } on : shard-b Timestamp(3, 0)
  • Split the jumbo chunk using sh.splitAt()
sh.splitAt( "test.foo", { x: 3 })

MongoDB removes the jumbo flag upon successful split of the chunk.

Indivisible Chunks

In some instances, MongoDB cannot split the no-longer jumbo chunk, such as a chunk with a range of single shard key value, and the preferred method to clear the flag is not applicable. 

Process

  • Stop the balancer.
  • Create a backup of config database.
mongodump --db config --port <config server port> --out <output file>
  • Connect to mongos and check for jumbo chunks using sh.status
  • Update chunks collection.

In the chunks collection of the config database, unset the jumbo flag for the chunk. For example,

db.getSiblingDB("config").chunks.update(
   { ns: "test.foo", min: { x: 2 }, jumbo: true },
   { $unset: { jumbo: "" } }
)
  • Clear the cached routing information.

After the jumbo flag has been cleared out from the chunks collection, update the cluster routing metadata cache.

db.adminCommand( { flushRouterConfig:  "test.foo" } )

Monitoring is a critical component of all database administration. A firm grasp of MongoDB’s reporting will allow us to assess the state of the database and maintain deployment without crisis. 

Below are some of the utilities used for MongoDB monitoring.

  • mongostat

The mongostat utility provides a quick overview of the status of a currently running mongod or mongos instance. mongostat is functionally similar to the UNIX/Linux file system utility vmstat but provides data regarding mongod and mongos instances.

In order to run mongostat user must have the serverStatus privilege action on the cluster resources.

Eg. To run mongostat every 2 minutes below command can be used.

mongostat 120
  • mongotop

mongotop provides a method to track the amount of time a MongoDB instance mongod spends reading and writing data. mongotop provides statistics on a per-collection level. By default, mongotop returns value every second.

Eg. To run mongotop every 30 sec below command can be used.

mongotop 30
  • Commands

MongoDB includes a number of commands that report on the state of the database.

  • serverStatus

The serverStatus command, or db.serverStatus() from the shell, return a general overview of the status of the database, detailing disk usage, memory use, connection, journaling, and index access. The command returns quickly and does not impact MongoDB performance.

  • dbStats

The dbStats command, or db.stats() from the shell, returns a document that addresses storage use and data volumes. The dbStats reflect the amount of storage used, the quantity of data contained in the database, and the object, collection, and index counters.

We can use this data to monitor the state and storage capacity of a specific database. This output also allows to compare use between databases and to determine the average document size in a database.

  • collStats

The collStats or db.collection.stats() from the shell that provides statistics that resemble dbStats on the collection level, including a count of the objects in the collection, the size of the collection, the amount of disk space used by the collection, and information about its indexes.

  • replSetGetStatus

The replSetGetStatus command (rs.status() from the shell) returns an overview of replica set’s status. The replSetGetStatus document details the state and configuration of the replica set and statistics about its members.

This data can be used to ensure that replication is properly configured, and to check the connections between the current host and the other members of the replica set.

  • Ops-manager/Cloud-manager

Apart from the above tools MongoDB also provides an option for GUI based monitoring with ops-manager and cloud-manager. These are very efficient and are mostly used in large enterprise environments.

Security is very important for any production database. MongoDB provides us with the best practices to harden out MongoDB deployment. This list of best practices should act as security checklist before we give green light to any production deployment.

  • We should enable authentication for our deployment. All clients should require to authenticate before they can access the MongoDB server. Methods like SCRAM-SHA-1, Certificate-based authentication or LDAP can be enabled. It is important to enabling authentication on each MongoDB server as if any server is left that could become a point of access to intruders.

  • We should enable authorization via Role-based access control(RBAC) model for our deployments. There should be a single administrative user to configure other users. We should have unique users for each person and application that access the database. These users should follow the principle of least privilege meaning users should not have access that is not needed. As a best practice, we should group common access privileges to roles and then assign these roles to individual users or groups.
  • We should enable encryption for our deployment. All connections via client or between nodes should go through tls/ssl protocols. apart from encrypting communication, it is important to encrypt data at rest using MongoDB native encryption available for WiredTiger. Also, it is important to rotate the encryption keys either by KMIP or any other protocol. Protecting MongoDB data files by assigning appropriate file-system permissions is also important.

  • We can significantly affect the security of MongoDB deployment by having a strong Network security process. Firewalls should be configured to control access of our MongoDB systems. On cloud deployments, proper VPC/VPNs should be configured. We should limit network traffic to specific systems on the given port via the use of firewalls. Only traffic from trusted sources should reach Mongod or Mongos instances. MongoDB also supports its own firewall with the configuration bind_ip, using this we can configure connections from specific IP address at the database level.

  •  It is important to audit any kind of database configuration changes. Sometimes it may be required to audit changes in data within a database. It should be noted that there are performance implications in enabling auditing.

  • We should not run our MongoDB applications from root user. There should be a dedicated user created for individual applications.

  • MongoDB should be run using secure configuration options. The HTTP status interface and REST API must be disabled. Also if we are not using operations like MapReduce(), group(), and $where server-side scripting should be disabled. This prevents MongoDB from malicious javascript attacks.

The balancer is a background process that runs on the primary of config server in a cluster. It constantly monitors the number of chunks on each shard and if the number of chunks for a specific shard is more than the migration threshold, it tries to automatically migrate chunks between shards so that there are an equal number of chunks per shard. The balancer migrates chunks from shards having more chunks to shards with lesser chunks. For example, Suppose we have 2 shards[shard01, shard02] with chunks 4 and 5 respectively. Now suppose there is a need to add another shard[shard03]. Initially, shard03 will have no chunks. A balancer will notice this uneven distribution and migrate chunks from shard01 and shard02 to shard03 until all 3 shards have three shards each.

There might be performance impact when balancer migrates the chunks as they carry some overhead in terms of bandwidth and workload, which can impact database performance. To minimize the impact balancer:

  1. Attempts only one chunk migration at a given time. So a shard cannot participate in multiple chunk migrations at a given time. Multiple chink migrations should occur one after the other. Although for 3.4 parallel chunk migrations are possible. Suppose a sharded cluster has 4 shards it can participate in 2[shard/2] simultaneous chunk migrations.
  2. Kicks off balancing round only when a number of chunks between shard with the greatest and lowest number of shards reaches migration threshold.

Impact of Adding and Removing Shards on a balancer

Adding or removing the shard from the cluster creates imbalance as either new shard will have no chunks or removed shard chunks need to be redistributed throughout the cluster.  In case shard was removed from the cluster with uneven chunk distribution the balancer will remove the chunks from draining shard before balancing remaining uneven chunks. When balancer notices this imbalance it starts chunk migration process immediately. The migration process takes time to complete.

MongoDB creates oplogs for each operation on primary and these are then replicated to secondary using replication. MongoDB uses asynchronous replication and automatic failover feature to perform this efficiently.

Asynchronous Replication

Oplogs from the primary is applied to secondary asynchronously. This helps applications to continue without downtime despite the failure of members. MongoDB deployments are usually on commodity servers and for commodity servers, if we want to have synchronous replication, latency for waiting for acknowledgement is in the order of 100ms which is quite high. Due to this reason, MongoDB prefers asynchronous replication.

From version 4.0.6, MongoDB provides the capability to log entries of slow oplog operations for secondary members of a replica set. These slow oplog messages are logged for the secondaries in the diagnostic log under the REPL component. These slow oplog entries do not depend on log levels or profiling level but depend only on the slow operation threshold. The profiler does not capture slow oplog entries.

Automatic Failover

Many traditional databases follow Master-slave setup but in case of master failure, we have to manually cutover to a slave database. In MongoDB, we can have one primary with multiple secondary. If we have fewer servers, we can still afford to do manual cutover but MongoDB being big data may have 100 shards and it is impossible to cutover manually every time. So MongoDB has automatic failover. When the primary is unable to communicate to other members for more than the configured time(electionTimeoutMillis), and eligible secondary triggers election to nominate itself as primary. Until the new primary is elected cluster cannot serve write requests and can only serve read requests. Once the new primary is selected cluster resumes normal operations

The architecture of the cluster should be designed keeping in mind Network latency and time required for replica sets to complete elections as they affect the time our cluster runs without Primary.

Indexes help in improving the performance of queries. Without indexes, query must perform collection scan where each and every document of collection is scanned for the desired query result. With the use of proper indexes, we can limit the number of documents scanned thus improving the performance of queries.

Like collections indexes also use storage as they store a small portion of collection data. For example, if we create an index on field ‘name’ it will store data for this field and in ascending or descending order which also helps sort operations. Using indexes, we can satisfy equality matches and range-based queries more efficiently.

Some of the different index options available for MongoDB are:

_id Index

By default, MongoDB creates an index on the _id field at the time of creating an index. This is a unique index and prevents applications from inserting multiple same values for the same _id field. MongoDB ensures that this index cannot be deleted.

Single field and compound index

These are indexes either on any one or combination of fields.

i.e

db.records.createIndex( { score: 1 } ) – Index on single field “score”
db.products.createIndex( { "item": 1, "stock": 1 } ) – Index on comination of “item and stock”

Multikey Index

MongoDB provides the option of creating an index on the contents stored in arrays. For every element of the array, a separate index entry is created. We can select matching elements of the array using multikey indexes more efficiently.

Geospatial Index

MongoDB also provides a geospatial index which helps to efficiently query the geospatial coordinate data. 2d indexes for planar geometry and 2dsphere indexes for spherical geometry.

Text Indexes

To support string content search in collection MongoDB provides text index. These indexes only store root words while ignoring the language-specific words like ‘the’, ‘a’ etc.

Partial Indexes

To search for specific filter expression in a collection partial indexes are used. Since they store only the subset of documents in a collection, they have lower storage requirements. Index creation maintenance and performance is also low for these indexes.

Sparse Indexes

If we only want to get the fields of a document that are indexed and skip all other fields we can do so by using the sparse index.

TTL Indexes

Certain application has requirements where documents need to be removed automatically after a certain amount of time. We can achieve this using TTL indexes. We specify TTL (time to live) for the documents after which a background process runs and removed these documents. This index is ideal for logs, session data and event data as such data only needs to persist for a limited time.

It is important to maintain data consistency in any database especially when multiple applications are accessing the same piece of data simultaneously. MongoDB uses locking and other concurrency control measures to ensure consistency. Multiple clients can read and write the same data while ensuring that all writes to single document either occur in full or not at all so that clients never see inconsistent data.

Effect of sharding on concurrency

In sharding, collections are distributed among several shard servers and so it improves concurrency. Mongos process routes multiple numbers of operations concurrently to different shards and finally combine them before sending back to the client.

In a sharded cluster locking is at individual shard level rather than cluster level so the operations in one shard do not block other shard operations. Each shard uses its own locks independent of other shards in the cluster.

Effect of replication on concurrency

  • Primary

In a MongoDB replica set each operation on the primary is also written to the special capped collection in the local database called oplog. So every time application writes to MongoDB it locks both databases i.e collection database and local database. Both these databases must be locked at the same time to maintain database consistency and ensuring that even with replication write operations maintain their ‘all-or-nothing” feature which ensures consistency.

  • Secondary

In MongoDB replication, the application does not write to secondary but the secondary gets write from primary in the form or oplog. This oplog are not applied serially but collected in batches and batches are applied in parallel. The write operations are applied in the same order as they appear in oplog. During the time oplog are applied secondary do not allow reads to applied data to maintain consistency.

MongoDB has replication to provide high availability and redundancy which are the basis for any production database. With replica sets, we can achieve HA as well as DR capability. This also enables up for horizontal scaling enabling the use of commodity servers instead of enterprise servers. With replication, we can prevent downtime even if entire DC goes down with proper configuration.

There are several types of replica members based on the requirement:

  1. Primary: This is the member who accepts all the writes from the application. If the primary goes down, the new primary member is selected which then accepts all writes. MongoDB applies to write operations on the primary and then records the operations on the primary’s Oplog.
  2. Secondary A secondary maintains a copy of the primary’s data set. To replicate data, a secondary applies operations from the primary’s Oplog to its own data set in an asynchronous process replica set can have one or more secondary’s.
  3. ArbiterAn arbiter is a secondary with a copy of the data, due to which it cannot ever become primary. It participates in the election for primary in case needed. Basically is there to maintain quorum.
  4. Hidden Replica Set Members: A hidden member maintains a copy of the primary’s data set but is invisible to client applications. Hidden members are good for workloads with different usage patterns from the other members in the replica set. They cannot become primary.
  5. Delayed Replica Set Members: Delayed members contain copies of a replica set’s data set. However, a delayed member’s data set reflects an earlier, or delayed, state of the set. They are “rolling backup” or a running “historical” snapshot of the data set, which may help you recover from various kinds of human error.

We can change the configuration of the replica set as per the requirement of the application. Configuration changes may include adding a new member, adding Arbiter, removing a member, changing priority or votes for members, or changing member from normal secondary to hidden or delayed member.

To add a new member, first we need to start the mongod process –replset option on the new server

  • To add new secondary

rs.add({host: “hostname” , port : “portno.”})

Once added member will fetch the data from primary using initial sync and replication synchronism.

  • To add Arbiter 

rs.addArb({host: “hostname” , port : “portno.”})

  • To remove a member

rs.remove(hostname)

As a good practice should shut down the member being removed before running the above command.

  1. Above steps can also be performed using below command providing new configuration.

rs.reconfig(new config)

Reconfig can be explained better with below example. Suppose we have replica set “rs0” with below configuration.

  • To change the priority of member 1:

From Primary:

cfg = rs.conf();

cfg.members[1].priority = 2;

rs.reconfig(cfg);

  • To change the Votes of member 2:

cfg = rs.conf();

cfg.members[2].votes = 0;

rs.reconfig(cfg);

  • To change current secondary member as a delayed member with 1-hour delay.

cfg = rs.conf()

cfg.members[n].priority = 0

cfg.members[n].hidden = true

cfg.members[n].slaveDelay = 3600

rs.reconfig(cfg)

  • To change current secondary member to hidden member

cfg = rs.conf()

cfg.members[n].priority = 0

cfg.members[n].hidden = true

rs.reconfig(cfg)

Description

MongoDB is an open-source NoSQL database that uses a document-oriented data model and a non-structured query language. It overcame one of the biggest pitfalls of the traditional database systems, that is scalability. MongoDB is being used by some of the biggest companies in the world, known for its best features and offers a unique set of features to the companies in order to resolve the unstructured data.

MongoDB is used across several companies in multiple domains. The research found that 26,929 companies are using it. The companies using MongoDB are most often found in the United States mostly in the Computer Software industry. Companies with 10-50 employees and with a revenue of 1 Million -10 Million dollars using this.

There is a huge demand for professionals who are qualified and certified in working with the advanced and basics of MongoDB and can expect to have a promising career. Organizations around the world are utilizing the innovation of MongoDB to meet the fast-changing requirements of their customers.

The MongoDB Interview Questions and answers are prepared by experienced industry experts and can prove to be very useful for newcomers as well as the experienced professionals who want to become a MongoDB Developer. These interview questions on MongoDB here will help you strengthen your technical skills, prepare for the new job test and quickly revise the concepts. You will have an in-depth knowledge by going through these MongoDB Interview Questions and help you ace your MongoDB interview.

To relieve you of the worry and burden of preparation for your upcoming interviews, we have compiled the above MongoDB Interview Questions and answers with answers prepared by industry experts.  These common interview questions on MongoDB will help you ace your MongoDB Interview.

Learning MongoDB will definitely give a boost to your career because of the demand for MongoDB in the market is increasing at a tremendous pace. All the best!

Read More
Levels