MongoDB basics and cluster administration

Introduction

We might be so strong and familiar with things while learning. But when we want to put that in practice, we just need a quick reference for confirmation. Been there ? This blog will be a quick brush up on few concepts about MongoDB basics and cluster administration.

Overview

In Mongo, the database is referred as namespace, which stores collections(tables). The collections store individual record called documents.

We can establish security at various levels. We can either authorize at database level or collection level, but document level authorization is not supported.
Schema - Provides list of fields, datatype, summary for range of values for each fields. Can be defined using JSON schema. For detailed overview on schema check MongoDB schemas.
Supported datatypes are String, Integer, Boolean, Double, Min/Max keys, Arrays, Timestamp, Object, Null, Symbol, Date, Object ID, Binary data, Code and Regular expressions.

Frequently used operations

insertOne() - Insert a document/record into a collection. _id is auto-generated for a records to be unique, which can be overwritten.
insertMany() - Used for inserting multiple document at the same time as an array.
updateOne() - Updates the first record which matches the condition.
updateMany() -Updates all the documents matching the filter.
In case of update, flag called upsert can be used, which means update if already exist else insert a new document.
$set, $unset, $min, $max, $currentDate, $rename, $inc, $mul, $setOnInsert are few update operators.
find() - Can be used to filter the data. In order to filter the data from nested document dot(.) should be used. For example, {"wind.direction.angle":"20"}, where direction resides inside wind and angle resides inside direction.

db.movies.find({"cast" : ["Jeff bridges","Tim robbins"]}) - to find a documents having both names in the cast.
db.movies.find({"cast" : "Jeff bridges"}) - to find a documents having name Jeff bridges in the cast.
db.movies.find({"cast.0" : "Jeff bridges"}) - We can also specify array index, where particular data in that array will be checked for the filter.

Projection is a concept in MongoDB where we return only particular field from the queried resultset.

db.movies.find( { genere: "Adventure" }, { name: 1, year: 1 } ) , Which is the same as,
SELECT _id, name, year from movies WHERE genere = "Adventure"
1 - to include the field, 0 - to exclude the field from resultset.

deleteOne() - Deletes the first record that matches the filter.
deleteMany() - Delete all the records that matches the filter.
User management

db.createUser();
db.dropUser();

Database management,

db.dropDatabase();
db.createCollection();
db.serverStatus();

File structures of MongoDB standalone server

Check the files generated at the location provided for dbpath.
Never modify those files manually, which may lead to crashes or data loss.
Write operations are buffered in memory and are flushed for every 60 seconds.
During the event of failure, WiredTiger can use journal to recover the data that occurred between the checkpoints.
If Mongo daemon crashed is between the checkpoints, there is possibility that data is not safely & completely written.
When Mongo daemon recovers the WiredTiger checks whether there is any recovery to be made.

Logging

Process logs of MongoDB can be grabbed at various levels which is based on verbosity level configured. db.getlogcomponents() returns the current verbosity settings. The verbosity settings determine the amount of log messages the MongoDB produces for each log message component. Default verbosity level is 1 for all the components.

Log verbosity levels are,

-1 : Inherit form parent
0 : Default verbosity, to include informational messages.
1-5 : Increase the verbosity level to debug the messages.

getLog(), is an administrative command that returns the most recent 1024 logged Mongo daemon events. It does not read log data from Mongo daemon file. Instead reads the data from RAM cache of logged Mongo daemon events.

Profiling the database

Profiler can be used to capture the list of operations executed against the running mongod instance, Which includes,

CRUD operations
Configurations
Administration commands.

New collection called system.profile will be created where all the above information resides.The Profiler is off by default, which can be enabled with three different levels,

0 - Profiler is off and does not collect any data.
1 - Profiler collects data of operations that are slow. We can also define the time constraint. Example, db.setProfilingLevel(1,{slowms:500});
2 - Profiler collects all the data.

Authentication

Authentication is used to identify the user. Basically to find who you are. MongoDB supports 4 authentication mechanisms,

SCRAM - Salted Challenge Response Authentication Mechanism is the default mechanism used by MongoDB client. It is nothing but the password security.
X509 - Available for community version. Uses X509 certificates for verification. Complex compared to SCRAM.
LDAP - Lightweight Directory Access Protocol. Available only in Enterprise edition.
KERBROS - Considered to be the most powerful authentication. Available only in Enterprise edition.

Authorization

Authorization is used to identify the privilege of a user. Basically to find what level of access you have. MongoDB uses RBAC (Role Based Access Control), which can be defined while creating a user.

User can have one or more roles.
Role can have one or more privileges.
Privilege is a group of actions and resources those actions apply to.

Replication

It is the process of maintaining the same data at various places. Provides redundancy and high availability. There will be a Primary nodes and set of secondary nodes to which the data will be replicated. All the operations performed in the primary node will be recorded in oplog which will be shared to secondary nodes for replication.

Types of Replication

Binary - It is necessary to know how the files are changing to maintain replication. Considered to be faster, contains less data. Uses binary logs.
Statement-Based - No need to care about how the data is physically stored. Not bounded by operating system. Uses oplog.

Facts about replication

Replication set can be configured as Primary, Secondary or Arbiters.
Arbiters does not hold any data, but it can vote in election in case of primary node failure. Usage of Arbiters can cause consistency issues.
Preferred to have odd number of nodes in replicaset which can help in election.
Maximum of 50 replicas can be created out of which only 7 members can vote.
Priority can be set to define the hierarchy within the replicaset. For Arbiters the priority should be 0.
Every node in replication set has it's own oplog.
Hidden nodes replicate data and it can vote in election.

Failover & Election

In case of primary node failure, election happens between the list of available secondary nodes to select the next primary nodes.

Write concern can be configured for various levels to confirm the data writes,

0 - Don't wait for acknowledgement
1 - Default. Wait for acknowledgment from the primary only.
>=2 - Wait for acknowledgement from one or more secondary members.
majority - Wait for acknowledgement from majority of replica set members.
More number of majority provides more durability, but takes some time.

Read concern can be configured for various levels to get most reliable the data during read,

Local - Default. Returns the most recent data.
Available - Default for secondary database.
Majority - Returns the document which are acknowledged by majority of clusters.
Linearizble - Available form 3.40 of MongoDB. Majority + Read your own writes functionality.
Read concern can also be used with write concern for best durability guarantee.

Read preferences can be configured based on the requirement for data reads. The list of Read-preferences are,

Primary - Default. Reads only from the primary node.
Primary Preferred - Reads from primary node. If primary node is not available, reads from secondary node.
Secondary - Reads only from secondary node.
Secondary Preferred - Reads from secondary node. If no secondary node is available, reads from primary.
Nearest - Reads randomly from any of the available nodes based on specified latency threshold.

Follow Replication documentation for configurations and setup.

Sharding

Distributing the datasets across various nodes is called Sharding.
Mongo by default uses horizontal scaling.
Divide the datasets into pieces and distribute those across many shards.
Deploying each shard as a replica set provides high availability and fault tolerance.
mongos is used for quering data across sharded cluster.
mongos uses metadata from the config servers to find where the data exactly resides.
mongos routes the queries to shards based on the information provided by config server.
mongs are stored in config database.
Should never write anything to config database, as it is maintained only for internal purpose.

Sharding keys are used to partition the data in sharded collections
Shard fields must be indexed.
Shard keys are immutable, which means shardkey field and values cannot be changed after sharding.
Once a collection is sharded, it cannot be reverted.
Good shardkey is considered if it matches high cardinality,low-freequency and less monotonic.

Chunks

The collections are broken into various chunks and distributed among shards.
Min is inclusive and Max is excluded.
All documents of the same chunk lives in same shard.
Default chunk size is 64MB. It can be configured between 1MB < 1 GB.
Large group of documents are categorized logically.
Increasing the chunksize can help eliminate jumbo chunks.

Sharding balancer can be used to increase the performance. Balancer runs on the primary member of the config server.

Search This Blog

Technical Blog