Many applications use file management and file storage as key elements to improve data processing. File storage frequently involves the use of a third-party CDN (Content Delivery Network), such as Amazon Web Services, although this complicates management. It would be preferable to access all your resources from a single cloud storage location rather than several different ones, as there is a possibility of failure during retrieval.
Until the addition of GridFS in MongoDB, it was difficult to store files directly into a database using a single API request. See how GridFS uses indexing and storing data in small sizes for faster retrieval and the methods used in achieving this. Explore the benefits and limitations of using GridFS.
What is GridFS?
GridFS is a driver specification for uploading and retrieving files from MongoDB. It is a specification for storing and retrieving files larger than the 16 MB limit of BSON documents. It divides a file into portions, or chunks, and saves each chunk as a separate document, rather than storing the file as a single document.
Each chunk can only be 255 KB in size. This signifies that the final chunk is usually equal to or less than 255 KB. That's quite cool!.
GridFS is an appropriate technique for storing files in MongoDB, supplementing the schema-less (and thus faster) retrieval of the information offered by the document model.
Because files are separated into smaller parts, it is easier to access specific areas of a file, saving memory-intensive tasks such as loading the whole file.
When reading from GridFS, the driver reassembles all chunks as needed. This means you can read chunks of a file based on the query rangelike, listen to a segment of an audio file or retrieve a segment of a video clip.
GridFS Collections MongoDB GridFS Indexes
For efficiency, GridFS employs indexes on each of the chunks and file collections. For convenience, drivers that adhere to the GridFS specification automatically build these indexes.
This specification defines a simple GridFS API. This specification also describes advanced GridFS capabilities that drivers may choose to offer in their implementations. Additionally, this work seeks to define the meaning and purpose of all fields in the GridFS data model, disambiguate GridFS nomenclature, and document previously unspecified configuration choices. You can also add as many indexes as you need to meet the needs of your application.
The Chunks Index
GridFS uses the files_id and n fields to create a unique compound index on the chunks collection. This enables efficient chunk retrieval, as shown in the following example:
db.fs.chunks.find( { files_id: myFileID } ).sort( { n: 1 } )
Drivers that follow the GridFS specification will automatically check for the existence of this index before performing read and write operations. For information on the unique behavior of your GridFS application, consult the corresponding driver documentation.
If this index does not exist, you can issue the following operation to create it using The MongoDB Shell (mongosh)., It's a complete JavaScript and Node.js 14.x REPL environment for working with MongoDB deployments. You may use the MongoDB Shell to directly test queries and actions against your database.
db.fs.chunks.createIndex( { files_id: 1, n: 1 }, { unique: true } );
The Files Index
It makes use of an index on the files collection based on the filename and UploadDate columns. It enables efficient file retrieval, as illustrated in the following example:
db.fs.files.find( { filename: myFileName } ).sort( { uploadDate: 1 } )
If this index does not already exist, you can use mongo shell to build it:
db.fs.files.createIndex( { filename: 1, uploadDate: 1 } );
Drivers that follow the GridFS specification will automatically check for the existence of this index before performing read and write operations. For information on the unique behavior of your GridFS application, consult the corresponding driver documentation.
MongoDB GridFS Sharding
GridFS is divided into two collections: files and chunks.
Chunks Collection
Chunks stores the binary chunks. Use either { files_id: 1, n: 1 } or { files_id: 1 } as the shard key index to shard the chunks collection. files_id is an ObjectId that updates in a monotonic manner.
You cannot utilize Hashed Sharding if the MongoDB driver uses filemd5.
Each document in the chunks collection represents a unique chunk of a file in GridFS. This collection's documents take the following format:
{
"_id" : <ObjectId>,
"files_id" : <ObjectId>,
"n" : <num>,
"data" : <binary>
}
The following fields are included in some or all of the documents in the chunks collection:
- chunks._id: Unique ObjectId.
- chunks.files_id: In the files collection, we can specify the _id of the parent document.
- chunks.n: The chunk's sequence number. GridFS assigns a number to each chunk, beginning with 0.
- chunks.data: The payload of the chunk as a BSON Binary type.
Files Collection
‘Files’ stores the file’s metadata. The file collection is minimal and consists mainly of metadata. GridFS keys do not lend themselves to equitable distribution in a sharded system. This allows all of the file metadata records to reside on a single primary shard.
If you need to shard the files collection, utilize the _id field in association with an application field.
Each document in the file collection represents a file in GridFS.
{
"_id" : <ObjectId>,
"length" : <num>,
"chunkSize" : <num>,
"uploadDate" : <timestamp>,
"md5" : <hash>,
"filename" : <string>,
"contentType" : <string>,
"aliases" : <string array>,
"metadata" : <any>,
}
The following fields are included in some or all of the documents in the files collection:
- files.length: The document's size in bytes.
- files._id: The _id is of the data type you specified when creating the original document. BSON ObjectId is the default type for MongoDB documents.
- files.chunkSize: Each chunk's size in bytes. Except for the last chunk, which is only as large as needed, GridFS breaks the document into chunks of size chunkSize. The standard size is 255 kilobytes (kB).
- files.uploadDate: GridFS's initial storage of the document. The type of this value is Date.
- files.md5: The filemd5 command returns an MD5 hash of the entire file. It is of the string type.
- files.metadata: The metadata field can contain any type of data and any additional information you choose to store. If you want to add more arbitrary fields to documents in the files collection, add them to a metadata object.
- files.aliases: An array of alias strings.
- files.contentType: It is entirely optional. A MIME type that is appropriate for the GridFS file.
- files.filename: It is entirely optional. The GridFS file's human-readable name.
Example:
{
"_id" : ObjectId("6177da181964fd7f82e2aaa9"),
"length" : 15720,
"chunkSize" : 261120,
"uploadDate" : ISODate("2021-10-26T16:06:08.091+05:30"),
"filename" : "ishanfile.docx",
"contentType" : "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
}
The files collection, like the chunks collection, employs a compound index based on the filename and uploadDate columns to enable for efficient file retrieval, for example:
db.fs.files.find( { filename: fileName } ).sort( { uploadDate: 1 } )
If this index does not exist, run the following command in a mongo shell:
db.gfs.file.createIndex( { filename: 1, uploadDate: 1 }, { unique: true } );
This will give the output as:
How to Read and Write files in MongoDB GridFS?
To follow the tutorial further, your machine must have following software installed:
- Node.js
- MongoDB with MongoDB Compass
- VS Code
Step 1: Make a folder named mongo_grid. Launch the VSCode editor and navigate to this folder. This folder will be transformed into a workspace, containing all of the code files contained within it.
Step 2: In this workspace, create folders titled filestoread and filestowrite, which will contain files that will be read and saved into a database, as well as files read from the database.
Step 3: Open the VS Code terminal and run npm init –y
This command will create a workspace package.json file with certain preset sections.
Install gridfs-stream and mongoose using the following command:
npm install gridfs-stream
npm install mongoose
In the devDependencies section of the package.json file, define the following packages:
The gridfs-stream package allows you to effortlessly stream files to and from MongoDB GridFS. The mongoose package contains the MongoDB object modelling tool, which is meant to function in an asynchronous environment to conduct operations on the MongoDB database.
Step 4: Maintain the following project folder structure:
Put a few images/videos/audios in the filestoread folder. These files will be utilized for writing and reading activities. A sample gfs.png file is utilized in this example.
Step 5: Open the MongoDB Compass and connect to the MongoDB Database. Create a database with the name filesDB and collection named files.
Step 6: For writing a file in GridFS, create a javascript file and name it as writefile.js and write this code in the file:
//1. Load the mongoose driver
var mongooseDv = require("mongoose");
//2. Connect to MongoDB and its database
mongooseDv.connect('mongodb://localhost/filesDB', { useMongoClient: true });
//3. The Connection Object
var connection = mongooseDv.connection;
if (connection !== "undefined") {
console.log(connection.readyState.toString());
//4. The Path object
var path = require("path");
//5. The grid-stream
var grid = require("gridfs-stream");
//6. The File-System module
var fs = require("fs");
//7.Read the video/image file from the videoread folder
var filesrc = path.join(__dirname, "./filestoread/gfs.png");
//8. Establish connection between Mongo and GridFS
grid.mongo = mongooseDv.mongo;
//9.Open the connection and write file
connection.once("open", () => {
console.log("Connection Open");
var gridfs = grid(connection.db);
if (gridfs) {
//9a. create a stream, this will be
//used to store file in database
var streamwrite = gridfs.createWriteStream({
//the file will be stored with the name
filename: "gfs.png"
});
//9b. create a readstream to read the file
//from the filestored folder
//and pipe into the database
fs.createReadStream(filesrc).pipe(streamwrite);
//9c. Complete the write operation
streamwrite.on("close", function (file) {
console.log("successfully written in database");
});
} else {
console.log("No Grid FS Object");
}
});
} else {
console.log('Not connected');
}
console.log("done");
The file from the filestoread folder is supplied as a parameter to the fs module's createReadStream() function. The write-stream formed with the gridfs object is accepted by the pipe() function. This stream is intended for use with the image file.
Step 7: Run the code using node writefile
This will give the following output:
Now check the MongoDB Compass and the data in the filesDB will look like:
You can view the file in fs.files:
Step 8: For reading a file, create a javascript file and name it readfile.js:
var mongooseDv = require("mongoose");
var schema = mongooseDv.Schema;
mongooseDv.connect('mongodb://localhost/filesDB', { useMongoClient: true });
var connection = mongooseDv.connection;
if (connection !== "undefined") {
console.log(connection.readyState.toString());
var path = require("path");
var grid = require("gridfs-stream");
var fs = require("fs");
var videosrc = path.join(__dirname, "./filestowrite/videos.mp4");
grid.mongo = mongooseDv.mongo;
connection.once("open", () => {
console.log("Connection Open");
var gridfs = grid(example.db);
if (gridfs) {
var fsstreamwrite = fs.createWriteStream(
path.join(__dirname, "./filestowrite/gfs.png")
);
var readstream = gridfs.createReadStream({
filename: "gfs.png"
});
readstream.pipe(fsstreamwrite);
readstream.on("close", function (file) {
console.log("File Read successfully from database");
});
} else {
console.log("No Grid FS Object");
}
});
} else {
console.log(Not connected');
}
console.log("done");
Step 9: Run the above code using node readfile
This will give the following output:
This will read the file from the MongoDB GridFS and write the file to the filestowrite folder:
When to Use the MongoDB GridFS Storage System
The MongoDB GridFS storage system is not widely utilized, although the following conditions may demand its use:
- When the present file system has a restriction on the number of files that can be stored in a given directory.
- When only a portion of the information saved has to be accessed, GridFS allows you to recall sections of the file without having to examine the entire document.
- When distributing files and their metadata via geographically distributed replica sets, GridFS allows the metadata to automatically sync and deploy data across numerous targeted computers.
When Not to Use the MongoDB GridFS Storage System
GridFS should not be used if you need to update the entire file's content. As an alternative, you can keep numerous copies of each file and specify the most recent version in the metadata. After uploading the new version of the file, you can use an atomic update to update the metadata field that indicates "latest" status, and then remove older versions if necessary.
And if your files are all less than the BSON Document Size restriction of 16 MB, consider storing each file in a single document rather than utilizing GridFS. To store binary data, you can use the BinData data type. For further information on utilizing BinData, consult your driver's documentation.
Looking to learn Python? Discover the power of this versatile python programming language. Join our Python course today and unlock endless possibilities. No need to worry about the cost, we offer affordable options for everyone. Start your coding journey now!
MongoDB GridFS Limitations
The GridFS File System has the following restrictions:
- Serving files alongside database content might severely deplete your RAM working set. If you don't want to disrupt your working set, you should serve your files from a different mongodb server.
- File serving performance will be slower than serving the file natively from your webserver and filesystem. However, the additional management benefits may outweigh the slowdown.
- GridFS does not support atomic file updates. If this scenario occurs, you will need to keep various versions of your files and select the appropriate version.
The power and rise of GridFS
GridFS is a gift for developers who want to store huge files in MongoDB. The GridFS storage system allows developers to store big files and retrieve portions of those files as needed. As a result, GridFS is an outstanding MongoDB feature that can be used with a variety of applications. The true benefit of this method is that only a piece of the file can be read without having to load the complete file into memory. This makes GridFS an extremely useful tool for modern applications.