What is Big Data 🤔..?

4 min readSep 17, 2020

Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.

Have you ever thought in a day How much Internet we, the whole world consume ? How big MNC’s like Google, Facebook, instagram etc Stores Thousands of Tera bytes of data with high speed & high efficiency?

Don’t bother about it . I will answer to all this questions & explain what is Big data problem how we can solve it.

How much data the whole world consume daily?

Given how much data is on the internet, the actual amount of data used is difficult to calculate.

But if we’re talking about how much data is created every day the current estimate stands at 1.145 trillion MB per day.

1.7MB of data is created every second by every individual throughout 2020.
2.5 quintillion bytes of data are produced by humans every day.
350 million photos are uploaded to Facebook each day.

Facebook generates 4 petabytes of data every day.
Every day, 306.4 billion emails are sent, and 5 million Tweets are made.

Ever imagined, how and where this huge amount of data might get stored?

Yes, the answer is very simple to think, in data centers which has hundreds of servers with great storage capacity and computing power. But, this ain’t that simple. Behind the scene, there are many problems to store such huge data. Big data has become a huge problem!

🤔 Big data problems:

Big data is huge amount of data. while storing , maintaining we will get various problems .

Now we will discuss about the challanges :

Volume: Volume refers to the sheer size of the ever-exploding data of the computing world. It raises the question about the quantity of data.

Velocity: Velocity refers to the processing speed. It raises the question of at what speed the data is processed

Variety: Variety refers to the types of data i.e., Structured Data (RDBMs), Unstructured Data( Video streaming , Social media etc.)

Veracity: This refers to the quality of the collected data. If source data is not correct, analyses will be worthless. As the world moves toward automated decision-making, where computers make choices instead of humans, it becomes imperative that organizations be able to trust the quality of the data.

If there are problems there are solutions too..👍

To solve this issue ( 4 V’s) we use a Concept named Distributed Storage System

What is Distributed Storage System 🤔 ?

A distributed storage system is infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.

In this system we split the data into small blocks and give one block to each computer or server . so the volume gets reduced and since we are storing data in parallel this saves our time and also reduces the input/output issue i.e., velocity. More and more independent servers the less time required to store data.

This topologie know as Master and Slave Model

Master/slave is a model of asymmetric communication or control where one device or process (the "master") controls one or more other devices or processes (the "slaves") and serves as their communication hub.

This total is known as cluster.

To implement any concept we need software .

To implement Distributed Storage Cluster we have some softwares :