Understanding Distributed Data Storage

With the advent of 21st century, we entered a digital world, people have started using technology in every aspect of life, adopting it more and more to make life easier and save time. With this increased use, the demand to store and preserve the user data has also increased manifold. You use cameras to take pics and to treasure these memories, upload them to web enabled storage like Picassa etc. Just imagine, Picasa has about 7 billion photos uploaded to it, which is little more than Flickr, less than Photobucket, and just a tiny fraction of Facebook. This figure might help you to imagine the kind of data store these companies might be using.

With data requirement as such, local storage solutions will not be able to keep at par with the continuous increase in demand. We can say that we have reached the stage at which traditional way of storing data using stand-alone network attached device no longer serves the purpose.

As per a report published by IDC, "This is the digital universe. It is growing 40% a year into the next decade, expanding to include not only the increasing number of people and enterprises doing everything online, but also all the “things” – smart devices – connected to the Internet, unleashing a new wave of opportunities for businesses and people around the world. Like the physical universe, the digital universe is large – by 2020 containing nearly as many digital bits as there are stars in the universe. It is doubling in size every two years, and by 2020 the digital universe – the data we create and copy annually – will reach 44 zettabytes, or 44 trillion gigabytes."

The answer to this doesn't lie in have faster, bigger drives or higher bandwidth data networks, it lies in a new concept of storing data, which is Distributed Data Storage.

Data storage solutions have evolved now to accommodate this ever growing need for flexible and scalable storage resource.

So what is Distributed Data Storage?

To quote Wikipedia, "A distributed data store is a computer network where information is stored on more than one node, often in a replicated fashion. It is usually specifically used to refer to either a distributed database where users store information on a number of nodes, or a computer network in which users store information on a number of peer network nodes."

Using Distributed Data Storage , you can deliver any of the 3 types of storage, may it be block, file or object.

How is Distributed Data Storage different from traditional way of storing data?

Understanding the difference between these two ways of storing data lies in getting to know the salient features and approach to the solution.

Traditional storage is mostly hardware defined and dependent on traditional SAN & NAS storage systems, whereas, Distributed Data Storage has given a new meaning to Software Defined Storage.

Cost -

Due to this very reason of traditional storage being hardware defined, cost plays a major role in making the decision. With traditional SAN & NAS storage array, you shell out a huge amount to procure and deploy the hardware, whereas, with SDS, you can simply start the storage cluster using commodity of the shelf hardware, it uses standard servers, drives and network. You don't any specialized hardware for deploying Distributed Data Storage . With Distributed Data Storage , you can minimize the cost of the infrastructure, which can be significant considering the continuous growth in demand of storage need. It combines storage and compute, hence, making full use of the servers. and consuming less power, cooling, space, etc.

Nowadays, such storage solutions are also available on ARM platforms, making it for useful for storage solutions. Ambedded Technology, winner of InterOp2016 is the first company who has pioneered in this direction.

Scalable ( capacity & performance ) -

Traditional SAN / NAS can be scaled up to a limit, maximum number of controllers it can support or maximum number of drives one can connect to each controller. This defines the real limit for these devices, whereas, with SDS / Distributed Data Storage , there is virtually no limit for expanding the capacity. It is by design, a cloud. You keep on adding new nodes to it and your cluster keeps on increasing. Maintenance is also comparatively easy for these as you can add a new node and remove the faulty from cluster.

Flexibility -

You can add / remove nodes from data storage cluster on the fly. You don't need any downtime for maintenance on this solution. Managing it is also easy, you can mostly manage it using CLI commands or use REST APIs.

Reliability / Resilience -

Most of the distributed Storage solutions take care of its fault-tolerance by design. They also have capability to replicate the data stored for any un-foreseen event. Due to its distributed nature, the data is not stored on the same node and is distributed among various nodes of the cluster and each node is further replicated, depending on number of replicas defined by Cluster admin. Almost all of them also have data sanity check so that data stored is also checked for any corruption.

In Distributed data storage,it is the function of Software Defined Storage system to ensure that everything is divided, distributed, replicated and in-case of any issue, rectified with intelligent algorithm created by admins.

OpenSource, No Vendor lockin –

With hardware NAS / SAN devices, you mostly need to go ahead with the existing solutions deployed and get the same hardware every-time you wish to increase the storage capacity. As distributed storage systems or for that case, Software Defined storage systems, are deployed /installed on commodity hardware, we are not bound with any particular hardware or server.

Multi-tenancy -

Software defined storage solutions are designed keeping cloud workload in mind, hence, multiple tenant support is built in it. They provide ability to isolate different tenants data and restrict access.

Facts and figures

The global cloud storage market is expected to grow from USD 18.87 Billion in 2015 to USD 65.41 Billion by 2020, at a Compound Annual Growth Rate (CAGR) of 28.2% during the forecast period.

The demand for this market is also being driven by big data and increasing adoption of cloud storage gateway. In 2015, North America is estimated to be the top contributor in the cloud storage market due to increasing technological acceptance and high awareness about emerging data storage concerns in the organization. However, APAC and some countries in MEA are expected to show tremendous growth in this market.

The cloud storage market is broadly segmented by type: solutions and services; by solution: primary storage solution, backup storage solution, cloud storage gateway solution, and data movement and access solution; by service: consulting services, system and networking integration, and support training and education; by deployment model: public, private, and hybrid; by organizational size into SMBs and large enterprises; by vertical: BFSI, manufacturing, retail and consumer goods, telecommunication and IT, media and entertainment, government, healthcare and life sciences, energy and utilities, research and education, and others; and by region: North America, Europe, APAC, Latin America, and MEA.

Cloud storage growth dominates projections for 2016 digital storage. According to Scality’s Jerome Lecat, “By the end of 2016, 80% of SMBs will host some or most of their business in the cloud. Because of this, the service providers that provide the cloud-services to such businesses will need new infrastructure to meet the increasing demand, and support its customers.”

According to Mario Blandini from SwiftStack, “Ubiquitous access with cloud APIs for all unstructured data will represent the lion’s share of unstructured data stored. Optimal storage solutions will be those that can store and return the same data based on authentication rules no matter the access method. In via file, out via object, or vice versa.”

Ceph, Swift, Gluster are some of the well known Software defined solutions for a fantastic distributed data storage solutions.

This brings us to the end of this document, hope this has helped you to understand the difference between traditional storage and distributed data storage design.

Like it? Click here to Tweet your feedback

Daleep Singh Bais

Red Hat Certified Architect (RHCA) with 12 years of technical experience in implementing various projects in India and overseas. Passionate about OpenSource Products and Technologies.