For the past few years NoSQL or Non-relational database tools have gained much popular in terms of storing huge amount of data and scaling them easily. There are debates on whether non-relational databases will replace relational databases in future. With the increasing number of social data and other unstructured data, the following are some of the questions raised on relational databases.
Are relational databases capable of handling big data?
Are relational databases able to scale out massive amount of data?
Are relational databases suited for the modern age data?
Before answering these questions, let us know some basics of both Relational and Non-Relational databases.
Relational databases: The concept of Relational Database was developed in 1970s. The most important feature of all relational databases is it’s support of ACID (Atomicity,Consistency,Isolation and Durability) properties which assures that all the transactions are reliably processed.
Atomicity: Each transaction is unique and make sure that if one logical part of a transaction fails everything is rollbacked so that data are unchanged.
Consistency: All data written to the database are subject to the rules defined (constraints, triggers, etc)
Isolation: Changes made in a transaction are not visible to other transactions until they are committed.
Durability: Changes committed in a transaction are stored and available in the database even if there is power failure or the database goes offline suddenly.
Strictly structured: The objects in the relational databases are structurally structured. All data in the table are stored as rows and columns. Each column has a datatype. It is mostly normalized. Structured Query Language (SQL) is suitable to relational databases to store and retrieve data in a structured way. Queries are Plain English commands. There are always fixed number of columns although additional columns can be added later. Most of the tables are related to each other with primary and foreign keys thus providing “Referential Integrity” among the objects.The major vendors are ORACLE, SQL Server, MySQL, PostGreSQL, etc.
Non-relational databases: The concept of non-relational databases came into picture to handle rapid growth of unstructured data and scale them out easily. This provides flexible schema so there is no such thing called “Referential Integrity” as we see in Relational databases. The data are highly de-normalised and do not require JOINs between objects. This relaxes ACID property of relational databases and supports CAP (Consistency, Availability and Partioning). But out of these three only two are guaranteed at any point of time. So as opposed to ACID, it will only support BASE (Basically Available Soft state, Eventual consistency). The initial databases created based on these concepts are BigTable by Google, HBase by Yahoo, Cassandra by Facebook, etc.
Categories of Non-relational databases: Non-relational databases can be classified into four major categories such as Key-values database, column database, document database and graph database.
Key-values database: This is the simplest form of NoSQL database where each value is associated with unique key.(ex Redis)
Column database: This database is capable of storing and processing large amount of data using a pointer that points to many columns that are distributed over a cluster.(ex HBase)
Document database: This database can contain many key-values documents with many nested level. Effecient Querying is possible with this database. The documents are stored in JSON format.(ex MongoDB)
Graph database: Instead of traditional rows and columns, this databases uses nodes and edges to represent graph structures and store data.(ex Neo4J)
It is an ability of a system that can easily accomadate the rapid incoming data without much performance problems. This is a main factor for any system to provide good scalability. There are two types of scaling methods known as Vertical and Horozontal scaling.
All the Relational database tools support vertical scaling. This is the method of increasing the power of the system by adding additional CPU, memory and disk spaces. So to allow rapid incoming data, the single production server is optimized to scale up. In this scaling technique there is always a single production server which can be connected by all the applications and users. A cluster environment can be created with some nodes and replicate the data across nodes. Because of ACID properties, all nodes should have the same set of data and data synchronization becomes complicated if there are several nodes in the cluster. This is very optimized for Read scaling. Vertical scaling is also known as scale-up
The benefit of this scaling methodology is the tight integration of data and its consistency across the nodes in a cluster. All nodes will have the same set of data and If there is a problem with the production server, another node will automatically be connected by the applications. So this cluster is known as Fail-over cluster.
All the Non-relational database tools support horizontal scaling. This is the method of adding more computers to the network to allow rapid incoming data. It is easy to add more nodes into the cluster to allow data growth. Data are split automatically and processed across nodes in a cluster. This is a distributed data environment. Hadoop Distributed File System (HDFS) is a classical example for this. Horizontal scaling is also known as Scale-out.
The benefit of this scaling technique is that since data are split and replicated across nodes, if any of the nodes goes offline, the application can still have the data from other nodes and this guarantees the availability of data at all the time. This method is very useful for the cases where no JOINs are required among the data of the nodes. This is also helpful in separating data and having them in different geographical locations.
While both these scaling techniques have advantages and disadvantages, a good environment can mix both of these to have outstanding Scale-up and Scale-out. We can have a scale-up read and write database in a single server which requires ACID properties and have a scale-out distributed historical data across several nodes for data mining purpose.