Although, MapReduce exists as a solid solution to the "Volume" problem of Big Data, there is still a strong need for keeping data in a relational and SQL environment. Applications such as Data Warehouses, OLAP Cubes, OLTP systems and Business Intelligence platforms still drive big demand in enterprises all over the world. For these use cases, the traditional tools are not able to keep up with the volumes of data. For these applications, new Massively Parallel Processing (MPP) Appliances have been created. MPP Appliances provide parallel and distributed processing across an integrated set of servers, storage. They also are integrated with relational DBMS and Business Intelligence/Data Warehousing tools to provide a SQL interface and store data in a relational form. Their appliance package provides the ability to scale performance, storage and memory by adding servers. They also arrive at your data center pre-configured for your networking environment which means no need to manage disk systems, software configuration, hardware configuration and optimization.
Although there are many of these applications growing in demand and popularity, the market is currently dominated and lead by the following offerings:
- Microsoft Parallel Data Warehouse
- IBM Netezza
- Teradata Data Warehouse Appliance
- Oracle Exadata
- SAP HANA
- EMC Greenplum
SMP VS MPP
System architecture where all of the processors connect to shared resources (memory, I/O, and network) under a single operating system. Each processor has a private cache memory, access to the main memory. Processors are interconnected using buses, switches or chip networks.
- Relatively inexpensive single machine design (no racks needed)
- Symmetric distributed computing
- Efficiently/Quickly process small to medium data volumes
- Inefficient/Lengthy at processing large data volumes
- Scaling up or down requires a machine upgrade/downgrade
- Resource/Memory contention between processors
- External interrupts impact all processing
- Operating System limitation on scalability (OS can only support 64-100 multi-processors)
- Expensive (time and cost) to upgrade hardware
Massive Parallel Processing (MMP)
System architecture where processing is parallelly distributed processing across an integrated set of servers known as compute nodes. Each compute node contains its own set of processors, memory and bus. Each server also comes with its own operating system and DMBS allowing it to run as an independent processing unit . Compute nodes are interconnected using a control and management node which split, distribute and mange processing. Compute nodes can be added or remove by adding or removing servers to the rack.
- Relatively inexpensive hardware needed to scale (cost of new server is cheaper than buying a new machine)
- No resource contention across compute nodes
- Scaling up or down is easy and can be performed without taking down the system
- Ability to add failover and backup servers
- Efficiently/Quickly process large data volumes
- No limitation on the number of compute nodes that can be added
- Additional maintenance required (rack space, cooling, monitoring)
- Additional maintenance costs (power, cooling, hardware upgrades)
- Unused resources during small and medium data volumes
As mentioned earlier, traditional BI, Data Warehouse and DBMS tools are not able to keep up with the volumes of data. Moore's law is unable to keep up with velocity and volume of data growth. Shared resources and memory prevent scalability and distributed processing. These tools use a Symmetric Multi-Processing (SMP) architecture.
In order to accommodate for the hardware lag (disparity between Moor's Law and data growth), integration/coupling of hardware needs to be utilized. Massively Parallel Processing (MPP) architecture allows for the combination of hardware resources to be pooled to create a more powerful system that is able to meet the demands of processing and storing Big Data while still providing usability of tools using the SMP architecture.