An MPP (massively parallel processing) database distributes data and queries across each node in a cluster of commodity servers. Greenplum’s approach to building an MPP data warehouse is unique. By building on an established open source database, PostgreSQL, they are able to focus engineering efforts on adding value where it counts: parallelization and associated query planning, a columnar data store for analytics, and management capabilities.
Greenplum was contributed to the Apache Foundation by Pivotal in 2015. The latest release, Greenplum 6.0, goes a long way towards re-integrating the Greenplum core with PostgreSQL, incorporating nearly six years of improvements from the PostgreSQL project. These efforts mean that, going forward, Greenplum will gain new features and enhancements for “free,” while Pivotal focuses on making these additions work well in a parallel environment.
An MPP database uses what is known as a shared nothing architecture. In this architecture, individual database servers (based on PostgreSQL), known as segments, each process a portion of the data before returning the results to a master host. Similar architectures are seen in other data processing systems, like Spark or Solr. This is one of the key architectural features that allows Greenplum to integrate other parallel systems, like machine learning or text analytics.