Apache Doris just ‘graduated’: Why care about this SQL data warehouse

Posted on 23-06-2022 , by: admin , in , 0 Comments

In case you are wondering who “she” is and what school she went to, Doris is an open source, massively parallel processing (MPP) analytical data warehouse that was under development at Apache Incubator.

Last week, the Apache Software Foundation (ASF) said that Doris had achieved top-level status, which according to the foundation means that a project “has proven its ability to be properly self-governed.” 

The SQL-based data warehouse, which utilizes MySQL analytics, was recently released in version 1.0, its eighth release while undergoing development at the incubator, along with six Connector releases linking Doris to various analytics and processing technologies. It has been built to support online analytical processing (OLAP) workloads, often used in data science scenarios, among others.

Doris was born inside Chinese internet search giant Baidu, dubbed as Palo then, as a data warehousing system for its advertisement business before being open sourced in 2017 and entering the Apache Incubator in 2018.

Doris has roots in Apache Impala and Google Mesa

Doris is based on the technology integration of Google Mesa and Apache Impala, an open source MPP SQL query engine developed in 2012 and based on the underpinnings of Google F1.

Mesa, which was designed around 2014 to be a highly scalable analytic data warehousing system, was used to store critical measurement data related to Google’s Internet advertising business.

According to its developers, both at Baidu and at the Apache Incubator, the database offers simple design architecture while providing high availability, reliability, fault tolerance, and scalability.

“The simplicity (of developing, deploying and using) and meeting many data serving requirements in a single system are the main features of Doris,” the Apache Software Foundation said in a statement, adding that the data warehouse supports multidimensional reporting, user portraits, ad-hoc queries, and real-time dashboards.

Other features of Doris include columnar storage, parallel execution, vectorization technology, query optimization, ANSI SQL, and integration with big data ecosystems via Connector support for Apache Flink, Apache Hive, Apache Hudi, Apache Iceberg, Apache Spark, and ElasticSearch, among other systems.

Uptake of open source databases forecast to grow

Uptake of enterprise-grade, open source databases is expected to grow. In Gartner’s State of the Open-Source DBMS Market 2019 report, the consulting firm predicted that more than 70% of new in-house applications will be developed on an Open Source Database Management System (OSDBMS) or an OSDBMS-based Database Platform-as-a-Service (dbPaaS) by 2022.

In adiditon, as data proliferates and businesses’ need for real-time analytics grows, a simple yet massively parallel processing database that is also open source, seems to be the need of the hour.

“As data volumes have grown, MPP databases became the only realistic way to process data quickly enough or cheaply enough to meet organizations demands,” said David Menninger, research director at Ventana Research.

Cloud fuels use of MPP databases

The other trends fueling MPP databases are the availability of relatively inexpensive cloud-based instances of servers, which can be used as part of the MPP configuration, thus eliminating the need to procure and install the physical hardware these systems use, Menninger said.

Further, making a case for Doris, Menninger said that while there are many MPP database alternatives, some of which are open sourced, there isn’t really an open source MySQL option.

“MySQL itself and MariaDb have been extended to support larger analytical workloads, but they were initially designed for transaction processing,” Menninger said, adding that open source NoSQL database Greenplum and hyperscaler services such as Google BigQuery, Amazon RedShift and Microsoft Synapse could be considered as rivals to Doris.

ClickHouse, Apache Druid, and Apache Pinot could also be considered rivals, said Sanjeev Mohan, former research vice president of big data and analytics at Gartner.

Doris offers architectural simplicity, fast query times

According to the Apache Foundation, using Doris could have multiple advantages, such as architectural simplicity and faster query times.

One of the reason behind Doris’ simplicity is its non-dependency on multiple components for tasks such as class management, synchronization and communication.

The reason behind fast query times can be attributed to vectorization, a process that allows a program or an algorithm to operate on a multiple set of values at one time rather than a single value.

Another benefit of the data warehouse, according to the developers at the Apache Foundation, is Doris’ ability to handle concurrencies, updates and deletes of data. Concurrencies can be termed as events or requests from multiple users to process data and gain insights from the database at the same time.

The need for concurrencies has increased because most organizations are now allowing many employees to access data in order to drive insights, in contrast to past practices, which called for mainly C-suite executives and specialists to have access to analytics.

Apache Doris is currently being used in more than 500 enterprises globally.