Why Implement Vertica?
A year ago we in GoodData decided to migrate big analytical databases from Postgres to a columnar database. In comparison to traditional relational row-oriented databases, columnar databases had been evolving for fewer years, and there is still a smaller number of products in this market. These engines differ from traditional databases in the way they store data on hard disc. They keep data of one column in a single file. So to compute total of one fact, you read only the data you really need, just the column where the fact is placed. This is especially beneficial in analytics where a typical query uses only a few columns out of many (often hundreds).
We evaluated available engines including Vectorwise, Monetdb, Greenplum, Infobright and LucidDB, and chose Vertica since it was the fastest, and provided other features important for us like massive parallelism, high availability mode, and also functional features like time slices on integrated R environment. We believe it is the best product we may have as an underlying database engine of our analytical data marts with large fact tables. Of course, since it is a hot new technology, we have experienced some stability and deployment issues during the implementation, but we have successfully solved them, partially in cooperation with HP, Vertica’s provider.
Some Real Results
So far, we have measured thousands of experiments, including report computation time of real GoodData projects. We want to give you a flavor of Vertica advantages in this blog post. Especially in case of tens or hundreds of millions of rows in fact tables Vertica is excellent, and this the primary use case for the columnar database. Smaller projects usually perform well on Postgres, and the difference is not as visible.
GoodData Customer 1 - there are many time consuming reports on Postgres (green) that are significantly faster on Vertica (blue). The Y-axis is the number of seconds to load a report and the X-axis are the reports by report number.
GoodData Customer 2 - overall Vertica (blue) provides significantly better performance than Postgres (green) out of the box. However as you can see from the chart, there are some exceptions that require further fine-tuning. The Y-axis is the number of seconds to load a report and the X-axis are the reports by report number.
On average, reports on Vertica are 3 times faster than on Postgres, but for large reports the improvement is much more visible: among reports taking more than 20 seconds, the average time on Vertica is about 28 seconds, but on Postgres it is 150 seconds!
On Vertica 98% of all reports are computed within 48 seconds, while Postgres needs 294 seconds. So the 98th percentile of report execution time is six times better on Vertica.
And these numbers are not final! The results have been captured on exactly the same data model on both engines, Postgres and Vertica, without any special optimizations specific for the columnar database. We are now working on some improvements for Vertica, so we expect the difference will be much visible once we roll it out. We plan that the first customers should be able to run their GoodData analytical projects on Vertica by the end of March 2014.
And the great news for our existing customers: we will identify projects that may perform better on Vertica, and migrate them without affecting the users. Their reports will just suddenly get faster.
About The AuthorTomáš Jirotka is a product manager at GoodData for analytical data marts and related infrastructure. He joined the company in July 2011 and participated on Extensible Analytical Engine (XAE) development thanks to this he has deep knowledge of the platform backend.