May 01, 2017
There are few large, hard to solve problems that keep your typical software engineer up at night — and a malfunctioning database is absolutely one of them. We’ll walk you through how our team at Contactually discovered the issue of query performance slippage on some of our largest tables (200 mil — 3 billion+ records) and what we did to address it.
In early 2016, we noticed query performance on our largest tables were starting to slip in NewRelic.
Over the course of a few weeks, queries that were once taking 10ms were now taking upwards of 6 seconds or more. The team considered numerous causes from poorly written queries to expensive background jobs that may be causing systematic performance issues. Eventually, with the aid of NewRelic, Amazon RDS logs, and hundreds of EXPLAIN queries, we came to the conclusion that this was a database level issue, not poorly written queries in our application (a very common cause of poor database performance).
After days of research, we narrowed in on the cause: the default autovacuum settings for large tables.
The default settings can cause weeks to elapse without triggering an autovacuuming process to run on a large table. When autovacuuming doesn’t occur, the query planner is using outdated, incorrect data to decide how to most efficiently execute a query.
Imagine trying to tell a friend where the milk is in a grocery store using directions from what the store looked like 5 years ago — you’re friend is going to waste a lot of time using your outdated directions.
There are numerous benefits to regular autovacuuming. However, this type of operation consumes system resources. The purpose of the default settings for autovacuuming is to ensure it does not run too often, siphoning precious system resources away from answering queries. It’s a delicate balance to strike.
The side effects of a poorly functioning query planner can be catastrophic. The CPU and Disk IO hit caused by using these bad directions can lock up vital resources needed by the rest of the database, dramatically decreasing query performance across the board, even for tables that aren’t affected by this specific issue.
Take a look at how a simple select statement, finding a row by its primary key, was impacted (before and after):
We use Amazon’s Relational Database Service (RDS) to host our Postgres databases, so much of the infrastructure complexity is hidden from us, e.g. network, filesystem, etc. However, there is a lot to be desired from the default RDS Postgres database configuration.
We found our solution in a rather infrequently mentioned “auto vacuum scale factor” setting that helps Postgres determine when it’s time to perform maintenance on a particular table. RDS defaults to running an autovacuum process when 20% of the tuples in a table have been created, updated or deleted. Some of our larger tables have as many as 3 billion records, so this meant that autovacuuming wouldn’t occur until 600 million rows had been changed!
For our larger tables we decided to update our autovacuum scale factor down to 2% instead of the default 20% (or 60 million records instead of 600).
vacuum threshold = vacuum base threshold + vacuum scale factor * numberof tuples
While the majority of the work to solve our issues was in research, better understanding the internals of Postgres, all it took to rectify the solution was run the following query and wait for Postgres to do its work.
ALTER TABLE table_name SET autovacuum_vacuum_scale_factor = 0.02;
We were able bring down our average query execution time by nearly 50% across the board, but often times upwards of 90% faster on these larger tables.
It goes without saying that this level of speed improvement shouldn’t be surprising when regular database maintenance that should be constantly occurring doesn’t happen for large periods of time.
Contactually is looking for experienced software engineers who are passionate about solving complex problems with code. We’ve taken a novel approach to building business software — focus on the end user — and it’s been working! Our users love us. Come help us build a product that makes contact management easy and rescue 10,000s of people from the jaws of clunky, outdated software.
Postgres at Scale: Query Performance and Autovacuuming for Large Tables was originally published in Contactually Product and Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.
Next Post: Engineering Great Design Process