June 19, 2017

Adding Read Replicas in a Production Ruby on Rails System with Zero Downtime Using Makara

Title slide bg wide

Thaddaeus Lim

When query, schema and index optimizations aren’t enough to support the load of a high throughput database system, fundamental changes in how data flows through that system are often necessary.

Over the past few years, our team at Contactually has been fortunate enough to tackle numerous database performance issues, ranging from autovacuum tuning to full production database upgrades. Growth as a company doesn’t occur without these types of issues and we’re happy to share our experiences with our peers.

Why Splitting Your Read/Write Queries is Necessary

Load balancing is a fundamental concept in any system design. It’s important to design your data systems to handle the peaks or spikes in usage, as opposed to designing to handle the average load a system experience.

By splitting queries between your primary and follower databases, you’re able to ensure that users interacting with your web application have a snappy and performant experience, while saving heavy background jobs for a follower.

How We Researched Potential Solutions

There are a few options when it comes to choosing a database adapter that supports primary/follower query balancing. After evaluating Makara and octopus, we decided to go with Makara for the following reasons:

  1. Thread-safety: we use Sidekiq for our background job processing, so a non-thread-safe database adapter was a deal-breaker.
  2. Community support / used in large production environments. We read about a few companies experience using Makara, namely Instacart.

In our research, we discovered a gem called distribute_reads (made by Instacart), that sealed the deal for us. It also fed directly into our ideal incremental release strategy.

Plan a Detailed Roll Out Strategy

Given the importance of a highly available database, we had to be very careful in how we rolled out these changes to production. The key take aways were to:

Implement a Kill Switch

Use a database config that could easily be reverted using ENV['USE_MAKARA’] in case we quickly needed to roll back our changes and use the default postgresql adapter.

development: &default
<% if ENV["USE_MAKARA"] %>
url: postgresql-makara:///
sticky: true
- role: master
name: primary
url: <%= ENV["DATABASE_URL"] %>
- name: replica
<% else %>
adapter: postgresql
url: <%= ENV["USE_DATABASE_URL"] %>
<% end %>

<<: *default

We ran into an issue with our first deployment where we needed to immediately roll back. Users’ access to Contactually was unaffected and we were able to quickly address the problem and re-deploy. You can see some more suggestions from this gist here.

Start Slow — Pick a Non-Critical Job

Pick a non-critical, background job to introduce distribute_reads on. We picked our job that generate merge suggestions for our users. If the jobs failed for whatever reason, we could roll back, and just run them again.

distribute_reads do
# your regular application logic

If the Makara adapter is being used, reads will be distributed throughout the pool of slaves. If it’s not, for whatever reason, distribute_reads will gracefully use the default postgresql adapter.

Monitoring is Key

Ensure you have proper monitoring in place. We use NewRelic and Amazon RDS metrics to see the effect of our changes. If you can’t monitor results, you have no way of knowing how effective your changes were.

Example RDS Metrics showing our databases’ health and load.

Effective monitoring helped us quickly (within seconds) determine that our initial rollout had an issue that needed to be quickly fixed.

Warm Your Follower Database’s Cache

Garett Mizunaka

If you’re introducing a follower database that has never been used or queried before, it’s important to make sure the database is warmed up before hitting it with lots of complex queries.

Just like a car in cold weather, you wouldn’t want to start your engine and slam on the gas before the engine has had a chance to warm up.

If the follower instances were recently created, initiate simple count queries for your most recently active users to warm both the database and operating system level caches. For more information, check out pg_prewarm.

Results — Huge Success!

Results so far have been a staggering improvement in system-wide database performance improvement. We saw a ~30% decrease in time spent in the database for our web-initiated queries and an almost total elimination of spikes in Queue Depth.

If you or your team have implemented a system like this, we’d love to hear your feedback and share what we’ve learned in greater depth!

For millions of professionals, relationships are the backbone of a viable business. Whether you’re working with clients, prospects, or potential investors, Contactually helps you build stronger relationships with the people who can make you successful.

Contactually is looking for experienced software engineers who are passionate about solving complex problems with code. We’ve taken a novel approach to building business software — focus on the end user — and it’s been working! Our users love us. Come help us build a product that makes contact management easy and rescue 10,000s of people from the jaws of clunky, outdated software.

Adding Read Replicas in a Production Ruby on Rails System with Zero Downtime Using Makara was originally published in Contactually Product and Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.