4 Ways to Scale on Heroku

Mastering Heroku — Issue #6

A few weeks ago I gave a presentation at the Columbus Ruby Brigade about my approach and mental model for scaling Heroku apps. My attempt to record the talk failed, so I rerecorded it as a screencast just for you. ❤️

Here’s what I cover in the video:

  • Scaling option #1: Horizontal—add more dynos. If you see increased request queue times in Scout or New Relic, you need to make your app faster or add more dynos. As soon as you’re using more than one dyno, automate it instead of playing a guessing game.

  • Scaling option #2: Vertical—increase dyno size. Because of Heroku’s random routing, you need concurrency within a single dyno. This means running more web processes, which consume more memory and may require a larger dyno type. Aim for at least three web progresses.

  • Scaling option #3: Process types. You’re not limited to just “web” and “worker” process types in your Procfile. Consider multiple worker process types that pull from different job queues. These can be scaled independently for more control and flexibility.

  • Scaling option #4: App instances. Heroku Pipelines make it relatively easy to deploy a single codebase to multiple production apps. This can be helpful to isolate your main app traffic from traffic to your API or admin endpoints, for example. Heroku will route traffic to the correct app based on the request subdomain and the custom domains configured for each app.

My general advice:

  • Start simple.

  • Configure multiple web processes per dyno, increasing dyno size if needed.

  • If you need more than one web dyno, autoscale it.

  • If certain background workers are resource hogs, they may require a larger dyno size. Split into their own process types with dedicated job queues so they can be scaled independently.

  • If you have dedicated sections of your web app such as an API or admin section, split them into their own subdomain so you can divert traffic to a separate app instance. Keep them on a single app instance until the additional complexity is absolutely necessary.

Did you find the video helpful? Anything you’d add or change? Let me know!

Happy scaling!
— Adam (@adamlogic)

Mastering Heroku — Issue #5

A reader asked me for some help this week:

Hi Adam, saw your post and decided to reach out. I used AutoScale in the past but am now on performance dyno. We keep running into Rack::Timeout::RequestTimeoutException errors. Wondering if you may have any suggestion.

I feel this pain. Request timeouts are the worst. If you're not a Rubyist or if you're just unfamiliar with the error above, Rack::Timeout is a library for timing out requests on Rails.

Why would you want to timeout a request? Because if you don't, Heroku will:

Occasionally a web request may hang or take an excessive amount of time to process by your application. When this happens the router will terminate the request if it takes longer than 30 seconds to complete.

We’ve all seen these infamous H12 errors appear in our logs and in our Heroku metrics panel.

It's unfortunate when a user sees an error page, but what's worse is that your app has no idea when Heroku times out a request. Your app will continue processing to completion, whether it takes an additional five seconds or an additional five minutes.

While the router has returned a response to the client, your application will not know that the request it is processing has reached a time-out, and your application will continue to work on the request.

Libraries like Rack::Timeout allow you to halt processing of a long-running request before Heroku times out. This gives you more control over the error the user sees and prevents a hung request from bringing down your app server.

It's a Band-Aid, though, and this particular Band-Aid often introduces more problems than it solves.

Raising mid-flight in stateful applications is inherently unsafe. A request can be aborted at any moment in the code flow, and the application can be left in an inconsistent state.

This is straight from the Rack::Timeout docs, which do an excellent job of warning you about the risks and tradeoffs. I’ve personally seen all kinds of strange and frustrating behavior with Rack::Timeout. As far as I’m concerned, it’s just not worth it.

So what’s a safe alternative that prevents hung requests from bringing down your app? As usual, Nate Berkopec sums it up well:

Nate Berkopec@nateberkopecMost apps could eliminate usage of rack-timeout if they just set aggressive timeouts on network and db. e.g. setting statement timeouts: https://t.co/Ji9delcuB8 I understand some apps can't for legitimate reasons, but rack-timeout is a big, big hammer.

Nate references the Ultimate Guide to Ruby Timeouts, which if you’re a Rubyist, you should bookmark right now. By setting library-specific timeouts on database queries and network requests, you can gracefully handle unpreventable slowdowns—roll back transactions, show a meaningful error message, whatever you need to do.

This isn’t a perfect solution, of course. You could still have a single request with 1,000 database queries, none of which individually time out, but collectively are way over Heroku’s 30-second limit.

In these cases, I still don’t think it’s worth reaching for a “big, big hammer”. Instead, set up alerting for Heroku’s H12 errors. You can use Heroku’s threshold alerting for this, or set up alerting in your log management tool (I use both). Heroku add-ons like Logentries will alert you on H12 errors out of the box.

With these alerts in place, you can investigate your timeouts to fix the root cause instead of relying on a Band-Aid. The H12 error will tell you the exact URL that timed out, so use that along with an APM tool like Scout to determine what went wrong. Chances are, you either have an N+1 query or you’ve omitted a timeout on some I/O.

To recap:

  • Set library-specific timeouts for all I/O (database, network, etc.)

  • Avoid solutions that arbitrarily halt application processing in the middle of a request. It’ll lead to unpredictable and hard-to-debug behavior.

  • Monitor your H12 errors.

  • Use an APM tool to fix those slow endpoints.

And of course, use autoscaling to ensure a few slow requests don’t slow down your entire app. 😁

Happy scaling!
Adam

Mastering Heroku — Issue #4

I recently had a consulting call with Jesse Hanley (creator of Bento) that got me thinking about an approach many of us are guilty of when our apps are struggling: We just turn the knobs up to 11.

Jesse Hanley@jessethanley

If you're using @heroku and have questions about scale, you should hit up @adamlogic.

Just had a really insightful call with him that took me from "I have no idea what I'm doing, app needs more Performance-L dynos lol" to confident to experiment finding a profitable balance.

September 17, 2018
On Heroku, "turning it to 11" translates to blindly adding dynos and increasing dyno size. This approach can work, but it comes with major strings attached:

  • It gets expensive fast.

  • You risk overwhelming your downstream dependencies, such as hitting a connection limit on Postgres (I touched on this last week).

  • You're masking or ignoring underlying root causes of performance issues.

I subscribe to this crazy idea that Heroku is not expensive. When optimially configured, it can be a steal.

Adam McCrea@adamlogic

Yesterday @railsautoscale handled 1.64M requests. I pay ~$100/mo to host it on @heroku. It *can* get expensive, but that's avoidable. Small teams don't need a dev-ops engineer, they just need guidance.https://t.co/fNOKhXva9d https://t.co/AIycHgDghm

August 14, 2018
So how do you optimize your Heroku setup? Here are some tips that got Jesse on the right track:

  • If absolutely consistent performance is a hard requirement, you need performance dynos. The shared architecture of standard dynos means your performance will fluctuate due to factors completely out of your control.

  • Remove the guesswork from choosing the number of dynos. Use Heroku's own autoscaling, HireFire, or Rails Autoscale to automatically scale up and down as needed. Autoscaling should be a hard requirement for a production app.

  • Once you’re autoscaling, there's little reason to use a larger dyno type than necessary. A smaller dyno lets you autoscale at a finer granularity and save a whole lot more cash. This is another reason jumping straight to Perf-L dynos is usually a bad idea.

  • On the other hand, you do need a large enough dyno to run multiple web processes. Heroku's random routing architecture means you’ll stabilize your performance by adding web processes so your app server can intelligently route requests within a dyno. A good rule of thumb is to choose a dyno type with enough memory to run at least three web processes.

  • Do the math to ensure you don’t exceed database connection limits: [connection pool] * [processes] * [max dynos] <— Do this for each process type (web, worker, release, etc).

With that basic setup, you can focus efforts your app itself. Use an APM like New Relic or Scout to measure and diagnose potential bottlenecks. Any improvements there will result in faster response time, less scaling up, and lower Heroku bills.

Happy scaling!
—Adam



Mastering Heroku — Issue #3

I’ve been thinking a lot about these tweets from Nate Berkopec (author of The Complete Guide to Rails Performance):

Nate Berkopec@nateberkopec

Hi! Do you have a Rails application? Open up your code *right now* and double check to make sure that your production database pool (config/database.yml), Sidekiq concurrency, and Puma thread count are all the same number (preferably using an env variable like RAILS_MAX_THREADS).

September 7, 2018
Nate Berkopec@nateberkopec

This is true if on Heroku or not: 5 for Puma/web procs, 10-25 for Sidekiq. Increase sidekiq number if your jobs do lots of HTTP calls. If mostly DB, shoot lower (10 is fine). To set this number on a per-dyno basis, add it your Procfile (e.g. "web: RAILS_MAX_THREADS=5 rails s") https://t.co/nIfgUKMgKM

September 7, 2018
It’s good, simple advice, and it’s absolutely an area where many apps get into trouble, on Heroku or otherwise. If your connection pools are too small, your application will unnecessarily waste time waiting for an available connection, perhaps encountering timeout errors after waiting too long. If your connection pools are too large, you might run up against a connection limit in your datastore. It’s a delicate balance.

Multiply those considerations by your number of datastores (I use Postgres and Redis) and the number of different processes you’re running (I have 5 defined in my Procfile), and suddenly you’re juggling a lot of information.

Nate’s advice above is a great starting point, but you’ll eventually need a solid understanding of why it’s good advise and how changing any of these settings will impact your application as a whole. I find it hard to wrap my head around without some kind of visualization. Something like this tool from Manuel van Rijn, but more visual and not specific to Sidekiq and Redis. Here’s a rough sketch of what I’m imagining:

This is too app-specific to be a Heroku feature, or even an add-on. It’s just a tool for you to plug in your own numbers and visualize the result. Here are the kinds of questions I want it to answer:

  • If I need to scale my app from 3 to 5 web dynos, how many extra datastore connections would I create?

  • What if I increase the number of processes per dyno (Puma workers, for example)?

  • How can I maximize the usage of my limited connections on an entry-level Postgres/Redis tier?

It could also highlight potential issues such as when a connection pool is smaller than the number of threads or when the total datastore connections exceed the current plan limit. Taking it a step further, I’d love to provide guidance on how to implement these settings, but that’s very framework-specific. I’m not sure… it’s a very rough idea right now that I thought would be fun to share.

Would you use something like this? Is it a dumb idea? Reply and let me know what you think!

Mastering Heroku – Issue #2

I want to talk about a problem we ran into last week at work.

We use pipelines for deployment—our master branch is automatically deployed to staging via CI, then we promote staging to production. This works amazingly well, and is one of the features I love most about Heroku. But this week, one of our promotions from staging to production failed. Specifically, the release command failed.

Don’t worry if you’re not a rubyist. Here’s what’s happening:

  • The release phase is trying to migrate the database via `rails db:migrate` (as specified in our Procfile).

  • It’s encountering an error while booting the app environment, before it even touches the database.

  • The error makes no sense. It’s failing to parse JSON that’s internal to Ugliflier, a Ruby wrapper for UglifyJS.

This kind of thing isn’t supposed to happen with pipelines. The beauty of pipelines is that the compiled slug is reused, so the same application code and dependencies are guaranteed between environments.

But it was happening, and we needed to fix it.

Debugging 101: Reproduce the problem

Before we could fix the problem, we needed a reliable way to reproduce it. Fortunately, repeated promotions did reproduce the exact same error. Unfortunately, we couldn’t step into that process to introspect what was going on.

We were hopeful that we could manually reproduce the problem by running `rails db:migrate` in a Heroku Bash session, but that ran without issue on both staging and production. Of course, this wasn’t a true reproduction anyway because our Bash session on production was using the old code—we didn’t have a way of getting the new code on production and manually reproducing the error.

So we were stumped again.

What are the differences?

We knew that staging worked with the exact same code, so the next question is: what’s different between the two environments? Pipelines ensure parity in the code, which leaves two potential differences:

The two environments use different databases, but the app hadn’t even attempted to connect to it yet. We were pretty sure this error was encountered before any external dependency came into play, so we focused on config var differences. One stood out to me in particular:

Jemalloc is an alternative memory allocator for Ruby, but that’s not really important. What is important is that I’d enabled it in production, but not staging. An oversight on my part.

The stack trace showed a JSON parse error, though—nothing remotely to do with memory. This couldn’t be a relevant difference at all. Right?

You know how this story ends. We disabled Jemalloc in production, promoted staging, and deployed production without a problem.

Takeaways

This isn’t exactly a happy ending to the story. We still have absolutely NO IDEA why we encountered this error. It makes no sense at all. We’ve left Jemalloc disabled for now and are moving on.

There are still some good lessons to take away from the experience:

  • Parity between environments is important.

  • Know your tooling. The Heroku CLI was especially useful here for running Bash and exploring config vars.

Elsewhere…

Speaking of environment parity, this mini-debate on Twitter caught my attention:

Michał Matyas@nerdblogpl

I know creating custom environment in Rails (like staging) is a bad practice, but why exactly? Asking for the next time I need to explain it to someone with arguing style of "I disagree strongly not because I'm right but because you don't have good arguments"

August 21, 2018
So there are two approaches for staging/review environments on Heroku:

  1. Create a separate custom environment in your application. In Rails, this means a new file in config/environments and changing RAILS_ENV for each app instance.

  2. Treat all Heroku app instances as “production”, and use config vars to differentiate behavior and credentials by environment. This is Heroku’s recommendation.

I’m with Heroku on this one because inevitably someone (not you or me of course) will add an explicit “production” environment check to the code:

Will that code execute in your staging environment? Who knows! Don’t do this. Use config vars instead. 👍

I also don’t like custom environments because it should be painful when your environments diverge. Custom environments make it a too easy for staging to become way different from production.

Anyway, that’s enough ranting from me. 😀 What do you think about all this?

Have a great week!
—Adam

Loading more posts…