This week, Heroku made Router 2.0 generally available, bringing features like HTTP/2, performance improvements and reliability enhancements out of the beta program!
Throughout the Router 2.0 beta, our engineering team has addressed several bugs, all fairly straight-forward with one exception involving Puma-based applications. A small subset of Puma applications would experience increased response times upon enabling the Router 2.0 flag, reflected in customers’ Heroku dashboards and router logs. After thorough router investigation and peeling back Puma’s server code, we realized what we had stumbled upon was not actually a Router 2.0 performance issue. The root cause was a bug in Puma! This blog takes a deep dive into that investigation, including some tips for avoiding the bug on the Heroku platform while a fix in Puma is being developed. If you’d like a shorter ride (aka. the TL;DR), skip to The Solution section of this blog. For the full story and all the technical nitty gritty, read on.
Reproduction
The long response times issue first surfaced through a customer support ticket for an application running a Puma + Rails web server. As the customer reported, in high load scenarios, the performance differences between Router 2.0 and the legacy router were disturbingly stark. An application scaled to 2 Standard-1X
dynos would handle 30 requests per second just fine through the legacy router. Through Router 2.0, the same traffic would produce very long tail response times (95th and 99th percentiles). Under enough load, throughput would drop and requests would fail with H12: Request Timeout
. The impact was immediate upon enabling the http-routing-2-dot-0
feature flag:
At first, our team of engineers had difficulty reproducing the above, despite running a similarly configured Puma + Rails app on the same framework and language versions. We consistently saw good response times from our app.
Then we tried varying the Rails application’s internal response time. We injected some artificial server lag of 200 milliseconds and that’s when things really took off:
This was quite the realization! In staging environments, Router 2.0 is subject to automatic load tests that run continuously, at varied request rates, body sizes, protocol versions. etc.. These request rates routinely reach much higher levels than 30 requests per second. However, the target applications of these load tests did not include a Heroku app running Puma + Rails with any significant server-side lag.
Investigation 🔍
With a reproduction in-hand, we were now in a position to investigate the high response times. We spun up our test app in a staging environment and started injecting a steady load of 30 requests per second.
Our first thought was that perhaps the legacy router is faster at forwarding requests to the dyno because its underlying TCP client manages connections in a way that plays nicer with the Puma server. We hopped on a router instance and began dumping netstat
connection states for one of our Puma app's web dynos :
Connections from legacy router → dyno
root@router.1019708 | # netstat | grep ip-10-1-38-72.ec2:11059
tcp 0 0 ip-10-1-87-57.ec2:28631 ip-10-1-38-72.ec2:11059 ESTABLISHED
tcp 0 0 ip-10-1-87-57.ec2:30717 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:15205 ip-10-1-38-72.ec2:11059 ESTABLISHED
tcp 0 0 ip-10-1-87-57.ec2:17919 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:24521 ip-10-1-38-72.ec2:11059 TIME_WAIT
Connections from Router 2.0 → dyno
root@router.1019708 | # netstat | grep ip-10-1-38-72.ec2:11059
tcp 0 0 ip-10-1-87-57.ec2:24630 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:22476 ip-10-1-38-72.ec2:11059 ESTABLISHED
tcp 0 0 ip-10-1-87-57.ec2:38438 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:38444 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:31034 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:38448 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:41882 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:23622 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:31060 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:31042 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:23648 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:31054 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:23638 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:38436 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:31064 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:22492 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:38414 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:42218 ip-10-1-38-72.ec2:11059 ESTABLISHED
tcp 0 0 ip-10-1-87-57.ec2:41880 ip-10-1-38-72.ec2:11059 TIME_WAIT
In the legacy router case, it seemed like there were fewer connections sitting in TIME_WAIT
. This TCP state is a normal stop point along the lifecycle of a connection. It means the remote host (dyno) has sent a FIN
indicating the connection should be closed. The local host (router) has sent back an ACK
, acknowledging the connection is closed.
The connection hangs out for some time in TIME_WAIT
, with the value varying among operating systems. The Linux default is 2 minutes. Once that timeout is hit, the socket is reclaimed and the router is free to re-use the address + port
combination for a new connection.
With this understanding, we formed a hypothesis that the Router 2.0 HTTP client was churning through connections really quickly. Perhaps the new router was opening connections and forwarding requests at a faster rate than the legacy router, thus overwhelming the dyno.
Router 2.0 is written in Go and relies upon the language’s standard HTTP package. Some research turned up various tips for configuring Go’s http.Transport
to avoid connection churn. The main recommendation involved tuning MaxIdleConnsPerHost
. Without explicitly setting this configuration, the default value of 2 is used.
type Transport struct {
// MaxIdleConnsPerHost, if non-zero, controls the maximum idle
// (keep-alive) connections to keep per-host. If zero,
// DefaultMaxIdleConnsPerHost is used.
MaxIdleConnsPerHost int
...
}
const DefaultMaxIdleConnsPerHost = 2
The problem with a low cap on idle connections per host is that it forces Go to close connections more often. For example, if this value is set to a higher value, like 10, our HTTP transport will keep up to 10 idle connections for this dyno in the pool. Only when the 11th connection goes idle does the transport start closing connections. With the number limited to 2, the transport will close more connections which also means opening more connections to our dyno. This could put strain on the dyno as it requires Puma to spend more time handling connections and less time answering requests.
We wanted to test our hypothesis, so we set DefaultMaxIdleConnsPerHost: 100
on the Router 2.0 transport in staging. The connection distribution did change and now Router 2.0 connections were more stable than before:
root@router.1020195 | # netstat | grep 'ip-10-1-2-62.ec2.:37183'
tcp 0 0 ip-10-1-34-185.ec:36350 ip-10-1-2-62.ec2.:37183 ESTABLISHED
tcp 0 0 ip-10-1-34-185.ec:11956 ip-10-1-2-62.ec2.:37183 ESTABLISHED
tcp 0 0 ip-10-1-34-185.ec:51088 ip-10-1-2-62.ec2.:37183 ESTABLISHED
tcp 0 0 ip-10-1-34-185.ec:60876 ip-10-1-2-62.ec2.:37183 ESTABLISHED
To our dismay, this had zero positive effect on our tail response times. We were still seeing the 99th percentile at well over 2 seconds for a Rails endpoint that should only take about 200 milliseconds to respond.
We tried changing some other configurations on the Go HTTP transport, but saw no improvement. After several rounds of updating a config, waiting for the router artifact to build, and then waiting for the deployment to our staging environment, we began to wonder—can we reproduce this issue locally?
Going local
Fortunately, we already had a local integration test set-up for running requests through Router 2.0 to a dyno. We typically utilize this set-up for verifying features and fixes, rarely for assessing performance. We subbed out our locally running “dyno” for a Puma server with a built-in 200ms lag on the /fixed
endpoint. We then fired off 200 requests over 10 different connections with hey:
❯ hey -q 200 -c 10 -host 'purple-local-staging.herokuapp.com' http://localhost:80/fixed
Summary:
Total: 8.5804 secs
Slowest: 2.5706 secs
Fastest: 0.2019 secs
Average: 0.3582 secs
Requests/sec: 23.3090
Total data: 600 bytes
Size/request: 3 bytes
Response time histogram:
0.202 [1] |
0.439 [185] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.676 [0] |
0.912 [0] |
1.149 [0] |
1.386 [0] |
1.623 [0] |
1.860 [0] |
2.097 [1] |
2.334 [6] |■
2.571 [7] |■■
Latency distribution:
10% in 0.2029 secs
25% in 0.2038 secs
50% in 0.2046 secs
75% in 0.2086 secs
90% in 0.2388 secs
95% in 2.2764 secs
99% in 2.5351 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0003 secs, 0.2019 secs, 2.5706 secs
DNS-lookup: 0.0002 secs, 0.0000 secs, 0.0034 secs
req write: 0.0003 secs, 0.0000 secs, 0.0280 secs
resp wait: 0.3570 secs, 0.2018 secs, 2.5705 secs
resp read: 0.0002 secs, 0.0000 secs, 0.0175 secs
Status code distribution:
[200] 200 responses
As you can see, the 95th percentile of response times is over 2 seconds, just as we had seen while running this experiment on the platform. We were now starting to worry that the router itself was inflating the response times. We tried targeting Puma directly at localhost:3000
, bypassing the router altogether:
❯ hey -q 200 -c 10 http://localhost:3000/fixed
Summary:
Total: 8.3314 secs
Slowest: 2.4579 secs
Fastest: 0.2010 secs
Average: 0.3483 secs
Requests/sec: 24.0055
Total data: 600 bytes
Size/request: 3 bytes
Response time histogram:
0.201 [1] |
0.427 [185] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.652 [0] |
0.878 [0] |
1.104 [0] |
1.329 [0] |
1.555 [0] |
1.781 [0] |
2.007 [0] |
2.232 [2] |
2.458 [12] |■■■
Latency distribution:
10% in 0.2017 secs
25% in 0.2019 secs
50% in 0.2021 secs
75% in 0.2026 secs
90% in 0.2042 secs
95% in 2.2377 secs
99% in 2.4433 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0002 secs, 0.2010 secs, 2.4579 secs
DNS-lookup: 0.0001 secs, 0.0000 secs, 0.0016 secs
req write: 0.0001 secs, 0.0000 secs, 0.0012 secs
resp wait: 0.3479 secs, 0.2010 secs, 2.4518 secs
resp read: 0.0000 secs, 0.0000 secs, 0.0003 secs
Status code distribution:
[200] 200 responses
Wow! These results suggested the issue is reproducible with any ‘ole Go HTTP client and a Puma server. We next wanted to test out a different client. The load injection tool, hey
is also written in Go, just like Router 2.0. We next tried ab
which is written in C:
❯ ab -c 10 -n 200 http://127.0.0.1:3000/fixed
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Finished 200 requests
Server Software:
Server Hostname: 127.0.0.1
Server Port: 3000
Document Path: /fixed
Document Length: 3 bytes
Concurrency Level: 10
Time taken for tests: 8.538 seconds
Complete requests: 200
Failed requests: 0
Total transferred: 35000 bytes
HTML transferred: 600 bytes
Requests per second: 23.42 [#/sec] (mean)
Time per request: 426.911 [ms] (mean)
Time per request: 42.691 [ms] (mean, across all concurrent requests)
Transfer rate: 4.00 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.2 0 2
Processing: 204 409 34.6 415 434
Waiting: 204 409 34.7 415 434
Total: 205 410 34.5 415 435
Percentage of the requests served within a certain time (ms)
50% 415
66% 416
75% 416
80% 417
90% 417
95% 418
98% 420
99% 429
100% 435 (longest request)
Another wow! The longest request took about 400 milliseconds, much lower than the 2 seconds above. Had we just stumbled upon some fundamental incompatibility between Go’s standard HTTP client and Puma? Not so fast.
A deeper dive into the ab
documentation surfaced this option:
❯ ab -h
Usage: ab [options] [http[s]://]hostname[:port]/path
Options are:
...
-k Use HTTP KeepAlive feature
That’s different than hey
’s default of enabling keepalive by default. Could that be significant? We re-ran ab
with -k
:
❯ ab -k -c 10 -n 200 http://127.0.0.1:3000/fixed
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Finished 200 requests
Server Software:
Server Hostname: 127.0.0.1
Server Port: 3000
Document Path: /fixed
Document Length: 3 bytes
Concurrency Level: 10
Time taken for tests: 8.564 seconds
Complete requests: 200
Failed requests: 0
Keep-Alive requests: 184
Total transferred: 39416 bytes
HTML transferred: 600 bytes
Requests per second: 23.35 [#/sec] (mean)
Time per request: 428.184 [ms] (mean)
Time per request: 42.818 [ms] (mean, across all concurrent requests)
Transfer rate: 4.49 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.5 0 6
Processing: 201 405 609.0 202 2453
Waiting: 201 405 609.0 202 2453
Total: 201 406 609.2 202 2453
Percentage of the requests served within a certain time (ms)
50% 202
66% 203
75% 203
80% 204
90% 2030
95% 2242
98% 2267
99% 2451
100% 2453 (longest request)
Now the output looked just like the hey
output. Next, we ran hey
with keepalives disabled:
❯ hey -disable-keepalive -q 200 -c 10 http://localhost:3000/fixed
Summary:
Total: 8.3588 secs
Slowest: 0.4412 secs
Fastest: 0.2091 secs
Average: 0.4115 secs
Requests/sec: 23.9269
Total data: 600 bytes
Size/request: 3 bytes
Response time histogram:
0.209 [1] |
0.232 [3] |■
0.255 [1] |
0.279 [0] |
0.302 [0] |
0.325 [0] |
0.348 [0] |
0.372 [0] |
0.395 [0] |
0.418 [172] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.441 [23] |■■■■■
Latency distribution:
10% in 0.4140 secs
25% in 0.4152 secs
50% in 0.4160 secs
75% in 0.4171 secs
90% in 0.4181 secs
95% in 0.4187 secs
99% in 0.4344 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0011 secs, 0.2091 secs, 0.4412 secs
DNS-lookup: 0.0006 secs, 0.0003 secs, 0.0017 secs
req write: 0.0001 secs, 0.0000 secs, 0.0011 secs
resp wait: 0.4102 secs, 0.2035 secs, 0.4343 secs
resp read: 0.0001 secs, 0.0000 secs, 0.0002 secs
Status code distribution:
[200] 200 responses
Again, no long tail response times and the median values comparable to the first run with ab
.
Even better, this neatly explained the performance difference between Router 2.0 and the legacy router. Router 2.0 adds support for HTTP keepalives by default, in line with HTTP/1.1 spec. In contrast, the legacy router closes connections to dynos after each request. Keepalives usually improve performance, reducing time spent in TCP operations for both the router and the dyno. Yet, the opposite was true for a dyno running Puma.
Diving deep into Puma
Note that we suggest reviewing this brief Puma architecture document if you’re unfamiliar with the framework and want to get the most out of this section. To skip the code review, you may fast-forward to The Solution.
This finding was enough of a smoking gun to send us deep into the the Puma server code, where we homed in on the process_client
method. Let’s take a look at that code with a few details in mind:
- Each Puma thread can only handle a single connection at at time. A client is a wrapper around a connection.
- The
handle_request
method handles exactly 1 request. It returnsfalse
when the connection should be closed andtrue
when it should be kept open. A client with keepalive enabled will end up in thetrue
condition on line470
. -
fast_check
is onlyfalse
once we’ve processed@max_fast_inline
requests serially off the connection and when there are more connections waiting to be handled. - For some reason, even when the number of connections exceeds the max number of threads,
@thread_pool.backlog > 0
is often times false. - Altogether, this means the below loop usually executes indefinitely until we’re able to bail out when
handle_request
returnsfalse
.
Code snippet from puma/lib/puma/server.rb
in Puma 6.4.2.
When does handle_request
actually return false
? That is also based on a bunch of conditional logic, the core of it is in the prepare_response
method. Basically, if force_keep_alive
is false
, handle_request
will return false
. (This is not exactly true. It’s more complicated, but that’s not important for this discussion.)
Code snippet from puma/lib/puma/request.rb
in Puma 6.4.2.
The last thing to put the puzzle together: max_fast_inline
defaults to 10
. That means Puma will process at least 10 requests serially off a single connection before handing the connection back to the reactor class. Requests that may have come in a full second ago are just sitting in the queue, waiting for their turn. This directly explains our 10*200ms = 2 seconds
of added response time for our longest requests!
We figured setting max_fast_inline=1
might fix this issue, and it does sometimes. However, under sufficient load, even with this setting, response times will climb. The problem is the other two OR’ed conditions circled in blue and red above. Sometimes the number of busy threads is less than the max and sometimes, there are no new connections to accept on the socket. However, these decisions are made at a point in time and the state of the server is constantly changing. They are subject to race conditions since other threads are concurrently accessing these variables and taking actions that modify their values.
The Solution
After reviewing the Puma server code, we came to the conclusion that the simplest and safest way to bail out of processing requests serially would be to flat-out disable keepalives. Explicitly disabling keepalives in the Puma server means handing the client back to the reactor after each request. This is how we ensure requests are served in order.
Once confirming these results with the Heroku Ruby language owners, we opened a Github issue on the Puma project and a pull request to add an enable_keep_alives
option to the Puma DSL. When set to false
, keepalives are completely disabled. The option will be released soon, likely in Puma 6.5.0.
We then re-ran our load tests with enable_keep_alives
disabled in Puma and Router 2.0 enabled on the app:
// config/puma.rb
...
enable_keep_alives false
The response times and throughput improved, as expected. Additionally, once disabling Router 2.0, the response times stayed the same:
Moving forward
Keeping keepalives
Keeping connections alive reduces time spent in TCP operations. Under sufficient load and scale, avoiding this overhead cost can positively impact apps’ response times. Additionally, keepalives are the de facto standard in HTTP/1.1 and HTTP/2. Because of this, Heroku has chosen to move forward with keepalives as the default behavior for Router 2.0.
Through raising this issue on the Puma project, there has already been movement to fix the bad keepalive behavior in the Puma server. Heroku engineers remain active participants in discussions arounds these efforts and are committed to solving this problem. Once a full fix is available, customers will be able to upgrade their Puma versions and use keepalives safely, without risk of long response times.
Disabling keepalives as a stopgap
In the meantime, we have provided another option for disabling keepalives when using Router 2.0. The following labs
flag may be used in conjunction with Router 2.0 to disable keepalives between the router and your web dynos:
heroku labs:enable http-disable-keepalive-to-dyno -a my-app
Note that this flag has no effect when using the legacy router as keepalives between the legacy router and dyno are not supported. For more information, see Heroku Labs: Disabling Keepalives to Dyno for Router 2.0.
Other options for Puma
You may find that your Puma app does not need keepalives disabled in order to perform well while using Router 2.0. We recommend testing and tuning other configuration options, so that your app can still benefit from persistent connections between the new router and your dyno:
- Increase the number of threads. More threads means Puma is better able to handle concurrent connections.
- Increase the number of workers. This is similar to increasing the number of threads.
-
Decrease the
max_fast_inline
number. This will limit the number of requests served serially off a connection before handling queued requests.
Other languages & frameworks
Our team also wanted to see if this same issue would present in other languages or frameworks. We ran load tests, injecting 200 milliseconds of server-side lag over the top languages and frameworks on the Heroku platform. Here are those results.
Language/Framework | Router | Web dynos | Server-side lag | Throughput | P50 Response Time | P95 Response Time | P99 Response Time |
---|---|---|---|---|---|---|---|
Puma | Legacy | 2 Standard-1X | 200 ms | 30 rps | 215 ms | 287 ms | 335 ms |
Puma with keepalives | Router 2.0 | 2 Standard-1X | 200 ms | 23 rps | 447 ms | 3,455 ms | 5,375 ms |
Puma without keepalives | Router 2.0 | 2 Standard-1X | 200 ms | 30 rps | 215 ms | 271 ms | 335 ms |
NodeJS | Legacy | 2 Standard-1X | 200 ms | 30 rps | 207 ms | 207 ms | 207 ms |
NodeJS | Router 2.0 | 2 Standard-1X | 200 ms | 30 rps | 207 ms | 207 ms | 207 ms |
Python | Legacy | 4 Standard-1X | 200 ms | 30 rps | 223 ms | 607 ms | 799 ms |
Python | Router 2.0 | 4 Standard-1X | 200 ms | 30 rps | 223 ms | 607 ms | 735 ms |
PHP | Legacy | 2 Standard-1X | 200 ms | 30 rps | 207 ms | 367 ms | 431 ms |
PHP | Router 2.0 | 2 Standard-1X | 200 ms | 30 rps | 207 ms | 367 ms | 431 ms |
Java | Legacy | 2 Standard-1X | 200 ms | 30 rps | 207 ms | 207 ms | 207 ms |
Java | Router 2.0 | 2 Standard-1X | 200 ms | 30 rps | 207 ms | 207 ms | 207 ms |
Go | Legacy | 2 Standard-1X | 200 ms | 30 rps | 207 ms | 207 ms | 207 ms |
Go | Router 2.0 | 2 Standard-1X | 200 ms | 30 rps | 207 ms | 207 ms | 207 ms |
These results indicate the issue is unique to Puma, with Router 2.0 performance comparable to the legacy router in other cases.
Conclusion
We were initially surprised by this keepalive behavior in the Puma server. Funny enough, we believe Heroku’s significance in the Puma/Rails world and the fact that the legacy router does not support keepalives may have been factors in this bug persisting for so long. Reports of it had popped up in the past (see Issue 3443, Issue 2625 and Issue 2331), but none of these prompted a fool-proof fix. Setting enable_keep_alives false
does completely eliminate the problem, but this is not the default option. Now, Puma maintainers are taking a closer look at the problem and benchmarking potential fixes in a fork of the project. The intention is to fix the balancing of requests without closing TCP connections to the Puma server.
Our Heroku team is thrilled that we were able to contribute in this way and help move the Puma/Rails community forward. We’re also excited to release Router 2.0 as GA, unlocking new features like HTTP/2 and keepalives to your dynos. We encourage our users to try out this new router! For advice on how to go about that, see Tips & Tricks for Migrating to Router 2.0.