As part of our recent "belt and suspenders" reliability improvements, we updated our clients to route all connections through Fastly, including the Server-Sent Events (SSE) connections we use to update configurations in our clients almost instantly. Unfortunately, our experience with SSE over the streaming-miss feature wasn't great, and we had to quickly go back to clients to that direct traffic straight to our infrastructure. We wanted to share some notes about this process.
The Problem We Hoped To Solve
We sell Feature Flags, and a big part of that is that feature flag changes are a dish best served instantly. Server Sent Events (SSE) achieves this by keeping a connection open and pushing updates as they happen. This is great, but it requires a long-lived connection and managing all of those open connections isn't trivial.
I was particularly excited to use the request collapsing and streaming miss features as described on Fastly's blog to move thousands of long-lived connections from our API services to Fastly. These features don't incur additional costs beyond standard request and transfer pricing, so it seemed too good to pass up. Unfortunately, we learned the hard way that what works in demos and testing doesn't always translate well to the real world. In particular, I failed to heed these lines from the documentation:
For requests to be collapsed together, the origin response must be cacheable and still 'fresh' at the time of the new request. However, if the server has ended the response, and the resource is still considered fresh, it will be in the Fastly cache and new requests will simply receive a full copy of the cached data immediately instead of receiving a stream.
This sounds reasonable, but I failed to consider what it actually means in practice. I expected the combination of request collapsing and streaming miss features would always produce long-lived connections for our SSE use case. However, when our server ended the connection before the scheduled cache expiry, Fastly returned very short-lived successful (200 OK) requests instead of long lived connections.
Surprise Request Volume
As our largest customer updated their prefab clients, our API servers experienced a significant reduction in connections. Yay! However, we soon noticed an unexpected surge in streaming logs from Fastly. The volume was so high that we had to scale up our log processing capabilities and investigate potential bottlenecks. Upon closer inspection, I discovered that many of these logs showed requests to our SSE endpoint lasting only about 25 milliseconds.
It's worth noting that our customers' data updates remained unaffected during this period. The issue manifested as a much higher number of outbound connections than anticipated.
Initially, we suspected we might be hitting connection count limits, but our tests to verify this were inconclusive. It took a couple of days before I realized the correlation: these spikes occurred whenever we restarted our API to deploy changes.
So What Was Happening?
We use 5-minute-long SSE connections with periodic keep-alives. Under normal circumstances, the prefab client would connect to Fastly and remain connected for up to five minutes. However, when our backend server restarted (unexpectedly for Fastly), our client would connect for only a few milliseconds, receive the full cached-to-date response with a 200 status code, and then repeatedly reconnect for the remainder of the cache TTL.
To probe Fastly's behavior, we used a simple SSE client implementation, shown below:
def connect():
headers = {
"Accept": "text/event-stream",
"testHeader": "python-test",
"Accept-Encoding": "identity"
}
while True:
start_time = time.time()
lines = 0
try:
with requests.get(URL, headers=headers, stream=True, timeout=(10, 30)) as response:
response.raise_for_status()
if response.ok:
for line in response.iter_lines(decode_unicode=True):
if line:
lines += 1
end_time = time.time()
LOGGER.warning(f"read {lines} lines in {end_time - start_time} seconds")
except Exception as e:
LOGGER.error(f"Unexpected error: {str(e)}")
# a real client would do backoff in case of error
Experience quickly suggests that this code should implement a backoff strategy for errors (which our real clients do). However, until this incident, I hadn't considered that this code—designed to process connections for minutes at a time—should also backoff when faced with "successful" but very short connections.
We have clients in many languages and, consequently, many different SSE implementations—some open-source, some written in-house. Surprisingly, only one of these appears to have a setting to handle the short connection situation we encountered.
Recommendations
While the absence of additional costs for using streaming-miss and request collapsing is tempting, it's not an ideal solution for SSE. If you decide to use Fastly in this manner, it's crucial to understand how your clients behave when encountering rapidly closed yet successful responses. Implementing a minimum reconnection time can help mitigate request spikes, though they'll still occur—just not as severely as if you hadn't considered this behavior at all.
The Path Forward
For now we’re back to directly hosting the long lived SSE connections on our API nodes. In the weeks ahead we’ll look at using push-based, SSE-specific tooling like Fastly’s Fanout or roll our own pushpin instances in-house.
We worked with Fastly support to understand what was going on here and have been offered a credit for our misunderstanding; they’ll also be updating the older SSE blog post with new information.
Huge thanks to the team at Fastly for their help and understanding here. Software is hard and mistakes happen. Turning an expensive oops into a learning opportunity is a win for everyone and makes us more confident that Fastly will continue to be there for us as we grow.