Skip to main content

30 posts tagged with "Engineering"

View All Tags

· 9 min read
Jeff Dwyer

Let's take a look at the Prefab pricing page and go step through step how we modelled it using Stripe's new Usage Based Billing APIs.

Feature flag pricing

Prefab sells things like Feature Flags and Dynamic Log Levels with awesome developer tooling. We charge money on two dimensions, both of which try to reflect the effort it takes us to support our customers. Those dimensions are:

Servers We need to keep a live connection open for each of the servers side SDK that you use to connect. We charge $2 / server for this, but that goes down to $1 or $.80 at higher volume.

Requests Front end clients make requests to get feature flag evaluations. We charge for these requests at the rate of $10 per 1.5M, with better pricing on the higher tiers.

Modeling Associated Products

To setup billing in Stripe, you need to decide what your products and prices are going to be. This sounds like it should be easy, but there are actually a lot of different ways to go about modeling things. Here's what we did.

In this diagram the green boxes represent the "product" and the blue represent prices.

Visualization of Products and prices

Having a single "product" for "Servers" is clearly the right move, but what about the base tier pricing? You could probably model the 3 Free/Basic/Enterprise as a single "product" called "platform" with 3 different prices. Why not do that? Well the Stripe pricing table let's you use an HTML widget <stripe-pricing-table>. It was our intention to use this weidget, because the less code the better, right? When you go to setup that widget however, it was very much "add your product" as the primary choices. Since for us, the choice here is Basic/Pro/Enterprise, this lead us to have a Product for each.

note

The <stripe-pricing-table> lets you add more than one price for the same product, but this seems to be for allowing monthly / annual billing.

note

Moving Beyond Pure Usage Based

Originally, we were purely usage based, ie "just $1 / server" but we discovered customers wanted easier ways to estimate their bill. Ask a customer to estimate their bill and there's friction, but tell them it's "$99 for up to 100 server connections" and they can just say "oh great, we have fewer than that so the bill won't be more than $99/month". It's a touch irrational when you've got your engineer hat on, but it turns out that being kind to the actual people doing the software purchasing is a good idea for us.

In order to have tiers in combination with usage based overage we end up with pricing of the form "$99 for the first 100 and then $99 per 100 after that". The term of art for this is Graduated / Tier based pricing even though our tiers are the same price. We'll get into more details shortly.

Connecting Subscriptions to Prices

When we go to create a subscription, we'll see that a subscription is basically just a payment method, associated with a set of prices. Each connection is called a SubscriptionItem.

Stripe::SubscriptionItem.create({
subscription: 'sub_1Mr6rbL',
price: 'price_1Mr6rdLkdI'
})

Here's an ER diagram of the billing modeling, representing a customer subscribed to the Pro plan. You can see that the basic and enterprise prices are not connected to the user's subscription.

tip

I've named the prices Pro.Servers.0 I would highly recommend that you do something similar and add in a number to indicate the price version. These prices are pretty immutable once you get going and it's easy to make mistakes. A little bit of version control in your naming will prevent server-price-basic-v2-copy-use-this-one type fiascos.

The important take-away here is that in our world, there are really "tightly correlated prices". If you have the "pro" prices for your subscription, then you need to have the "pro.server" price for your Server product.

The main code that we're going to end up writing is the logic to ensure that these prices remain correlated as subscriptions change.

Usage Based Billing

Here's the data model of Stripe's usage based billing, right from their documentation.

How Meters work at stripe

This is much better than the previous model, which we discussed earlier.

Our usage based tracking is really going to be very simple. For the prices basic.requests.0, pro.requests.0 etc, we just set them up to reference the same meter: billable_requests. This makes our code that records usage totally oblivious to what subscription the customer has, which is just what we want.

This is also really useful for trials / free tiers. We create a customer for each Prefab team when they signup and we can instantly start tracking usage for them against a Meter. Eventually our customer will add a subscription and at that point the usage can start making it onto an invoice. But it's nice to be able fully separate the concern of measuring and tracking usage from the more intricate dance of creating and adjusting subscriptions & prices.

  def create_billable_requests_meter_events(amount, api_usage, event_type)    
attr = {
event_name: "billable_requests",
timestamp: api_usage.day.to_time(:utc).to_i,
payload: {
stripe_customer_id: @team.stripe_customer_id,
value: amount,
}
Stripe::Billing::MeterEvent.create(attr)
end

Rendering the Pricing Table

Ok, so we've seen how we modelled our Products and Prices. We've started recording metered usage for our customers. How do we connect the two? How do we actually create a subscription and add the correct 3 prices as subscription items?

The promise of Stripe handling the entire pricing table was appealing to me. With our product information already in the system, I was able to quickly create a PricingTable object from my Stripe Product Catalog and then I just dropped the following into my billing page.

<script async src="https://js.stripe.com/v3/pricing-table.js"></script>
<stripe-pricing-table pricing-table-id="prctbl_9999999"
publishable-key="pk_live_uCl3999999999">
</stripe-pricing-table>

This rendered me a nice looking pricing table and clicking on the button took me to a nice checkout experienve, so at first I thought I was all done.

stripe pricing table UI widget

Problems:

  1. A small thing, but I had to make a separate dark mode and light mode table which was... unfortunate.
two UI widgets for darkmode
  1. It only worked for Creating a subscription. I couldn't use the same tool to let people upgrade / downgrade a subscription. This was bigger bummer. (You see I can only 'cancel' the subscription from the Customer Portal). the stripe customer portal

Because of #2, if I was going to let people switch from Basic to Pro without canceling Basic, I essentially felt the need to rewrite a pricing table. If I was going to do that anyway, then I wanted it to be consistent, so I did not end up using this pricing widget.

The Code We Wrote

The code we wrote came in 4 parts. Here are the 4 things that our code needs to accomplish:

  1. Creating a subscription. eg basic and basic.requests.0 and basic.servers.0.
  2. Change a subscription from eg basic to pro which should remove basic and add pro
  3. Catch webhook for subscription change and "reconcile". Remove basic.requests.0 and basic.servers.0 and add pro.requests.0 and pro.servers.0.
  4. Mapping in code of the underlying stripe price IDs, so that when we reconcile we can know what price ID to add in production for pro.

1) Basic Create of the Core Subscription

  def self.create_checkout_session(team, new_plan_name, return_url)
new_object = StripeProducts.find_by(name: new_plan_name)

session = Stripe::Checkout::Session.create(
customer: team.stripe_customer_id,
line_items: [{
price: new_object.product_price_id,
quantity: 1
},
{
price: new_object.servers_price_id
},
{
price: new_object.requests_price_id
}
],
mode: 'subscription',
success_url: return_url,
cancel_url: return_url,
subscription_data: { billing_cycle_anchor: StripeProducts.get_billing_cycle_anchor }
)
session
end

We don't actually have to add the servers and requests prices here, since the webhook would reconcile them for us. However the checkout page is better if it has all 3 prices at the time of customers adding a credit card.

2) Change a Subscription

def self.change_subscription(team, new_plan_name)
subscription = team.get_subscription

old_product_object = get_product_for(subscription)
new_object = StripeProducts.find_by(name: new_plan_name)

if old_product_object
old_si = subscription.items.filter { |si| si.price.id == old_product_object.product_price_id }.first
Stripe::SubscriptionItem.delete(old_si.id)
end

new_si = Stripe::SubscriptionItem.create(
subscription: subscription.id,
price: new_object.product_price_id
)
end
  def process_webhooks
case data.dig("type")
when 'customer.subscription.created' || 'customer.subscription.updated'
process_subscription
end
end

def process_new_subscription
subscription = data.dig("data", "object")
team = Team.find_by_stripe_customer_id(subscription.dig("customer"))
ensure_correct_subscription_items(team)
end

def ensure_correct_subscription_items(team)
subscription = team.get_subscription

prices_to_add(subscription).each do |price_id|
Stripe::SubscriptionItem.create({
subscription: subscription.id,
price: price_id
})
end

subscription_items_to_remove(subscription).each do |item|
Stripe::SubscriptionItem.delete(item.id, clear_usage: true)
end
end

def prices_to_add(subscription)
## if subscription is Pro, return [pro.requests.0, pro.servers.0]
end
def prices_to_remove(subscription)
## if subscription is Pro, return anything that isn't "pro.*"
end

4) A Mapping of Associated Products

Backing all of this, we did have to have a map of these "associated prices". When the "reconciler" runs it can use this mapping to find the request and server price ids for the given product price id.

class StripeProducts < ActiveHash::Base
LOG = SemanticLogger[self]
self.data = [
{ id: 1,
name: 'Basic',
default: OpenStruct.new(
product_price_id: 'price_1Or44444444',
requests_price_id: 'price_1Or5555555',
servers_price_id: 'price_1Or06666666'
),
production: OpenStruct.new(
product_price_id: 'price_1O77777777',
requests_price_id: 'price_1O8888888',
servers_price_id: 'price_1Oz1999999'
)
},
{ id: 2,
name: 'Pro',
default: OpenStruct.new(
product_price_id: 'price_1Or4111111',
...

This big file of the various Pricing IDs is not my favorite, but it works. I considered Stripe Terraform but it didn't have support for the new Meter object yet. I considered generating my prices in code and then saving the IDs out to a file. That might be a more elegant solution, but the quantity of these was below the threshold that was an obvious win for automation in my opinion.

Conclusion

The great thing about Stripes new support for Usage Based Billing and Meters is that the "usage" part has gotten very simple and is hardly something you need to think about anymore. Just decide your meters and record usage for the customer.

It still takes a decent amount of thinking for correctly model your products and prices, but now you can focus on these as their own problem. I hope taking a look at what we did at Prefab was helpful. Get in touch if you want to run your modeling past me while this is all still loaded in my brain.

· 7 min read
Jeff Dwyer

Stripe just launched a major update to their support for Usage Based billing. It's a big improvement and has made Prefab's billing code a lot simpler. Let's look at the change and how that impacts the code you need to write.

Usage Based Billing with Stripe

  • We have APIs that get used.
  • We have products and prices setup in Stripe.
  • Now we just have to send usage to Stripe and tally the billing.

How hard could that be? Well, it used to be harder than you'd think, but Stripe's new usage based billing helps a ton.

The new Stripe usage code helps us in 2 main ways:

  1. Simplifying our code so less of our code needs to understand the customer's Stripe subscription
  2. More flexibility & reliability around getting the right usage on the right invoice.

Previously...

In the legacy Usage Based Billing, the data model looked like this.

To track customer usage you had to specify the SubscriptionItem ID like this:

Stripe::SubscriptionItem.create_usage_record(
'{{SUBSCRIPTION_ITEM_ID}}',
{
quantity: 100,
timestamp: 1714398365,
action: 'increment',
}
)

This was a problem for a few reasons. How do we get our usage code the correct subscription item? The source of truth for a customer's subscription has to be Stripe, but if our code needs to understand the details, that means catching a lot of webhooks and trying to maintain a local picture of the subscription. Subscriptions can be complicated beast too. Free trials, upgrades, downgrades & cancellations. SubscriptionItems are tightly tied to a specific Product & Price. There's a lot of opportunity for race conditions or missed edge cases.

The second issue was that this usage record needed to be created during the current billing period. That means that if you want something on the bill for March, you need to write the usage record during March. Again, the edge cases abound. What about usage in the last minute of the hour in March? What about failure scenarios in your billing or aggregation code?

The Update

With the new system, Stripe has added a entirely new concept called a Meter.

This seems like a small change, but it makes a world of difference.

How Does This Help?

Usage Code needed to understand Subscriptions

Prefab charges based on the number of requests to the Feature Flag evaluation endpoint. But I couldn't just tell Stripe that customer A did 1M API requests. I had to tell Stripe to put 1M API requests onto the specific pro tier pricing subscription item for a given customer.

So now it's all better. To put the usage onto the meter, all we need to know is the stripe_customer_id.

  def create_billable_requests_meter_events(amount, api_usage, event_type)    
attr = {
event_name: "billable_requests",
timestamp: api_usage.day.to_time(:utc).to_i,
payload: {
stripe_customer_id: @team.stripe_customer_id,
value: amount,
}
Stripe::Billing::MeterEvent.create(attr)
end

Now we create as many Prices as we need for the given Meter. At Prefab that means a Basic, Pro and Enterprise price. I went through a full break down of how we've modeled our pricing and usage in modeling usage based billing at Prefab. Our usage aggregation code is blissfully unaware of trial end dates, upgrades or cancellations.

note

The price itself can hold the more complex logic. On our Basic tier, there's no up-front cost for 1.5M requests (~= 5k feature flags MAU). After that it's $10 per 1.5M requests. So our price is a Usage Based, Per tier, Graduated pricing. The first tier is 1->15M for $0 and after that it's .00000666 / request.

Very Difficult to align Billing periods

At Prefab, we see a lot of Feature Flag evaluations. Far far too many to be writing the usage to Stripe for each evaluation. We aggregate the data a two different levels, once inside our backend code, with an in-memory aggregation. This outputs to BigQuery. We then do a second level of aggregation to rollup to hourly level data. We store this in Postgres where it's move convenient to query. Finally we have another job that pull from this table and writes the data to Stripe.

Ideally, usage that happens in March should land on the March bill. But it turns out that's easier said than done. For usage that happens in the last hour of the last day of March, just what exactly are the mechanics for getting that usage onto the March bill?

In the legacy system, all billing usage needed to be updated during the current billing period. See the note in the legacy documentation:

Reporting usage outside of the current billing period results in an error.

During the first few minutes of each billing period, you can report usage that occurred within the last few minutes of the previous period. If the invoice for the previous period isn’t finalized, we add that usage to it. Otherwise, we bill that usage in the current period. After the grace period, you can’t report usage from the previous billing period.

Don’t rely on the grace period for reporting usage outside of a billing period. It’s intended only to account for possible clock drift, and we don’t guarantee it.

There's a bunch of built-in delay in this pipeline while we wait for these aggregation steps so it was very possible for things to run past the end of the "clock drift". We could have adopted a more streaming / dataflow version of this, but that wouldn't really solve the problem. Pipelines can freeze / pause too. We want a billing system that can reliably put the right usage on the right bill even if there's more than 5 minutes of latency in the pipes.

The new system supports:

  1. Recording usage before a Customer has created a subscription.
  2. A full 1-hour grace period for usage recorded after invoice creation.
  3. Cancelling events that have been sent to Stripe within the last 24 hours.
  4. Backdated subscription creation to capture usage from before a subscription was created.
  5. Mid-cycle updates to prices.

Work Still In Progress

Our billing code is much improved, but we still have some work in progress.

Bucket Pricing

If you charge $100 per 15M requests with the first 15M requests free and the customer makes 16M requests, what should the bill be?

  • If you answered $6.66 that suggested you want it to be graduated.
  • If you answered $100 that suggests that you're thinking like enterprise software ;)

As engineers, the rate makes sense to us and usage based billing gives me some comfort with non-round numbers. It turns out that software buyers are not necessarily rational however and a lot of people would prefer a bill that is $99, $99, $99 vs a bill that bounces all over the place like: $71, $92, $53.

We haven't exactly decided which way we'll go on this. But for now the limitation is actually a technical one so we are charging in the graduated manner. We would like the flexibility to be able to charge for a "bucket" of requests. What this requires is that we could set the pricing to have a transform_quantity option. In our case we would transform_quantity / 15_000_000 to get the number of buckets and then charge $99 per bucket. At the time of this writing this was not an option with tiered usage based billing at Stripe, but I'm assured that it's just around the corner.

Conclusion

Stripe's new Usage Based billing support has been a big upgrade for us. Our code has better separation of concerns. We have a much improved story around consistency and reliability even in the face of delays in our aggregation pipelines. We're really excited to see where Stripe takes this next, particularly when it comes to support for detailed breakdowns of our invoices.

· 5 min read
Jeff Dwyer
info

tl;dr As of 1.6.0 prefab-cloud-ruby's dynamic logging now works by being an injectable filter for semantic_logger rather than a full logging library, ie SemanticLogger.add_appender(filter: Prefab.log_filter). This is awesome.

After we released our dynamic logging library for Ruby, I stepped back and tried to do an overview of the pros and cons of the major logging libraries in the Ruby ecosystem in Before You Lograge.

One redditor commented:

Kudos to the author for not putting their company's product at #1 when it is clearly a very solid #2. I had forgotten how good semantic logger is.

It's nice to feel people appreciated me being fair. But also... kinda stinks to be #2.

The fact of the matter was that they were right. Semantic Logger is awesome. It's been around for 12 years, it's had 62 contributors. It's a solid piece of code that can deal with whatever weirdness Rails throws at it and has had time to build custom appenders for everything from: Sentry to syslog, DataDog to Splunk.

At Prefab we really don't want to be in the business of building logging libraries. We want you to be able to turn on debug logging for a single customer in an instant, with our dynamic logging, but we don't really want to have an opinion on your aggregator, your formatting or how logging integrates with Rails.

In other languages, like Java or Python the standard logging libraries have a clear place to inject a custom filter and this was perfect for us. In Ruby however, the standard logging libraries didn't have any concept of a pluggable filter, so in our initial version we ended up rolling our own.

This worked quite well, but we were always going to be behind SemanticLogger when it came to a question of:

Supporting lots of different logging sinks/aggregators

# ElasticSearch example
SemanticLogger.add_appender(
appender: :elasticsearch,
url: "http://localhost:9200",
index: "my-index",
data_stream: true
)
# LogStash example
log_stash = LogStashLogger.new(type: :tcp, host: "localhost", port: 5229)
SemanticLogger.add_appender(logger: log_stash)

Structured Logging

logger.debug("Calling stripe", item: "si_124", customer: "cust_456")
#2024-04-01 D MyController -- Calling stripe -- { :item => "si_123", :customeer => "cust_456" }

Duration Metrics

logger.measure_info("Called sleep", metric: "supplier/add_user") do
sleep 3
end
#2024-04-01 I (3.005s) MyController -- Called sleep

Tagged Logging

SemanticLogger.tagged(user: "Jack", zip_code: 12345) do
# All log entries in this block will include the above named tags
logger.debug("Hello World")
end
# 2024-04-01 D {user: Jack, zip_code: 12345} MyController -- Hello World

Standardizing the very odd output of Rails logging internals

See rails_semantic_logger for details, but we get really nice structured breakdowns of Rails' internal logging.

2024-04-01 16:09:38.404316 W [99437:puma srv tp 001] {path: application_controller.prefab_controller} (3.005s) PrefabController -- Called supplier
2024-04-01 16:09:38.404822 W [99437:puma srv tp 001] {path: application_controller.prefab_controller} PrefabController -- Calling Supplier -- { :request => "update", :user => "Jack" }
2024-04-01 16:09:38.405205 W [99437:puma srv tp 001] {user: Jack, zip_code: 12345, path: application_controller.prefab_controller} PrefabController -- Hello World
2024-04-01 16:09:38.409065 D [99437:puma srv tp 001] {path: action_view} ActionView -- Rendering -- { :template => "prefab/index.html.haml" }
2024-04-01 16:09:38.416935 D [99437:puma srv tp 001] {path: action_view} (7.391ms) ActionView -- Rendered -- { :template => "prefab/index.html.haml", :within => "layouts/application", :allocations => 13541 }
2024-04-01 16:09:38.515765 D [99437:puma srv tp 001] {path: action_view} (1.488ms) ActionView -- Rendered -- { :partial => "layouts/_amplitude.html.haml", :allocations => 2874 }
2024-04-01 16:09:38.519110 D [99437:puma srv tp 001] {path: active_record} (1.332ms) ActiveRecord -- Brand Load -- { :sql => "SELECT \"brands\".* FROM \"brands\" ORDER BY slug", :allocations => 8, :cached => nil }
2024-04-01 16:09:38.530351 D [99437:puma srv tp 001] {path: action_view} (1.356ms) ActionView -- Rendered -- { :partial => "layouts/_footer.html.haml", :allocations => 3857 }
2024-04-01 16:09:38.530932 I [99437:puma srv tp 001] {path: application_controller.prefab_controller} (3.153s) PrefabController -- Completed #index -- { :controller => "PrefabController", :action => "index", :format => "HTML", :method => "GET", :path => "/prefab", :status => 200, :view_runtime => 118.27, :db_runtime => 5.8, :allocations => 171461, :status_message => "OK" }
2024-04-01 16:09:38.596435 D [99438:puma srv tp 002] {path: active_record} (0.668ms) ActiveRecord -- { :sql => "SELECT \"schema_migrations\".\"version\" FROM \"schema_migrations\" ORDER BY \"schema_migrations\".\"version\" ASC", :allocations => 7, :cached => nil }

If You Can't Beat Them, Join Them!

All of this is to say that as of 1.6.0 we're very happy that prefab-cloud-ruby works by leveraging semantic_logger and implementing dynamic log levels where they were meant to be, as a filter.

To get dynamic logging level for all of your code, you just need to:

# config/initializers/logging.rb
SemanticLogger.sync! # Use synchronsous processing for targeting logging with current context
SemanticLogger.default_level = :trace # Prefab will take over the filtering
SemanticLogger.add_appender(
io: $stdout, # Use whatever appender you like
filter: Prefab.log_filter, # Insert our Prefab filter
)

PR (#173) was a delightful +304/-1046, which reflects all of the work that we no longer had to do by being able to just rely on our new logging friend.

· 8 min read
Jeff Dwyer

I've got a fun sleuthing mystery for you today. Follow along and see if you can figure out what the problem is before the end.

So we're spinning up a new application at Prefab. (We want to offer SAML to our customers. So we're building out a quick authentication app, with the idea that perhaps we also sell this as a service one day.)

This is a brand new Rails app, 3 weeks old. We deploy to kubernetes which is running in GKE autopilot.

  1. Locally everything working great.
  2. Staging, everything working great.
  3. Production... not happy. Pod restart alerts.

What gives?

So first things first, we pull out Lens and see we've got some OOM. Okay, that's annoying and a bit surprising. It's configured for 768Mi and that should be enough since we've got other rails apps doing more with less, so something is weird.

And when I say "running out of memory" I don't mean "slow memory leak". This this is getting killed very quickly. 1, 2 minutes and kaboom.

What's very odd though is that the service is actually up! The pod is getting killed constantly, but.. actually 1 pod of the 2 pod deployment is happy. What the heck??

Well, let's see if we're just under-provisioned / give it a bit more room to see what happens. Let's give it 3GB as overkill and see what happens.

Memory usage increasing rapidly

Nope! That's a hungry hungry hippo, easily chewing through 3GB of memory.

That's a hungry hungry hippo!

Time For More Information

So ok, looks like some kind of looks memory leak, so let's get some more info. First we turned on data dog profiling.

To do that we run app with ddprofrb ie bundle exec ddprofrb exec rails server -p 3000 -b 0.0.0.0 and set the ENV vars:

  DD_PROFILING_ENABLED: "true"
DD_PROFILING_ALLOCATION_ENABLED: "true"
DD_PROFILING_EXPERIMENTAL_HEAP_ENABLED: "true" #alpha!
DD_PROFILING_EXPERIMENTAL_HEAP_SIZE_ENABLED: "true" #alpha!

And here's what we get:

Datadog heap profiling

Huh.. those are pretty pictures, but... I don't see any sign of a memory leak. Allocations, heap all look fine. If anything it's reporting way too small. It's saying there's only 26mb and it's not growing. huh.

Ok, well let's try some rack-mini-profiler, because I've had luck with that before. This let's us append ?pp=profile-memory to any URL and get output that includes:

memprof profiling

and... basically same thing.

note

Allowing just anyone to ?pp=profile-memory on your production instance is a terrible idea, so this doesn't work out of the box. I enabled it with:

#application_controller.rb
before_action do
if Prefab.enabled?("rack.mini-profiler")
Rack::MiniProfiler.authorize_request
end
end

and

#application.rb
config.to_prepare do # https://github.com/MiniProfiler/rack-mini-profiler/issues/515
Rack::MiniProfiler.config.enable_advanced_debugging_tools = Prefab.get("rack.mini-profiler.enable_advanced_debugging_tools")
end

Which let me turn this on as a targeted feature flag.

Either way, I'm not seeing anything suspicious here. The overall memory allocations are nothing like the 3GB monster that I'm seeing in the Datadog metrics.

At this point I'm pretty confused. Conventional wisdom on the Internet is that jemalloc is better, so I threw that at the wall.

RUN apt-get update && apt-get install -y libjemalloc2
ENV LD_PRELOAD=libjemalloc.so.2

That shouldn't fix a leak, but... maybe magic happens? Nope. No discernible change.

Allright... Time to Think

Let's recap:

  1. 1 Deployment with 2 pods: one good, one evil.
  2. Evil pod is a very hungry hippo. Easily gets to 3gb quickly.
  3. Memory profiling tools don't reflect this usage, more like 90MB.
  4. Running on a GKE autopilot kubernetes cluster.
  5. Brand new Rails 7.1.3 application.
  6. Many other rails apps working fine with the same env vars, application.yaml etc.

So, can you figure it out? Time for some classic choose your own adventure:

  • If you think there's a bug in Rails turn to page 33.
  • If you think there's a memory leak in our code turn to page 43.
  • If you think this is a Kubernetes issue turn to page 56.
  • If you think DHH is personally at fault turn to page 88.

And what would you do as next steps?

  • Give the pods 34 GB of memory and cross your fingers.
  • Get deep in the code trying to print out more ObjectSpace to see where this leak is.
  • Go old-school and run top on the servers.
  • Investigate Ruby memory management eg RUBY_GC_MALLOC_LIMIT_MAX & RUBY_GC_HEAP_GROWTH_FACTOR.
  • Try to reproduce locally under load.
  • Other.

While you wait I'll make a brief plug for:

Feature Flags: They are Great

Feature Flags are great and you should use them. This was not a very fun day at the office, but all this sleuthing around would've been soooo much worse if there were any customers using this service. As is this just made a lot of noise and affected internal testing, but we could turn it off at anytime because it was flagged.

So we had a lot of alarms going off because something was unhappy in production, but there was no actual impact on customers. This is living the Feature Flag dream!

So use feature flags everyone. Use your internal tool or some other SaaS or ours, but USE THEM! Yay for feature flags.

And the answer is...

If you guessed it's DHH's fault you win! (Please read this with the loving kindness I intend, I just had a chuckle that it was actually his commit that caused this issue.)

The answer is that a commit in a very recent Rails release triggered our app to behave weirdly when deployed.

So what is the issue?

Rails recently did a bunch of upgrades to the default puma configuration. This issue in particular was actually really interesting and an awesome look at a lot of smart people negotiating in public: https://github.com/rails/rails/issues/50450. The upshot is while Rails did a bunch of work on Puma and while better in most cases, one commit had some pathological behavior on shared hosting.

Rails / Puma is different then many other languages / frameworks in that it typically uses a combination of threads and workers. In other languages like Java you would run a single process and then parallelize with threads. In Ruby it's common to do process fork to gain parallelism with shared memory. Forking is when we get another copy of our process. It shares some memory because of copy on write, but in general it's a whole new process with its own memory overhead.

With that in mind, here's the problematic commit. https://github.com/rails/rails/commit/839ac1e70f7fd340a93348d54cde626fbdbedcb6

worker_count = ENV.fetch("WEB_CONCURRENCY") { Concurrent.physical_processor_count }

Can you see the issue yet?

The key diagnostic step that blew the case wide open was the humble top command. James ran it on the happy pod and saw a handful of ruby processes. Then he tried it on the evil pod and saw... OMG SO MANY RUBY. Yep, 32! ruby processes all chugging away.

What happened? Physical Processors In A Virtual World

We're running GKE autopilot. This is a fully managed kubernetes that is very easy to set up and we quite like. We don't need to worry about anything expect our pods and how much cpu / memory we want them each to have.

This is fine, up until the point that a pods starts asking tough questions about the underlying hardware. Shared hosting means my pod may be on a 4CPU instance or a 64CPU instance. In our case, we have 2 pods guaranteed to run on different instances because of anti-affinity rules. In particular, one of the pods ended up on a e2-standard-4 and one is on a e2-highcpu-32. So one pod was running with 4 workers and one was trying to spawn 32 workers. That explains the 3GB of memory usage!! And this explains why we had one good pod and one evil pod.

I believe it also explains is why the Datadog profiling and ruby memory profiling tools didn't work as expected. Each worker of a ruby process is a FORK. It's a different pid, different process, no communication back to the mothership. So when we're running ObjectSpace and other things, we must just looking at our little slice of the universe, and each slice of the universe was fine. It was just a case of too many slices.

This also explains why this only affected our newest app. This change came into Rails recently in the 7.1.0.beta1 and this was our only app on Rails 7.1.

Good News

The good news is that it's not in rails/main as it was reverted in https://github.com/rails/rails/commit/379d7951f1ea92b925d2fab44b3e7e0245456c74.

Bad news is that it's not in a released version yet (latest is 7.1.3.2). But I'm sure it'll be out soon.

Summary

We did it! We removed Concurrent.physical_processor_count by hand and replaced it with our config system since that's how we roll. For us that looked like the following, but ENV vars or hardcode would fix too.

threads Prefab.get("rails.min.threads"), Prefab.get("rails.max.threads")

if Prefab.get("rails.worker.count") > 1
require "concurrent-ruby"
workers Prefab.get("rails.worker.count")
end

And it worked! Our hippos have been satiated.

Happy example of memory usage

My particular take-away from this was a reminder to my future self that forking makes diagnostic / observability tools behave quite differently from some of my expectations. That's a tricky one to keep in mind as I bounce between languages.

Hope you enjoyed the ride!


· 3 min read
Jeffrey Chupp

If you're just starting the LSP, you might wonder what language to build your Language Server (LS) with. This article will help you pick the right language. You can choose anything (seriously, I built a toy Language Server in Bash). There's no universally correct answer, but there’s a correct one for you.

Thinking...

Consideration 1: Audience

Your audience is the most important consideration. If you're writing a language server for a new Python web framework (Language Servers aren't just for languages, people), then implementing the language server in Java might raise a few eyebrows.

The audience for a Python framework is less likely to contribute to a language server written in a language they're less familiar with. There's nothing wrong with Java (IMHO), but the biases associated with languages could hurt adoption.

If your language server is for a specific language or tooling tied to a specific language, you should probably use the same language to build the server.

Consideration 2: Personal Preference

If you're building your language server as a hobby, the first user is yourself. Optimize for your own enjoyment.

You’re less likely to have fun building if you pick a popular language with unfamiliar (or poor) ergonomics. If you're not having fun, you're less likely to get very far with the project, and your language server won't ever matter to anyone else anyway.

This doesn't mean you should limit yourself to languages you're already an expert in -- building a language server is a great way to learn how a new language handles

  • stdin/stdout and other communication channels
  • concurrency and parallelism
  • error handling
  • testing, debugging, profiling
  • etc.

Consider picking a language you'll enjoy using.

Non-consideration: Performance

Unless you're building a language server to replace one that is demonstrably slow, you should probably avoid optimizing your decision for performance. Measure first before you start hand-coding assembly code.

You're a developer; I get it. You want to think performance matters. Suppose computationally intensive behaviors are required to calculate diagnostics/code actions/etc. In that case, you can always shell out to something tuned for performance and still keep the Language Server itself implemented at a higher level.

Don't worry about performance. It isn't important at first, and you have options later.

Non-consideration: Ecosystem and Libraries

Many languages already have libraries that provide abstractions to help you write language servers. These can jump-start your development but aren't going to make or break your project.

You have all the building blocks you need if you can read and write over stdin/stdout and encode and decode JSON.

Learn more and build alongside me in my LSP From Scratch series.

You can build a language server without third-party libraries.

What If There's No Clear Winner?

If the considerations above haven't helped you pick a clear winner, choose TypeScript (or, if you must, JavaScript).

The first-party libraries (e.g., vscode-languageserver-node) are in written TypeScript, and the community and ecosystem are excellent. A discussion on the vscode-languageserver-node project often leads to an update to the spec itself.

As a bonus, servers written in TypeScript (and JavaScript) can be bundled inside a VS Code extension and be available in the VS Code Marketplace as a single download. I've put up a Minimum Viable VS Code Language Server Extension repo where you can see how this all fits together.

All things being equal, choose TypeScript.

· 6 min read
Andrew Yip

It's common to use static site generators like Jekyll or Docusaurus for marketing or documentation websites. However, it's not always easy to run A/B tests when using these tools.

Prefab makes it simple. In this post we'll show how to setup an A/B test on a statically-generated Docusaurus website. We'll also show you how to send your experiment exposures to an analytics tool. We'll be using Posthog, but the process should be very similar for any analytics tool that has a JS client.

Installing Prefab

This step is the same as for adding Prefab to any other React project.

npm install @prefab-cloud/prefab-cloud-react

Initializing Prefab in Docusaurus

We recommend using the PrefabProvider component from our React library. In a normal React application, you would insert this component somewhere near the top level of your app. For a Docusuarus site, the easiest place to add it is in the Root component. That way Prefab will be available for experimentation on any page of your site.

tip

If you haven't already swizzled the Root component, here's a link to the Docusaurus docs for how to do it: https://docusaurus.io/docs/swizzling#wrapper-your-site-with-root

Everything that we're going to do here needs to run client side, so we'll start by adding the Docusaurus useIsBrowser hook to our Root component.

import React from "react";
import useIsBrowser from "@docusaurus/useIsBrowser";

export default function Root({ children }) {
const isBrowser = useIsBrowser();

if (isBrowser) {
// do client stuff
}

return <>{children}</>;
}

This is the basic initialization for the Prefab client.

import React from "react";
import useIsBrowser from "@docusaurus/useIsBrowser";
import { PrefabProvider } from "@prefab-cloud/prefab-cloud-react";

export default function Root({ children }) {
const isBrowser = useIsBrowser();

if (isBrowser) {
const onError = (error) => {
console.log(error);
};

return (
<PrefabProvider apiKey={"YOUR_CLIENT_API_KEY"} onError={onError}>
{children}
</PrefabProvider>
);
}

return <>{children}</>;
}

Adding Context for Consistent Bucketing

Often A/B tests are bucketed based on users. To do that, we need some consistent way to identify the user, even if they're not logged in...which is usually the case for a static site. Luckily you can probably get an identifier from whatever analytics tool you have installed, or you can generate one yourself.

const uniqueIdentifier = window.posthog?.get_distinct_id();

Once you have the identifier, you can pass it to the Prefab client as context.

const contextAttributes = {
user: { key: uniqueIdentifier },
};

<PrefabProvider
...
contextAttributes={contextAttributes}
...
>
{children}
</PrefabProvider>
tip

We have some opinions about why you might want to generate your own unique tracking ID.

Tracking Experiment Exposures

Your experiment is only going to be useful if you have data to analyze. Prefab is designed to work with whatever analysis tool you already have, so you don't have a competing source of truth. To do this we make it easy to forward exposure events to your tool of choice.

Typically you will have initialized your tracking library as part of the Docusaurus config. You can then provide an afterEvaluationCallback wrapper function to the Prefab client. This will be called after each use of isEnabled or get to record the flag evaluation and resulting value. In this example we're using the Posthog analytics platform.

<PrefabProvider
...
afterEvaluationCallback={(key, value) => {
window.posthog?.capture("Feature Flag Evaluation", {
key, // this is the feature flag name, e.g. "my-experiment"
value, // this is the active flag variant, e.g. true, "control", etc.
});
}}
...
>
{children}
</PrefabProvider>

Here's an example chart from Posthog showing an experiment funnel going from experiment exposure to viewing any other page.

Prefab experiment analysis
tip

Prefab also provides evaluation charts for each feature flag, which you can find under the Evaluations tab on the flag detail page. This telemetry is opt-in, so you need to pass collectEvaluationSummaries={true} to PrefabProvider if you want the data collected. While these are lossy and not a substite for analysis in your analytics tool of choice, they can be useful for troubleshooting experiment setup. Below is an example of an experiment with a 30/70 split.

Prefab experiment analysis

Setting up Your Experiment Code

Congrats, now you're ready to use Prefab from any Docusuarus JSX page or component. Import the usePrefab hook and use it to get a value for your experiment.

import React from "react";
import Layout from "@theme/Layout";
import { usePrefab } from "@prefab-cloud/prefab-cloud-react";

export default function Hello() {
const { isEnabled } = usePrefab();

return (
<Layout title="Hello" description="Hello React Page">
{isEnabled("my-experiment") && (
<div>
<p>"Some experimental copy..."</p>
</div>
)}
</Layout>
);
}
tip

The usePrefab hook also provides a get function for accessing non-boolean feature flags.

Is it Fast?

The Prefab client loads feature flag data via our CDN to ensure minimal impact on your page load speed. It also caches flag data after the initial load. You can read more about the Prefab client architecture in our docs.

Will it Flicker?

There's a catch here, which is not specific to using Prefab. Since Docusaurus is a static site generator, it does not execute any server-side logic when pages are requested. There are more details in the Docusaurus static site generation docs.

This means that the page will first render the static version, which means no access to cookies or to the Prefab flags data. Once your React code runs client-side, it will render again with the correct feature flag values from Prefab.

So in the example above, the page will initially load without your experiment content. Then it will pop-in on the re-render. You'll have to make a judgement call on whether this negatively impacts the user experience, depending on where the experiment is on the page and how it affects the layout of other page elements.

The alternative is to render a loading state on the initial render, then display the actual content once the Prefab client has loaded.

const MyComponent () => {
const {get, loading} = usePrefab();

if (loading) {
return <MySpinnerComponent />
}

switch (get("my-experiment")) {
case "experiment-on":
return (<div>Render the experiment UI...</div>);
case "control":
default:
return (<div>Render the control UI...</div>);
}
}

You can read a more in-depth discussion of handling loading states in the Prefab React client docs.

Configuring the Experiment in the Prefab Dashboard

I wrote a detailed walkthrough of creating flags in the Prefab UI in a previous blog post.

For a simple experiment with only a control and an experiment treatment, you'll want to create a boolean feature flag. The important part for making it an experiment is defining a rollout rule for targeting. Notice that we are setting user.key as the "sticky property". This means that Prefab will use the unique identifier we passed in for segmenting users into the two experiment variants.

Prefab experiment settings

· 5 min read
Jeffrey Chupp

So you've got a misbehaving function in your Node app, and you need to debug it. How can you get more logging? It would be great if you could add log lines to your function, but only output them when you need them so you don't create a bunch of noise & expense. You can do this with Prefab dynamic logging for Node.

Let's see how to enable logging for:

  • A single function
  • A particular user
  • For just 1 hour

The Code We Want To Debug

Here's a really basic skeleton of an Express app. It's has a simple route that takes a user id from the url and returns some data from the database. Let's pretend it's misbehaving and we need to debug it.

We've added two console.log statements, but this probably isn't shippable as is because, at high throughput, we're going to print out way too much logging.

app.get("/users/:id", (req, res) => {
const userId = req.params.id;

var sql = "SELECT * FROM users WHERE id = $1";
console.log(`running the following SQL ${sql}`, { userId: userId });

db.run(sql, [userId], (err, rows) => {
if (err) {
// ...
}

console.log("query returned", { rows: rows });
res.send(`200 Okey-dokey`);
});
});

Add & Initialize Prefab

The first thing we're going to do is add Prefab. We'll use the standard NodeJS server side client. This gives us an SSE connection to Prefab's API out-of-the-box so we'll get instant updates when we change our log levels.

const { Prefab } = require("@prefab-cloud/prefab-cloud-node");

const prefab = new Prefab({
apiKey: process.env.PREFAB_API_KEY,
defaultLogLevel: "warn",
});

// ... later in our file
await prefab.init();

Swap Logging to Prefab

Rather than use a console.log we will create a Prefab logger with the name express.example.app.users-path and the default level of warn so we don't get too much output.

We can replace our console.log with some logger.debug and logger.info and now it's safe to deploy. They won't emit logs until we turn them on.

const logger = prefab.logger("express.example.app.users-path", "warn");

// simple info logging
logger.info(`getting results for ${userId}`);

var sql = "SELECT * FROM table WHERE user_id = $1";

// more detailed debug logging
logger.debug(`running the following SQL ${sql} for ${userId}`);

db.run(sql, [userId], function (err, rows) {
logger.debug("query returned", { rows: rows });
res.send(`200 Okey-dokey`);
});

Listen for changes and Turn On Debugging in the UI

We can now toggle logging in the Prefab UI! Just choose express.example.app.users-path, change it to debug and a minute later you'll see the debug output in your logs.

Change the log level for this express route

Adding Per User Targeting

To add per user targeting, we need to set some context for Prefab so it can evaluate the rules. We should move the logger creation inside this context so that the logger knows about the user id.

// take the context from our url /users/123 and give it to prefab as context
const prefabContext = { user: { key: userId } };

// wrap our code in this context
prefab.inContext(prefabContext, (prefab) => {
const logger = prefab.logger("express.example.app.users-path", "warn");

logger.info(`getting results for ${userId}`);

var sql = "SELECT * FROM table WHERE user_id = $1";

// more detailed debug logging
logger.debug(`running the following SQL ${sql} for ${userId}`);

db.run(sql, [userId], function (err, rows) {
logger.debug("query returned", { rows: rows });
return res.send(`200 Okey-dokey`);
});
});

We can now create the rules in the Prefab UI for just 1 hour and just user 1234. This will let us see the debug output for just that user and automatically stop debug logging after the hour is up.

Target express route logging to just a single user

That's It!

If we load the pages /users/1000, /users/1001 and /users/1234 we'll see the following output in our logs. We have INFO level logging for the first two, but DEBUG level logging for the last one because it matches our user.key rule.

INFO  express.example.app.users-path: getting results for 1000
INFO express.example.app.users-path: getting results for 1001
INFO express.example.app.users-path: getting results for 1234
DEBUG express.example.app.users-path: running the following SQL SELECT * FROM table WHERE user_id = $1 for 1234
DEBUG express.example.app.users-path: query returned { rows: [ { id: 1, user_id: 1234, account: active, balance: 340 } ] }

Full Code Example

const express = require("express");
const { Prefab } = require("@prefab-cloud/prefab-cloud-node");

const prefab = new Prefab({
apiKey: process.env.PREFAB_API_KEY,
defaultLogLevel: "warn",
});

const app = express();
const port = 3000;

// Mock database for the purposes of this example
const db = {
run: (sql, params, callback) => {
callback(null, []);
},
};

const main = async () => {
app.get("/users/:id", (req, res) => {
const userId = req.params.id;
// take the context from our url /users/123 and give it to prefab as context
const prefabContext = { user: { key: userId } };

// wrap our code in this context
prefab.inContext(prefabContext, (prefab) => {
const logger = prefab.logger("express.example.app.users-path", "warn");

logger.info(`getting results for ${userId}`);

var sql = "SELECT * FROM table WHERE user_id = $1";

// more detailed debug logging
logger.debug(`running the following SQL ${sql} for ${userId}`);

db.run(sql, [userId], function (err, rows) {
logger.debug("query returned", { rows: rows });
return res.send(`200 Okey-dokey`);
});
});
});

await prefab.init();

app.listen(port, () => {
console.log(`Example app listening on port ${port}`);
});
};

main();

To learn more about Prefab dynamic logging, check out the dynamic logging or check out the other things you can do with Prefab in Node like feature flags.

· 6 min read
Jeff Dwyer

So you've got a misbehaving Netlify function and you need to debug it. How can you get more logging? It would be great if we could add log lines to our function, but only output them when we need them so we don't create a bunch of noise & expense. We can do this with Prefab dynamic logging for Netlify.

In this post, we'll add dynamic logging to our Netlify function that will let us turn on debug logging for:

  • A single function
  • A particular user
  • For just 1 hour

The Code We Want To Debug

Here's a really basic skeleton of a Netlify function. It's a simple function that takes a user id from the url and returns some data from the database. Let's pretend it's misbehaving and we need to debug it.

We've added two console.log statements, but this probably isn't shippable as is because, at high throughput, we're going to print out way too much logging.


export default async (req, context) => {

const {userId} = context.params;

var sql = "SELECT * FROM table WHERE user_id = $1";
console.log(`running the following SQL ${sql}`, {userId: userId});

db.run(sql, [userId], function (err, rows) {
console.log("query returned", {rows: rows});
return new Response("200 Okey-dokey");
});
};

export const config = {
path: "/users/:userId"
};

Add & Initialize Prefab

The first thing we're going to do is add Prefab. We'll use the standard NodeJS server-side client, but we'll turn off the background processes. Since we're running on a lambda, we don't want any background processes in our function.

import {Prefab} from "@prefab-cloud/prefab-cloud-node";

var prefab = new Prefab({
apiKey: process.env.PREFAB_API_KEY,
enableSSE: false, // we don't want any background process in our function
enablePolling: false, // we'll handle updates ourselves
defaultLogLevel: "warn",
collectLoggerCounts: false, // turn off background telemetry
contextUploadMode: "none", // turn off background telemetry
collectEvaluationSummaries: false, // turn off background telemetry
});
await prefab.init();

Swap Logging to Prefab

Rather than use a console.log, we will create a Prefab logger with the name netlify.functions.hello and the default level of warn so we don't get too much output.

We can replace our console.log with some logger.debug and logger.info, and now it's safe to deploy. They won't emit logs until we turn them on.

const logger = prefab.logger("netlify.functions.hello", "warn");

// simple info logging
logger.info(`getting results for ${userId}`);

var sql = "SELECT * FROM table WHERE user_id = $1";

// more detailed debug logging
logger.debug(`running the following SQL ${sql} for ${userId}`);
db.run(sql, [userId], function (err, rows) {
logger.debug("query returned", {rows: rows});
return new Response("200 Okey-dokey");
});

This logging will not show up in your Netlify logs yet, because the logger is warn but the logging here is info and debug. That means it's safe to go ahead and deploy.

Listen for changes and Turn On Debugging in the UI

Since we turned off the background polling, we'll want to update prefab in line. We can do this by calling the updateIfStalerThan with our desired polling frequency. This is a quick check to a CDN, taking around 40ms (once every minute).

prefab.updateIfStalerThan(60 * 1000); // check for new updates every minute

We can now toggle logging in the Prefab UI! Just choose the function, change it to debug, and a minute later, you'll see the debug output in your logs.

Change the log level for this netlify function

This is pretty cool and you can stop here if this solves your needs. With this pattern you'll be able to instantly turn logging on and off for any function in your app.

Adding Per User Targeting

Now we'll go deeper and add per user targeting. This will let us laser focus in on a particular problem.

To add per user targeting, we need to tell Prefab who the current user is. We do this by setting some context for Prefab so it can evaluate the rules. We should also move the logger creation inside this context so that the logger has this context available to it.

// take the context from our url /users/123 and give it to prefab as context
const {userId} = context.params;
const prefabContext = {user: {key: userId}};

// wrap our code in this context
prefab.inContext(prefabContext, (prefab) => {
// logger goes inside the context block
const logger = prefab.logger("netlify.functions.hello", "warn");

logger.info(`getting results for ${userId}`);

var sql = "SELECT * FROM table WHERE user_id = $1";

logger.debug(`running the following SQL ${sql} for ${userId}`);
db.run(sql, [userId], function (err, rows) {
logger.debug("query returned", {rows: rows});
return new Response("200 Okey-dokey");
});
});

We can now create the rules in the Prefab UI for just 1 hour and just user 1234. This will let us see the debug output for just that user and automatically stop debug logging after the hour is up.

Target netlify function logging to just a single user

That's It!

If we load the pages /users/1000, /users/1001, and /users/1234, we'll see the following output in our logs. We have INFO level logging for the first two, but DEBUG level logging for the last one because it matches our user.key rule.

INFO  netlify.functions.hello: getting results for 1000
INFO netlify.functions.hello: getting results for 1001
INFO netlify.functions.hello: getting results for 1234
DEBUG netlify.functions.hello: running the following SQL SELECT * FROM table WHERE user_id = $1 for 1234
DEBUG netlify.functions.hello: query returned { rows: [ { id: 1, user_id: 1234, account: active, balance: 340 } ] }

Full Code Example

import {Prefab} from "@prefab-cloud/prefab-cloud-node";

var prefab = new Prefab({
apiKey: process.env.PREFAB_API_KEY,
enableSSE: false, // we don't want any background process in our function
enablePolling: false, // we don't want any background process in our function
defaultLogLevel: "warn",
collectLoggerCounts: false, // we don't want any background process in our function
contextUploadMode: "none", // we don't want any background process in our function
collectEvaluationSummaries: false, // we don't want any background process in our function
});

export default async (req, context) => {
prefab.updateIfStalerThanSec(60 * 1000); // check for new updates every minute

// take the context from our url /users/123 and give it to prefab as context
const {userId} = context.params;
const prefabContext = {user: {key: userId}};

prefab.inContext(prefabContext, (prefab) => {
const logger = prefab.logger("netlify.functions.hello", "warn");

logger.info(`getting results for ${userId}`);

var sql = "SELECT * FROM table WHERE user_id = $1";

logger.debug(`running the following SQL ${sql} for ${userId}`);
db.run(sql, [userId], function (err, rows) {
logger.debug("query returned", {rows: rows});
return new Response("200 Okey-dokey");
});
});
};

export const config = {
path: "/users/:userId"
};

To learn more about Prefab dynamic logging, check out the dynamic logging or check out the other things you can do with Prefab in Netlify like feature flags.

We set this up to target a particular user, but you can easily target anything else you provide in the context. Team ID, transaction ID, device ID, device type are all common examples.

Happy dynamic logging!

· 4 min read
Jeff Dwyer

Introduction

How should we integrate feature flags into Netlify functions? We'll explore why it's a bit tricky with lambdas, and I'll guide you through the best approaches to make it work efficiently.

The Lambda Challenge

Lambdas, like those in Netlify functions, are transient and don't run indefinitely. They're frozen after execution. This behavior poses a unique challenge for feature flags, which need to be swift and efficient and typically achieve this by using a background process to update the flag definitions.

Understanding Feature Flag Paradigms

Feature flags generally operate in two ways:

  1. Server-Side Flags: Here, your server connects to the flag server, downloads the necessary data, and performs local flag evaluations. This setup ensures no network calls during flag evaluations. Plus, we can manage telemetry asynchronously to avoid slowing down requests.

  2. Client-Side Flags: Common in web browsers, this approach involves making a network call to fetch flag values. For example, sending user data to an evaluation endpoint on page load, which returns the flag states. These endpoints need to be optimized for low latency, because they get called on every request.

Netlify Functions: A Middle Ground

Netlify functions are neither purely server-side nor client-side. They can't run background processes traditionally, but they are more persistent than a web browser so it would be nice to avoid network calls on every request. So, what's the best approach?

Feature Flags in Netlify: The Browser-Like Approach

A practical solution is to treat Netlify functions similar to a browser. Prefab's Javascript client, for instance, caches flag evaluations per user in a CDN. Here's a sample code snippet for this approach:

import { prefab, Context } from "@prefab-cloud/prefab-cloud-js";

export default async (req, context) => {
const clientOptions = {
apiKey: process.env.PREFAB_API_KEY,
context: new Context({user: {key: 1234}}),
};

await prefab.init(clientOptions);
if (prefab.get("my-flag")) {
// Your code here
}
return new Response("ok");
}

In my testing from a Netlify function I see results around a 50ms latency initially and around then 10ms for each subsequent request for the same context. That may be too slow for some applications, but it's a good starting point and very easy to set up.

The nice thing about this solution is that you're going to get instant updates when you change a flag. The next request will have up to date data.

The Server-Side Alternative

Alternatively, you can implement a server-side strategy using the Prefab NodeJS client. The key will be configuring our client to disable background updates and background telemetry, then performing an update on our own timeline.

Here's a sample code snippet for this approach:

import { Prefab } from "@prefab-cloud/prefab-cloud-node";

var prefab = new Prefab({
apiKey: process.env.PREFAB_API_KEY,
enableSSE: false, // we don't want any background process in our function
enablePolling: false, // we'll handle updates ourselves
collectLoggerCounts: false, // turn off background telemetry
contextUploadMode: "none", // turn off background telemetry
collectEvaluationSummaries: false, // turn off background telemetry
});

// initialize once on cold start
await prefab.init();

export default async (req, context) => {
const { userId } = context.params;
const prefabContext = { user: { key: context.userId } };

return prefab.inContext(prefabContext, (prefab) => {
if (prefab.get("my-flag")) {
// Your code here will
}

// ever 60 seconds, check for updates in-process
updateIfStalerThan(60 * 1000);
return new Response("ok");
});
};

export const config = { path: "/users/:userId" };

With this approach, most of our requests will be fast, but we'll have a periodic update that will take a bit longer. This is about 50ms in my testing from a Netlify function. We're entirely in control of the frequency here, so it's a judgment call on how real-time you want your feature flag updates. You could even disable the updates altogether if tail latency is of utmost concern and you didn't mind redeploying to update your flags.

Is there a better way?

The best way to solve this problem would be to use a Lambda Extension which could run a sidecar process to update the flags, then serve the flag data over localhost to your function. Unfortunately, Netlify doesn't support Lambda Extensions yet, but this is an exciting avenue to explore for other serverless platforms.

Conclusion

Deciding between a browser-like or server-side approach depends on your specific use case in Netlify functions. Both methods have their merits. The browser-like method offers simplicity and instant updates to feature flags, whereas the server-side approach gives a much better average response time at the cost of some tail latency and a configurable delay seeing flag changes. Choose what fits best for your application's architecture and performance requirements. Happy coding!

· 7 min read
Jeff Dwyer

We build configuration tooling here at Prefab, so it was a little embarrassing that our own local development configuration was a mess. We fixed it, we feel a lot better about it and we think you might dig it.

So What Was Wrong?

We used our own dynamic configuration for much of our config and that worked well, but when we needed environment variables everything started to fall apart. The pain points were:

Defaults In Multiple Places

Environment variables sound nice: "I'll just have my app be configurable from the outside". But in practice it can get messy. What are the default values? Do I need to specify defaults for everything? How do I share those defaults. When do I have fallback values and when do I blow up?

We had ended up with defaults in:

Ruby code:

# puma.rb
ENV.fetch('WEB_CONCURRENCY') { 2 }

A .env.example file:

GCS_BUCKET_NAME=app-development-billing-csv-uploads

Other yaml configs like config/application.yml:

STRIPE_SECRET_KEY: sk_test_1234566

production:
STRIPE_PRODUCT_BUSINESS_MONTHLY_2022_PRICE_ID: price_1234556

And in Terraform in another rep:

resource "kubernetes_config_map" "configs" {
metadata {
name = "configs"
}

data = {
"redis.uri" = "${local.redis_base_uri}/1"
}
}

Per Env Configuration All Over the Place

Beyond defaults, where do I put the environment specific overrides? Are these all in my devops CD pipeline? That's kinda a pain. Where are the production overrides? Could be anywhere! We had them in each of:

  1. config/production.rb
  2. database.yml production: section
  3. config/application.yml production: section

Duplicated Defaults Across Repos

Because we have multiple services, we also had some of the defaults in ruby .env.example also showing up in our Java app in a src/main/resources/application-development.yml.

No Easy Way to Share Secrets / API Keys

As if all of ^ wasn't enough of a mess. Secrets had to have an entirely different flow. We were good about not committing anything to source control, but it was a pain to get the secrets to the right place and easy to forget how to do it.

Summary

We were surviving, but it wasn't fun and the understanding / context fell out of our heads quickly meaning that whenever we needed to change something we had to reboot how things worked into our working memory and it took longer than it needed to. For a longer rant on environment variables, check out 9 Things I Hate About Environment Variables.

What Would Be Better?

So, what would be better? We wanted:

  • A single place to look to see all of our of my configuration
  • Developers have a single API key to manage, no local env var mysteries
  • Defaults that are easy to override for local dev, but weren't footguns leading to Works On My Machine issues
  • Easy to share configuration between projects
  • Interoperability with our Terraform / IaaS / Kubernetes variables
  • A system that supports secrets as well as configuration
  • Telemetry on what values are actually being used in production for our IaaS / Terraform provided values

We had a ton of the infrastructure in place to support this from our dynamic configuration work, but when it came to environment variables we were still in the stone age.

Our Dream

Our dream looked like this. With just a single api key and callsite, like:

#.env
# One API Key per developer
PREFAB_API_KEY=123-Development-P456-E2-SDK-c12c561b-22c3-4a52-9c38-a8f24355c102

#database.yaml
default: &default
database: <%= Prefab.get("postgres.db.name") %>

We wanted to be able to see all of my configuration in one place in:

The Prefab Config UI for a config using normal strings as well as provided strings.

Prefab UI for database name

It's clear what the value is in every environment and I can see which environments are getting the value from a Terraform provided env var.

What We Did to Enable This

There were 3 big things we needed to support to make this happen: Environment variables, Datafiles & Secrets.

Provided ENV Vars as a Config Type

First we needed to allow a config value to be "provided by" an environment variable. You can now do that within the Prefab UI or CLI.

Set config to be provided by an ENV VAR in some environments

If you check the box for "Provide by ENV VAR" you can then specify the ENV VAR name for any environments that it should be provided in.

Datafile Support

Datafile support allows the Prefab client to start up using a single local file instead of reaching out to the Prefab API. This is useful for CI environments where you may want perfect reproducibility and no external network calls. You can generate a datafile for your local environment and then commit it to source control. This allows you to have a single source of truth for your configuration and secrets.

In our goal of having a "Single Source of Truth" for our configuration, the original system of default files like .prefab.default.config.yaml ended up being more of a hindrance than a help. There's a big difference between a UI that is all-knowing and a system that has partial knowledge that could be overridden by other files, re-introducing complexity into the system.

Making the API all-knowing is lovely, but if everything is in the API, what do we do for CI / Testing?

Our solution is to have 2 modes:

  1. Live mode.
  2. Datafile mode. Offline, load a single datafile.

The datafiles are easy to generate. You simply run prefab download --env test and it will download a datafile for the test environment. You can then commit that datafile to source control and use it in CI.

In CI environments you can then run PREFAB_DATAFILE=prefab.test.123.config.json and it will use that datafile instead of reaching out to the API.

Secrets

The last big piece of this work was supporting secrets. If we were going to clean this all up once and for all, it just didn't work to still be on our own for secrets. I'll cover that in a future blog post, but if you're interested in our Secrets Management Beta please let us know. It's a zero-trust, CLI based solution that we think hits the nail on the head of being dead simple and easy to use.

Prefab Secret Management

What's Next?

We're really happy with how this turned out. Everything just feels... right. Configuration is important. Configuration goes in one place. It sounds like that should be easy, but from my experience up until now it's not the world many of us have been living in.

If you've been living in a monolith world deploying to heroku, you've long been enjoying the simple pleasure of heroku config:set GITHUB_USERNAME=joesmith. But if you have more than one application, apps in different languages, or weren't deploying to something simple like heroku, the story has been much worse.

What we've built has been a big improvement for us and we think it will be for you too. We're going to be rolling this out to all of our SDKs over the next few weeks. We'd love to hear what you think.

· 3 min read
Jeff Dwyer

We're thrilled to introduce our new Editor Tools for React developers!

As React developers, we cherish our time in the editor. However, dealing with Feature Flags typically meant stepping out of that space. We pondered, "What if we could manage everything directly from the editor?" The result is something we're really proud of.

Feature Flag Autocompletion

First off, we've integrated an autocomplete feature for feature flags. A wrongly typed feature flag can be a nuisance, especially since they default to false, leading to tricky debugging. Let your editor assist you. Enjoy autocomplete for flag and configuration names, and the ability to auto-create simple flags if they don't exist yet.

Feature Flag autocomplete

Feature Flag Evaluations Data on Hover

Implementing a feature flag is often straightforward. The real challenge is monitoring its status. Is it active? Can it be removed? What's its production value?

We envisioned how amazing it would be to integrate evaluation data directly into the editor. The result is indeed amazing! Now, you can get all the answers with a simple hover, without ever leaving your editor.

Feature Flag autocomplete

This lets you see if a flag is set to true in your staging or demo environment, or doing a percentage rollout in production.

Toggle Feature Flags

Don't leave your editor to toggle a feature flag. Simply click on the flag and set it to true.

Feature Flag autocomplete

Personal Overrides

Ever accidentally committed if true || flags.enabled?("myflag")? I've done it. It happens when I want to see a flag's effect but don't want to change it globally. So, I temporarily tweak it to true and then sometimes forget to revert it.

Feature Flag autocomplete

Wouldn't it be better to simply click on the flag and set it to true just for your local environment? This personal overrides feature, linked to your developer account, lets you do just that. Test your feature without disrupting others, all within your coding flow.

Summary

We're absolutely digging these tools internally and we're excited to expand upon them. We think the idea of being able to detect zombie or stale flags right in our editor would be very useful. We feel like we've taken a big step forward with the inline evaluation data on hover, but we're excited to keep pushing forward. We'd love to hear some of your ideas for how we can make these tools even better.

· 8 min read
Jeff Dwyer

In the world of software development, environment variables are how we configure our applications. The Twelve-Factor app methodology made this canonical and was a significant improvement over the terrible things we'd done before. However, I think we can do better.

Looking at the big picture, we've essentially created a system of global, untyped variables with no declarations and no defaults – a scenario that would be unacceptable in regular code. Yet, here we are, using this approach for one of the most critical aspects of our applications.

Specific Challenges with Environment Variables

1. Environment Variable Whack-a-Mole

How often have you cloned an app only to be greeted with a slew of errors due to missing environment variables? Start the app, it explodes, hunt down the value for the env var, start the app, explode on another env var, etc. I asked a friend how big a problem this was on a scale of 1-10. I think he spoke for us all when he said: "Mostly a 1 or 2. Yesterday, it was an 11."

  • Examples:
    • Api.get(key: ENV["THE_KEY"]) will lead us to frustrating mysterious 401 errors when it isn't defined
    • Api.get(key: ENV.fetch("THE_KEY")) will raise the error, but now we're mole-whacking.
  • Doesn't Dotenv Fix It?: Sometimes. Dotenv has been a huge improvement but over time each developers local .env starts straying from the common .env.example and we get a lot of "it works on my machine" issues. Oh... and it's got nothing for secrets.

2. Scattered Defaults

Env vars are big global variables, and there isn't even a clear answer to where we put the values. Most codebases end up with a mix of defaults in the ENV invocation, some in .env files or maybe a .env.production file. Possibly a config/staging.yaml. Maybe something from our continuous deployment. Some things in a kubernetes configmap. It's a mess.

  • Examples:
    • .env.production using dotenv for deployed envs.
    • config/default.yaml or config/production.yaml YAML configs.
    • config.x.swarm_count = ENV.fetch('SWARM_COUNT', 3) in-line defaults.
    • config.x.configure_sys = !Rails.env.test? this looks like a config value but isn't actually updateable.
  • Issue: Defaults are inconsistently spread throughout the codebase, creating a chaotic and confusing setup.

3. No Types & Unsafe Interpolation

Speaking of chaotic mess, how much fun is it debugging an issue when your env var is a string but you're expecting a boolean? Or when you're expecting an array delimited on comma, but somebody left a space in it, and the env var isn't quoted someplace. Good times.

  • Examples:
    • config.x.use_seeds = ENV.fetch('USE_SEEDS', 'false') == 'true' (Potential boolean misinterpretation)
    • config.x.cors_origins = ENV.fetch('CORS_ORIGINS', '').split(',') (Complications with array parsing)
    • config.x.timeout_millis = ENV.fetch('TIMEOUT', '1') * 1000 (Potential for unit mismatches abound)
  • Issue: The lack of inherent type safety necessitates extra coding for handling data types, increasing the risk of errors.

4. What Value is it in Production?

How many times have you had to SSH into a production server to check the value of an environment variable? Or had to ask an ops person to do it for you? It's a pain, it's gross, it's a security risk. Environment variables: the really important configuration variables that you can't actually see or audit.

Partly, this is from scattered defaults, but mostly, this is from the complexity of the systems we've built to inject these variables and the lack of telemetry on their usage.

  • Issue: Assessing the environment variable values in production is cumbersome, requiring system access and specific commands.
  • Impact: This adds complexity to troubleshooting and configuration verification in live environments.

Why can't I use a CLI to see this? Why can't I just hover in my editor and see the configuration in each environment and the actual runtime values?

~/app (main)  $ prefab info
? Which item would you like to see? postgres.db.ip

- Default: 127.0.0.1
- Development: [inherit]
- Production: `POSTGRES_DB_IP` via ENV
- Staging: `POSTGRES_DB_IP` via ENV
- Test: [inherit]

Evaluations over the last 24 hours:

Production: 5
- 100% - 10.1.1.1

Staging: 2
- 100% - 10.11.12.13

Development: 25
- 100% - 127.0.0.1

No more ssh and printenv; I should just be able to do this from the comfort of home.

5. Refactoring Environment Variables is Terrible

Want to change an environment variable name? Good luck. Enjoy slacking everyone that they need to update their .env file in every repo.

Want to spin up a new application? Copy pasta the old .env around and let the duplication party begin.

Want to update the default across all your apps? Good luck.

  • Issue: Each .env is a massive duplication of our configuration, and this makes refactoring hard.
  • Impact: We get crufty code.

6. Cross Language Incompatibility

In truth, Rails has a decent story around all of this for a monolith. And various languages and frameworks have good approaches. But, what's that you say? You have a node app and a rails app? A Java app, too? And you'd like to... gasp... share a configuration value across them all? Sorry, mate, you're on your own.

  • Issue: Custom configuration libraries for each language create a lack of consistency and interoperability.
  • Impact: Lack of interoperability meets cut-overs to the new redis.uri need to happen on a per language basis and require understanding the configuration system (or systems) for each repository.

7. Question of Scale: How Many is Too Many?

How many environment variables is the right number? Ten or twenty is certainly fine. 100 sure feels like a lot and makes things ugly. 1000? More? No, thank you.

But... how many aspects of my system would I like to be configurable? Well, if you take off the shackles of having to jam everything into an env var, I suppose I'd actually like to configure lots of things. Should my http timeout be the same for every single request? Actually, I’d like to tune that at a fine-grained level. But I sure as heck am not going to do that if there is one env var per config. TIMEOUT_AUTH_SERVICE_FROM_BILLING_SERVICE=5000 is madness.

  • Issue: The way environment variables work fundamentally encourages a small number of variables, which is at odds with the desire to have a highly configurable system.
  • Impact: We build systems and libraries without as many knobs and levers as we'd like, and this limits our options for real-time adjustments to production issues.

8. Updates: Slow and Forgettable

Most places I would expect an hour or two. Yes that's crazy, but yes that's the reality. Usually this is a ticket into your devops team and then they have to go update the value in a configmap or something. (I will admit that if you're on heroku this probably takes 1 minute. This is how it should be!)

Changing a variable should be instant, but we have these variables locked into a system that, for most of us, is slow to update.

  • Issue: Updating environment variables can be time-consuming, particularly in larger and more complex systems.
  • Impact: Slow MTTR when issues could be fixed by configuration changes.

9. Secrets Management Requires a Different System

Secrets are just configuration too, or they should be, albeit with more permissions and confidentiality. However, our code needs to know the values just like it would any other variable. Instead, almost all of us have to operate two totally separate tools/processes for managing secrets and configuration.

  • Issue: Managing sensitive data often requires a separate system from standard environment variables, adding to the complexity of configuration management.
Secrets and config should live next to each other

I should be able to see all my configuration in one place, secrets, too. Sure, secrets are confidential and should be encrypted, but that doesn't mean I shouldn't be able to understand that my applications are using them.

Conclusion

Environment variables have got us a long way, but we can do better, and indeed, lots of organizations have built sophisticated dynamic configuration systems that address all of these issues. The future just isn't evenly distributed. Or... hasn't been until now.

The key elements of a better system are:

  1. A single view of all of my configuration
  2. Typed values like: string, bool, duration, arrays, etc.
  3. Defaults that are easy to override for local dev
  4. Easy to share configuration between projects
  5. Telemetry on what values are actually being used in production
  6. Interoperability with Terraform / IaaS / Kubernetes / Existing Secrets Management
  7. A system that supports secrets as well as configuration

As I said, to my knowledge, the best examples of systems that support all this typically come from internal tools at large companies. HubSpot talks briefly about their in How we deploy 300 times a day. Amplitude covers the architecture decisions of theirs in Using DynamoDB for Dynamic Configuration and Netflix's open source Archaius has a lot of the underpinning pieces, though no help on the UI. And, of course, we have Prefab, which is our attempt to bring this to the world.

What's Next?

I think we're a fair way along this journey here at Prefab, and we're excited to share what we've learned and what we've built. I'd love you to check out our dynamic configuration and let me know what you think.

To a world of better config for all 🚀