Skip to main content

28 posts tagged with "Engineering"

View All Tags

· 5 min read
Jeffrey Chupp

Wouldn't it be great if you could log all of the gory details of a request, but only if it fails?

Choosing the appropriate logging level for our web app can be challenging. Specifying the DEBUG level yields detailed information, but the signal for failed requests can get lost in the noise of successful ones. A WARN level reduces volume but lacks detail to help track down bugs.

Take this example:

Rails.log_level = :warn

class CriticalBusinessCode
def calculate_grumpkin_quotient(apples, bananas)
log.debug "calling fruit api with #{apples}, #{bananas}" # :( no info
FruitApi.call(apples, bananas) # <= Throws mysterious "400 Bad Request"
end
end

If we knew at the beginning of a request that it would fail, we'd set the log level to DEBUG (or even TRACE), right?

We don't have the technology to build a psychic Logger, but we can build the next-best thing: a Time-traveling Logger.

Time Traveling Logger

A first version

Let's start with the core Ruby Logger. Here's how it is used without modification.

require 'logger'

logger = Logger.new($stdout)
logger.level = Logger::WARN

logger.debug "DEBUG"
logger.info "INFO"
logger.warn "WARN"
logger.error "ERROR"

We told the logger ahead of time that we wanted WARN as our level. Because we set the level to WARN, we only see the WARN and ERROR output, not the DEBUG and INFO output.

How can we build on top of the core logger to allow specifying the level later in the program's execution — after some logging statements have already executed?

Queued logging

The key idea is to defer flushing the logs until the end of the request. We don’t have the context we need to make the verbosity decision up-front, but when the request finishes, we will. We just need a place to stash these logs while the request processes.

Building on top of the core Logger, we can implement a queue by overriding the add method

require 'logger'

class QueuedLogger < Logger
def initialize(*args)
super
@queue = []
end

def add(severity, message = nil, progname = nil)
@queue << -> { super(severity, message, progname) }
end

def flush!(level = nil)
old_level = self.level
self.level = level if level

@queue.each(&:call)
@queue = []
ensure
self.level = old_level
end
end

The implementation here isn't too exciting (in a good way). .debug, .info, etc. call add. With this change, we don’t immediately write the message but instead we throw it in a queue to be logged later.

When we're ready to dump out the logs, we can specify the level we want.

logger = QueuedLogger.new($stdout)
logger.level = Logger::WARN

logger.debug "DEBUG"
logger.info "INFO"
logger.warn "WARN"
logger.error "ERROR"

logger.flush!(Logger::DEBUG)

At flush! time, we look back through history and evaluate the log statements with the provided level.

Despite the level being set to WARN initially, our DEBUG and INFO lines will show up since we're flushing at the DEBUG level.

Example usage

Let’s imagine we’re writing software for managing appointments for physicians. We have a background job to remind patients about upcoming appointments via SMS.

class AppointmentReminderJob
def perform(doctor, patient, time, timezone)
logger.debug "Scheduling appointment for #{doctor} and #{patient} on #{time} in #{timezone}"

message = compose_message(doctor, patient, time, timezone)
SMS.send(message, patient.phone)
end

def compose_message(doctor, patient, time, timezone)
# ...
end

def logger
@logger ||= Logger.new($stdout).tap do |logger|
logger.level = Logger::WARN
end
end
end

If an exception happens when sending the reminder (perform), we’d like as much detail as possible. Unfortunately our default level of WARN means that our logger.debug statement is never written.

We could set the default level to DEBUG. While the logger.debug line here is useful if something goes wrong, it clutters our logs with noise when everything is working as intended.

We can get the best of both worlds by applying our QueuedLogger to capture debug-level messages only when an exception occurs.

Here’s the updated class:

class AppointmentReminderJob
def perform(doctor, patient, time, timezone)
logger.debug "Scheduling appointment for #{doctor} and #{patient} on #{time} in #{timezone}"

message = compose_message(doctor, patient, time, timezone)
SMS.send(message, patient.phone)
rescue => ex
logger.error(ex)
# write at the debug level so we get all the details
logger.flush!(:debug)
ensure
# write at the default level we specified (WARN in this example)
logger.flush!
end

def compose_message(doctor, patient, time, timezone)
# ...
end

def logger
@logger ||= QueuedLogger.new($stdout).tap do |logger|
logger.level = Logger::WARN
end
end
end

When Where do we go from here?

QueuedLogger isn't quite ready to be thrown into your Rails app yet with config.logger = QueuedLogger.new(...). To get ready for prime-time, we'd want to isolate the queue for the request (maybe using ActiveSupport::CurrentAttributes, concurrent-ruby, or RequestStore).

If you're interested in a battle-hardened version of this logger that is ready to use, let us know and I'll see what I can put together :)

There's some interesting discussion happening on Reddit.

Have a great day!

· 3 min read
Jeffrey Chupp

There's an open issue from March of 2020 titled sassc is very slow to compile and install. The issue has people pleading for help and asking for sassc-ruby to ship precompiled binaries (as nokogiri does). The place this hurts the most is building your Rails app with Docker where you can pay a 10+ minute install time every time you modify any part of your Gemfile.lock

Oof.

I have good news for those still stuck on sassc: Your Docker builds don't have to be slow.

Docker enthusiasts know that layer caching is a huge time saver. But modifying your Gemfile.lock breaks your bundle install layer and causes the previous cache to be unusable. Even though sassc has been at version 2.4.0 since June of 2020 and isn't likely to be updated, even a minor version bump on any other gem in your Gemfile means you're reinstalling sassc again.

Fortunately the fix is a trivial change in your Dockerfile: before your RUN bundle install command, add RUN gem install sassc:2.4.0.

The sassc install will be cached as its own Docker layer and then your subsequent bundle install will use the existing sassc from disk.

You can use this strategy for other rarely-changed gems with native extensions for more savings.

Altogether this looks like:

FROM ruby:3.2.1

# 1. Install gems that rarely change or are very slow
ENV RAILS_VERSION 6.1.7.2
RUN gem install rails --version "$RAILS_VERSION"

# pull sassc out and install early so we don't pay the price on each gem change
RUN gem install sassc:2.4.0
RUN gem install bundler


# 2. Install gems that change more frequently
RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app
COPY Gemfile /usr/src/app/
COPY Gemfile.lock /usr/src/app/

# don't bother installing development and test gems
RUN bundle config set --local without 'development test'
RUN bundle install --jobs 4

# 3. Now Move the app code in, because this changes every build
COPY . /usr/src/app

RUN RAILS_ENV=production bundle exec rake assets:precompile

That's it!

For us gem install sassc:2.4.0 inside bundler has saved 8 minutes per build. Hopefully this can help you too.

Speaking of things that are slow, have you ever wanted to change an environment variable variable without restarting your server? Ever wanted to change the log level without pushing a new build? Or perhaps you're simply interested in a fabulous Rails Feature Flag solution that has everything you need, but doesn't charge by the seat? If so, check us out here at Prefab!

· 9 min read
Jeff Dwyer

Introduction

If you're setting up tracking for a new project, the natural thing to do is follow the instructions from whatever product analytics tool or event hub you're planning to use: Segment / Amplitude / PostHog etc. Each of these will assign a trackingId for you, but I would suggest you take this into your own hands. Let's dive in and see why.

The Standard Solution

The current standard of user identification is basically:

  1. Generate a GUID Cookie for anonymous visitors
  2. Track events against this GUID
  3. On login / account creation. Send an identify call saying userId == guid
  4. When performing analysis do it for all GUIDs and UserIds.

This is how Segment, Amplitude or PostHog works.

I'll get it out of the way now, this is definitely the most complete and "correct" solution. The problem in my eyes is that it's also the most complex to implement, analyze and reason about. And sometimes bad things happen.

The complex method of user identification has gained popularity because on paper it provides a more comprehensive view of user behavior across multiple devices and browsers. By merging anonymous IDs with user IDs, anytime we realize that one set of activity is actually coming from a specific user, we can stitch them together. This sounds great.

The Problems

Challenging Analysis: The core issue is readily apparent from the diagram however. We have 4 different streams. In a data warehouse, this is going to look like 4 different primary keys. In order to do any analysis, we're going to have to union all events from these 4 keys together every time and sort them. In particular because of the changes to how browsers handle cookies, we are going to get lots and lots of tracking_ids per use and at "analytics scale" this can be a non-trivial problem. Over time, trying to select * from events where tracking_id IN (select tracking_id from user_tracking_ids where user_id = 123) can simply fall over.

While this process generally works well within 3rd party tools, this is because it's really one of their core competencies and they have really optimized to solve it. That's great for them, but the problem is this isn't one of your core competencies. If you want to write all the events to the data warehouse, which I think you should, you're going to have a harder time. Now, do you really need to store raw events in your warehouse? It's your call, but I'd strongly encourage it. Owning your data means you control your destiny and being able to dive into the core of the issue with raw data is often key to solving thorny problems.

Interoperability & Portability The other issue is that this method of user identification is less portable. If you want to switch from Posthog to Amplitude, you're likely going to have to re-identify all your users. Similarly if you're looking to move data between systems or describe your data to another system, it's just awkward to not have a single immutable ID for each user. Say you want to build a simple realtime personalization API. You're going to build a kafka listener on your event stream, run some KSQL and save off something like "how many times person X looked at the pricing page". Should this API be responsible for understanding that all 4 GUIDs are actually the same person? That's a lot of work for a simple API.

Bad Merges: Accidental/Incorrect merges are a real thing and can create an amazing amount of confusion. These merge-style systems are greedy. Their whole purpose in life is to join together different streams of the user experience into a unified view. However, a few users share a computer or have multiple accounts and suddenly these systems slurp two users into one and you have a real mess on your hands. Try explaining to everyone why you had 400 users in March last week, but now there are only 396 because your 3rd party analytics system accidentally merged two users. It's not fun.

Even in a "correct merge" situation, very weird things can happen. From the Amplitude docs

When users are merged, the user could "lose" user property values that were never meant to be changed (e.g. 'Start Version' or initial UTM parameters) because the new user property values will overwrite the original user property values. If you are a paying customer and this affects you, please reach out to our Support team.

Not That Useful: The reason this standard solution exists is that we would really love to be able to track an anonymous user across every device and browser they use, before they convert, so we can understand the full acquisition picture. This would be very useful to understand acquisition channels, but sadly it's not on the menu. The reality is that new privacy controls have made this close to impossible. Honestly this isn't a huge change. Even before new browser cookie policies, cross device tracking was dark magic that never really worked that well.

What we do get from this complex approach is typically:

  1. 1 stream of anonymous traffic that converts to a user [very useful]
  2. 5 streams of anonymous traffic that we never associate with a user [sad, but nothing to be done]
  3. 20 streams of anonymous traffic that then logs in. [not very useful]

Our simpler solution is going to give us #1 & #2 above and will track #3 but de-emphasize it.

A Simpler Solution

My preferred solution makes one big sacrifice and reaps a ton of simplicity. Here's the approach pictorially.

We now have a single stream of events for our user and we can select * from events where tracking_id = 'abc123'. This is a lot easier to reason about.

We also have a "marooned" stream. This is what it looks like when an existing user of yours comes to your site and logs in. It is not connected directly to the other stream (but we can connect it if we need to). This is the tradeoff. The core of the reasoning is that, in practice, detailed analysis of this funnel is just not that important. How much analysis happens of the create new user flow? A ton. How much of the login flow? Not much.

Our proposed solution simplifies user identification as follows:

  1. All pages/applications should set the same cookie for new visitors if it's not already set.
  2. Upon signup, transfer the GUID to a single tracking_id field on the user.
  3. On each page load, if a user is defined, set the tracking cookie to the user tracking_id. Otherwise, track using the cookie.
  4. If step #3 would change an existing cookie that means we have a crossover event. Track 2 "transition" events, one in each stream to aid analysis if you'd like.

This approach results in a single GUID per user, making analysis and identification much simpler, especially when it comes to exploring raw data in your data warehouse.

Addressing Potential Concerns

  1. Working with 3rd party tools: Just because we're taking control of the tracking_id doesn't mean we shouldn't use 3rd party analytics tools. All we need to do is identify(trackin_id) and they'll respect our authority on the matter.
  2. Multiple devices and browsers: Post-login activity from different devices can still have the same tracking_id as long as you mobile apps follow the same protocol.
  3. Data loss during transition: Firing "switchover" events helps maintain data in the warehouse, enabling the possibility of stitching data together if needed.
  4. Data privacy and compliance: There's no significant impact on data privacy and compliance with GDPR or CCPA. When you go to delete a user you'll need to delete all this data, but that's true of any solution.

Applicability and Limitations

This simpler solution is well-suited for large B2C and B2B SaaS companies, where acquisition analysis is a priority. However, if re-acquisition is central to your business, this approach may not be the best fit.

Implementation

Here is a suggested implementation in Rails.

The core of the implementation is a class that generates a tracking_id and persists it to cookies. The rest is just plugging it into Rails and saving it to the user on signup.

class TrackingId
COOKIE_KEY = "tid".freeze

def self.build(user:, cookies:)
builder = new(user, cookies)
builder.persist_to_cookies
builder.tracking_id
end

def initialize(user = nil, cookies = {})
@user = user
@cookies = cookies
end
def tracking_id
@tracking_id ||= user_tracking_id || cookie_tracking_id || self.class.new_tracking_id
end

def persist_to_cookies
@cookies[COOKIE_KEY] = {
value: tracking_id,
expires: 1.year.from_now
}
end

private

def user_tracking_id
@user.try(:tracking_id)
end

def cookie_tracking_id
return if @cookies[COOKIE_KEY].blank?
@cookies[COOKIE_KEY]
end

def self.new_tracking_id
SecureRandom.uuid
end
end

All webpages a user might land on, should set the cookie if it's not set already set. Here's an example in JS.

function setGUIDCookie() {
// Check if the tid cookie has already been set
if (document.cookie.indexOf("tid=") === -1) {
const guid = crypto.randomUUID();
const expirationDate = new Date(
new Date().setFullYear(new Date().getFullYear() + 1)
).toUTCString();
document.cookie = `tid=${guid}; expires=${expirationDate}; SameSite=Lax; path=/`;
}
}
setGUIDCookie();

Conclusion

This proposed solution offers a more straightforward approach to user identification for 3rd party tracking systems, simplifying analysis and reducing complexity, particularly when examining raw data in your data warehouse.

This works with all the existing tools like Segment, Amplitude, PostHog, etc. You'll just take control and send them the tracking_id you generate and don't rely on their merge logic.

In my experience, this simpler solution is better and leads to fewer headaches. Taking control of the identity puts you in the driver's seat and makes it easier to analyze across whatever analytics tools you decide to use.

Good luck!

· 4 min read
Jeff Dwyer

I didn't think I'd be writing this.

I really thought it would be a 3 line commit. All I wanted to know was how many streaming connections I had. DataDog was already setup and was happily sending metrics to it, so I figured I'd just add a gauge and be done with it.

But here we are.

A Basic Gauge

Micrometer is a "Vendor-neutral application observability facade" which is Java speak for "a common library of metrics stuff like Counters, Timers, etc" If you want a basic "what is the level of X over time", a gauge is the meter you are looking for.

Here's a basic example of using a Gauge. This is a Micronaut example, but is pretty generalizable.

@Singleton
public class ConfigStreamMetrics {

private final AtomicInteger projectConnections;

@Inject
public ConfigStreamMetrics(MeterRegistry meterRegistry) {
projectConnections =
meterRegistry.gauge(
"config.broadcast.project-connections",
Tags.empty(),
new AtomicInteger()
);
}

@Scheduled(fixedDelay = "1m")
public void recordConnections(){
projectConnections.set(calculateConnections());
}
}

Ok, with that code in place and feeling pretty sure that calculateConnections() was returning a consistent value. You can imagine how I felt looking at the following, which shows my gauge value going all over the place from 0 to 1 to 2 (it should just be 2). All over the place

Why is my gauge not working?

What is happening here? The gauge is all over the place. It made sense to me that taking the avg was going to be wrong, if I have 2 servers I don't want the average of the gauge on each of them, I want the sum. But I'm charting the sum() here and that doesn't exp lain what's happening.

The Key

The key is remembering how statsd with tagging works and discovering some surprising behavior from a default DataDog setup.

Metrics from micrometer come out looking like config.broadcast.project-connections.connections:0|g|#statistic:value,type:grpc.

As an aside, I'd highly recommend setting up a quick clone of statsd locally that just outputs to stdout when you're trying to get this all working.

The "aha" is that all of these metrics get aggregated based on just that string. So if you have

Server 1: config.broadcast.project-connections.connections:99|g|#statistic:value,type:grpc

Server 2: config.broadcast.project-connections.connections:0|g|#statistic:value,type:grpc

A gauge is expecting a single value at any given point, so what we end up with here is a heisengauge that could be either 0 or 99. Our sum doesn't work, because we don't have a two data points to sum across. We just have one value that is flapping back and forth.

The gotcha

Now we know what's up, and it's definitely a sad state of affairs. What we do want is outputting a different key per pod and then summing across those. But why aren't these metrics getting tagged with the pod?

It turns out that https://micronaut-projects.github.io/micronaut-micrometer/latest/guide/#metricsAndReportersDatadog hits DataDog directly, not my local Datadog agent which is normally responsible for adding these host & pod tags.

Since it goes straight there and we aren't explicitly sending a pod or host tag, these metrics are clobbering each other.

Two solutions

1) Point your metrics to your datadog agent and get the host tags that way

This makes a lot of sense, but I wasn't able to get it working easily.

2) Set CommonTags Yourself

The other solution is to calculate the same DataDog hostname that the datadog agent uses and manually add that as a commonTag to our MetricRegistry. Doing that looks like this:

@Order(Integer.MAX_VALUE)
@Singleton
@RequiresMetrics
public class MetricFactory
implements MeterRegistryConfigurer<DatadogMeterRegistry>, Ordered {

@Property(name = "gcp.project-id")
protected String projectId;

@Override
public void configure(DatadogMeterRegistry meterRegistry) {
List<Tag> tags = new ArrayList<>();
addIfNotNull(tags, "env", "MICRONAUT_ENVIRONMENTS");
addIfNotNull(tags, "service", "DD_SERVICE");
addIfNotNull(tags, "version", "DD_VERSION");
addIfNotNull(tags, "pod_name", "POD_ID");

if (System.getenv("SPEC_NODENAME") != null) { // contruct the hostname that datadog agent uses
final String hostName =
"%s.%s".formatted(System.getenv("SPEC_NODENAME"), projectId);
tags.add(Tag.of("host", hostName));
}

meterRegistry.config().commonTags(tags);
}

private void addIfNotNull(List<Tag> tags, String tagName, String envVar) {
if (System.getenv(envVar) != null) {
tags.add(Tag.of(tagName, System.getenv(envVar)));
}
}

@Override
public Class<DatadogMeterRegistry> getType() {
return DatadogMeterRegistry.class;
}
}

Passing the node & pod names in required some kubernetes yaml work so that the pod name and node name were available as environment variables.

spec:
containers:
- image: gcr.io/-----
name: -----------
env:
- name: SPEC_NODENAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: POD_ID
valueFrom:
fieldRef:
fieldPath: metadata.name

Wrap

With all of that in place we're finally in a good place. Our gauges are independently gauging and our sum is working as expected.

Yay