CDNs are great, but they can be a bit of a black box. What if you want to see your CDN traffic broken down by customer? The CDN has that information, but how can we get it out?
Fastly’s CDN hosts our customer’s configuration data so that it is available even if Prefab’s API were to be unavailable. In order to understand our CDN traffic broken down by customer, we set out to use Fastly’s logging product to stream data back to us containing information from our API response but hit a bump in understanding the varnish/Fastly object model.
The default Fastly Log format doesn’t contain a lot of data. To make downstream computation more straightforward, we set out to add a couple of headers to our API’s configuration responses that would identify the API key used to request the configuration. Those are the prefab-key-id and prefab-env-id.
Second Problem: Getting New Headers Back In Fastly Logs
First, I thought it would be as easy as adding something like “prefab_env_key_id”: %{beresp.http.Prefab-EnvId}V" to the JSON log format to get the new headers shipped our way, but I was confronted with an error message indicating the vcl_log method didn’t have access to that variable. I asked a question in the Fastly Developers community and got a helpful answer.
Fastly is a great user experience wrapped around the Varnish HTTP Cache, which shows in its references to VCL (Varnish Control Language) snippets and callbacks like vcl_fetch, vcl_deliver, and vcl_log. Each of those callbacks (subroutines) has access to different objects, so the trick is to use those subroutines to copy data from the original (origin) response (known as beresp) in varnish to the cache object resp so that future invocations of vcl_log have access to that.
In Fastly Configuration, Content → Headers let us do this without writing any VCL.
Now that the header Prefab-EnvId is set on the cached object, we can refer to that using resp.http.Prefab-EnvId in the JSON logging format like this
After we spent months with all of our Datadog alerts for both staging and production routed to a single slack #alerts channel, we all got tired of looking in that room with urgency to see something happening in staging again. The team asked me to figure out how to create environment-specific alert channels, so here’s what I did.
Once you’ve set up a monitor with a Slack alert, you’ll see something like this in the message: @slack-Prefab_Internal-alerts. That’s a hyphen delimited integration name, integration install name, and then the channel.
To target specific environments by the env tag that’s usually available, edit the monitor’s message like this @slack-Prefab_Internal-alerts-{{env.name}}. The drop-down box below the message won’t be helpful, but this works.
Then setup a slack channel like #alerts-production.
You may also want to specify the “Multi Alert” aggregation options we chose.
import fetch from"node-fetch"; import{ createConnection, InitializeResult, TextDocumentSyncKind, TextDocuments, }from"vscode-languageserver/node"; import{ TextDocument }from"vscode-languageserver-textdocument"; const connection =createConnection(); const documents: TextDocuments<TextDocument>=newTextDocuments(TextDocument); typeEdge={ node:{ title:string; number:number; url:string; body:string; }; }; typeResponse={ data:{ search:{ edges: Edge[]}}; }; // Replace with your real GitHub access token const accessToken ="YOUR_GITHUB_ACCESS_TOKEN"; const searchScope ="repo:facebook/docusaurus"; const search =async(input:string):Promise<Response>=>{ const query =` { search(query: "${searchScope}${input}", type: ISSUE, first: 10) { edges { node { ... on PullRequest { title number body url } ... on Issue { title number body url } } } } } `; const response =awaitfetch("https://api.github.com/graphql",{ method:"POST", body:JSON.stringify({ query }), headers:{ Authorization:`Bearer ${accessToken}`, }, }); if(!response.ok){ const error ={ status: response.status, statusText: response.statusText, body:await response.json(), }; connection.console.error( `Error fetching from GitHub: ${JSON.stringify(error)}`, ); } returnawait response.json(); }; connection.onInitialize(()=>{ const result: InitializeResult ={ capabilities:{ textDocumentSync: TextDocumentSyncKind.Incremental, // Tell the client that the server supports code completion completionProvider:{ // We want to listen for the "#" character to trigger our completions triggerCharacters:["#"], }, }, }; return result; }); connection.onCompletion(async(params)=>{ const doc = documents.get(params.textDocument.uri); if(!doc){ returnnull; } const line = doc.getText().split("\n")[params.position.line]; // return early if we don't see our trigger character if(!line.includes("#")){ returnnull; } const input = line.slice(line.lastIndexOf("#")+1); const json =awaitsearch(input); const items = json.data.search.edges.map((edge)=>{ return{ label:`#${edge.node.number}: ${edge.node.title}`, detail:`${edge.node.title}\n\n${edge.node.body}`, textEdit:{ range:{ start:{ line: params.position.line, character: line.lastIndexOf("#"), }, end:{ line: params.position.line, character: params.position.character, }, }, newText:`${edge.node.url}`, }, }; }); return{ isIncomplete:true, items, }; }); // Make the text document manager listen on the connection // for open, change and close text document events documents.listen(connection); // Listen on the connection connection.listen();
It works, but searchScope (e.g., repo:facebook/docusaurus) and our GitHub access token are hard-coded. That's fine for local development, but to get this language server ready to share, we should make those configurable.
Note that the myLSP prefix on the properties matches the string we use in our getConfig function.
Once you've repackaged and reinstalled your extension, you can modify the settings by pressing ctrl+shift+p (cmd+shift+p on Mac) and typing "Open Settings UI"
Once in the Settings UI, search for myLSP to modify your settings.
Changes take effect immediately because we pull the settings in our server as needed.
Language servers let you meet developers where they are: their editor. They give you the power of in-editor tooling without rebuilding your tools for each editor you want to support. They can implement features like completion, go-to-definition, code actions, and more.
The Language Server Protocol (LSP) specifies how a language server should interface with an editor.
import{ createConnection, InitializeResult, TextDocumentSyncKind, TextDocuments, }from"vscode-languageserver/node"; import{ TextDocument }from"vscode-languageserver-textdocument"; const connection =createConnection(); const documents: TextDocuments<TextDocument>=newTextDocuments(TextDocument); connection.onInitialize(()=>{ const result: InitializeResult ={ capabilities:{ textDocumentSync: TextDocumentSyncKind.Incremental, // Tell the client that the server supports code completion completionProvider:{ // We want to listen for the "#" character to trigger our completions triggerCharacters:["#"], }, }, }; return result; }); // Make the text document manager listen on the connection // for open, change and close text document events documents.listen(connection); // Listen on the connection connection.listen();
After our imports, we set up a connection and some documents. The connection handles the communication between the editor (the client) and our code (the server). documents is a store of open files in our editor. documents stays in sync when files open, close, and change.
When the editor starts our server, onInitialize is fired. The editor sends some params describing what it can do (and what it wants our server to do), but we're not using those here, so we'll omit them. We reply with a list of things we can do (primarily, we offer completions triggered by #).
We tell documents to listen on the connection and the connection to listen for communication from the editor.
Our server could now run hosted by the editor (more on that later), but it doesn't do anything useful yet.
Let's add the initial hard-coded completion after our onInitialize.
connection.onCompletion(()=>{ return[ { label:"Cannot set properties of undefined (setting 'minItems')", data:"https://github.com/facebook/docusaurus/issues/9271", }, { label:"docs: fix typo in docs-introduction", data:"https://github.com/facebook/docusaurus/pull/9267", }, { label:"Upgrade notification command does not copy properly", data:"https://github.com/facebook/docusaurus/issues/9239", }, ]; });
While the LSP standardizes how to implement Language Servers and how they talk to editors, there's no good standard for making your editor aware of a language server so that it is ready to be used. For this post, I'll cover VS Code and Neovim.
You connect a language server to VS Code by creating a thin extension layer. To make things as easy as possible for you, I've published a Minimum Viable VS Code Language Server Extension. Follow the directions in that repo and replace the entire contents of server/src/server.ts with the code we're writing in this post.
After following those directions, we'll also need to install node-fetch for the server since the node version packaged with VS Code is too old to include native fetch support.
cd server && npm install node-fetch@2 && npm install --save-dev @types/node-fetch && cd -
Now, you should be able to test your extension with the steps mentioned here. You should be able to open a new file, type #, and then see something like this:
For Neovim, you pass some configuration to vim.lsp.start. For testing, I've put the following in ~/.config/nvim/after/ftplugin/ruby.lua. Note that this location means it only runs in ruby files. Feel free to start the server elsewhere.
GitHub provides a GraphQL API. Get a GitHub access token and have it handy. It appears a classic token with "repo" permissions is required to get the search query to work correctly (at the time of writing).
Here's a query to search the Docusaurus repo for issues and PRs that include the word "bug.”
{ search(query:"repo:facebook/docusaurus bug",type:ISSUE,first:10){ edges{ node{ ...onPullRequest{ title number body url } ...onIssue{ title number body url } } } } }
We could replace repo:facebook/docusaurus with owner:facebook to scope search every repo owned by Facebook (that we have access to).
You can replace the searchScope with your organization or repo name when applying the changes below. Be sure to swap out the placeholder YOUR_GITHUB_ACCESS_TOKEN with your token. In a future post, we'll cover language server configuration to pass in our search scope and access token.
With this information and the documents we initialized earlier, we have everything we need to understand what the user was typing.
connection.onCompletion(async(params)=>{ const doc = documents.get(params.textDocument.uri); if(!doc){ returnnull; } const line = doc.getText().split("\n")[params.position.line]; // return early if we don't see our trigger character if(!line.includes("#")){ returnnull; } const input = line.slice(line.lastIndexOf("#")+1); // ...
After selecting our completion choice, you’ll notice that the # character hangs around. How do we prevent this? And why do we need a trigger character anyway?
If we didn't have a trigger character, this completion would trigger constantly as we type any code. That's wasteful for API requests and means we're doing unnecessary work -- we only want to show issues and PRs in very specific scenarios. So, the behavior should be opt-in, and a trigger character is beneficial.
To prevent the # from hanging around, we can replace our naive insertText with a TextEdit. TextEdits allow you to specify a range in the document to replace with new text. We'll modify our completion item formatting function:
We put the user.id in one place, and it is now in all the log output. We could even put that in an around_action and
have it in all logging for our request.
logger.tagged "user-#{user.id}"do logger.tagged team.name do logger.info "hello" end end # The plain text version of the log reads nicely #INFO [user-1] [The Best Team] hello # But the JSON Log isn't ideal #{severity: INFO, message: "hello", tags: ["user-123", "The Best Team"]}
Almost every log aggregator will support JSON logging and user-123 won't be as easy to search as {user: 123}. With the structured format you're going to be able to display and search the column user for 123. Additionally, the team name is just floating there without a key. This post Structured logging in Rails lays out the problem well. Does your logging aggregator want:
🤔 But are these things you're tagging really "tags" or should they be new keypairs? In a practical example, the code is more likely something like this:
Rails.logger.tagged("BillingJob")do Rails.logger.tagged("User #{user.id}").info("nested tagged log message") end
and wouldn't that be better JSON as:
{ "class":"BillingJob", "user":1234 }
vs
{"tags":["BillingJob","User 1234"]}
If you're nodding along, at this point, you're probably wondering whether maybe we should just write:
As you can see, structured logging is pretty straightforward. It lets us do something like this:
log.info "hello",user:"1",team:"The best team" # Text Log # INFO hello user=1, team=The Best Team # JSON Log # {severity: INFO, message: "hello", user: 1, team: "The Best Team"}
Hey now, that really is better.
So, what libraries do this? Quite a few, actually. Here's what they look like:
Lograge is sort of a structured logger. It does output structured logs for your Rails requests, but it doesn't give you a way to structure your own logs. It's 101M downloads and a proven tool, but I'm not including it here because we're talking about structured logging and tagged logging in general.
So, is Tagged Logger worse than Structured Logging?
Well, yes. Mostly. Honestly I didn't have an opinion before I really dug in and researched this, but in my view structured logging can give us the same benefits as tagged logging, but in a cleaner, more machine-readable way that fits better with modern log aggregators.
🤔 However there is one thing I found valuable with tagged logging and that is nested contexts. It’s a convenient way to propagate context instead of adding it to each log entry. Does using structured logging mean we have to give that up?
No! Fortunately, we can do this in SemanticLogger as well. It looks like this:
I ended up writing this post because we're working on a dynamic log library, and I wanted to make sure we had all our bases covered with respect to structured logging and tagged logging.
Semantic logger is pretty darn good, but if we're going to really DRY things up I could reuse the context I use for feature flags. Here's what that might look like.
My proposed ideal code
# application_controller.rb classApplicationController< ActionController::Base around_action do|_, block| # Put our context together in one place for the whole request $prefab.with_context({ user:{ key: current_user.tracking_id, id: current_user.id, name: current_user.name, } },&block) end end # application.rb moduleMyApplication classApplication< Rails::Application # tell the logger to add the values of these contexts to log messages logger.add_context_keys "user.id","user.name" # or add extra logging context in a block logger.with_context {runtime:{value:42}}do logger.info "hello" # Feature flag value evaluation uses existing context for user.id and runtime.value $prefab.get("my-feature-flag") end end end # Log Output # INFO hello user.id=1234, user.name=Bob Beemer, runtime_value=42
This is the approach we're planning on, but I'd love to hear your thoughts.
Programmatically changing the log level in Java is easy. The only up-front hitch is that you just can't do it via SLF4J -- you'll need to use the underlying logging framework.
Log4j has a Configurator class we can use and in Logback we simply get the logger and set the level. Here's what that looks like.
What you're actually after is changing the log level without restarting the application. What you're actually after is getting more debug information from a production system without having to redeploy or turn off excessive logging that is costing you money in log ingestion costs.
Changing the log level programmatically is only half the battle (or maybe a 1/10th the battle). In any real application, we now we need to figure out how this code gets called.
This challenge is more complex, but it breaks down into two big parts:
Where does the user input the logger and level?
How does this system broadcast this information to all services?
Capturing the user's desire to change the log level
The developer wants more debug information about FooClass. How do we capture this intent? Collecting the changes comes in a few flavors.
Your developer can POST to an internal URL.
Something like POST /log-levels?logger=com.example.FooClass&level=DEBUG would work. The question here will be how to broadcast this change to all instances of the service.
Log4j supports JMX to trigger configuration reloads or edit the config file itself. There's even a client GUI! As with using an API request to a custom endpoint, that'll only work one application instance at a time.
Your developer can edit a shared log4j.properties or logback.xml somewhere. This is the most common way to do this, but sharing the file around can be an annoying infrastructure challenge and the behavior if you corrupt the file is not always obvious. Log4j also supports (re)loading configuration from a URI which is still basically just a remote shared file or files.
Maybe you build an internal UI or you use something like SpringBoot Admin or Prefab. The Spring Boot Admin UI looks like this:
The Prefab UI is similar, giving us a hierarchical view as well as some feedback on log volume at each level.
A centralized UI may or may not have knowledge about the loggers that you're using.
Prefab has knowledge of both the logger name as well as the frequency they are being used so it is able to render this helpful UI. This is helpful so you don't have to correctly type the logger name, but also so you know how many message/sec are happening at each logger. To do this, you'll need to send aggregate telemetry back to a central location on logging calls.
Broadcasting the change to all instances of the service
Once we've captured the intent to change the log level, we need to broadcast this change to all instances of the service. While some of the approaches like shared files or URI-based configuration will handle this easily, others like the API endpoint approach have shortcomings in this department. Hit that custom logging configuration API, and our load balancer delivered the POST to one instance of our service, and we changed the log level there. But... what about the other instances? Hope that was the one you wanted!
We could skip this step and ony changing the log levels on the one instance that received the user intent. Depending on your use case, this may be sufficient. If you only have a single instance, or you are just looking for a sampling of logs it should serve you well. If you have multiple instances or you are looking for a specific log message, this will not work and it may lead to confusion.
We could implement some type of service discovery to propagate the change to peer application instances. If each instance is aware of the other instances, then we can broadcast the change to all of them. It's unlikely you're going to want to build this out yourself just for this logging use case, but if this is technology you already have in place, it could be a reasonable option.
We can listening to a PubSub system for changes and programmatically configuring the logging system using the code above. Your API endpoint could publish a Redis pub/sub and all instances would listen for changes. Sandeep has a full fledged example of this (in Python) in route triggered state update across workers.
This is likely the easiest solution if you are going to DIY. Redis Pub/Sub is pretty easy to use. You'll need to decide on the message format, but it's a pretty straightforward solution.
One wrinkle is that the log level change will reset as instances reboot. Pub/Sub is fleeting. Instances that are listening at the time of the change will set their levels, but new instances that start afterwards will not see the change. If you want new instances to pick up the change, you'll need use something like Kafka or Redis Streams which will persist the message and allow new instances to pick up the change.
The last option is a bit different. Rather than listen for changes and set the log level, we can use a log filter to decide whether or not to log a message and have this filter look up the current log level by looking into a dynamic configuration system.
This dynamic configuration system will use some form of pub sub to ensure its working from the most current configuration, so this could also be seen as a variation of #4, but it's distinct in that this pub/sub system is not a 1-off just for logging. This is what Prefab does, specifically, we use SSE to get the changes and then a Logback TurboFilter that looks up the current value. Pseudo code for this looks like:
Dynamic logging is a great way to get more information out of your production system without having to redeploy.
At first, the problem looks like a simple question of figuring out how to change the log level programatically, but as we get into the real world we see that the real challenge is:
Building an ergonomic solution for our developers to input their desired log levels
Propagating this change across our service instances
Building this ourselves is very doable. Adding a POST endpoint to our services and using a Redis Pub/Sub to broadcast the change is a perfectly viable solution. Alternatively a service like Prefab will give us the capability out of the box, with some nice bells and whistles like a UI, log volume information and advanced targeting.
A health check is a page or endpoint that can indicate whether an application can properly serve requests. Health checks are a useful diagnostic mechanism and it’s handy to have them for all of your application’s dependencies. On the other hand, a failing health check can prevent your application instances from serving requests unaffected by the failing dependency, to the point of complete downtime. How to keep your application’s instances or pods healthy in the face of failing health checks?
There’s a fair amount written on the topic of health check management already (some links below), including approaches such as these:
Break down services to have simpler functionality to dependency graphs
Categorize your critical dependencies differently from your nice-to-have dependencies
Separate liveness checks from readiness checks
Think about failing-open vs closed when all health checks have gone bad
These are the right approaches, but it's tough to get them right on the first shot. In practice this require a fair amount of tuning work to dial in the right behavior. The upshot is that it's fairly common to end up in a situation where a health check starts failing and your application instances or pods are withdrawn from service, but you'd prefer to keep them up and running.
This Redis healthcheck is failing and taking our pods out of service, but the apps are still mostly functional. How do we fix this immediately?
To deal with a problem right now, we took an alternative approach: dynamically manage an ignore-list of health checks that will be excluded from aggregation into an overall health status.
We recently encountered an issue with one of our Redis hosts which caused downtime in our staging environment as pods considered unhealthy were withdrawn from service (despite not affecting most core functionality). Once we realized what’s going on, I quickly subclassed Micronaut's DefaultHealthAggregator and created a dynamically configured ignore list to exclude certain health checks. I also added logging to gain better visibility into failing health checks. These changes will give us more visibility and flexibility to quickly handle similar dependency failures in the future.
Now in dynamic configuration we can add a list of health checks to ignore. We'll create a List<String> with the name micronaut.ignored.health.checks and then add the name of the health check we want to ignore. For testing, let's add redis to the list in staging.
Dynamic configuration will propagate this value out to all connected SDKs. But how do we know that it's all working? Well, luckily for us we have some useful debug logging. To verify that it's working, we can turn on debug logging for the PrefabHealthAggregator class in staging using dynamic logging.
Then we can look over in our logging aggregator and see that the health checks are being ignored:
WARN 10:52:10 Ignoring healthcheck redis with status UP and details {} based on prefab config micronaut.ignored.health.checks DEBUG 10:52:10 passing healthcheck grpc-server with status UP and details {host=localhost, port=60001} DEBUG 10:52:10 passing healthcheck diskSpace with status UP and details {total=101203873792, free=49808666624, threshold=10485760} DEBUG 10:52:10 passing healthcheck service with status UP and details null DEBUG 10:52:10 passing healthcheck compositeDiscoveryClient() with status UP and details {services={}} DEBUG 10:52:10 passing healthcheck jdbc:postgresql://1.1.1.1:5432 with status UP and details {database=PostgreSQL, version=15.2}
Prefab Founding Engineer. Three-time dad. Polyglot. I am a pleaser. He/him.
Front-end logging can feel like finding a needle in an avalanche of haystacks.
Since the browser is the bridge between user interaction and the back-end, front-end logs have the potential to be a treasure trove of insights.
Datadog and others offer browser log collections but everyone I’ve talked to that has tried this has turned it off. Why? Way too expensive.
How can we get meaningful signal from our front-end logging? Three steps:
Transports: We need to put some logic between the log statement and the aggregator. Learn how to direct your logs to the right places and why it matters.
Context: We need to give that logic context, so it can decide what it shouldLog. Crafting actionable error messages and adding the right context will make your logs meaningful.
Targeting: We need some rule evaluation for Laser targeting. We’ll go over how to selectively capture logs for troubleshooting without overwhelming your system.
Transports: Sending the right logs to the right places
Step one in front-end logging is figuring out how to get access to the content. It doesn't matter how good your logging is if it only exists in your end-user's browser.
You'll likely want to sign up with a SaaS for log aggregation rather than rolling your own. We'll use Datadog as our example.
After following the installation steps for @datadog/browser-logs, you're able to use datadogLogs.logger.info (or error, or debug, etc.) wherever you'd normally use console.log.
🤔 We probably don't want to replace console.log entirely, right? For local development, it'd sure be nice to see the logs in my browser rather than going to the Datadog UI.
This is where we start thinking in "transports." There's no reason all our logging should go to the same places. Instead, we can have a logger with intelligent routing. This opens up possibilities like
Everything in local dev stays local in the console. Nothing is suppressed.
Production shows only ERROR messages in the console. Less severe messages don't show up.
Production sends anything INFO or more severe to Datadog.
Here's a rough sketch of what this might look like:
Without further details, "Something went wrong during checkout" is inactionable and likely to frustrate the developer on the receiving end.
"[checkout]: Charge has an invalid parameter" is a better message but still not yet actionable. Who did this happen to? What were they trying to buy?
Our logs are most actionable when we couple a well-crafted error message with metadata about the state of the user and app. Datadog and other offerings support this metadata out-of-the-box.
Consider the following
logger.error("[checkout]: charge has an invalid parameter",{ customerId:"cus_9s6XKzkNRiz8i3", cart:{plan:"Pro",credits: desiredCredits }, });
This logs something like
ERROR: [checkout]: Charge has an invalid parameter | customerId=cus_9s6XKzkNRiz8i3 cart.plan="Pro" cart.credits=-10
They tried to buy the Pro plan and -10 credits? -10 looks weird! That's a thread to pull on.
Adding context to our example logger implementation is easy enough. We can just add an extra context variable (already supported by @datadog/browser-logs) throughout.
We ship our logger and everything is working great. Developers are adding more logging and it is easier than ever to track down issues in local dev. But since we're being strategic about what gets shipped to Datadog, we're not paying for those console.debug lines.
We start getting reports that the activity feed is broken for the user with trackingId=abc-123. How can we use our approach to access the DEBUG-level logs for this user without recording those same DEBUG-level logs for every user?
You could tweak the report innards to consider the context:
This gives us a nice mechanism for allow-listing certain tracking ids for verbose logging.
logger.debug("test",{trackingId: user.trackingId}); // Examples logger.debug("test",{trackingId:"abc-123"});// This is sent to Datadog (trackingId matches) logger.debug("test",{trackingId:"xyz-999"});// This is not sent
A hard-coded array of tracking ids means we'll need to PR, merge, deploy, etc. to get this change out, but this is still a powerful approach. We'll get all the details we need to fix the user's problem. Once fixed, we can remove their tracking id from the array and do the PR/merge/deploy dance once more.
That all works great. You can call it a day and move on, confident and happy that enabled yourself and other devs to log better than ever.
But let me offer you a superpower: Sign up for Prefab and install @prefab-cloud/prefab-cloud-js (or the react flavor). Now you can target based on any context data to just the right logs to just the right places -- without having to change any code after the initial setup.
Here's the changes needed to start using prefab.shouldLog
import{ prefab,Context}from"@prefab-cloud/prefab-cloud-js"; // Set up prefab with the context of the current user const options ={ apiKey:"YOUR-API-KEY-GOES-HERE", context:newContext({ user:{trackingId:"abc-123",email:"test@example.com"}, device:{mobile:true}, }), }; prefab.init(options); // ... exportconstlogger=(environment, loggerName)=>{ // ... constreport=(severity, message, context)=>{ transports.forEach(([transport, options])=>{ const{ defaultLevel }= options ||{}; // Use prefab to check if we should log based on the user context if( prefab.shouldLog({ loggerName, desiredLevel: severityLookup[severity], defaultLevel: severityLookup[defaultLevel]??"debug", }) ){ transport(severityLookup[severity], message, context); } }); }; // ...
Now we can use the tools in the Prefab UI to target trackingId abc-123 and add and remove users without shipping code changes. Because you can set up custom rules to target whatever you provide in your context, we can even do things like target specific users on specific devices.
Full updated code
import{ prefab,Context}from"@prefab-cloud/prefab-cloud-js"; // Set up prefab with the context of the current user const options ={ apiKey:"YOUR-API-KEY-GOES-HERE", context:newContext({ user:{trackingId:"abc-123",email:"test@example.com"}, device:{mobile:true}, }), }; prefab.init(options); constERROR=0; constINFO=1; constDEBUG=2; const severityLookup ={ [ERROR]:"error", [INFO]:"info", [DEBUG]:"debug", }; constdatadogTransport=(severity, message, context)=> datadog.logger[severity](message, context); constconsoleTransport=(severity, message, context)=> console[severity](message, context); exportconstlogger=(environment, loggerName)=>{ const transports =[]; if(environment ==="production"){ transports.push([datadogTransport,{defaultLevel:INFO}]); transports.push([consoleTransport,{defaultLevel:ERROR}]); }else{ transports.push([consoleTransport]); } constreport=(severity, message, context)=>{ transports.forEach(([transport, options])=>{ const{ defaultLevel }= options ||{}; // Use prefab to check if we should log based on the user context if( prefab.shouldLog({ loggerName, desiredLevel: severityLookup[severity], defaultLevel: severityLookup[defaultLevel]??"debug", }) ){ transport(severityLookup[severity], message, context); } }); }; return{ debug:(message, context)=>{ report(DEBUG, message, context); }, info:(message, context)=>{ report(INFO, message, context); }, error:(message, context)=>{ report(ERROR, message, context); }, }; };
Wait, you forget to tell me which logging library to install!
You might not need one. I'd hold off picking until I knew more about why I wanted one.
IMHO, there isn't a clear winner. Since the major players (e.g. pino and winston) support you providing your own transports, you can pick whichever you prefer and always change your mind later.
Google Spanner is a remarkable piece of technology. It’s a globally distributed, strongly consistent, relational database that scales horizontally with a 99.999% SLA. With the recent addition of granular instance sizing, it's now possible to start with spanner at $40/month andnder the hood it even uses atomic clocks! How cool is that?
If you decide to try Spanner, right off the bat you've got to choose between two dialects, and it's not clear which one is the best choice.
Spanner comes in two flavors, GoogleSQL(an ANSI-compliant SQL) and PostgreSQL. Unfortunately it's a permanent choice on a per-database basis and it will have significant impact on the code you write, so you need to choose wisely!
Google provides a guide that can help you make an informed decision regarding the choice between PostgreSQL and GoogleSQL, along with a description of the features and limitations of Postgres that are and are not supported. These documents are a good starting point for your decision-making process.
As a long-time fan of Postgres, I was initially happy that we had chosen the PostgreSQL dialect. However, after becoming more familiar with its limitations compared to GoogleSQL, the PostgresSQL dialect feels like a second-class citizen in Google Spanner. Therefore, we have recently started creating new tables in a GoogleSQL dialect database instead. Here are the details of our experience.
Spanner implements many Postgres functions, but when I started to do some in-database rollups of data within time windows and so quickly reached for good old date_trunc and found it unimplemented along with date_bin. The list of supported postgres datetime methods is short and undistinguished. GoogleSQL has much better date time functions
If you think you might query by date or timestamp in your database, you should probably use GoogleSQL.
Google has a spanner emulator runnable from a docker image for the GoogleSQL dialect which will enable cheap, conflict free testing in local and CI/CD environments. PostgresSQL dialect users must pay to run tests against real Spanner database instances with all the headaches that come with that approach. There’s an issue but its been almost a year since the last update.
The PostgreSQL spanner driver makes you write ugly code like this to build statements
Statement.Builder builder =Statement .newBuilder("SELECT * FROM cool_table WHERE project_id = $1 AND id >= $2") .bind("p1") .to(projectId) .bind("p2") .to(startAt);
GoogleSQL spanner driver lets you use this much more pleasant named placeholder approach
Statement.Builder builder =Statement .newBuilder("SELECT * FROM cool_table WHERE project_id = @projectId AND id >= @id") .bind("projectId") .to(projectId) .bind("id") .to(startAt);
The PostgresSQL dialect might be a good fit for an application already using postgres that also happens not to use any of the missing postgres features. For us, the compatibility isn’t quite there so we’ll be steadily moving towards GoogleSQL
Prefab Founding Engineer. Three-time dad. Polyglot. I am a pleaser. He/him.
Wouldn't it be great if you could log all of the gory details of a request, but only if it fails?
Choosing the appropriate logging level for our web app can be challenging. Specifying the DEBUG level yields detailed information, but the signal for failed requests can get lost in the noise of successful ones. A WARN level reduces volume but lacks detail to help track down bugs.
Take this example:
Rails.log_level =:warn classCriticalBusinessCode defcalculate_grumpkin_quotient(apples, bananas) log.debug "calling fruit api with #{apples}, #{bananas}"# :( no info FruitApi.call(apples, bananas)# <= Throws mysterious "400 Bad Request" end end
If we knew at the beginning of a request that it would fail, we'd set the log level to DEBUG (or even TRACE), right?
We don't have the technology to build a psychic Logger, but we can build the next-best thing: a Time-traveling Logger.
We told the logger ahead of time that we wanted WARN as our level. Because we set the level to WARN, we only see the WARN and ERROR output, not the DEBUG and INFO output.
How can we build on top of the core logger to allow specifying the level later in the program's execution — after some logging statements have already executed?
The key idea is to defer flushing the logs until the end of the request. We don’t have the context we need to make the verbosity decision up-front, but when the request finishes, we will. We just need a place to stash these logs while the request processes.
Building on top of the core Logger, we can implement a queue by overriding the add method
require'logger' classQueuedLogger< Logger definitialize(*args) super @queue=[] end defadd(severity, message =nil, progname =nil) @queue<<->{super(severity, message, progname)} end defflush!(level =nil) old_level =self.level self.level = level if level @queue.each(&:call) @queue=[] ensure self.level = old_level end end
The implementation here isn't too exciting (in a good way). .debug, .info, etc. call add. With this change, we don’t immediately write the message but instead we throw it in a queue to be logged later.
When we're ready to dump out the logs, we can specify the level we want.
Let’s imagine we’re writing software for managing appointments for physicians. We have a background job to remind patients about upcoming appointments via SMS.
classAppointmentReminderJob defperform(doctor, patient, time, timezone) logger.debug "Scheduling appointment for #{doctor} and #{patient} on #{time} in #{timezone}" message = compose_message(doctor, patient, time, timezone) SMS.send(message, patient.phone) end defcompose_message(doctor, patient, time, timezone) # ... end deflogger @logger||=Logger.new($stdout).tap do|logger| logger.level = Logger::WARN end end end
If an exception happens when sending the reminder (perform), we’d like as much detail as possible. Unfortunately our default level of WARN means that our logger.debug statement is never written.
We could set the default level to DEBUG. While the logger.debug line here is useful if something goes wrong, it clutters our logs with noise when everything is working as intended.
We can get the best of both worlds by applying our QueuedLogger to capture debug-level messages only when an exception occurs.
Here’s the updated class:
classAppointmentReminderJob defperform(doctor, patient, time, timezone) logger.debug "Scheduling appointment for #{doctor} and #{patient} on #{time} in #{timezone}" message = compose_message(doctor, patient, time, timezone) SMS.send(message, patient.phone) rescue=> ex logger.error(ex) # write at the debug level so we get all the details logger.flush!(:debug) ensure # write at the default level we specified (WARN in this example) logger.flush! end defcompose_message(doctor, patient, time, timezone) # ... end deflogger @logger||=QueuedLogger.new($stdout).tap do|logger| logger.level = Logger::WARN end end end
QueuedLogger isn't quite ready to be thrown into your Rails app yet with config.logger = QueuedLogger.new(...). To get ready for prime-time, we'd want to isolate the queue for the request (maybe using ActiveSupport::CurrentAttributes, concurrent-ruby, or RequestStore).
If you're interested in a battle-hardened version of this logger that is ready to use, let us know and I'll see what I can put together :)
Prefab Founding Engineer. Three-time dad. Polyglot. I am a pleaser. He/him.
There's an open issue from March of 2020 titled sassc is very slow to compile and install. The issue has people pleading for help and asking for sassc-ruby to ship precompiled binaries (as nokogiri does). The place this hurts the most is building your Rails app with Docker where you can pay a 10+ minute install time every time you modify any part of your Gemfile.lock
Oof.
I have good news for those still stuck on sassc: Your Docker builds don't have to be slow.
Docker enthusiasts know that layer caching is a huge time saver. But modifying your Gemfile.lock breaks your bundle install layer and causes the previous cache to be unusable. Even though sassc has been at version 2.4.0 since June of 2020 and isn't likely to be updated, even a minor version bump on any other gem in your Gemfile means you're reinstalling sassc again.
Fortunately the fix is a trivial change in your Dockerfile: before your RUN bundle install command, add RUN gem install sassc:2.4.0.
The sassc install will be cached as its own Docker layer and then your subsequent bundle install will use the existing sassc from disk.
You can use this strategy for other rarely-changed gems with native extensions for more savings.
Altogether this looks like:
FROM ruby:3.2.1 # 1. Install gems that rarely change or are very slow ENV RAILS_VERSION 6.1.7.2 RUN gem install rails --version "$RAILS_VERSION" # pull sassc out and install early so we don't pay the price on each gem change RUN gem install sassc:2.4.0 RUN gem install bundler # 2. Install gems that change more frequently RUN mkdir -p /usr/src/app WORKDIR /usr/src/app COPY Gemfile /usr/src/app/ COPY Gemfile.lock /usr/src/app/ # don't bother installing development and test gems RUN bundle config set --local without 'development test' RUN bundle install --jobs 4 # 3. Now Move the app code in, because this changes every build COPY . /usr/src/app RUN RAILS_ENV=production bundle exec rake assets:precompile
For us gem install sassc:2.4.0 inside bundler has saved 8 minutes per build. Hopefully this can help you too.
Speaking of things that are slow, have you ever wanted to change an environment variable variable without restarting your server? Ever wanted to change
the log level without pushing a new build? Or perhaps you're simply interested in a fabulous Rails Feature Flag solution that has everything you need, but
doesn't charge by the seat? If so, check us out here at Prefab!
If you're setting up tracking for a new project, the natural thing to do is follow the instructions from whatever product analytics tool or event hub you're planning to use: Segment / Amplitude / PostHog etc. Each of these will assign a trackingId for you, but I would suggest you take this into your own hands. Let's dive in and see why.
I'll get it out of the way now, this is definitely the most complete and "correct" solution. The problem in my eyes is that it's also the most complex to implement, analyze and reason about. And sometimes bad things happen.
The complex method of user identification has gained popularity because on paper it provides a more comprehensive view of user behavior across multiple devices and browsers. By merging anonymous IDs with user IDs,
anytime we realize that one set of activity is actually coming from a specific user, we can stitch them together. This sounds great.
Challenging Analysis:
The core issue is readily apparent from the diagram however. We have 4 different streams. In a data warehouse, this is going to look like 4 different primary keys. In order to do any analysis, we're going to have to union all events from these 4 keys together every time and sort them. In particular because of the changes to how browsers handle cookies, we are going to get lots and lots of tracking_ids per use and at "analytics scale" this can be a non-trivial problem. Over time, trying to select * from events where tracking_id IN (select tracking_id from user_tracking_ids where user_id = 123) can simply fall over.
While this process generally works well within 3rd party tools, this is because it's really one of their core competencies and they have really optimized to solve it. That's great for them, but the problem is this isn't one of your core competencies. If you want to write all the events to the data warehouse, which I think you should, you're going to have a harder time. Now, do you really need to store raw events in your warehouse? It's your call, but I'd strongly encourage it. Owning your data means you control your destiny and being able to dive into the core of the issue with raw data is often key to solving thorny problems.
Interoperability & Portability
The other issue is that this method of user identification is less portable. If you want to switch from Posthog to Amplitude, you're likely going to have to re-identify all your users. Similarly if you're looking to move data between systems or describe your data to another system, it's just awkward to not have a single immutable ID for each user. Say you want to build a simple realtime personalization API. You're going to build a kafka listener on your event stream, run some KSQL and save off something like "how many times person X looked at the pricing page". Should this API be responsible for understanding that all 4 GUIDs are actually the same person? That's a lot of work for a simple API.
Bad Merges: Accidental/Incorrect merges are a real thing and can create an amazing amount of confusion.
These merge-style systems are greedy. Their whole purpose in life is to join together different streams of the user experience into a unified view. However, a few users share a computer or have multiple accounts and suddenly these systems slurp two users into one and you have a real mess on your hands. Try explaining to everyone why you had 400 users in March last week, but now there are only 396 because your 3rd party analytics system accidentally merged two users. It's not fun.
Even in a "correct merge" situation, very weird things can happen.
From the Amplitude docs
When users are merged, the user could "lose" user property values that were never meant to be changed (e.g. 'Start Version' or initial UTM parameters) because the new user property values will overwrite the original user property values. If you are a paying customer and this affects you, please reach out to our Support team.
Not That Useful: The reason this standard solution exists is that we would really love to be able to track an anonymous user across every device and browser they use, before they convert, so we can understand the full acquisition picture. This would be very useful to understand acquisition channels, but sadly it's not on the menu. The reality is that new privacy controls have made this close to impossible. Honestly this isn't a huge change. Even before new browser cookie policies, cross device tracking was dark magic that never really worked that well.
What we do get from this complex approach is typically:
1 stream of anonymous traffic that converts to a user [very useful]
5 streams of anonymous traffic that we never associate with a user [sad, but nothing to be done]
20 streams of anonymous traffic that then logs in. [not very useful]
Our simpler solution is going to give us #1 & #2 above and will track #3 but de-emphasize it.
My preferred solution makes one big sacrifice and reaps a ton of simplicity. Here's the approach pictorially.
We now have a single stream of events for our user and we can select * from events where tracking_id = 'abc123'. This is a lot easier to reason about.
We also have a "marooned" stream. This is what it looks like when an existing user of yours comes to your site and logs in. It is not connected directly to the other stream (but we can connect it if we need to). This is the tradeoff. The core of the reasoning is that, in practice, detailed analysis of this funnel is just not that important. How much analysis happens of the create new user flow? A ton. How much of the login flow? Not much.
Our proposed solution simplifies user identification as follows:
All pages/applications should set the same cookie for new visitors if it's not already set.
Upon signup, transfer the GUID to a single tracking_id field on the user.
On each page load, if a user is defined, set the tracking cookie to the user tracking_id. Otherwise, track using the cookie.
If step #3 would change an existing cookie that means we have a crossover event. Track 2 "transition" events, one in each stream to aid analysis if you'd like.
This approach results in a single GUID per user, making analysis and identification much simpler, especially when it comes to exploring raw data in your data warehouse.
Working with 3rd party tools: Just because we're taking control of the tracking_id doesn't mean we shouldn't use 3rd party analytics tools. All we need to do is identify(trackin_id) and they'll respect our authority on the matter.
Multiple devices and browsers: Post-login activity from different devices can still have the same tracking_id as long as you mobile apps follow the same protocol.
Data loss during transition: Firing "switchover" events helps maintain data in the warehouse, enabling the possibility of stitching data together if needed.
Data privacy and compliance: There's no significant impact on data privacy and compliance with GDPR or CCPA. When you go to delete a user you'll need to delete all this data, but that's true of any solution.
This simpler solution is well-suited for large B2C and B2B SaaS companies, where acquisition analysis is a priority. However, if re-acquisition is central to your business, this approach may not be the best fit.
The core of the implementation is a class that generates a tracking_id and persists it to cookies. The rest is just plugging it into Rails and saving it to the user on signup.
classTrackingId COOKIE_KEY="tid".freeze defself.build(user:,cookies:) builder =new(user, cookies) builder.persist_to_cookies builder.tracking_id end definitialize(user =nil, cookies ={}) @user= user @cookies= cookies end deftracking_id @tracking_id||= user_tracking_id || cookie_tracking_id ||self.class.new_tracking_id end defpersist_to_cookies @cookies[COOKIE_KEY]={ value: tracking_id, expires:1.year.from_now } end private defuser_tracking_id @user.try(:tracking_id) end defcookie_tracking_id returnif@cookies[COOKIE_KEY].blank? @cookies[COOKIE_KEY] end defself.new_tracking_id SecureRandom.uuid end end
All webpages a user might land on, should set the cookie if it's not set already set. Here's an example in JS.
functionsetGUIDCookie(){ // Check if the tid cookie has already been set if(document.cookie.indexOf("tid=")===-1){ const guid = crypto.randomUUID(); const expirationDate =newDate( newDate().setFullYear(newDate().getFullYear()+1) ).toUTCString(); document.cookie=`tid=${guid}; expires=${expirationDate}; SameSite=Lax; path=/`; } } setGUIDCookie();
This proposed solution offers a more straightforward approach to user identification for 3rd party tracking systems, simplifying analysis and reducing complexity, particularly when examining raw data in your data warehouse.
This works with all the existing tools like Segment, Amplitude, PostHog, etc. You'll just take control and send them the tracking_id you generate and don't rely on their merge logic.
In my experience, this simpler solution is better and leads to fewer headaches. Taking control of the identity puts you in the driver's seat and makes it easier to analyze across whatever analytics tools you decide to use.
I really thought it would be a 3 line commit. All I wanted to know was how many streaming connections I had. DataDog was already setup
and was happily sending metrics to it, so I figured I'd just add a gauge and be done with it.
Micrometer is a "Vendor-neutral application observability facade"
which is Java speak for "a common library of metrics stuff like Counters, Timers, etc"
If you want a basic "what is the level of X over time", a gauge is the meter you are looking for.
Here's a basic example of using a Gauge. This is a Micronaut example, but is pretty generalizable.
Ok, with that code in place and feeling pretty sure that calculateConnections() was returning a consistent value. You can imagine how I felt looking at the following, which shows my gauge value going all over the place from 0 to 1 to 2 (it should just be 2).
What is happening here? The gauge is all over the place.
It made sense to me that taking the avg was going to be wrong, if I have 2 servers I don't want the average of the gauge on each of them, I want the sum.
But I'm charting the sum() here and that doesn't exp lain what's happening.
The key is remembering how statsd with tagging works and discovering some surprising behavior from a default DataDog setup.
Metrics from micrometer come out looking like config.broadcast.project-connections.connections:0|g|#statistic:value,type:grpc.
As an aside, I'd highly recommend setting up a quick clone of statsd locally that just outputs to stdout when you're trying to get this all working.
The "aha" is that all of these metrics get aggregated based on just that string. So if you have
Server 1:
config.broadcast.project-connections.connections:99|g|#statistic:value,type:grpc
Server 2:
config.broadcast.project-connections.connections:0|g|#statistic:value,type:grpc
A gauge is expecting a single value at any given point, so what we end up with here is a heisengauge that could be either 0 or 99. Our sum doesn't work, because we don't have a two data points to sum across. We just have one value that is flapping back and forth.
Now we know what's up, and it's definitely a sad state of affairs.
What we do want is outputting a different key per pod and then summing across those.
But why aren't these metrics getting tagged with the pod?
It turns out that https://micronaut-projects.github.io/micronaut-micrometer/latest/guide/#metricsAndReportersDatadog hits DataDog directly, not my local Datadog agent
which is normally responsible for adding these host & pod tags.
Since it goes straight there and we aren't explicitly sending a pod or host tag, these metrics are clobbering each other.
The other solution is to calculate the same DataDog hostname that the datadog agent uses and manually add that as a commonTag to our MetricRegistry.
Doing that looks like this: