I didn't think I'd be writing this.
I really thought it would be a 3 line commit. All I wanted to know was how many streaming connections I had. DataDog was already setup and was happily sending metrics to it, so I figured I'd just add a gauge and be done with it.
But here we are.
A Basic Gauge
Micrometer is a "Vendor-neutral application observability facade" which is Java speak for "a common library of metrics stuff like Counters, Timers, etc" If you want a basic "what is the level of X over time", a gauge is the meter you are looking for.
Here's a basic example of using a Gauge. This is a Micronaut example, but is pretty generalizable.
@Singleton
public class ConfigStreamMetrics {
private final AtomicInteger projectConnections;
@Inject
public ConfigStreamMetrics(MeterRegistry meterRegistry) {
projectConnections =
meterRegistry.gauge(
"config.broadcast.project-connections",
Tags.empty(),
new AtomicInteger()
);
}
@Scheduled(fixedDelay = "1m")
public void recordConnections(){
projectConnections.set(calculateConnections());
}
}
Ok, with that code in place and feeling pretty sure that calculateConnections()
was returning a consistent value. You can imagine how I felt looking at the following, which shows my gauge value going all over the place from 0 to 1 to 2 (it should just be 2).
Why is my gauge not working?
What is happening here? The gauge is all over the place.
It made sense to me that taking the avg
was going to be wrong, if I have 2 servers I don't want the average of the gauge on each of them, I want the sum
.
But I'm charting the sum()
here and that doesn't exp lain what's happening.
The Key
The key is remembering how statsd with tagging works and discovering some surprising behavior from a default DataDog setup.
Metrics from micrometer come out looking like config.broadcast.project-connections.connections:0|g|#statistic:value,type:grpc
.
As an aside, I'd highly recommend setting up a quick clone of statsd locally that just outputs to stdout when you're trying to get this all working.
The "aha" is that all of these metrics get aggregated based on just that string. So if you have
Server 1:
config.broadcast.project-connections.connections:99|g|#statistic:value,type:grpc
Server 2:
config.broadcast.project-connections.connections:0|g|#statistic:value,type:grpc
A gauge is expecting a single value at any given point, so what we end up with here is a heisengauge that could be either 0 or 99. Our sum
doesn't work, because we don't have a two data points to sum across. We just have one value that is flapping back and forth.
The gotcha
Now we know what's up, and it's definitely a sad state of affairs. What we do want is outputting a different key per pod and then summing across those. But why aren't these metrics getting tagged with the pod?
It turns out that https://micronaut-projects.github.io/micronaut-micrometer/latest/guide/#metricsAndReportersDatadog
hits DataDog directly, not my local Datadog agent
which is normally responsible for adding these host & pod tags.
Since it goes straight there and we aren't explicitly sending a pod or host tag, these metrics are clobbering each other.
Two solutions
1) Point your metrics to your datadog agent and get the host tags that way
This makes a lot of sense, but I wasn't able to get it working easily.
2) Set CommonTags Yourself
The other solution is to calculate the same DataDog hostname that the datadog agent uses and manually add that as a commonTag
to our MetricRegistry.
Doing that looks like this:
@Order(Integer.MAX_VALUE)
@Singleton
@RequiresMetrics
public class MetricFactory
implements MeterRegistryConfigurer<DatadogMeterRegistry>, Ordered {
@Property(name = "gcp.project-id")
protected String projectId;
@Override
public void configure(DatadogMeterRegistry meterRegistry) {
List<Tag> tags = new ArrayList<>();
addIfNotNull(tags, "env", "MICRONAUT_ENVIRONMENTS");
addIfNotNull(tags, "service", "DD_SERVICE");
addIfNotNull(tags, "version", "DD_VERSION");
addIfNotNull(tags, "pod_name", "POD_ID");
if (System.getenv("SPEC_NODENAME") != null) { // contruct the hostname that datadog agent uses
final String hostName =
"%s.%s".formatted(System.getenv("SPEC_NODENAME"), projectId);
tags.add(Tag.of("host", hostName));
}
meterRegistry.config().commonTags(tags);
}
private void addIfNotNull(List<Tag> tags, String tagName, String envVar) {
if (System.getenv(envVar) != null) {
tags.add(Tag.of(tagName, System.getenv(envVar)));
}
}
@Override
public Class<DatadogMeterRegistry> getType() {
return DatadogMeterRegistry.class;
}
}
Passing the node & pod names in required some kubernetes yaml work so that the pod name and node name were available as environment variables.
spec:
containers:
- image: gcr.io/-----
name: -----------
env:
- name: SPEC_NODENAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: POD_ID
valueFrom:
fieldRef:
fieldPath: metadata.name
Wrap
With all of that in place we're finally in a good place. Our gauges are independently gauging and our sum is working as expected.