Some Opinions on Tracking Users
Introduction
If you're setting up tracking for a new project, the natural thing to do is follow the instructions from whatever product analytics tool or event hub you're planning to use: Segment / Amplitude / PostHog etc. Each of these will assign a trackingId for you, but I would suggest you take this into your own hands. Let's dive in and see why.
The Standard Solution
The current standard of user identification is basically:
- Generate a GUID Cookie for anonymous visitors
- Track events against this GUID
- On login / account creation. Send an
identify
call sayinguserId == guid
- When performing analysis do it for all GUIDs and UserIds.
This is how Segment, Amplitude or PostHog works.
I'll get it out of the way now, this is definitely the most complete and "correct" solution. The problem in my eyes is that it's also the most complex to implement, analyze and reason about. And sometimes bad things happen.
The complex method of user identification has gained popularity because on paper it provides a more comprehensive view of user behavior across multiple devices and browsers. By merging anonymous IDs with user IDs, anytime we realize that one set of activity is actually coming from a specific user, we can stitch them together. This sounds great.
The Problems
Challenging Analysis:
The core issue is readily apparent from the diagram however. We have 4 different streams. In a data warehouse, this is going to look like 4 different primary keys. In order to do any analysis, we're going to have to union all events from these 4 keys together every time and sort them. In particular because of the changes to how browsers handle cookies, we are going to get lots and lots of tracking_ids per use and at "analytics scale" this can be a non-trivial problem. Over time, trying to select * from events where tracking_id IN (select tracking_id from user_tracking_ids where user_id = 123)
can simply fall over.
While this process generally works well within 3rd party tools, this is because it's really one of their core competencies and they have really optimized to solve it. That's great for them, but the problem is this isn't one of your core competencies. If you want to write all the events to the data warehouse, which I think you should, you're going to have a harder time. Now, do you really need to store raw events in your warehouse? It's your call, but I'd strongly encourage it. Owning your data means you control your destiny and being able to dive into the core of the issue with raw data is often key to solving thorny problems.
Interoperability & Portability The other issue is that this method of user identification is less portable. If you want to switch from Posthog to Amplitude, you're likely going to have to re-identify all your users. Similarly if you're looking to move data between systems or describe your data to another system, it's just awkward to not have a single immutable ID for each user. Say you want to build a simple realtime personalization API. You're going to build a kafka listener on your event stream, run some KSQL and save off something like "how many times person X looked at the pricing page". Should this API be responsible for understanding that all 4 GUIDs are actually the same person? That's a lot of work for a simple API.
Bad Merges: Accidental/Incorrect merges are a real thing and can create an amazing amount of confusion. These merge-style systems are greedy. Their whole purpose in life is to join together different streams of the user experience into a unified view. However, a few users share a computer or have multiple accounts and suddenly these systems slurp two users into one and you have a real mess on your hands. Try explaining to everyone why you had 400 users in March last week, but now there are only 396 because your 3rd party analytics system accidentally merged two users. It's not fun.
Even in a "correct merge" situation, very weird things can happen. From the Amplitude docs
When users are merged, the user could "lose" user property values that were never meant to be changed (e.g. 'Start Version' or initial UTM parameters) because the new user property values will overwrite the original user property values. If you are a paying customer and this affects you, please reach out to our Support team.
Not That Useful: The reason this standard solution exists is that we would really love to be able to track an anonymous user across every device and browser they use, before they convert, so we can understand the full acquisition picture. This would be very useful to understand acquisition channels, but sadly it's not on the menu. The reality is that new privacy controls have made this close to impossible. Honestly this isn't a huge change. Even before new browser cookie policies, cross device tracking was dark magic that never really worked that well.
What we do get from this complex approach is typically:
- 1 stream of anonymous traffic that converts to a user [very useful]
- 5 streams of anonymous traffic that we never associate with a user [sad, but nothing to be done]
- 20 streams of anonymous traffic that then logs in. [not very useful]
Our simpler solution is going to give us #1 & #2 above and will track #3 but de-emphasize it.
A Simpler Solution
My preferred solution makes one big sacrifice and reaps a ton of simplicity. Here's the approach pictorially.
We now have a single stream of events for our user and we can select * from events where tracking_id = 'abc123'
. This is a lot easier to reason about.
We also have a "marooned" stream. This is what it looks like when an existing user of yours comes to your site and logs in. It is not connected directly to the other stream (but we can connect it if we need to). This is the tradeoff. The core of the reasoning is that, in practice, detailed analysis of this funnel is just not that important. How much analysis happens of the create new user flow? A ton. How much of the login flow? Not much.
Our proposed solution simplifies user identification as follows:
- All pages/applications should set the same cookie for new visitors if it's not already set.
- Upon signup, transfer the GUID to a single
tracking_id
field on the user. - On each page load, if a user is defined, set the tracking cookie to the user
tracking_id
. Otherwise, track using the cookie. - If step #3 would change an existing cookie that means we have a crossover event. Track 2 "transition" events, one in each stream to aid analysis if you'd like.
This approach results in a single GUID per user, making analysis and identification much simpler, especially when it comes to exploring raw data in your data warehouse.
Addressing Potential Concerns
- Working with 3rd party tools: Just because we're taking control of the
tracking_id
doesn't mean we shouldn't use 3rd party analytics tools. All we need to do isidentify(trackin_id)
and they'll respect our authority on the matter. - Multiple devices and browsers: Post-login activity from different devices can still have the same
tracking_id
as long as you mobile apps follow the same protocol. - Data loss during transition: Firing "switchover" events helps maintain data in the warehouse, enabling the possibility of stitching data together if needed.
- Data privacy and compliance: There's no significant impact on data privacy and compliance with GDPR or CCPA. When you go to delete a user you'll need to delete all this data, but that's true of any solution.
Applicability and Limitations
This simpler solution is well-suited for large B2C and B2B SaaS companies, where acquisition analysis is a priority. However, if re-acquisition is central to your business, this approach may not be the best fit.
Implementation
Here is a suggested implementation in Rails.
The core of the implementation is a class that generates a tracking_id and persists it to cookies. The rest is just plugging it into Rails and saving it to the user on signup.
class TrackingId
COOKIE_KEY = "tid".freeze
def self.build(user:, cookies:)
builder = new(user, cookies)
builder.persist_to_cookies
builder.tracking_id
end
def initialize(user = nil, cookies = {})
@user = user
@cookies = cookies
end
def tracking_id
@tracking_id ||= user_tracking_id || cookie_tracking_id || self.class.new_tracking_id
end
def persist_to_cookies
@cookies[COOKIE_KEY] = {
value: tracking_id,
expires: 1.year.from_now
}
end
private
def user_tracking_id
@user.try(:tracking_id)
end
def cookie_tracking_id
return if @cookies[COOKIE_KEY].blank?
@cookies[COOKIE_KEY]
end
def self.new_tracking_id
SecureRandom.uuid
end
end
All webpages a user might land on, should set the cookie if it's not set already set. Here's an example in JS.
function setGUIDCookie() {
// Check if the tid cookie has already been set
if (document.cookie.indexOf("tid=") === -1) {
const guid = crypto.randomUUID();
const expirationDate = new Date(
new Date().setFullYear(new Date().getFullYear() + 1)
).toUTCString();
document.cookie = `tid=${guid}; expires=${expirationDate}; SameSite=Lax; path=/`;
}
}
setGUIDCookie();
Conclusion
This proposed solution offers a more straightforward approach to user identification for 3rd party tracking systems, simplifying analysis and reducing complexity, particularly when examining raw data in your data warehouse.
This works with all the existing tools like Segment, Amplitude, PostHog, etc. You'll just take control and send them the tracking_id
you generate and don't rely on their merge logic.
In my experience, this simpler solution is better and leads to fewer headaches. Taking control of the identity puts you in the driver's seat and makes it easier to analyze across whatever analytics tools you decide to use.
Good luck!