Revisiting MULSID's Design

Mitch

Mitch

If you have not read the original blog about MULSID, I highly recommend reading that for context before reading this blog post.

Introduction

A while ago I built MULSID, the Mini Unique Lexicographically-Sortable ID. The goal behind this ID format was simple; it needed to be:

  • compact
  • lexicographically sortable
  • URL-safe
  • human-friendly
  • collision resistant enough for low-throughput applications

These requirements resulted in a 10-character base62 ID format that encoded a timestamp and a small amount of randomness.

MULSID worked well, and there really were not any glaring issues with the original implementation. It achieved short IDs which were lexicographically sortable, practical for environments where we only generate a small number of IDs per second, and had a usable lifetime that spanned for many generations before becoming problematic.

All of that said, after revisiting the current design in my head recently, I realized something important:

MULSID is designed conceptually around the characters in the ID, but fundamentally it is a store of information

At a low level, IDs are just fancy integers. So why was I focused on the characters rather than the number that they represent? I decided I needed to revisit the internal implementation and design behind MULSID so that it better utilizes the underlying structure behind the ID while maintaining the original requirements that I set.

MULSID before

The original MULSID format was conceptually a character-based split between the timestamp and randomness entropy.

ComponentSize
Timestamp7 chars
Randomness3 chars

The ID handles lexicographic sorting via the timestamp portion, and uniqueness via the randomness portion. With 9ms "ticks" for the timestamp, this allowed for roughly 1000 years of coverage before IDs would overflow. This system worked fine; with those 3 characters of randomness, each ID had 62^3 (238,328) possible values per-tick.

If you are curious about this original design, I explored it in-depth in the original blog post about MULSID. At the time, this original format was intuitive and easy to reason about, but internally the ID was still a base62 encoded integer. The character "regions" were more conceptual.

Shifting to a Bit-packed Approach

When revisiting MULSID's design in my head, I started with a simple observation; A 10-character base62 ID has 62^10 (839,299,365,868,340,200) possible values. In terms of information that can store, that is equal to log_2(62^10) (~59.54) bits.

If we view through that lens, the original character-based approach started to feel a bit too rigid for my liking. The "system" doesn't care about characters, it cares about what information we actually store in those bits. So, instead of partitioning the ID via characters, the new design treats the ID for what it is; an integer with information packed into the bits.

With that in mind, I first looked at the randomness portion of the ID. In the old design, the 3 character randomness covers just under 18 bits (log_2(62^3) = ~17.86 bits). I could just bump that to hold a full 18 bits instead to get a little bit more uniqueness out of our design. Now, we can hold 2^18 (262,144) possible values per-tick instead.

Now, with randomness improved, we just need to pack the timestamp in the remaining ~41.54 bits. If we keep the same 9ms timestamp window, we can still get a bit over 912 years of lifetime out of the ID before we start to overflow. This reduces the lifetime of the ID from the original format, but I think that is a fair trade for a more even and slightly improved amount of randomness.

So, with all of that in mind, we now have an ID that holds the randomness in the lower 18 bits, and the timestamp in the higher 41.54 bits. This removes the artificial character-owned idea entirely.

High-level Comparison

At a high level, the new design is not that different from the original.

The IDs:

  • are still 10 characters long
  • are still base62
  • are still lexicographically sortable
  • still use 9ms ticks for the timestamp
  • still have a very long lifetime

But internally, we're using the entropy budget a bit more efficiently.

PropertyOld LayoutNew Layout
Tick size9ms9ms
Random states238,328262,144
Lifespan~1004 years~912 years

So, we are simply trading a relatively small amount of timestamp lifetime (~9% shorter) for a cleaner internal model and implementation, and improved collision resistance (~10% more values).

Statistical Comparison

This redesign is also interesting from a statistical perspective. Let's revisit the birthday collision problem from the original blog.

The probability of a collision inside a timestamp tick can be approximated using this equation:

Where R is the generation rate per second of IDs, d is time in seconds per-tick, and V is the number of possible random values.

Under the original design:

V = 62^3 = 238,328

Under the new design:

V = 2^18 = 262,144

Using the same birthday-problem approximation from the original MULSID analysis, under a 1-second time window, the collision probability comparison looks like this:

IDs/secOldNew
5000.4709%0.4389%
10001.8704%1.7438%
15004.1592%3.8866%
20007.2741%6.6357%
250011.1306%10.1728%
500037.6233%34.8908%

It is important to keep in mind, this is the chance that there is ONE collision in ANY tick within the window. It does not mean that when we're generating 5000 IDs/sec, that 35% of them will have a duplicate. In reality, at 5000 IDs/sec, the probability of a collision within one tick in the new design is only 0.377% (vs 0.412% with the old design).

So, obviously we aren't making enormous gains here, but they are not insignificant. And we still have the monotonic mode for a guaranteed unique and strictly increasing ID.

Conclusion

I think the most interesting part of this whole exercise was realizing that the old "character region" model was not the best way to think about the design of MULSID. Fundamentally, MULSID, or any ID really, is simply an integer which is encoding some data.

Once we start looking at this in terms of information budget and density rather than "characters with jobs", we can come up with a better design where we:

  • preserve MULSID's compactness and sortability
  • improve same-tick collision resistance
  • simplify implementation
  • retain similar lifetime

Sometimes, optimization is not about massive gains. It can simply be minor improvements that make meaningful change for the better.

You can try MULSID online at mulsid.mpact.llc or install it via npm or jsr to use in your own JS/TS projects.

If you enjoyed this, check out my other blog posts or the MPACT LLC catalog. Thanks for reading!