可以利用唯一的词典分类排序标识符来按时间查询s3对象,而无需支持元数据存储,此处's how!
瑞安·布朗(Ryan Brown)Trek10
瑞安·斯科特·布朗 | 2020年4月14日

The rise of distributed data stores and the general decomposition of systems into smaller pieces means that coordination between each server, service, or function is less available. In my first applications, unique ID generation meant setting auto_increment=True on a column in the SQL database. Easy, done, no problem. Today, each microservice has its own data source(s) and NoSQL stores are common. Every NoSQL DB is "NoSQL" in its own way, but they usually eschew coordinated and single-writer solutions in the name of reliability/performance/both. You can't have an auto-increment column without implementing the coordination client-side.

使用 numbers as identifiers 也 creates problems. Auto-incrementing can lead to enumeration-based attacks. Fields can have fixed sizes. These issues can go unrealized until you overflow the uint32 field, and now your logs are a pile of ID conflict errors. Instead of integers, we can use a different kind of fixed-length field and make it non-sequential so that different hosts can generate IDs without a central coordinating point.

UUID是一种改进,可以避免在分布式设置中发生冲突,但是严格来说,它是随机的,您无法轻松地对它们进行排序或确定粗略顺序。 Segment不久前发布了博客 about one replacement for UUIDs with the KSUID (K-Sortable Universal ID) but it has limitations and uses a strange 14e8 offset to avoid running out of epoch time in the next 100 years.

输入 唯一的按词典顺序排序的标识符(ULID). These are sortable, high-entropy identifiers that we can generate anywhere in our pipeline without coordination and have confidence that there won't be collisions. A ULID looks like 01E5TZRCM5WZYPB2BH7KMYR5HT, and the first 10 characters are a timestamp, and the next 16 characters are random.


I came across the need for ULID / KSUID when working with S3 objects that needed to be named, but I 也 wanted to be able to query for recent objects. Typically, when I need a random identifier, I reach for UUID-v4. Why v4?

  • UUID v1和v2包含基于生成它们的主机的MAC地址。这并不是真正的安全性问题,因为L2地址不会对公共互联网有太大帮助。但是,这确实意味着,如果我的UUID是在Lambda中生成的,则MAC地址没有语义值。我无法通过SSH进入Lambda并查找MAC或以其他方式使用该信息。
  • UUID v3 requires a seed, and I would just be using random.randint() or the equivalent to pick my seed value. Any system that requires a seed means that I have to think about what to use as a seed, how that impacts the randomness, and how that might impact security or collisions.
  • UUID v4是随机的,但是由于它是完全随机的,因此它不提供语义重载。

为什么我要在系统上语义上重载UUID?我从语义超负荷向导中得到了提示, 里克·霍利汉(Rick Houlihan)。我花了一些时间在单表DynamoDB设计上,而这种思考方式也蔓延到了S3存储系统的设计中。


基于索引的思维可能很有启发性,特别是因为IT充满了本质上分类的存储系统。 S3在返回对象键和前缀时对其进行排序,无论它们以什么顺序添加。

选择密钥以按存储桶和前缀列出。例如,考虑一个名为“ dictionary”的存储桶,其中包含每个英语单词的键。您可能会打电话列出该存储桶中所有以字母“ q”开头的键。列表结果始终以UTF-8二进制顺序返回。

AWS S3文档


使用 Python中的ULID is simple. First, you need to install the ulid-py library, then you can import ulid and start generating identifiers:

This would upload an object with just a ULID as the name, with the contents abc. Then when we list objects in the CLI or in any other app, they're sorted by the time they were created even if there were several new objects in a single millisecond.

$ aws --profile personal s3 ls s3://t10-blog-ulids
2020-04-13 21:17:53          3 01E5V474WE4DE0N63ZWT7P6YWH
2020-04-13 21:17:54          3 01E5V475QFRCEHXKJAS3BRS6BV
2020-04-13 21:24:51          3 01E5V4KXFTP52C9M5DVPQ2XR8T
2020-04-13 21:48:33          3 01E5V5Z9J0GX72VFSENBCKMHF0


>>> import ulid
>>> u = ulid.new()
>>> u.str
>>> u.bytes
>>> u.int
>>> u.uuid
>>> u.float
>>> bin(u.int)

Especially useful is the u.uuid type that 所有ows you to replace existing UUIDs in your system with ULIDs without changing the value format. This means you can start benefitting from the ordering properties of ULIDs in existing systems.


因为48位时间戳+ 100位随机性的ULID格式意味着我们可以获得100位 每毫秒 几乎消除了碰撞的可能性*。将此与我们的自动递增数字列进行对比。增量使我们不得不集中管理数据库中的该编号,以避免ID冲突。使用ULID,我们可以在任何Lambda,容器或EC2实例中生成ID。



Further, you may choose to go off-spec and use the 2 most-significant bits of the ULID that our encoding gives us. This is possible because there are 150 bits available in text representation, minus 148 used by the timestamp and randomness in the spec. You can get 4 sub-types of ULID in the same spirit of descriptive IDs like i-0123456789 and AKIAXNMVN by making the ID itself contain an encoded type.

*如果您是Amazon Retail,请不要遵循此建议,百万分之一的事件每小时发生两次,且规模足够大。


DynamoDB的新趋势是 单表设计。使用具有允许不同GSI服务多个查询的设计的单个表。里克(Rick)在推特上发布了这个真实示例 Kindle收藏权服务 使用4个GSI服务9个查询。

Kindle DynamoDB表设计

这些单表设计依赖于使用可排序的属性来启用查询,通常通过组合哈希来实现&Range in novel ways for each type of object. For example, you might create a key like Hash=Org#Trek10 Range=Post#2020-04-03#ca21477c-5693-4f2d-92e5-068102b24be9 which is composed of a type, org name, create time, and UUIDv4. Instead, with a ULID, you would be able to obviate the timestamp and ID combination and use a range key of Range=Post#01E5WF8AERWH9F8PDTQ5K4GW7R. This is a more efficient representation which 也 lets you use that same ID as a foreign key.



> const monotonicFactory = require('ulid').monotonicFactory;
> const ulid = monotonicFactory()
> ulid(1586872590191)
> ulid(1586872590191)
> ulid(1586872590191)
> ulid(1586872590191)
> ulid(1586872590191)



Let's return to our earlier S3 example for a moment. When looking for data in a specific time range, you can reduce the number of objects returned by ListObjects significantly. The Delimiter argument lets you narrow the range of your search in 5-bit increments. A ULID has 10 leading characters that represent a 48-bit timestamp with millisecond precision, with each character encoding 5 bits of the number.

48-bit millisecond epoch timestamps will run out of space in 10889 AD, mark your calendar. The astute reader will 也 note that a 48-bit timestamp value doesn't evenly encode to 50 bits available in a Crockford Base32 string, so the highest timestamp that will ever be representable is actually 7ZZZZZZZZZ not ZZZZZZZZZZ.

t = time character
r = randomness character


  • 第一个角色:407226天
  • 第二个角色:12725天
  • 第三角色:397天
  • 第四字符:12天10个小时
  • 第5个角色:9小时19分钟
  • 第六位角色:17分28秒
  • 第7个角色:32秒
  • 第8个字符:1秒
  • 第9个字符:30毫秒
  • 第十个字符:1毫秒

This means that with S3's ListObjectsV2 API and the Delimiter parameter, you can grab 17-minute spans of your written data by using the 6th character of the ULID as your Delimiter. Take these objects:

2020-04-13 21:17:54          3 01E5V475QFRCEHXKJAS3BRS6BV
2020-04-13 21:24:51          3 01E5V4KXFTP52C9M5DVPQ2XR8T
2020-04-13 21:48:33          3 01E5V5Z9J0GX72VFSENBCKMHF0

We can slice between the 01E5V5Z... span with the following code:

>>> [k['Key'] for k in s3.list_objects_v2(
>>> [k['Key'] for k in s3.list_objects_v2(



在本文中,我们介绍了几种方式,即语义收费的标识符在您的存储层中很有用。总体而言,ULID和可分类标识符的类似规范是对标准完全随机UUID的改进。它们可以使您的应用程序运行得更快,同时仍然避免冲突和枚举攻击,并且 可以更有效地存储(26个字符对36个字符)。

瑞安·布朗(Ryan Brown)Trek10