服务
关于
CloudProse博客
聚光灯

利用ULID在无序数据存储中创建订单

可以利用唯一的词典分类排序标识符来按时间查询s3对象,而无需支持元数据存储,此处's how!
瑞安·布朗(Ryan Brown)Trek10
瑞安·斯科特·布朗 | 2020年4月14日

The rise of distributed data stores and the general decomposition of systems into smaller pieces means that coordination between each server, service, or function is less available. In my first applications, unique ID generation meant setting auto_increment=True on a column in the SQL database. Easy, done, no problem. Today, each microservice has its own data source(s) and NoSQL stores are common. Every NoSQL DB is "NoSQL" in its own way, but they usually eschew coordinated and single-writer solutions in the name of reliability/performance/both. You can't have an auto-increment column without implementing the coordination client-side.

使用 numbers as identifiers 也 creates problems. Auto-incrementing can lead to enumeration-based attacks. Fields can have fixed sizes. These issues can go unrealized until you overflow the uint32 field, and now your logs are a pile of ID conflict errors. Instead of integers, we can use a different kind of fixed-length field and make it non-sequential so that different hosts can generate IDs without a central coordinating point.

UUID是一种改进,可以避免在分布式设置中发生冲突,但是严格来说,它是随机的,您无法轻松地对它们进行排序或确定粗略顺序。 Segment不久前发布了博客 about one replacement for UUIDs with the KSUID (K-Sortable Universal ID) but it has limitations and uses a strange 14e8 offset to avoid running out of epoch time in the next 100 years.

输入 唯一的按词典顺序排序的标识符(ULID). These are sortable, high-entropy identifiers that we can generate anywhere in our pipeline without coordination and have confidence that there won't be collisions. A ULID looks like 01E5TZRCM5WZYPB2BH7KMYR5HT, and the first 10 characters are a timestamp, and the next 16 characters are random.

UUID呢?

I came across the need for ULID / KSUID when working with S3 objects that needed to be named, but I 也 wanted to be able to query for recent objects. Typically, when I need a random identifier, I reach for UUID-v4. Why v4?

  • UUID v1和v2包含基于生成它们的主机的MAC地址。这并不是真正的安全性问题,因为L2地址不会对公共互联网有太大帮助。但是,这确实意味着,如果我的UUID是在Lambda中生成的,则MAC地址没有语义值。我无法通过SSH进入Lambda并查找MAC或以其他方式使用该信息。
  • UUID v3 requires a seed, and I would just be using random.randint() or the equivalent to pick my seed value. Any system that requires a seed means that I have to think about what to use as a seed, how that impacts the randomness, and how that might impact security or collisions.
  • UUID v4是随机的,但是由于它是完全随机的,因此它不提供语义重载。

为什么我要在系统上语义上重载UUID?我从语义超负荷向导中得到了提示, 里克·霍利汉(Rick Houlihan)。我花了一些时间在单表DynamoDB设计上,而这种思考方式也蔓延到了S3存储系统的设计中。

用于启用S3时间查询的ULID

基于索引的思维可能很有启发性,特别是因为IT充满了本质上分类的存储系统。 S3在返回对象键和前缀时对其进行排序,无论它们以什么顺序添加。

选择密钥以按存储桶和前缀列出。例如,考虑一个名为“ dictionary”的存储桶,其中包含每个英语单词的键。您可能会打电话列出该存储桶中所有以字母“ q”开头的键。列表结果始终以UTF-8二进制顺序返回。

AWS S3文档

这对我们的应用意味着什么?这意味着,如果我们向S3提供可排序的键,并按照我们实际希望接收项目的顺序对其进行排序,那么我们将能够按顺序获取对象,而无需在客户端进行任何排序。在对象名称中使用ULID(或者更好的方法是用前缀分割ULID)可以避免冲突,也可以防止对对象进行枚举相关的攻击。

使用 Python中的ULID is simple. First, you need to install the ulid-py library, then you can import ulid and start generating identifiers:

This would upload an object with just a ULID as the name, with the contents abc. Then when we list objects in the CLI or in any other app, they're sorted by the time they were created even if there were several new objects in a single millisecond.

$ aws --profile personal s3 ls s3://t10-blog-ulids
2020-04-13 21:17:53          3 01E5V474WE4DE0N63ZWT7P6YWH
2020-04-13 21:17:54          3 01E5V475QFRCEHXKJAS3BRS6BV
2020-04-13 21:24:51          3 01E5V4KXFTP52C9M5DVPQ2XR8T
2020-04-13 21:48:33          3 01E5V5Z9J0GX72VFSENBCKMHF0

自动排序很有用,当然,可以根据需要以不同的方式格式化ULID。

>>> import ulid
>>> u = ulid.new()
>>> u.str
'01E5V7GWA9CHP337PB8SR18ZP4'
>>> u.bytes
b'\x01qvxqIdl1\x9e\xcbFp\x14~\xc4'
>>> u.int
1918360407572615930874316424782053060
>>> u.uuid
UUID('01717a42-cde2-b5be-eed8-55222c867b58')
>>> u.float
1.918360407572616e+36
>>> bin(u.int)
'0b1011100010111011001111000011100010100100101100100011011000011000110011110110010110100011001110000000101000111111011000100'

Especially useful is the u.uuid type that 所有ows you to replace existing UUIDs in your system with ULIDs without changing the value format. This means you can start benefitting from the ordering properties of ULIDs in existing systems.

分散发电

因为48位时间戳+ 100位随机性的ULID格式意味着我们可以获得100位 每毫秒 几乎消除了碰撞的可能性*。将此与我们的自动递增数字列进行对比。增量使我们不得不集中管理数据库中的该编号,以避免ID冲突。使用ULID,我们可以在任何Lambda,容器或EC2实例中生成ID。

由于ID本身带有时间戳,因此我们可以容忍分区和延迟。插入最新数据不会引起排序问题,因为在生成ID时会为项目加上时间戳,如果需要,我们随时可以在摄取时添加另一个时间戳字段。这些ID使我们能够维护订单并在以后插入数据,而不必添加单独的提取过程。

分布式生成确实意味着没有“真正的时钟”可让我们完美地订购放置ULID的项目。在任何规模的系统中,中央同步点(用于订购)和更高的可靠性/弹性之间的这种折衷是常见的,并且在规模上几乎成为必需。

Further, you may choose to go off-spec and use the 2 most-significant bits of the ULID that our encoding gives us. This is possible because there are 150 bits available in text representation, minus 148 used by the timestamp and randomness in the spec. You can get 4 sub-types of ULID in the same spirit of descriptive IDs like i-0123456789 and AKIAXNMVN by making the ID itself contain an encoded type.

*如果您是Amazon Retail,请不要遵循此建议,百万分之一的事件每小时发生两次,且规模足够大。

DynamoDB中的ULID

DynamoDB的新趋势是 单表设计。使用具有允许不同GSI服务多个查询的设计的单个表。里克(Rick)在推特上发布了这个真实示例 Kindle收藏权服务 使用4个GSI服务9个查询。

Kindle DynamoDB表设计

这些单表设计依赖于使用可排序的属性来启用查询,通常通过组合哈希来实现&Range in novel ways for each type of object. For example, you might create a key like Hash=Org#Trek10 Range=Post#2020-04-03#ca21477c-5693-4f2d-92e5-068102b24be9 which is composed of a type, org name, create time, and UUIDv4. Instead, with a ULID, you would be able to obviate the timestamp and ID combination and use a range key of Range=Post#01E5WF8AERWH9F8PDTQ5K4GW7R. This is a more efficient representation which 也 lets you use that same ID as a foreign key.

通过将随机性值处理为单调性,ULID也可以用于关联同时创建的相似项。

以创建一个ULID的NodeJS示例为例,然后使用该ULID的随机性来创建一系列相关项,这些项将按词法排序在一起:

> const monotonicFactory = require('ulid').monotonicFactory;
> const ulid = monotonicFactory()
> ulid(1586872590191)
'01E5WFM7VFPWCNF4DM76ADV80W'
> ulid(1586872590191)
'01E5WFM7VFPWCNF4DM76ADV80X'
> ulid(1586872590191)
'01E5WFM7VFPWCNF4DM76ADV80Y'
> ulid(1586872590191)
'01E5WFM7VFPWCNF4DM76ADV80Z'
> ulid(1586872590191)
'01E5WFM7VFPWCNF4DM76ADV810'

这些ULID可用于关联动作和事件,或将来自特定任务或主机的活动分组。

在S3中去格子

Let's return to our earlier S3 example for a moment. When looking for data in a specific time range, you can reduce the number of objects returned by ListObjects significantly. The Delimiter argument lets you narrow the range of your search in 5-bit increments. A ULID has 10 leading characters that represent a 48-bit timestamp with millisecond precision, with each character encoding 5 bits of the number.

48-bit millisecond epoch timestamps will run out of space in 10889 AD, mark your calendar. The astute reader will 也 note that a 48-bit timestamp value doesn't evenly encode to 50 bits available in a Crockford Base32 string, so the highest timestamp that will ever be representable is actually 7ZZZZZZZZZ not ZZZZZZZZZZ.

t = time character
r = randomness character
ttttttttttrrrrrrrrrrrrrrrr

每个字符的范围多长?好吧,这是每个中可表示的最低有效位的几个数量级。

  • 第一个角色:407226天
  • 第二个角色:12725天
  • 第三角色:397天
  • 第四字符:12天10个小时
  • 第5个角色:9小时19分钟
  • 第六位角色:17分28秒
  • 第7个角色:32秒
  • 第8个字符:1秒
  • 第9个字符:30毫秒
  • 第十个字符:1毫秒

This means that with S3's ListObjectsV2 API and the Delimiter parameter, you can grab 17-minute spans of your written data by using the 6th character of the ULID as your Delimiter. Take these objects:

2020-04-13 21:17:54          3 01E5V475QFRCEHXKJAS3BRS6BV
2020-04-13 21:24:51          3 01E5V4KXFTP52C9M5DVPQ2XR8T
2020-04-13 21:48:33          3 01E5V5Z9J0GX72VFSENBCKMHF0

We can slice between the 01E5V5Z... span with the following code:

>>> [k['Key'] for k in s3.list_objects_v2(
    Bucket='t10-blog-ulids',
    Delimiter='4',
    Prefix='01E5V4'
)['Contents']]
['01E5V475QFRCEHXKJAS3BRS6BV', '01E5V4KXFTP52C9M5DVPQ2XR8T']
​
​
>>> [k['Key'] for k in s3.list_objects_v2(
    Bucket='t10-blog-ulids',
    Delimiter='5',
    Prefix='01E5V5'
)['Contents']]
['01E5V5Z9J0GX72VFSENBCKMHF0']

正如预期的那样,键在返回时是有序的,我们可以使用按位运算符(也称为魔术)将所需的任何时间戳或范围转换为S3前缀查询。这使我们可以进行基于时间范围的筛选,而无需列出存储桶中的每个对象或使用S3库存这样的外部作业来列出所有对象名称和时间戳。

包起来

在本文中,我们介绍了几种方式,即语义收费的标识符在您的存储层中很有用。总体而言,ULID和可分类标识符的类似规范是对标准完全随机UUID的改进。它们可以使您的应用程序运行得更快,同时仍然避免冲突和枚举攻击,并且 可以更有效地存储(26个字符对36个字符)。

作者
瑞安·布朗(Ryan Brown)Trek10
瑞安·斯科特·布朗