记录一个坑了我好久的 Bug。
为了防止hundreds of thousands of nodes 同时发起请求压垮中心数据库(thundering herd),我们通常会在定时任务中引入 Jitter(随机抖动)。比如一个 10 分钟周期的任务,加上 3 分钟的 Jitter,本意是希望节点在 [10min, 13min) 的窗口内均匀分布 (Uniform Distribution)。
但如果不小心把生成随机数的逻辑放进了底层的轮询(Poll)循环里,灾难就发生了。
踩坑代码还原
错误的做法:在每 5 秒一次的 ticker 里,用局部变量重新生成 Jitter。
// 错误示范:无状态重采样 (Stateless Re-rolling)
func (p *Registry) writeConfigs() {
// 致命错误:每次 5s 醒来,都重新抛一次 0~180s 的骰子
jitter := time.Duration(rand.Int63n(180 * time.Second))
nextTime := p.lastWrite.Add(10*time.Minute + jitter)
if nextTime.After(time.Now()) {
return // 继续等
}
// 执行发请求... (i.e. send the request here)
p.lastWrite = time.Now()
}为什么这是个灾难?
直觉上,我以为这是在一个 3 分钟的窗口里随机。但在数学上,整个模型已经完全变了。
因为每 5 秒都在重新掷骰子,这就变成了一系列成功的概率随时间线性递增的伯努利试验。随着等待时间的变长,判定条件 jitter < 已流逝时间 越来越容易满足。
在离散状态下,这被称为非齐次几何分布;当轮询间隔相对窗口足够小时,它在连续空间的极限就是 Rayleigh(瑞利)分布。
- 期望的 Uniform 分布: 曲线平坦,标准差 ,流量均匀铺满 180 秒。
- 实际的 Rayleigh 分布: 极度左偏的尖峰,标准差被严重压缩至 。最高峰(Mode)出现在第 30 秒。
结果就是: requests 根本没有均匀散开,86.5% 的节点在头 60 秒集体 send reqeusts,直接打挂了 DB。
正确的解法: Do not re-generate jitter for each ‘check’
修复方案异常简单,但直指核心:Jitter 必须是有状态的 (Stateful)。每个周期只掷一次骰子,并把它存为实例变量。
// 正确示范:有状态的 Jitter (Stateful Jitter)
type Registry struct {
lastWrite time.Time
nextJitter time.Duration // 作为实例状态保存
}
func (p *Registry) writeConfigs() {
nextTime := p.lastWrite.Add(10*time.Minute + p.nextJitter)
if nextTime.After(time.Now()) {
return
}
// 执行发请求...
p.lastWrite = time.Now()
// 执行完毕后,为【下一次】周期抽取并固定 Jitter
p.nextJitter = time.Duration(rand.Int63n(180 * time.Second))
}Takeaways
- 不要在轮询循环里重新采样 (Re-roll) 随机数。
- 错误的 Jitter 实现(比如这种导致 Rayleigh 聚集的实现),杀伤力甚至大于完全不加 Jitter。
- 坏的 Jitter(尤其是重采样导致的): 它引入了极度不均匀的变量。由于 Rayleigh 强大的“向左挤压”特性(大部分节点都倾向于在较早的极窄窗口内触发),它甚至可能会在多个周期的叠加下,破坏原本均匀的自然散布,把原本错开的节点通过不规则的累加给“凑”到同一个窄窗里去。
- 很多看似是“参数没配好”的工程问题,扒到最后往往是底层的数学/统计学模型出了偏差。
Actually related to ir-air-sleek-report
- History of this investigation:
- ir-air-sleek we found slow DB → death spiral
- For ir-air-sleek, we resolved the death spiral
- But then, we need to resolve the slow DB
- We dropped the never used update_at index on check_statues first.
- Then we changed the sync_config from two queries to one CTE and saved 50% of the WAL workload. And we also replaced the ‘update_at’ index on the ‘node_config’ table from BTREE to BRIN to resolve the LWLock:BufferContent contention
- But then, on May 18 and May 11, we found we are still haunted by the Success ratio dip..
- So, we decided to re-align the SLO definition (but we didn’t drop the detector threshold yet), but at least our SLO won’t be hurt. But then we were still hunting for the root cause.
- Then on May 19 I found this issue, this is why the jitter never actually works as expected.
- So, we decided to re-align the SLO definition (but we didn’t drop the detector threshold yet), but at least our SLO won’t be hurt. But then we were still hunting for the root cause.
- But then, on May 18 and May 11, we found we are still haunted by the Success ratio dip..
- Then we changed the sync_config from two queries to one CTE and saved 50% of the WAL workload. And we also replaced the ‘update_at’ index on the ‘node_config’ table from BTREE to BRIN to resolve the LWLock:BufferContent contention
- We dropped the never used update_at index on check_statues first.
- But then, we need to resolve the slow DB
- For ir-air-sleek, we resolved the death spiral
- ir-air-sleek we found slow DB → death spiral
Formal report hha-reportcurrentcheckconfigs-jitter-bug-report