Jitter 经典陷阱: Uniform 分布退化为 Rayleigh

记录一个坑了我好久的 Bug。

为了防止hundreds of thousands of nodes 同时发起请求压垮中心数据库（thundering herd），我们通常会在定时任务中引入 Jitter（随机抖动）。比如一个 10 分钟周期的任务，加上 3 分钟的 Jitter，本意是希望节点在 [10min, 13min) 的窗口内均匀分布 (Uniform Distribution)。

但如果不小心把生成随机数的逻辑放进了底层的轮询（Poll）循环里，灾难就发生了。

踩坑代码还原

错误的做法：在每 5 秒一次的 ticker 里，用局部变量重新生成 Jitter。

// 错误示范：无状态重采样 (Stateless Re-rolling)
func (p *Registry) writeConfigs() {
    // 致命错误：每次 5s 醒来，都重新抛一次 0~180s 的骰子
    jitter := time.Duration(rand.Int63n(180 * time.Second))
    nextTime := p.lastWrite.Add(10*time.Minute + jitter)
    
    if nextTime.After(time.Now()) {
        return // 继续等
    }
    // 执行发请求... (i.e. send the request here)
    p.lastWrite = time.Now()
}

为什么这是个灾难？

直觉上，我以为这是在一个 3 分钟的窗口里随机。但在数学上，整个模型已经完全变了。

因为每 5 秒都在重新掷骰子，这就变成了一系列成功的概率随时间线性递增的伯努利试验。随着等待时间的变长，判定条件 jitter < 已流逝时间 越来越容易满足。在离散状态下，这被称为非齐次几何分布；当轮询间隔相对窗口足够小时，它在连续空间的极限就是 Rayleigh（瑞利）分布。

期望的 Uniform 分布： 曲线平坦，标准差 $σ \approx 52 s$ ，流量均匀铺满 180 秒。
实际的 Rayleigh 分布： 极度左偏的尖峰，标准差被严重压缩至 $σ \approx 20 s$ 。最高峰（Mode）出现在第 30 秒。

结果就是： requests 根本没有均匀散开，86.5% 的节点在头 60 秒集体 send reqeusts，直接打挂了 DB。

正确的解法: Do not re-generate jitter for each ‘check’

修复方案异常简单，但直指核心：Jitter 必须是有状态的 (Stateful)。每个周期只掷一次骰子，并把它存为实例变量。

// 正确示范：有状态的 Jitter (Stateful Jitter)
type Registry struct {
    lastWrite time.Time
    nextJitter time.Duration // 作为实例状态保存
}
 
func (p *Registry) writeConfigs() {
    nextTime := p.lastWrite.Add(10*time.Minute + p.nextJitter)
    
    if nextTime.After(time.Now()) {
        return 
    }
    // 执行发请求...
    p.lastWrite = time.Now()
    // 执行完毕后，为【下一次】周期抽取并固定 Jitter
    p.nextJitter = time.Duration(rand.Int63n(180 * time.Second))
}

Takeaways

不要在轮询循环里重新采样 (Re-roll) 随机数。
错误的 Jitter 实现（比如这种导致 Rayleigh 聚集的实现），杀伤力甚至大于完全不加 Jitter。
1. 坏的 Jitter（尤其是重采样导致的）：它引入了极度不均匀的变量。由于 Rayleigh 强大的“向左挤压”特性（大部分节点都倾向于在较早的极窄窗口内触发），它甚至可能会在多个周期的叠加下，破坏原本均匀的自然散布，把原本错开的节点通过不规则的累加给“凑”到同一个窄窗里去。
很多看似是“参数没配好”的工程问题，扒到最后往往是底层的数学/统计学模型出了偏差。

Actually related to ir-air-sleek-report

History of this investigation:
- ir-air-sleek we found slow DB → death spiral
  - For ir-air-sleek, we resolved the death spiral
    - But then, we need to resolve the slow DB
      - We dropped the never used update_at index on check_statues first.
        
        Then we changed the sync_config from two queries to one CTE and saved 50% of the WAL workload. And we also replaced the ‘update_at’ index on the ‘node_config’ table from BTREE to BRIN to resolve the LWLock:BufferContent contention
        
        But then, on May 18 and May 11, we found we are still haunted by the Success ratio dip..
        
        So, we decided to re-align the SLO definition (but we didn’t drop the detector threshold yet), but at least our SLO won’t be hurt. But then we were still hunting for the root cause.
        
        Then on May 19 I found this issue, this is why the jitter never actually works as expected.

Formal report hha-reportcurrentcheckconfigs-jitter-bug-report

折腾 Zhēteng

Explorer

Jitter 经典陷阱: Uniform 分布退化为 Rayleigh

踩坑代码还原

为什么这是个灾难？

正确的解法: Do not re-generate jitter for each ‘check’

Takeaways

Graph View

Table of Contents