Last July, Intel’s Peter Zijlstra proposed “Call Depth Tracking” as a mitigation approach to deal with Retbleed and avoid the “performance horror show” of using IBRS (Indirect Branch Restricted Speculation) . The most recent version of the call depth tracking code was released today and the benchmark results look very promising in easing the pain of the performance impact of Retbleed CPU mitigation.
As for what Peter has been working on with this call depth tracking code, he explained with the v2 patch series:
This version is significantly different from the previous one in that it no longer uses external call thunks allocated from module space. Instead, each function is 16 byte aligned and gets 16 bytes of (pre-symbol)
padding. (This padding will also be useful for other things, like kCFI/FineIBT work.)
Prior to these fixes, feature alignment is basically non-existent, as such any statement fetch for the first few statements in a function will have (on average) half of the fetch window filled with all of the above . Pushing the alignment up to 16 bytes makes things better for chips that have an i-fetch window size of 16 bytes (Intel) while not making things worse for chips that have a larger window 32 byte i-fetch (AMD Zen). In fact, this improves the worst case for Zen from 31 bytes of garbage to 16 bytes of garbage.
As such, the first patches in the series fix many alignment quirks.
The second big difference is the introduction of the pcpu_hot structure. Since the compiler managed to place two adjacent DEFINE_PER_CPU() variables (in code) in random cache lines (it’s absolutely free), introducing the x86_call_depth variable per CPU sometimes introduced cache pressure significant extra, while other times it would sit nicely in the same line with preempt_count and not display at all.
In order to alleviate this problem; introduce the pcpu_hot structure and collect a number of per-CPU hot variables in a way the compiler can’t mess up.
As more information on call depth tracking to mitigate Rebleed:
Apart from these changes; the heart of depth tracking is still the same.
– objtool creates a list of (function) call sites.
– for each call; overwrite the target function padding with the accounting thunk (if not already) and adjust the call site to target that thunk.
– retbleed feedback thunk mechanism is used for a custom feedback thunk that includes feedback accounting and performs RSB stuffing if needed.
This ensures that no new compilers are required and avoids almost all overhead for unaffected machines. This new option can still be selected using:
“to bleed again = stuff”
on the kernel command line.
The Return-Stack-Buffer (RSB) is a 16-deep stack that is filled on every call. On the return path, speculation will “pop up” an entry and take it as a return target. Once the RSB is empty, the processor reverts to other predictors, e.g. the branch history buffer, which can be user-space malformed and misleads the speculation (back) path to a gadget of disclosure of your choice – as described in retbleed’s article.
Call depth tracking is designed to break this speculative path by stuffing calls with speculation traps in the RSB whenever the RSB is low. This way speculation stops and never falls back on other predictors.
The assumption is that the stuffing on the 12th return is enough to break the speculation before it reaches the undershoot and fallback to the other predictors. Tests confirm that it works. Johannes, one of the rebleed researchers, attempted to attack this approach and confirmed that it brings the signal-to-noise ratio down to crystal ball level.
The benchmark results look very promising:
Full details and the latest Call Depth Tracking v2 patches for the Linux kernel via this mailing list thread.