Ribbon Filters - Paul's Notes

rel:: [[Databases MOC#probabilistic]] [[probabilistic]] [[datastructures]] # Ribbon Filters - [paper link](x-devonthink-item://B3442887-DB68-458E-8E67-0717082BE52B) - [fb engineering post](https://engineering.fb.com/2021/07/09/data-infrastructure/ribbon-filter/) [[probabilistic]] filter data structures trade off false positives (over-approximate) set queries for space savings. Ribbon filters are in the same class as - [[Bloom Filters]] - [Xor filters](x-devonthink-item://81C55E20-602D-43BA-9A49-556C6507FDD3) Ribbon filters have ~30% space savings for 3-4x CPU compared to [Bloom filters](https://en.wikipedia.org/wiki/Bloom_filter) while retaining Bloom-like configurability. ## [[RocksDB]] from [Ribbon Filter | RocksDB](x-devonthink-item://BD20F402-1F44-4347-9E44-6800599DC0A6) ### when to use in place of Bloom filters > filters that live in memory for more than ==an hour== should be Ribbon, and filters that live less than an hour should be Bloom > Write-heavy RocksDB loads are often backed by flash storage, which has some specified write endurance for its intended lifetime. This can be expressed as device writes per day (DWPD), and supported DWPD is typically < 10.0 even for high end devices (excluding NVRAM). Roughly speaking, the DB would need to be writing at a rate of 20+ DWPD for data to have an average lifetime of less than one hour. Thus, ==unless you are prematurely burning out your flash or massively under-utilizing available storage, using the Ribbon filter has the better cost profile on average.== ### LSM levels > LSM levels give us very strong data lifetime hints. Data in L0 might live for minutes or a small number of hours. Data in Lmax might live for days or weeks. So even if Ribbon filters weren’t the best choice on average for a workload, they almost certainly make sense for the larger, longer-lived levels of the LSM. As of RocksDB 6.24, you can specify a minimum LSM level for Ribbon filters with `NewRibbonFilterPolicy`, and earlier levels will use Bloom filters. ### Generic Recommendation > If using `NewBloomFilterPolicy(bpk)` for a large persistent DB using compression, try using `NewRibbonFilterPolicy(bpk)` instead, which will generate Ribbon filters during compaction and Bloom filters for flush, both with the same FP rate as the old setting. Once new SST files are generated under the new policy, this should free up some memory for more caching without much effect on burst or sustained write speed. Both kinds of filters can be read under either policy, so there’s always an option to adjust settings or gracefully roll back to using Bloom filter only (keeping in mind that SST files must be replaced to see effect of that change).