Maintaining snapshot consistency on eviction¶
When removing a
rows_entry (=r1), we need to record the fact that the range to which this row belongs is now discontinuous. For a single
mutation_partition that would be done by going to the successor of r1 (=r2) and setting its
continuous flag to
false, which would indicate that the range between r1’s predecessor and r2 is incomplete. With many partition versions, in order for the snapshot’s logical
mutation_partition to remain correct, special constraints on version contents and merging rules must apply as described below.
rows_entry is selected for eviction, we only have a reference to that object. We want to be able to evict it without having to update entries in other versions, to avoid associated overheads of locating sibling versions and lookups inside them. To support that, each
partition_version has its own continuity, fully specified in that version, independent of continuity of other versions. Row continuity of the snapshot is a union of continuous sets from all versions. This is the independent-continuity rule.
We also need continuous row intervals in different versions to be non-overlapping, except for points corresponding to complete rows. A row may overlap with another row, in which case the row from the later version completely overrides the one from the older version. A later version may have a row which falls into a continuous interval in the older version. A newer version cannot have a continuous interval with no rows which covers a row in the older version. This is the no-overlap rule. We make use of this rule to make calculating the union of intervals easier on version merge. Merging of newer version into older has time complexity of only O(size(new_version)) then. If we allowed for overlaps, we might have to walk over all entries in the old version to mark ranges between them as continuous.
Another rule is that row entries in older versions are evicted first, before any row in the newer version is evicted. This is needed so that we don’t appear to loose writes in case we have the same row in more than one version, and the one from the newer version gets evicted first. To achieve this, we only move to the front of the LRU (marking as more recently used, evicted last) row entries which belong to the latest
partition_version in a given
partition_entry. This implies that detached snapshots never update the LRU.
The above rules have consequences for range population. When populating a discontinuous range which is adjacent to an existing row entry in an older version, we need to insert an entry for the bound (due to the independent-continuity rule), and need to satisfy the no-overlap rule in one of the following ways:
copy complete row entry from older version into the latest version
insert a dummy entry in the latest version for position before(key).
add support for incomplete row entries, and insert an incomplete entry in the latest version
Options (3) and (2) are more efficient than (1), but (1) is the simplest to implement, so it was chosen. With option (2) we have an additional problem of cleaning up the extra dummy entries when the versions are finally merged. Option (3) makes cache reader more complicated.
partition_version always has a dummy entry at
position_in_partition::after_all_clustering_rows(), so that its row range can be marked as fully discontinuous when all of its rows get evicted. Note that we can’t remove fully evicted non-latest versions, because they may contain range tombstones and static row versions, which are needed to calculate snapshot’s view on those elements. We can’t merge them into newer versions in reclamation context due to no-allocation requirement, and because they could be referenced by snapshots.