Skip to content

TDBv0.2: Cache background revalidation and eviction #515

@krizhanovsky

Description

@krizhanovsky

The task is crucial since even with #2074 we barely can handle a case when web content doesn't fit RAM. The failure #2557 also happened due to cache overflown.

Depends on #1869

Scope

tfw_cache_mgr thread must traverse Web-cache and evict stale records on memory pressure or revalidate them otherwise. The thread must be accurately scheduled and throttled to not to impact system performance as well as efficiently free required memory. #500 must be kept in mind as well.

Validation logic is defined by RFC 7234 4.3 and requires implementation of conditional requests.

Keep in mind DoS attack from #520. Following items linked with #516 (TDB v0.3) must be implemented:

  • Revalidate cache entries by specified per-vhost timeout (like S3 lifecycle). This is tricky logic, which should be done on cache.c side, either by the tfw_cache_mgr thread or just a callback, see TDB eviction and stale response processing #2074 (comment)
  • TDB tables must be dynamically extensible and should not be strictly power of 2, e.g. 7GB should be fine. See comment Huge pages allocation issue and the crash on cache sizes >=2GB #1515 (comment)
  • UPDATE and DELETE operators must be implemented. Probably the lock-free index should be immutable and deletion should be implemented using thumbstones and updates are just copies of data plus thumbstone for the old data.
  • properly implement reinsert and lookup & insert (tdb_rec_get_alloc()) logic from Temporal client accounting #1115 (temporary implementatied in Temporal client accounting #1178).
  • optional cache warmup (also see Huge pages allocation issue and the crash on cache sizes >=2GB #1515)
  • Race-free interface for large insertions. E.g. __cache_add_node() creates a TDB entry, which immediately becomes visible for other threads, and later tfw_cache_copy_resp() inserts actual data, so concurrent threads may get incomplete or corrupted data. It can be done in 2 phases (soft updates): (1) allocate space in TDB data area and (2) actual insert (index update) to link the data. tfw_client_obtain() modifications from Temporal client accounting #1178, as well as similar HTTP sessions storage (Sticky cookies load balancing #685), and __cache_add_node() must be changed to use the soft updates. This also implies some versioning: while a softirq sending data for current cached object (probably very slowly with Redesign of TCP synchronous sending and data caching #391 .1 in mind), the object may stall and/or replaced by a new version, so the new version only must be fetched by new scans while the old version must reside in TDB untill it's fully transmitted and then it should be evicted.
  • Support/fix constant address placement for small records, see Temporal client accounting #1178 (comment)
  • Generic items removal. On removal the HTrie must be shrinked. With records locking and/or reference counting, probably thumbstone removal should be implemented.
  • There must be locks or reference counters for the stored entries to not to delete entries being processed (see e.g. Servicing stale cached responses and immediate purging #522)
  • custom eviction strategy must be implemented (e.g. Web-cache should register it's callbacks for freshness calculation) such that different tables can use different eviction strategies or no eviction at all. A custom triggers must be supported, e.g. TLS cache should be able to specify maximum number of stored sessions as 50 (see ssl_cache.c).
  • besides creation timestamp for eviction, entries must have minimum and maximum lifecycle honored by the eviction strategy
  • number of memset() calls must be reduced.
  • fix for data persistency on clean restart. Introduce non-persistent tables - sessions (Sticky cookies load balancing #685) and client (Temporal client accounting #1115) tables should be non-persistent. _Probably for Beta we should go with non-persistent tables only (as for now). We definitely should have a configuration option whether to read the full database into RAM on start or just throw out (or do in background for TDBv0.3: transactions, indexes, durability #516 ) all the data _
  • Web cache data for different vhosts must be stored in different tables to prevent full path collisions and improve concurrency and security (tables separation plus tdbfs user/group access control instead of chroot isolation).
  • The current TDB table size maximum is 128GB, which is too small for the web cache on the modern hardware This is teh subject for NUMA-aware cache modes #400
  • At the moment we have very limited number of tables, but we might need to scale to thousands of tables, e.g. for logging Fast access logging for analytics #537
  • we need to create Tempesta DB tables in runtime (e.g. to reconfigure a hash table for a bots protection algorithm) to load Tempesta Language HTTPtables migration to eBPF #102 scripts in run time.
  • cache tables must be per-vhost to get rid of unnecessary contention and index splitting for different vhosts. Important for the CDN use case. However, large tables still must be supported for single resource cases.
  • Avoid __cache_entry_size() call which introduces an extra response traversal. It seems we can just allocate new TDB data blocks and later reuse them if we have extra space or just ignore the tail if it's unusable.
  • Consider to send cached content as compound pages, just like high-speed NICs do this (e.g. see discussions in Random memory corruptions when modifying HTTP header data in an SKB. #447)

The task is required to fix #803.

UPD. Since filtering (#731) and QoS (#488) also require eviction, there job should be done in tdb_mgr thread instead.

UPD. TDB was designed to provide access to stored data in zero-copy fashion, such that cached response body can be sent directly to a socket. This property made several design limitations and introduced many difficulties. However, with TLS we always have to copy data. So TDB design can be significantly simplified with copying. So depends on #634.

Cache eviction

While CART is well known good adaptive replacement algorithm, there are number of caching algorithms based on machine learning, which provide much better cache hit. See for example the survey and Cacheus. Some of the algorithms required access to columnar storage for statistics (common practice in CDNs). Also consider Apache ATS CLFUS (Clocked Least Frequently Used by Size).

At least some interface for the user space algorithm is required. Probably just CART with some weights, where weights are loaded from the users space into the kernel, would be enough.

The cache must implement per-vhost eviction strategies and space quotas to provide caching QoS for CDN cases. Probably 2-layer quotas are required to not prevent poor configuration issues for bad Vary specification on application side, which may take too much space (linked with #733). Different eviction strategies are required to handle e.g. chunks of live streams (huge data volume, immediately remove outdated chunks) and rarely updated web content like CSS (may service stale entries).

It must be possible to 'lock' some records in evictable data sets (see #858 and #471).

Purging

On this feature implementation we should be able to normally update the site content w/o Tempesta restart or memory leaks. It's hard to track which new pages appeared and which are deleted during site content update, so in this task we need:

  1. full web content purging;
  2. regular expression purging, e.g. /foo/*.php or /foo/bar/*
  3. immediate (purge in original [Cache] purging #501) strategy for the purging (we still need the mode to leave stale responses in the cache for Servicing stale cached responses and immediate purging #522); Done in TDB eviction and stale response processing #2074

Documentation

Need to update https://github.com/tempesta-tech/tempesta/wiki/Caching-Responses#manual-cache-purging wiki page.

Testing

  • Throughput on large cached objects and compare with Nginx
  • web content purging with invalidate and immediate strategies
  • Test on web cache larger than 4GB in 1 and 2 NUMA nodes with cache modes 1 and 2.

Metadata

Metadata

Assignees

Labels

TDBTempesta DB module and related issuescachecrucial

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions