Does HTM obviate the need for crafty lock-free index designs?
How does HTM differ from lock-free index designs?
Given that lock-free designs are still relevant, can HTM help simplify lock-free design techniques while maintaining good performance?
ease the burden (a tedious and error prone task leading to deadlocks and race conditions) by delegating conflict detection and resolution from the developer to the system.
a best-effort model, optimistic concurrency.
store transaction buffers and provide isolation.
cache coherence protocol
detect conflicting transactional accesses.
read and write set of a transaction must fit in cache in order for it to be executed
limitation of CPU.
hardware events will abort a transaction.
the work of conflict detection
it is usually done at the granularity of a cache line. This may lead to cases of false sharing where aborts occur due to threads accessing and modifying separate items on the same cache line.
最简单的一种，是transaction不成功的时候，就改用lock，这个技术叫做lock elision；它将一个lock保护的区域作为transaction来执行，仅当transaction 没有成功commit的时候，它才falls back为block on the lock。
lock elision将MT作为一个fast path，slow path就是lock (a simple coarse grain locks).
In lock elision, the lock word needs to be included in the read set of a transaction, so that the transaction aborts when another thread acquires the lock (thus causing a conflict). Hence, once a thread resorts to non-transactional execution by taking the lock, all other concurrently executing transactions will abort, stalling overall progress.
the support of transactional memory in Intel starts from Haswell.
According to Intel, there is no guarantee that a transaction
will eventually succeed even if it is infinitely retried.
Hardware Lock Elision (HLE)
Restricted Transactional Memory (RTM)
- does not contain the fine-grained locking techniques and concurrency protocols
a mapping table that maps logical page identifiers (LPIDs) to virtual addresses
- delta record
- pages consolidation (to get search-optimized page)
- breaks an SMO into a sequence of atomic steps
for read operation, HTM provides high throughput with little effort.
such a workload in experiment 1 do not hold in general.
capacity limits (for Haswell’s CPU)
L1 cache is 32KB. no transaction can write more than can fit in L1.
associativity (for Haswell’s CPU)
hyper-threading (for Haswell’s CPU)
L1 cache and other resources is shared among hardware threads.
many of the properties that determine the HTM abort rate for a given tree may not be known until runtime. A tree’s key size, payload size, total size, and address access patterns all affect performance.
even trees with relatively small keys and payloads cannot always parallelize. With Haswell’s HTM almost all transactions abort with payloads larger than a few kilobytes.
two other problematic ways
speculation is not free
When a transaction falls back and acquires the lock, all other transactions in the critical section abort and cannot restart until the lock is released. The effect is that execution is fully serialized until the lock is released – even if the other transactions operate on non conflicting cache lines.
- mitigate the lemming effect
have transactions retry more than once
- cost of retrying a contentious transaction < serializing execution*
- mitigate the lemming effect
the workload skew: for the payload, the bigger the skew is, the latter it is updated at the end of a tree traversal operation, and the latter it is within the transaction.
The results show that as workload skew increases the performance of lock-elision drops sharply. At some point lock-elision performs even worse than spin-locks. Increasing the number of transactional attempts delays the performance cliff, and leads to a more graceful degradation of performance as skew increases.
for a set of concurrent transactions, it may be possible that none of them commit: a situation worse than using a spinlock.
The optimal number of transactional attempts depends on workloads and varies highly. chooses the number of transactional attempts dynamically depending on the workload.
lock-coupling is one of the most widely used approaches to achieve efficient fine-grained locking on B-Trees.
In lock-coupling a pair of locks are held as a worker traverses pages: one on a “source” page and another on a “target” page.
Because of restrictive support of lock-coupling in Haswell’s HTM interface, it is not possible on Haswell’s CPU.
for high contention workloads the reader throughput that the cpp-btree can sustain begins to drop.
a garbage-collection mechanism for memory safety
- epoch mechanism used in Bw-tree，类似引用计数。
additional indirection for atomic updates
lockfreedom influences the in-memory layout of structures*
cost of copy-on-write (when using paged copy-on-write semantics)
- the cost may be influenced by page size, access skew, the uniformity of payload sizes.
These costs are highly intertwined: tradeoffs for each influence the cost of the others
it is very difficult to architect and build complex lock-free
- MW-CAS同时操作的word数受限于L1 cache的大小。过多的word，则无法用HTM来实现MW-CAS。Bw-tree可以。
- Lock elision in the GNU C library
- Transactional Synchronization Extensions
- CPU Cache and Memory Ordering, 何登成
- This is a basic Cache Tutorial
- Computer Systems A Programmer’s Perspective, Randal E. Bryant, David R. O’Hallaron
- To Lock, Swap, or Elide: On the Interplay of Hardware Transactional Memory and Lock Free Indexing, Darko Makreshanski, Justin Levandoski, Ryan Stutsman