VLDB Paper reviews

What's the new sexy in the DB world?

Posted by Ricky on January 11, 2021

1. [Best Paper] Opportunities for Optimism in Contended Main-Memory Multicore Transactions

Wow, this is cool:

We evaluate several concurrency control designs under varying contention and varying workloads, including TPC- C, and find that implementation choices unrelated to concurrency control may explain much of OCC’s previously-reported degra- dation. When these implementation choices are made sensibly, OCC performance does not collapse on high-contention TPC-C. We also present two optimization techniques, commit-time updates and timestamp splitting, that can dramatically improve the high- contention performance of both OCC and MVCC. Though these techniques are known, we apply them in a new context and high- light their potency: when combined, they lead to performance gains of 3.4× for MVCC and 3.6× for OCC in a TPC-C workload.

The CC protocol investigated:

OSTO(OCC variant)
  • scalable timestamp allocator (like Silo)
  • RCU style GC
MSTO(MVCC variant)
  • Follows Cicada’s design
  • Each version has a write timestamp (wts), read timestamp (rts) and a state (PENDIGN/COMMIT/ABORT)
  • read-only TXNs always commit
  • Commit protocol (I think this is just the standard?) :
    • Bump the global wirte timestamp
    • 1: append to all write versions chain of a PENDING version (to lock it)
      • Abort when needed
    • 2: checks read set
    • 3: commit

The basis factors

image-20210103120341227

Contention regulation
  • avoid cache line thrashing
  • Exponential backoff
Memory Allocation
  • the default glibc sucks
    • bottleneck at low contention
  • use a scalable memory allocator like rpmalloc
    • 1.6x better at high contention
Abort Mechanism
  • Exception style is bad (because global lock is acquired)
  • Use checked-return is better
Index type
  • support hash table if can

  • support contention aware index:

    • sometimes, certain protection (like against phantoms) create additional conflict, which could be avoided given the workload or with the design of the index. Example of TPC-C workload (new order vs delivery):

      image-20210103121128646

Commit time update:

  • Some updates do not expose dependency, aka, its execution could be commutative with all operations after it. E.g.:
    • Its value is not read after update as either data/control dependency
  • These updates (usually, read-modify-write) could then be delayed to commit time
  • Use a functional object (Updater) that captures the update logic, and only execute it during execution time.

Timestamp splitting

Not a really fancy idea though:

  • have multiple (more than 1) timestamps for the record
  • some of the fields are never modified (so they could always be read safely), and they don’t need to see frequent updates to the timestamp that might trigger conflict as a result of updates to other hot fields.

Pushing Data-Induced Predicates Through Joins in Big-Data Clusters

Video: https://www.youtube.com/watch?v=XrtYA1UadGs&ab_channel=VLDB2020

Motivation
  • predicate push-down has been proven to be a great techiqnue for query optimization, enabling partition elimination.

    • Datatable is usually partiioned, with each partition having some sort of statistics (like min-max)
    • Knowing the predicate, one can skip scanning some of the partitions by looking at the stats
  • However, if the predicate is not on a join column, the predicate could not be pushed down to the other table, for example:

    • 1
      
      SELECT * FROM A JOIN B ON A.y = B.y WHERE B.x > 1
      
    • The predicate on B.x> 1 could not be pushed down to A (sucks if A is the larger table here especially)
Solution
  • By looking at the stats of both tables A, B, we can find all paritions which have its x fulfilling the predicate.
  • For all these paritions, we can then merge the stats of them on column y, (the join column, which doesn’t have any predicate), and from the merged stats, one can derive a predicate now on the join column, which could be pushed down (or propogate to other parts of the query tree given relationaional algebra)
More

The paper talks about other crucial things like:

  • How much could it filter?
    • Well, this really depends, one of the example the paper gave could filter as much as 95% partitions.
  • How to do it?
    • This is hard, it also is coupled with query plan search scheduling

2. [Honorable Mention] MyRocks

Motivation:

  • B tree based (InnoDB) has issues:

    • Large space fragmentation from the the B-index (up to 50% space usage)
    • Limited compression (storage capacity becomes the bottleneck with new storage technology)
    • Large write amplification on flash storage, which is bad.
    • Space overhead for transactional handing( 13 byte for each row)
  • RocksDB is good because:

    • L-tree based buffer writes, and much better write amplification

      LSM-tree is more effective because it avoids in-place updates to pages, which eventually caused page writes with small updates in UDB. Updates to a LSM-tree are batched and when they are written out, pages only contain updated entries, except for the last sorted run. When updates are finally applied to the last sorted run, lots of updates are already accumulated, so that a good percentage of page would be newly updated data.

    • Works well for compression. TXN related sequence number could be compressed away

RocksDB not neccessrily better:
  • On read performance:

    We picked LSM-tree over B-Tree to save space at the expense of read performance. For read intensive databases where all data resides in memory, MyRocks was hardly better than InnoDB and the space savings benefit was minimal

MyRocks Design:

  • RocksDB has more key comparisons => more sensitive to key comparison operatons
  • reverse key scan is slow for many reasons => implement reverse key comparator
  • stats estimation on a specific range for the optimizer costs CPU => skips esimation OR skips partially filled files to speed up.
  • improve scan speed => prefix bloom filter (does this sorted file contain any key with matching prefix?)
  • Too many tombstones in RocksDB because index update to the MyRocks will change keys in the RocksDB, which result in a Delete to the ROcksDB, that is implemented with tombstones (only remove at the last level of the LSM tree).
    • => this means scan needs to process lots of tombstones entries as well
    • => introduce SimpleDelete which works under the assumption that no other PUT will happen to the same key (which is the case for the index update scenario)

3. TiDB

The HTAP world.

Key idea:

  • Storage layer built on top of Raft protocol to ensure availability across multiple replicas. Its multi-Raft protocol extends the orignal Raft protocol with a few optimizations, such as -
    • parallel log replication with logs appending at the leader
    • decoupled apply thread from the main log handling thread
    • optimized reads such as read index (heartbeat request from leader to followers to ensure the local to-be-read log implies it is still a leader), lease read (reads could be served automatically by the leader within the lease, and the follow not starting election during the lease), follower read( allows followers to serve reads by querying the leader if its local read is committed on the leader)
    • regions split and merges
  • Introduce a learner node that is responsible for converting row-based tuples into columnar storage for OLAP queries.
Row to Column

Each learner will :

  1. scan through the logs, compact it (ignore those rollback logs)
  2. decode row-based tuples (remove the timestamps and transactional information)
  3. convert rows into columns (with consult to the schemas)

To keep the schemas updated:

  1. periodic schema sycing (regular sync)
  2. on-deman syncing (compulsive sync): when the decoded tuple has different number of columns as the cached schema (or any anomalies)

Data structure for columnar stores - Delta Tree

It uses the Delta tree to store write and serve reads.

Write: appending to the delta stores simply add the new deltas into a delta file, and also append to a B+ tree index for the updates

Merge: the deltas need to be merged periodically (with the help of B+ tree)

Read: read the merged version

Comparison with LSM:

(on Sysbench of a select count query while data is being updated)

  • better read latency
  • higher write amplification

Comparison with MemSQL

The performance evaluation of TiDB is:

image-20210111145604822

Which shows very little performance degradation of the transactional workload when more analytical clients are added. (a,b)

While for the MemSQL:

image-20210111145816948

The degradation to the transactional processing throughput is really obvious with more AP added.

Notes:

One thing I did not really understand after reading the paper is: **Why use a B+ tree index though? ** Seems the Delta tree data structures could just be used with a hash index. Does it need range query?

4. DIAMetrics: Benchmarking Query Engines at Scale

It is: an architecture for benchmarking that is capable of generating indicative benchmark work- loads over production deployments, executing them, and measuring a system’s performance on that workload

Goals:

  • (a) deliver a general so- lution that is capable of benchmarking end-to-end a variety of query engines;

  • (b) support every step of the benchmark- ing life-cycle;

  • (c) provide insights with respect to sys- tem performance and efficiency. It is a one-stop tool for all benchmarking needs including complex tasks such as bench- mark generation, execution, and result visualization.

Previous Issues:

  • the benchmark workload is statically defined
  • the system being benchmarked assumes complete con- trol of the entire data management stack

Solution:

image-20210112122554028

Wrokload Extractor: from query log generates features for each query, and the extractor will summarize them into some representative workloads (with “good” representative and “good” coverage of the original workload)

Scrambler: Anonymize the data with various tricks

Data Mover: Transform data into differnt formats for different query engines

Workload Runner: Handles scheduling and running of the workload with different configurations optios (benchmark coinfig, workload config)

My 5 cents:

I think the more interesting part of the paper is its Workload extractor , the rest seems rather standard practice. (The paper doesn’t seem to go into details how different components could be “pluggable” with use cases. So not sure how big the impact will be for agents other than Google iteself)