1.3 A Brief History About the TiDB database platform

  |   0 评论   |   0 浏览

1.3 A Brief History About the TiDB database platform(TiDB 发展简史)

自从 v1.0.0 GA 开始,TiDB 做到了:可以从计算和存储两个层面的无限扩展,兼容了 MySQL 的语法和协议,强一致的真分布式事务。到今天,TiDB 可以称为一个真正的 HTAP 系统,不需要 ETL 工具进行数据转换,在系统运行 OLTP 业务时,也可以方便的进行报表查询。


Shuaipeng Yu (于帅鹏)

Product Specialist & Effic Team Leader

Before we begin

  • Goal: Introduce a brief history of TiDB
  • Outline:
    • Ancient days of TiDB
    • TiDB with TiSpark
    • TiDB with TiFlash

Ancient days of TiDB

  • Inspired by Google Spanner, we made TiDB
  • In the 1.0.0 GA version, TiDB is
    • A freely scalable (computing, storage) database
    • Compatible with MySQL syntax and protocol
    • Transparent Data Splitting Policy-Range Splitting
    • Strongly consistent, distributed transaction support

计算和存储的无限扩展,兼容 Mysql 语法。

TiDB Architecture - Original


可以简单的认为 TiDB 是一个容量无限大的 Mysql。

Datahub Capability - Syncer


通过 Syncer 向 TiDB 同步和汇总数据。

Datahub Capability - Coprocessor


使用 Coprocessor 进行数据聚合。

Datahub Capability

  • TiDB ideal for Datahub scenarios
  • Protocol-compatible, easy synchronization of MySQL production libraries
  • Transparent and accessible cross-segmentation queries
  • Data landing in real time
  • Massive storage allows multiple data sources to converge
  • Standby - Datahub Analysis 2-in-1

One year later

  • TP Scenario
    • CUSTOMER: There are still some problems though...Smell good!
  • AP Scenario
    • Client 1: Complex statements are so slow!
    • Client 2: Always OOM!
    • Client 3: Can't integrate with big data platform!


  • Either combine TiDB or TiKV together
    • Complete refactoring of optimizers and actuators to build MPP Engine
    • High risk and long duration
  • Or,
    • The need for an open source distributed computing framework
    • High maturity and wide user base



引入 TiSpark,将单点的 TiDB 计算能力扩展为多节点的并行计算。


  • Spark help us do distributed computing
    • A mature distributed computing platform
    • Faster(?), more stable(?).
      Complete succession to the Apache Spark ecosystem
    • Painlessly integrating into the big data ecosystem
    • Scripting, Python, R, Apache Zeppelin, Hadoop...


  • Apache Spark can only provide low concurrency computation
    • Heavy computational model and high resource consumption
    • Better for Reports and Heavyweight Adhoc Queries
  • Users still need high concurrency, small to medium-size AP capacity in many situations
    • Complex query capability with low consumption
    • TiDB is far simpler to maintain than Spark Clusters


  • We were also working on various optimization around stand-alone TiDB
    • Smarter, more efficient and faster in small to medium scale scenarios
  • Optimizer
    • Basic optimizer? --> RBO + CBO Optimizer --> Cascades Optimizer(WIP)
  • Executor
    • Classic Volcano Model --> Batch Execution --> Vectorized Execution
    • Better Concurrency and Pipeline
  • Partition tables, Index Merge, etc.

TiDB 1.0 vs 2.0


TiDB 2.0 vs 2.1


Core conflict

  • At this point, we were still left with 2 core contradictions.
    • Row storage is not friendly to analysis scenarios
      • "How dare you call yourselves HTAP without column store?
    • Workload isolation is not possible
      • "I ran a query and the CPU usage was 1000%!"
      • TiSpark scenarios would be worse.

Row vs Column Storage



  • Synchronize a set of column storage independently via Raft Leaner
    • Raft Learner provides extremely low consumption copy synchronization
    • Raft Leaner read protocol works with MVCC to provide strong and consistent reads
  • Physical isolation via Label
    • AP / TP workloads do mot affect each other


TiFlash Architecture


Raft Learner - Sync







Till now

  • TiDB = HTAP
    • TiDB doesn't require you to choose TP or AP, it's HTAP.
  • One platform, compatible with row and column storage
    • Painless data synchronization
  • Easy to analyze on columns when the main TiDB cluster runs TP services

TiDB Today