1.3 A Brief History About the TiDB database platform
1.3 A Brief History About the TiDB database platform(TiDB 发展简史)
自从 v1.0.0 GA 开始,TiDB 做到了:可以从计算和存储两个层面的无限扩展,兼容了 MySQL 的语法和协议,强一致的真分布式事务。到今天,TiDB 可以称为一个真正的 HTAP 系统,不需要 ETL 工具进行数据转换,在系统运行 OLTP 业务时,也可以方便的进行报表查询。
主讲人:
Shuaipeng Yu (于帅鹏)
Product Specialist & Effic Team Leader
Before we begin
- Goal: Introduce a brief history of TiDB
- Outline:
- Ancient days of TiDB
- TiDB with TiSpark
- TiDB with TiFlash
Ancient days of TiDB
- Inspired by Google Spanner, we made TiDB
- In the 1.0.0 GA version, TiDB is
- A freely scalable (computing, storage) database
- Compatible with MySQL syntax and protocol
- Transparent Data Splitting Policy-Range Splitting
- Strongly consistent, distributed transaction support
计算和存储的无限扩展,兼容 Mysql 语法。
TiDB Architecture - Original
可以简单的认为 TiDB 是一个容量无限大的 Mysql。
Datahub Capability - Syncer
通过 Syncer 向 TiDB 同步和汇总数据。
Datahub Capability - Coprocessor
使用 Coprocessor 进行数据聚合。
Datahub Capability
- TiDB ideal for Datahub scenarios
- Protocol-compatible, easy synchronization of MySQL production libraries
- Transparent and accessible cross-segmentation queries
- Data landing in real time
- Massive storage allows multiple data sources to converge
- Standby - Datahub Analysis 2-in-1
One year later
- TP Scenario
- CUSTOMER: There are still some problems though...Smell good!
- AP Scenario
- Client 1: Complex statements are so slow!
- Client 2: Always OOM!
- Client 3: Can't integrate with big data platform!
Choice
- Either combine TiDB or TiKV together
- Complete refactoring of optimizers and actuators to build MPP Engine
- High risk and long duration
- Or,
- The need for an open source distributed computing framework
- High maturity and wide user base
TiSpark(1/3)
引入 TiSpark,将单点的 TiDB 计算能力扩展为多节点的并行计算。
TiSpark(2/3)
- Spark help us do distributed computing
- A mature distributed computing platform
- Faster(?), more stable(?).
Complete succession to the Apache Spark ecosystem - Painlessly integrating into the big data ecosystem
- Scripting, Python, R, Apache Zeppelin, Hadoop...
TiSpark(2/3)
- Apache Spark can only provide low concurrency computation
- Heavy computational model and high resource consumption
- Better for Reports and Heavyweight Adhoc Queries
- Users still need high concurrency, small to medium-size AP capacity in many situations
- Complex query capability with low consumption
- TiDB is far simpler to maintain than Spark Clusters
Meanwhile...
- We were also working on various optimization around stand-alone TiDB
- Smarter, more efficient and faster in small to medium scale scenarios
- Optimizer
- Basic optimizer? --> RBO + CBO Optimizer --> Cascades Optimizer(WIP)
- Executor
- Classic Volcano Model --> Batch Execution --> Vectorized Execution
- Better Concurrency and Pipeline
- Partition tables, Index Merge, etc.
TiDB 1.0 vs 2.0
TiDB 2.0 vs 2.1
Core conflict
- At this point, we were still left with 2 core contradictions.
- Row storage is not friendly to analysis scenarios
- "How dare you call yourselves HTAP without column store?
- Workload isolation is not possible
- "I ran a query and the CPU usage was 1000%!"
- TiSpark scenarios would be worse.
- Row storage is not friendly to analysis scenarios
Row vs Column Storage
TiFlash
- Synchronize a set of column storage independently via Raft Leaner
- Raft Learner provides extremely low consumption copy synchronization
- Raft Leaner read protocol works with MVCC to provide strong and consistent reads
- Physical isolation via Label
- AP / TP workloads do mot affect each other
同步代价小,通过打标签的方式实现物理隔离。
TiFlash Architecture
Raft Learner - Sync
行存转列存。
Merge
Performance
Till now
- TiDB = HTAP
- TiDB doesn't require you to choose TP or AP, it's HTAP.
- One platform, compatible with row and column storage
- Painless data synchronization
- Easy to analyze on columns when the main TiDB cluster runs TP services
TiDB Today