Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap 2023 #26

Open
28 of 38 tasks
hustnn opened this issue Jan 11, 2023 · 23 comments
Open
28 of 38 tasks

Roadmap 2023 #26

hustnn opened this issue Jan 11, 2023 · 23 comments
Labels
enhancement New feature or request

Comments

@hustnn
Copy link
Contributor

hustnn commented Jan 11, 2023

Welcome to share your ideas on the roadmap. The updated roadmap for Q3 and Q4 are shown below.

Storage

External Table/Data Lake (project https://github.com/orgs/ByConity/projects/2)

Index

Runtime

  • Projection support - Q3
  • Grace hash join - Q3
  • Adaptive query scheduling - Q3
  • Common table expression (CTE) reuse - Q3
  • Materialized view - Q4
  • Extract, Load, Transform (ELT) phase 1 - Q3
    Asynchronous query execution、query Queue、join spill
  • Extract, Load, Transform (ELT) phase 2 - Q4
    Exchange spill、colocated scheduling、batch execution
  • Sql UDF support - Q4
    ByConity Support UDF #427

Optimizer

  • CBO statistics auto collection - Q3
  • SQL plan management (manually creating binding) - Q3

Transaction

Enterprise feature

Performance improvement

Stability

Installation

CI

@hustnn hustnn pinned this issue Jan 11, 2023
@canhld94
Copy link
Collaborator

canhld94 commented Jan 13, 2023

Some feedback from the community

  1. Usability:
  • Need better documents for installation in non-containerized environments
  • Need better documents for ease of use (e.g. storage engines)
  1. Storage:
  • Should support commons object storage (S3, GCP)
  1. Integration:
  • Need to ensure compatibility with common DB driver

Team discussions

  1. Support query from data lake (Hudi, Iceberg, Delta); currently we support Hive, and can extend this model to support others.
  2. Indexing services

Feel free to add more.

@zhbdesign
Copy link
Contributor

Support delete, update, user permission control stay the same the latest version of clickhouse database

@zhbdesign
Copy link
Contributor

Support CREATE Table AS SELECT * syntax

@zhbdesign
Copy link
Contributor

sql fingerprint support
Support jdbc facade
insert overwrite is supported

@zhbdesign
Copy link
Contributor

Other import support :RocketMQ, MaterializedMySQL, MaterializedPostgreSQL, Flink, Pulsar

@LiuYangkuan
Copy link

LiuYangkuan commented Jan 14, 2023

@canhld94 @hustnn

Storage:
Should support commons object storage (S3, GCP)

Make shared metadata of object storage to be compatible with JuiceFS, then we can shared checkpoint of cnch merge tree part with original ClickHouse using DiskLocal through mounted Juicefs.

In some time-travel scan case, we can use latest original ClickHouse which has new feature that ByConity hav't.

@zhbdesign
Copy link
Contributor

automatic collection, update and analysis of statistical information

@zbtzbtzbt
Copy link

support distributed cache for higher query cache hit rate

@zhbdesign
Copy link
Contributor

Support JuiceFS

@zhbdesign
Copy link
Contributor

Byconity provides an all-in-one package. You can install clickhouse-client,clickhouse-server,clickhouse-worker,tso_server,daemon_manager,and resource_manager all at once with this unified package; You can install some components or specify a component version as required.

@zhbdesign
Copy link
Contributor

Support RBAC,Support for SQL driven maintenance

@zhbdesign
Copy link
Contributor

Clone table
Create a new table using the same schema and data as the original table.

@Adora627 Adora627 added the enhancement New feature or request label Feb 17, 2023
@s7monk
Copy link

s7monk commented May 30, 2023

Can it support Apache Paimon?

@hustnn
Copy link
Contributor Author

hustnn commented Jun 20, 2023

Roadmap updating proposal / 路线图更新提案
Proposals includes both english and chinese versions shown below./ 提案包括英文和中文版本

===English version===
We plan to update the roadmap for ByConity's third and fourth quarters. The updates consist of two parts: adding new features and removing or adjusting some old functionalities. The additions come from three sources. The first part includes features and performance requirements that have received high demand from the community. The second part involves functionalities ported from ByConity's old code baseline to the new code baseline (from version 19.x to 21.x). The third part comprises functionality requirements planned by ByConity's research and development team based on identified gaps, data warehouse positioning, and future trends. The updated roadmap is as follows. We use GitHub Projects to manage subtasks and track progress at https://github.com/orgs/ByConity/projects, and some features already have associated projects.

Storage
    Object store (S3) support - Q2
External Table/Data Lake (project https://github.com/orgs/ByConity/projects/2)
    Hive Usability - Q2-Q3
    Hudi COW and MOR support - Q3
    Multi-catalog (Glue/Hive) support - Q3
    Hive query execution improvement - Q3-Q4
    IceBerg support - Q4
Runtime
    Projection support - Q2
    Grace hash join - Q3
    Adaptive query scheduling - Q3
    Common table expression (CTE) reuse - Q3
    Extract, Load, Transform (ELT)
        Asynchronous Query execution、Query Queue、Join Spill - Q3
        Exchange spill、Colocated scheduling、Batch execution - Q4
Optimizer
    CBO statistics auto collection - Q3
    SQL Plan Management (manually creating binding) - Q3
Transaction
    Direct insert values in worker - Q2
    Atomic attach - Q3
    Iterative transaction support - Q3
Enterprise feature
    HA with keeper - Q3
    Multi-tenant support - Q3
    Support RBAC - Q3 (project https://github.com/orgs/ByConity/projects/4)
    Fine grained access control - Q4
LLM-DB
    LLM vector store support - Q4
    Integrate with OpenAI, LangChain and LlamaIndex - Q4
Performance improvement
    Part cache lockless scan - Q3
    Hybrid part allocation - Q3
    IO scheduler - Q3
    Query result cache - Q3 (project https://github.com/orgs/ByConity/projects/3)
    Column statistics for part pruning - Q4
Stability
    Server isolation - Q3
    Metrics enhancement for better observability - Q3

As mentioned above, the roadmap update consists of two parts: adding new features and removing or adjusting some old functionalities. Let's break down these two parts of the update.
The additions are divided into three parts. The first part mainly stems from the community's demands after going open source. For example, the ability to write directly to the worker to reduce server responsibilities and facilitate horizontal scalability. Other improvements include optimizing I/O and enhancing cold read performance. The second part consists of functionalities ported from ByConity's old code baseline, such as projection. The third part encompasses functionality requirements planned by the research and development team based on identified gaps, data warehouse positioning and future trend analysis, such as enhanced data lake capabilities and support for ELT (Extract, Load, Transform).

# Requirement from ByConity community
External Table/Data Lake
    Hive Usability - Q2-Q3
    Hive query execution improvement - Q3-Q4
Transaction
    Direct insert values in worker - Q2
Performance improvement
    IO scheduler - Q3
    Query result cache - Q3
Stability
    Metrics enhancement for better observability - Q3

# Code baseline merge
Runtime
    Projection support - Q2
    Grace hash join - Q3
    Adaptive query scheduling - Q3
    Common table expression (CTE) reuse - Q3
Transaction
    Atomic attach - Q3
    Iterative transaction support - Q3
Enterprise feature
    HA with keeper - Q3
    Multi-tenant support - Q3
Performance improvement
    Part cache lockless scan - Q3
    Hybrid part allocation - Q3
Stability
    Server isolation - Q3
    
# RD planning
External Table/Data Lake
    Hudi MOR support - Q3
Performance improvement
    Query result cache - Q3
Extract, Load, Transform (ELT)
    Asynchronous Query execution、Query Queue、Join Spill - Q3
    Exchange spill、Colocated scheduling、Batch execution - Q4
LLM-DB
    LLM vector store support - Q4
    Integrate with OpenAI, LangChain and LlamaIndex - Q4

With the addition of the aforementioned high-priority features, we have also removed or adjusted some old functionalities. These involve features with unclear requirements and code refactoring. The detailed list is as follows, and we will allocate time to support the removed functionalities.

Storage
    Hudi COW support -(to Q3)
    Delta lake support - Q2 (replan)
    IceBerg support -(to Q4)
Index
    Space-filling curves - Q1(replan)
    Index auto recommendation - Q2(replan)
Performance
    Column statistics for part pruning -(to Q4)
    Hybrid part allocation -(to Q3)
    Query result cache - Q3
Stability
    Server isolation - (to Q3)
    Metrics enhancement for better observability - (to Q3)
Enterprise feature
    Support RBAC -(to Q3)
    Fine grained access control -(to Q4)
    Backup and recover - Q2 (replan)
Transaction
    Direct write in worker - (to Q3)
    Iterative transaction support - (to Q3)
    Atomic attach - (to Q3)
    Code refactoring - (replan)

Towards the end of each quarter, we conduct a review and fine-tune the content for the following quarter. We synchronize these adjustments with the community and welcome discussions, comments, and new feature requests. The finalized roadmap will be updated after the first week of each quarter and any newly proposed feature requests will be considered for the subsequent quarter.

===中文版===
我们计划对ByConity 第3和第4季度的路线图进行更新。更新包括2部分,第一部分是新增加了一些功能,第二部分是移除和调整了一部分旧的功能。新增内容来源于3块,第一块来源于社区呼声比较高的功能和性能需求,第二块分来自从ByConity旧基线移植到ByConity新基线的功能(19.x to 21.x),第三块是ByConity研发根据功能短板,数仓定位和对未来趋势判断规划的功能需求。更新之后的路线图如下所示。我们使用github project来管理子任务和追踪进度https://github.com/orgs/ByConity/projects, 部分功能已创建project。

Storage
    Object store (S3) support - Q2
External Table/Data Lake (project https://github.com/orgs/ByConity/projects/2)
    Hive Usability - Q2-Q3
    Hudi COW and MOR support - Q3
    Multi-catalog (Glue/Hive) support - Q3
    Hive query execution improvement - Q3-Q4
    IceBerg support - Q4
Runtime
    Projection support - Q2
    Grace hash join - Q3
    Adaptive query scheduling - Q3
    Common table expression (CTE) reuse - Q3
    Extract, Load, Transform (ELT)
        Asynchronous Query execution、Query Queue、Join Spill - Q3
        Exchange spill、Colocated scheduling、Batch execution - Q4
Optimizer
    CBO statistics auto collection - Q3
    SQL Plan Management (manually creating binding) - Q3
Transaction
    Direct insert values in worker - Q2
    Atomic attach - Q3
    Iterative transaction support - Q3
Enterprise feature
    HA with keeper - Q3
    Multi-tenant support - Q3
    Support RBAC - Q3 (project https://github.com/orgs/ByConity/projects/4)
    Fine grained access control - Q4
LLM-DB
    LLM vector store support - Q4
    Integrate with OpenAI, LangChain and LlamaIndex - Q4
Performance improvement
    Part cache lockless scan - Q3
    Hybrid part allocation - Q3
    IO scheduler - Q3
    Query result cache - Q3 (project https://github.com/orgs/ByConity/projects/3)
    Column statistics for part pruning - Q4
Stability
    Server isolation - Q3
    Metrics enhancement for better observability - Q3

如上所述,路线图的更新包括2部分,一部分是新增功能,一部分是移除和调整了部分旧的功能,这里对这2部分更新进行拆解。
新增内容由3部分组成,第一部分主要来自开源之后社区的需求,例如能够直写worker,降低server负责,使得写入易于水平扩展。例如通过优化IO,提升冷读性能等等。第二部分是从ByConity的旧的代码基线移植过来的功能,例如projection。第三部分是研发根据数仓定位和对未来趋势判断规划的功能需求,例如数据湖的增强,ELT的支持等等。

# ByConity开源社区需求
External Table/Data Lake
    Hive Usability - Q2-Q3
    Hive query execution improvement - Q3-Q4
Transaction
    Direct insert values in worker - Q2
Performance improvement
    IO scheduler - Q3
    Query result cache - Q3
Stability
    Metrics enhancement for better observability - Q3

# 代码基线合并
Runtime
    Projection support - Q2
    Grace hash join - Q3
    Adaptive query scheduling - Q3
    Common table expression (CTE) reuse - Q3
Transaction
    Atomic attach - Q3
    Iterative transaction support - Q3
Enterprise feature
    HA with keeper - Q3
    Multi-tenant support - Q3
Performance improvement
    Part cache lockless scan - Q3
    Hybrid part allocation - Q3
Stability
    Server isolation - Q3
    
# 研发规划
External Table/Data Lake
    Hudi MOR support - Q3
Performance improvement
    Query result cache - Q3
Extract, Load, Transform (ELT)
    Asynchronous Query execution、Query Queue、Join Spill - Q3
    Exchange spill、Colocated scheduling、Batch execution - Q4
LLM-DB
    LLM vector store support - Q4
    Integrate with OpenAI, LangChain and LlamaIndex - Q4

由于新增了上述高优的功能,我们也移除和调整了部分旧的功能,这部分功能涉及到一些需求不明确的功能和代码重构,详细列表如下所示,对移除的功能会重新安排时间去支持。

Storage
    Hudi COW support -(to Q3)
    Delta lake support - Q2 (replan)
    IceBerg support -  (to Q4)
Index
    Space-filling curves - Q1(replan)
    Index auto recommendation - Q2(replan)
Performance
    Column statistics for part pruning -(to Q4)
    Hybrid part allocation -(to Q3)
    Query result cache - Q3
Stability
    Server isolation - (to Q3)
    Metrics enhancement for better observability - (to Q3)
Enterprise feature
    Support RBAC -(to Q3)
    Fine grained access control -(to Q4)
    Backup and recover - Q2 (replan)
Transaction
    Direct write in worker - (to Q3)
    Iterative transaction support - (to Q3)
    Atomic attach - (to Q3)
    Code refactoring - (replan)

我们每个季度临近结束的时候都会进行一次review和并对后续季度的内容进行微调,并把调整同步到社区,欢迎大家讨论评论和提新的功能需求,并在每个季度的第一周结束之后进行定稿,然后更新社区路线图。定稿之后提的新功能需求会顺延到下一个季度。

@zhaojintaozhao
Copy link
Contributor

zhaojintaozhao commented Jul 4, 2023

MetaData Backup and recover

If the metadata kv FoundationDB is broken (for example, disk broken or logic fault cause FDB broken) and metadata is lost, is there any method to restore data?Therefore, I suggest adding the metadata backup and recovery function, which I hope to move to Q3 plan.

@zhaojintaozhao
Copy link
Contributor

Extract, Load, Transform (ELT)
    Asynchronous Query execution、Query Queue、Join Spill - Q3
    Exchange spill、Colocated scheduling、Batch execution - Q4

The ByConity support ELT feature is an exciting feature and a valuable feature.
After this feature is implemented, ByConity can support more complex Hive SQL statements and SQL queries with a larger data volume.
The shuffle of Cnch Hive SQL will be a complex feature that requires a lot of effort.

@juppylm
Copy link
Contributor

juppylm commented Jul 4, 2023

Is inverted index also supported? In addition to the good performance of the primary key index in the current index, when querying non-primary key fields, the performance needs to be improved.

@juppylm
Copy link
Contributor

juppylm commented Jul 5, 2023

CnchMergeTree support materialized view.
Materialized view is an important feature of clickhouse, and I think CnchMergeTree should also support it.

@FourSpaces
Copy link

FourSpaces commented Jul 5, 2023

I hope to support joint query engines for multiple tables and multiple data sources, pushing queries down to their respective data sources for querying. The joint query engine consolidates the data from each data source.

Merge Table Engine similar to clickhouse

@zhaojintaozhao
Copy link
Contributor

Build Objectives

  • In data warehouse scenario, ByConity should connect to the general data warehouse system, complete features and improve performance.
  • In OLAP scenario, we will build ByConity to a comprehensive cloud-native database for multi-dimensional analysis. The performance of large-width tables is close to that of ClickHouse. This cloud-native bigdata database will support elastic scaling, resource isolation, high reliability. ByConity will support large commercial deploy.

Requirements

1. Data warehouse scenario

In data warehouse scenarios, we focus on scenario coverage improvement and support complex query of large tables in data warehouses.

Key Features

1.1. Multi-stage execution and ETL capabilities at the execution layer, batch processing and exchange shufle are supported, and complex SQL statements for querying large tables in the data warehouse are supported. Key Fature
1.2. Support Special Hive functions and Hive UDFs. Key Feature
1.3. Performance improment (for eg: ORC/Parquet Native Reader, block cache, min-max index, etc.)

General Features

1.4. Support multi external catalog;
1.5. Automatically infer foreigh hive table column type.
1.6. Orc/Parquet file min-max index;
1.7. Read Orc/Parquet data files in distributed mode with thread pool.
1.8. Schedule multiple worker/worker-group workloads, make full use of resources.
1.9. Support Iceberg\Hudi.

2. OLAP Scenario

we focus on reliability and performance improvement in the OLAP scenario
2.1. Projection
2.2. Part's detach\attach
2.3. Automatic re-collection of full CBO statistics
2.4. Automatic collection of CBO statistics incremental data
2.5. Cache of min and max information of a part and query pruning.
2.6. Multi-disk local data cache
2.7. Seperate Primary index cache, mark cache, and data cache of CnchMergeTree.
2.8. Cluster management in containerized and non-containerized scenarios (adding and deleting workers\worker-group\virtual house)'.

3. Reliability enhancement

3.1. Metadata backup and restoration capabilities Key Feature
3.2. The ByConity service monitoring and alarms
3.3. Monitoring, alarm, and capacity expansion of FoundationDB
3.4. Multi-server\tso host selection and HA
3.5. Multi-instance and HA of RM\DM

What we foucs feature is 1.1, 1.2 and 3.1
@hustnn

目标

  • 数仓场景:对接通用数仓体系,补齐功能提性能。
  • OLAP场景:构建完善的云原生数据库,大宽表性能接近ClickHouse,支持弹性伸缩、资源隔离和高可靠,支持大规模商用部署;

需求

一、数仓场景

数仓场景关注场景覆盖率提升和支持数仓的大表复杂查询

关键特性

1.1. 执行层的多Stage执行、ETL能力,支持batch processing和exchange shufle;
1.2. 支持Hive的特殊函数、Hive UDF;
1.3. 性能加速(ORC/Parquet Native Reader、block cache、index等等)

通用特性

1.4. Multi External catalog;
1.5. 外表类型自动推断;
1.6. Orc/parquet min-max index;
1.7. 数据分布式读取;
1.8. 对多个worker\worker-group负载的调度,充分利用资源
1.9. 支持Iceberg\Hudi

二、Olap多维分析场景

Olap场景关注可靠性和性能提升
2.1. Projection
2.2. Part的detach\attach
2.3. CBO统计全量数据的自动重新收集
2.4. CBO统计增量数据的自动收集
2.5. Part的min\max等信息的cache和查询剪枝
2.6. 多磁盘cache
2.7. Part的primary index cache、mark cache和data cache
2.8. 容器化、非容器化下的集群管理(增加删除worker\worker-group\virtual house)

三、可靠性增强

3.1. 元数据具备备份和恢复能力
3.2. ByConity的服务监控、告警
3.3. 元数据FDB的监控、告警和扩容
3.4. 多server\tso的选主和HA
3.5. RM\DM的多实例和HA
最关注的特性是1.1, 1.2 和 3.1

@hustnn
Copy link
Contributor Author

hustnn commented Jul 11, 2023

Finished roadmap from Q1 and Q2

Storage

Index

Stability

Installation

CI

@hustnn hustnn changed the title Roadmap 2023 (discussion) Roadmap 2023 (Q3-Q4) Jul 11, 2023
@hustnn hustnn changed the title Roadmap 2023 (Q3-Q4) Roadmap 2023 Jul 12, 2023
@shijiaoming
Copy link

Support Apache Paimon,it‘s very cool Stream Data Lake!!!

@kevinthfang
Copy link
Contributor

kevinthfang commented Nov 2, 2023

Updates on roadmap:
reduced:

  • Atomic attach - Q4
  • Iterative transaction support (support multiple inserts atomic) - Q4

added:

  • Storage based HA support - Q4

@ixnzh ixnzh mentioned this issue Mar 4, 2024
31 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests