Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On-disk size of data.kz is large when using multiple CREATE statements #3411

Open
prrao87 opened this issue Apr 30, 2024 · 1 comment
Open
Assignees
Labels
bug Something isn't working

Comments

@prrao87
Copy link
Member

prrao87 commented Apr 30, 2024

Migrating this from our Discord channel as reported by a user. Basically, when a few thousand CREATE statements are sent via individual transactions, the on-disk size of the database directory is much larger (400 MB) compared to sending batched transactions (50 MB) via COPY. The question is about whether some sort of compaction process can be triggered to reduce the disk space usage when batched transactions are not possible.

Original message:

Is there a way to reduce the on-disk footprint of kuzu graph db? Can we trigger some sort of compaction on-demand? If I load a few thousand transactions, nodes and edges combined, my on-disk size grows to 400MB. Same transactions when batched create less than 50MB. However, batching is not practical in many cases, therefore is there a way to load transactions and then trigger compaction to reduce the disk footprint?

Questions we asked to clarify more:

  • "A few thousand transactions, nodes and edges combined": Is this for creation only? i.e., no deletions?
  • Batched meaning a few thousand creations within a single transaction?
  • Does most of the size difference come from data.kz file?

Clarifications:

Creation only, no deletion.
It was not a single batch... around 20 batches.
Size difference is all from data.kz file. That single file has most of the data.

Reproducible example:

Here is a test, that reproduces. Just two node types, create a 1000 instances of each. Once for batched and once for non-batched, and print the directory sizes for each.

import kuzu
import os
import shutil
import pandas as pd
from tempfile import NamedTemporaryFile


def run_kuzu_query(query, conn):
    qR = conn.execute(query)
    return qR.get_as_df()


def init_kuzu_db(_kuzu_loc):
    if os.path.exists(_kuzu_loc):
        shutil.rmtree(_kuzu_loc)
    ddl1 = 'create node table ContainerNode(my_key STRING, name1 STRING, name2 STRING, name3 STRING[], count INT64, PRIMARY KEY (my_key))'
    ddl2 = 'create node table ContentNode(my_key STRING, name1 STRING, name2 STRING, name3 STRING[], val1 DOUBLE, val2 DOUBLE, val3 DOUBLE, x BOOLEAN, y BOOLEAN, z BOOLEAN, w STRING[], PRIMARY KEY (my_key))'
    ddl3 = 'CREATE REL TABLE ContContRel(FROM ContainerNode TO ContentNode)'
    buf_pool_gb = 1
    _kuzuDB = kuzu.Database(_kuzu_loc, buffer_pool_size=buf_pool_gb*(1024**3))
    conn = kuzu.Connection(_kuzuDB, num_threads=1)
    run_kuzu_query(ddl1, conn)
    run_kuzu_query(ddl2, conn)
    run_kuzu_query(ddl3, conn)
    return conn


def create_container_nodes_no_batch(conn):
    for i in range(1000):
        dml  = "CREATE(t:ContainerNode {{ my_key : '{0}', name1 : 'xyz', name2 : 'abc', name3 : ['pqr'], count : {1} }})".format(i, i)
        run_kuzu_query(dml, conn)


def create_content_nodes_no_batch(conn):
    for i in range(1000):
        dml  = ("CREATE(t:ContentNode {{ my_key : '{0}', name1 : 'xyz', name2 : 'abc', name3 : ['pqr'], val1 : {1}, val2: {2}, val3: 3.5, "
                "x: TRUE, y: FALSE, z: TRUE, w: ['random'] }})").format(i, i, 1.5*i)
        run_kuzu_query(dml, conn)


def create_container_nodes_batched(conn):
    df = pd.DataFrame(columns=['my_key', 'name1', 'name2', 'name3', 'count'])
    for i in range(1000):
        df.loc[len(df.index)] = [f'{i}', 'xyz', 'abc', ['pqr'], i]

    df.reset_index(drop=True, inplace=True)

    tf = NamedTemporaryFile(suffix=".parquet")
    with tf:
        df.to_parquet(tf)
        tf.flush()
        dml = f"""COPY ContainerNode FROM "{tf.name}" """
        conn.execute(dml)



def create_content_nodes_batched(conn):
    df = pd.DataFrame(columns=['my_key', 'name1', 'name2', 'name3', 'val1', 'val2', 'val3', 'x', 'y', 'z', 'w'])
    for i in range(1000):
        df.loc[len(df.index)] = [f'{i}', 'xyz', 'abc', ['pqr'], 1.0*i, 1.0*i, 3.5, True, False, True, ['random']]

    df.reset_index(drop=True, inplace=True)
    tf = NamedTemporaryFile(suffix=".parquet")
    with tf:
        df.to_parquet(tf)
        tf.flush()
        dml = f"""COPY ContentNode FROM "{tf.name}" """
        conn.execute(dml)


def get_dir_size(path):
    total = 0
    with os.scandir(path) as it:
        for entry in it:
            if entry.is_file():
                total += entry.stat().st_size
            elif entry.is_dir():
                total += get_dir_size(entry.path)
    return total


## Without batching
conn_no_batch = init_kuzu_db("/tmp/no_batch.kuzudb")
create_container_nodes_no_batch(conn_no_batch)
create_content_nodes_no_batch(conn_no_batch)

## With batching
conn_batched = init_kuzu_db("/tmp/batched.kuzudb")
create_container_nodes_batched(conn_batched)
create_content_nodes_batched(conn_batched)


print('Size without batching:\t', get_dir_size("/tmp/no_batch.kuzudb"))
print('Size with batching:\t\t', get_dir_size("/tmp/batched.kuzudb"))
@prrao87 prrao87 added the bug Something isn't working label Apr 30, 2024
@ray6080
Copy link
Contributor

ray6080 commented Apr 30, 2024

By looking at the script, one hypothesis for the size difference is due to compression, a few thousand transactions might trigger re-compression of existing tuples, and right now we don't reclaim those space yet (will be added for sure later), while a single COPY statement doesn't trigger re-compression at all.

Will profile a bit more to verify if that's the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants