Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support of registered files (DuckDB-WASM) in vgplot #130

Open
spren9er opened this issue Jul 5, 2023 · 2 comments
Open

Support of registered files (DuckDB-WASM) in vgplot #130

spren9er opened this issue Jul 5, 2023 · 2 comments

Comments

@spren9er
Copy link
Contributor

spren9er commented Jul 5, 2023

Data loading helpers create DuckDB tables/views from parquet, JSON or CSV files. Unfortunately, this process can take a while when data is big. Sometimes — in DuckDB-WASM — it is nicer to register files and run queries directly on them (see registerFileURL). Query performance on DuckDB tables is optimal, but queries on parquet files are fast, too.

Here, we try to use a registered CSV file (via HTTP) in vgplot:

import * as vg from "@uwdata/vgplot";

import { DuckDBDataProtocol } from '@duckdb/duckdb-wasm';

const wasm = await vg.wasmConnector();

wasm.db.registerFileURL(
  'athletes.csv',
  'https://uwdata.github.io/mosaic-datasets/data/athletes.csv',
  DuckDBDataProtocol.HTTP,
  true,
);

vg.coordinator().databaseConnector(wasm);

const $query = vg.Selection.intersect();

export default vg.vconcat(
  vg.menu({
    label: 'Sport',
    as: $query,
    from: 'athletes.csv',
    column: 'sport',
    value: 'aquatics'
  }),
  vg.vspace(10),
  vg.plot(
    vg.barX(vg.from('athletes.csv', { filterBy: $query }), {
      x: vg.sum('gold'),
      y: 'nationality',
      fill: 'steelblue',
      sort: { y: '-x', limit: 10 },
    }),
    vg.xLabel('Gold Medals'),
    vg.yLabel('Nationality'),
    vg.yLabelAnchor('top'),
    vg.marginTop(15)
  )
);

Note, that references of mosaic clients point now to registered file athletes.csv, not a DuckDB table anymore.

This works! However, currently not all marks from vgplot are supported with registered files. Connected marks (e.g. area, line) don't work. When replacing above mark barX by

vg.lineY(vg.from('athletes.csv', { filterBy: $query }), {
  x: 'nationality',
  y: vg.sum('gold')
})

an error is raised:

Coordinator.js:145 Uncaught (in promise) Error: Catalog Error: Table with name csv does not exist!

I think the reason is probably due to some preprocessing of table name, see here. This logic conflicts with registered file name. I tried changing name of registered file athletes.csv, but if I omit .csv suffix, then DuckDB-WASM interprets reference as table.

It would be great if vgplot fully works with registered files (via HTTP(S), S3, local storage, etc.).

@spren9er spren9er changed the title Support of registered files (DuckDB-WASM) in _vgplot_ Support of registered files (DuckDB-WASM) in vgplot Jul 5, 2023
@jheer
Copy link
Member

jheer commented Jul 5, 2023

I won’t be able to take a closer look at this until next week (the issue is that Mosaic interprets the table name as schema.table), but in the meantime here are two ideas:

  • Can you create a view over the registered file (eg, using an exec query)? Then you can query the view as a normal table.
  • You might try creating a file without an extension or perhaps wrap the file name in an explicit sql expression (sql`’athletes.csv’` or similar)?

@spren9er
Copy link
Contributor Author

spren9er commented Jul 5, 2023

Thanks, first workaround (creating a view for it) worked out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants