Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFE] Optional backend proxy to enable bigger data sources #51

Open
saaj opened this issue May 16, 2021 · 11 comments
Open

[RFE] Optional backend proxy to enable bigger data sources #51

saaj opened this issue May 16, 2021 · 11 comments

Comments

@saaj
Copy link
Contributor

saaj commented May 16, 2021

As a user of Sqliteviz,
In order to simplify my workflow with bigger databases,
I want to be able to query them directly in sqliteviz w/o the need to pre-aggregate data for it.

A local SQLite database file can easily be 1-3 order of magnitude bigger than what (allowed to) fits into browser tab's memory. Using it directly w/o producing pre-aggregated samples for visualisation can simplify the workflow a lot (or make sqliteviz attractive to another audience). It's likely not about the volume of the result set (even what fits into the browser may not make sense to pass to Plotly -- there are much fewer pixels space there even taking into account zooming), but just the span of aggregates that is accessible.

UI

Code-wise, sql.js interactions are abstracted by a promise-based API, so it should not be much difference for the core logic.

HTTP API

Points:

  • No persistent connection on the backend -- everything is request-response
  • Distribution -- just a script(s) (preferably w/o or with few dependencies) in the repo

Database candidates on the surfaces (e.g. for a Python module):

  • stdlib sqlite3 and/or apsw (or apsw-wheels) and/or rqlite HTTP API directly
  • DuckDB -- just pip install duckdb away, much faster for OLAP queries than SQLite
  • can be SQLAlchemy for anything else as the central interface

Complying to the API any other SQL-speaking source can potentially be added it the need be.

Endpoints:

  1. Execute a SQL query script

    • Source DSN/URI (e.g. sqlite3.connect('file:path/to/database?mode=ro', uri=True)
      in Python). Credentials also fit here.
    • SQL query itself
    • Result set: list of dicts in JSON
    • Cursor meta?
  2. Auto-completion for database schema/hierarchy, functions, etc. Upfront or on-demand? E.g.
    SQLite has COMPLETION table-valued function. TBD.

  3. OpenAPI endpoint

https://github.com/roapi/roapi is nearly perfect for the task:

  1. Installed via pip or Docker

  2. Simple HTTP API

    Point to a file

    roapi-http --table 'uk_cities=test_data/uk_cities_with_headers.csv' 
    

    Then query like:

    curl -X POST -d "SELECT city, lat, lng FROM uk_cities LIMIT 2" localhost:8080/api/sql
    

    Outputs JSON (and also supports CSV and Apache Arrow serialisation).

  3. Schema is available at /api/schema.

  4. Input transport: local filesystem, HTTP, S3.

  5. Input formats: SQLite, CSV, (ND)JSON, parquet, MySQL, DeltaLake, Google spreadsheet.

  6. Some interesting functionality is in to-do:

    • OpenAPI
    • Postgres
    • authn
@twoxfh
Copy link
Contributor

twoxfh commented Aug 9, 2021

You might be able to get most of what you want with a database function which takes in a URL and post parameters then returns a JSON document which parsed as a table with the JSON1 extension using roapi-http. Likely would need the ability to externalize database functions in Sqliteviz or possibly in your Sqljs build script.

For larger datasets you might want to store the JSON document and then process it.

@saaj
Copy link
Contributor Author

saaj commented Aug 9, 2021

I don't think SQLite is needed in between in this case. Yes, then there's an issue with inconsistency of the SQL dialect(s) in UI (datafusion's in case of roapi), but it's (much) smaller than the performance issues and the implementation cost. Especially given that this a kind of exploratory use case.

But also backends should be swappable, once Sqliteviz establishes the expected HTTP API. So making one for particular SQL-like source with the HTTP server technology of choice should be a very simple task.

@twoxfh
Copy link
Contributor

twoxfh commented Aug 9, 2021

It took me a second to understand the thought process, are you expecting sqliteviz to be a producer of data to roapi and then have one search interface? Or be the interface for roapi?

If you have a SQLiteviz with local data how do you query it with the data over http? When I talked about a database function I meant within JavaScript not a C program compiled with SQLite.

@saaj
Copy link
Contributor Author

saaj commented Aug 10, 2021

It took me a second to understand the thought process [...]

That's reciprocal ;-) But I'm curios to understand your idea.

How sqliteviz works now (and should continue) -- main use case.

   input file           browser
  ┌──────────┐         ┌───────────────────────────┐
  │ .csv     │         │                           │
  │ .sqlite  ├─────────┼───────────────────┐       │
  └──────────┘         │                   │       │
                       │  web worker       │       │
                       │ ┌─────────────────┼─────┐ │
                       │ │  SQLite†        │     │ │
                       │ │ ┌───────┐  ┌────▼───┐ │ │
                       │ │ │       │  │database│ │ │
                       │ │ │  ┌────┼──►byte    │ │ │
                       │ │ │  │    │  │array   │ │ │
                       │ │ └──┼────┘  └────────┘ │ │
                       │ │    │                  │ │
                       │ └────┼──────────────────┘ │
                       │      │                    │
                       │ ┌────▼──────┐             │
                       │ │           │             │
                       │ │ SQL query │             │
                       │ │ from user │             │
                       │ │           │             │
                       │ └───────────┘             │
                       │                           │
                       └───────────────────────────┘

What this RFE seeks is this exploratory/experimantal use case.

     roapi-http              browser
  ┌──────────────┐          ┌───────────────┐
  │              │          │               │
  │ ┌──────────┐ ◄──────────┼───────┐       │
  │ │datafusion│ │          │       │       │
  │ │   SQL†   │ │          │ ┌─────▼─────┐ │
  │ └──────────┘ ├────┐     │ │           │ │
  │              │    │     │ │ SQL query │ │
  └──────────────┘    │     │ │ from user │ │
                      │     │ │           │ │
                      │     │ └───────────┘ │
    input files       │     │               │
   ┌───────────┐      │     └───────────────┘
   │ .csv      ◄──────┘
   │ .parquet  │
   └───────────┘

What I am saying is that the effective SQL engine † can be different, but that's okay (even though it would complicate UI). If the experiment succeeds, later we can think how to combine the two (say, VFS for ATTACH DATABASE or something like that), but for now it's premature.

@twoxfh
Copy link
Contributor

twoxfh commented Aug 10, 2021

I agree in that scenario I am not sure how you would standardize all the functions and operations across multiple platforms.

If you pick JSON as the interchange format, sqljs at the worker level let's you define JavaScript database functions. You can then make api calls through the browser JavaScript and return the json in a table format using the JSON1 extension much like generate series is a table valued function. Make sense? Couple benefits to this approach are the choice between static and real time data and ability to automate multiple sources at once.

Of course CORS is an issue if you don't have access to the endpoint, but if it's centralized and configured for access I think this may meet your use case. This would be easiest on json arrays, access to manipulate hierarchial json would be achieved through sqlite JSON1 extension

@saaj
Copy link
Contributor Author

saaj commented Aug 11, 2021

I agree in that scenario I am not sure how you would standardize all the functions and operations across multiple platforms.

I guess there has already been some standardisation going on around ISO/IEC_9075 and we can just build on its subset. Basically SELECT with GROUP BY and a handful of aggregate functions is all that's needed for this use case. And as I said it does complicate things for the UI a little, but far from being a problem (e.g. how SQL auto-completion works now).

If you pick JSON as the interchange format, sqljs at the worker level let's you define JavaScript database functions. You can then make api calls through the browser JavaScript and return the json in a table format using the JSON1 extension much like generate series is a table valued function. Make sense? Couple benefits to this approach are the choice between static and real time data and ability to automate multiple sources at once.

Yes, now I understand. We touched it before, and I still stand by #56 (comment).

For this use case, I don't think it would work. It either requires you to load all the data to aggregate it in the browser, which defeats whole point of this RFE -- data doesn't fit in browser/RAM, aggregate it elsewhere prior to visualisation in sqliteviz.

SELECT json_extract(value, '$.name') name, COUNT(*) cnt
FROM json_each(fetch_sync('http://localhost/some_huge_json_array.json'))
GROUP BY 1 

Or in case of roapi it would require SQL inside SQL string and unnecessary JSON maniulation:

SELECT json_extract(value, '$.name') name, json_extract(value, '$.cnt') cnt
FROM json_each(
  fetch_sync_post(
     'http://localhost:8080/api/sql', 
     'SELECT name, COUNT(*) cnt FROM some_huge_csv_file GROUP BY 1'
  )
)

Where the suggested alternative is filling a form in UI to point to the host-port roapi listens on and execute simple:

SELECT name, COUNT(*) cnt
FROM some_huge_csv_file
GROUP BY 1

Plus there will be schema and auto-completion as it is now.

@twoxfh
Copy link
Contributor

twoxfh commented Aug 13, 2021

I think I get it now, you want to http request flat JSON from a website through your browser, convert and feed the result javascript array to the sqliteviz table, and visualize the data without using sqlite, right? Not sure how multiple tabs would work, maybe that's a per tab functionality.

My comments were based on collecting aggregated data from external systems and aggregating it with data in the browser sqlite; it sounds like your usecase is different.

It would require another extension (statement vtab) but it wouldn't be hard for a user to define a custom function: to abstract the URL/JSON1 queries and internally call the fetch_sync_post function above:

create table roapi_data as select * from roapi("select name, COUNT(*) cnt from any_table group by 1")

For me this adds a bit of transparency of where the data is from in a complete savable inquiry file that is repeatable. In my experience even one million records JSON1 extension is fast enough not to need tables. I will digress, different use case.

@zhaoxilingcheng

This comment was marked as off-topic.

@saaj

This comment was marked as off-topic.

@zhaoxilingcheng

This comment was marked as off-topic.

@zhaoxilingcheng

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants