Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider storing data_uri in component parts #658

Open
danielballan opened this issue Feb 20, 2024 · 1 comment
Open

Consider storing data_uri in component parts #658

danielballan opened this issue Feb 20, 2024 · 1 comment

Comments

@danielballan
Copy link
Member

In the Catalog assets table, we store data_uri as one string:

sqlite> select * from assets;
id  data_uri                                 is_directory  hash_type  hash_content  size  time_created         time_updated       
--  ---------------------------------------  ------------  ---------  ------------  ----  -------------------  -------------------
1   file://localhost/tmp/tmp_qqdv5m4/data/y  1                                            2024-02-20 12:52:14  2024-02-20 12:52:14

Prompted by a comment from @callumforrester, I'm wondering we should be storing this in component parts e.g. scheme (file), netloc (localhost), and path (/tmp/tmp_qqdv5m4/data/y).

Why? This would enables us to index them separately and easily:

  • Filter by scheme ("Show file-based URIs only...")
  • Extract (and possibly rewrite) the paths
  • In the future, extract (and possibly rewrite) the domain (~netloc)

How? The complete specification is large. The httpx.URL docstring has useful ASCII art:

    The components of a URL are broken down like this:

       https://jo%40email.com:a%20secret@müller.de:1234/pa%20th?search=ab#anchorlink
    [scheme]   [  username  ] [password] [ host ][port][ path ] [ query ] [fragment]
               [       userinfo        ] [   netloc   ][    raw_path    ]

We should give some thought to how finely to carve it, and whether or not we want to include everything at the start. (I'm not sure about fragment and I have concerns about userinfo.)

We could keep everything in one column but make it a JSON[B] column. That would make the components index-able. However I think that because the structure is well-defined and each entry is homogeneous, making a column for each component is the way to go.

@danielballan
Copy link
Member Author

danielballan commented Feb 20, 2024

I think this would be my first pass:

scheme VARCHAR
netloc VARCHAR
path VARCHAR
query JSON[B]  # array of [key, value] arrays
  • Omit fragment because I've never seen a data_uri that uses it, and we can easily add it if we find one.
  • Omit userinfo because we should not be handling plaintext credentials in our database, full stop. We can revisit how to handle this kind of situation when we add support for data drawn from non-file-based sources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant