You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Prompted by a comment from @callumforrester, I'm wondering we should be storing this in component parts e.g. scheme (file), netloc (localhost), and path (/tmp/tmp_qqdv5m4/data/y).
Why? This would enables us to index them separately and easily:
Filter by scheme ("Show file-based URIs only...")
Extract (and possibly rewrite) the paths
In the future, extract (and possibly rewrite) the domain (~netloc)
How? The complete specification is large. The httpx.URL docstring has useful ASCII art:
The components of a URL are broken down like this:
https://jo%40email.com:a%20secret@müller.de:1234/pa%20th?search=ab#anchorlink
[scheme] [ username ] [password] [ host ][port][ path ] [ query ] [fragment]
[ userinfo ] [ netloc ][ raw_path ]
We should give some thought to how finely to carve it, and whether or not we want to include everything at the start. (I'm not sure about fragment and I have concerns about userinfo.)
We could keep everything in one column but make it a JSON[B] column. That would make the components index-able. However I think that because the structure is well-defined and each entry is homogeneous, making a column for each component is the way to go.
The text was updated successfully, but these errors were encountered:
Omit fragment because I've never seen a data_uri that uses it, and we can easily add it if we find one.
Omit userinfo because we should not be handling plaintext credentials in our database, full stop. We can revisit how to handle this kind of situation when we add support for data drawn from non-file-based sources.
In the Catalog
assets
table, we storedata_uri
as one string:Prompted by a comment from @callumforrester, I'm wondering we should be storing this in component parts e.g.
scheme
(file
),netloc
(localhost
), andpath
(/tmp/tmp_qqdv5m4/data/y
).Why? This would enables us to index them separately and easily:
netloc
)How? The complete specification is large. The
httpx.URL
docstring has useful ASCII art:We should give some thought to how finely to carve it, and whether or not we want to include everything at the start. (I'm not sure about
fragment
and I have concerns aboutuserinfo
.)We could keep everything in one column but make it a JSON[B] column. That would make the components index-able. However I think that because the structure is well-defined and each entry is homogeneous, making a column for each component is the way to go.
The text was updated successfully, but these errors were encountered: