Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --json-file/-j option to read from file w/ json contents #311

Open
wants to merge 40 commits into
base: master
Choose a base branch
from

Conversation

jacobmealey
Copy link
Contributor

@jacobmealey jacobmealey commented May 16, 2024

This adds --json-file as an option to read from a file that contains a json representation of the urls. It works very similarly to --url-file. The --json-file input structure is designed to match roughly what the output of --json produces. the simplest json object possible is:

[{ "parts": { "host": "example.com" }}]

You can pass an arbitrary number of these url JSON objects and perform operations on them like you would with any other url in trurl.

Something that I did that may be up for debate is how the query is handled. I chose to make the query only be able to be set by an array of "keys" and "values" (the "params" section of --json") and it will throw a warning if the "query" component of "parts" is set.

A few examples:

$ cat testfile/test004.txt
[
  {
    "parts": {
      "scheme": "ftp",
      "user": "scream",
      "host": "url.com",
      "path": "/",
      "query": "ignoredquery",
      "fragment": "a-fragment"
    },
    "params": [
      {
        "key": "query",
        "value": "pair"
      },
      {
        "key": "singlequery",
      }
    ]
  },
  {
    "parts": {
      "scheme": "http",
      "host": "example.org",
      "path": "/"
    }
  }
]
$ trurl -j testfiles/test0004.txt --set "fragment=a-new-fragment" 
trurl note: ignoring 'query', provide a separate 'params' array.
ftp://scream@url.com/?query=pair&singlequery#a-new-fragment
http://example.org/#a-new-fragment
$ trurl -j testfiles/test0004.txt --append "path=file"
trurl note: ignoring 'query', provide a separate 'params' array.
ftp://scream@url.com/file?query=pair&singlequery#a-fragment
http://example.org/file

Currently the feature is disabled by default, do turn it on you need to add TRURL_JSON_IN to your environment to enable it.

fixes: #291

@jacobmealey
Copy link
Contributor Author

jacobmealey commented May 16, 2024

okay I need to figure out whats going on with the windows builds, the cygwin was was working on my local branch until very recently, might be something i changed on the way we are doing the file i/o. I will take a closer look this evening.

Edit: Okay, i got the cygwin build working. I just need to add libjson-c to the curl-for-win build. @vszakats would this be done by manually compiling and linking it in here: https://github.com/curl/curl-for-win/blob/main/trurl.sh ?

@vszakats
Copy link
Member

vszakats commented May 16, 2024

It's a new dependency, kind of a big deal to add support for one in general. One approach is to make it a build option, disabled by default, so existing trurl builds (including curl-for-win) continue to work without modification. Another option is to delete/disable the curl-for-win CI job till I (or someone else) get around adding support for it. A 3rd one is to add an option to disable this feature, and I update curl-for-win to set it.

@vszakats
Copy link
Member

...that said, have you thought of using a JSON parser that's a self-contained C89 source (with a fitting license)? It's still a dependency, but perhaps easier to integrate and carry.

@jacobmealey
Copy link
Contributor Author

jacobmealey commented May 17, 2024

have you thought of using a JSON parser that's a self-contained C89 source

I was looking a lot at https://github.com/DaveGamble/cJSON, but I didn't like how it looks and json-c has some nice features. Both this and json-c are in vcpkg, freebsd repos, and most linux distro repositories so i was hoping it wouldn't be too much work either way for maintainers.

If folks want a smaller ANSI c json library instead of the one I chose, I think that's a fair request and I can start working this again.

for now, I will put it behind a compile time flag as you suggested

Edit:
There are a ton of other libraries listed on json.org that could be used too.

tests.json Outdated Show resolved Hide resolved
Co-authored-by: Viktor Szakats <vszakats@users.noreply.github.com>
Co-authored-by: Viktor Szakats <vszakats@users.noreply.github.com>
Copy link
Member

@bagder bagder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We made the -f function possible to work as a filter, as in it reads URLs and outputs the results in the mean time. This JSON reader reads the entire file before it acts on it, which makes it not work as a filter and if you want to work an a huge number of JSON URLs this will allocate a lot of memory...

trurl.c Outdated
Comment on lines 1655 to 1658
while((n = fread(json_string + latest, sizeof(char), json_buf_size, file))) {
latest += n;
json_buf_size *= 2;
char *new_json_string = realloc(json_string, json_buf_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is something wrong in the math here.

First loop entry: latest is 0, json_buf_size is 1024.

It loads 1024 bytes. latest is bumped to 1024. json_buf_size is doubled to 2048.

The buffer is realloced to be 2048 bytes.

Now loop.

The fread reads data into offset 1024 in the buffer that is 2048 bytes big. But fread is asked to read 2048 bytes, meaning that it will write outside the buffer if the file is larger than 2048 bytes.

Or am I wrong?

I think you should rather convert this to read data into a stack based fixed size buffer, then copy that data to the end of a realloced buffer of the appropriate size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a better approach. I was pretty unhappy with this in general (I tried I to rework it but it broke on windows lol) Along with your comment about making it act more like -f I will need to redo this.

It would be ideal if it only allocated the memory required for the current url in the array, which shouldnt be too hard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A way to get it to behave more like -f, would be to have it scan for matching { and } (with proper checking you aren't in a string a what not) and when you've go a full enclosed object it would take that and parse it and perform the trurl operations on it. this would make it so it only allocates as much memory as required for the current buffer. the big issue i see with this is that trurl could parse non-conformant json unless we bake all that parsing logic for arrays in as well (at the point how far would we be from an actual json parser?).

trurl.c Outdated Show resolved Hide resolved
@jacobmealey
Copy link
Contributor Author

@bagder, i've made a few big modifications. Mainly I made it so it reads one json object at a time, so it now only allocates as much memory as required to parse the largest single json object, instead of attempting to read until the of the file. I also made it so it reads in a fixed buffer on the stack, and then allocates more data on the heap if need be as you suggested.

Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
jacobmealey and others added 2 commits May 24, 2024 16:12
Co-authored-by: Viktor Szakats <vszakats@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Get path components by index / json array
3 participants