Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI support? #73

Open
halaxa opened this issue Jan 28, 2022 · 18 comments
Open

CLI support? #73

halaxa opened this issue Jan 28, 2022 · 18 comments
Labels
enhancement New feature or request

Comments

@halaxa
Copy link
Owner

halaxa commented Jan 28, 2022

Please let me know by reactions/voting or comments if a CLI version of JSON Machine would be useful to have. Thanks.

jm command would take a JSON stream from stdin, and send items one by one to stdout wrapped in a single-item JSON object encoded as {key: value}.

Possible usage:

$ wget <big list of users to stdout> | jm --pointer=/results
{"0": {"name": "Frank Sinatra", ...}}
{"1": {"name": "Ray Charles", ...}}
...

Another idea might be to wrap the item in a JSON list instead of an object, like so:

$ wget <big list of users to stdout> | jm --pointer=/results
[0, {"name": "Frank Sinatra", ...}]
[1, {"name": "Ray Charles", ...}]
...
@fwolfsjaeger
Copy link
Contributor

On Linux this can already be done using jq: https://stedolan.github.io/jq/

@halaxa
Copy link
Owner Author

halaxa commented Jan 31, 2022

Good point, I didn't know, that jq supports stream parsing. The speed will be incomparable, that's clear, but the jq's usage with stream parsing seems somehow unintuitive. While looking at the usage of jq another option came to my mind for jm:

$ wget <big list of users to stdout> | jm --pointer=/results
{"key": 0, "value": {"name": "Frank Sinatra", ...}}
{"key": 1, "value": {"name": "Ray Charles", ...}}
...

It is extensible for other fields in the future such as position, matchedPointer and so on...

@fwolfsjaeger
Copy link
Contributor

You're right, using jq for stream parsing is not intuitive at all. A cli-wrapper for JSON Machine can be easily added:

#!/usr/bin/env php
<?php

use JsonMachine\Items;

if ( ! is_file(dirname(__DIR__).'/vendor/autoload.php')) {
    throw new LogicException('Composer autoloader missing. Try running "composer install".');
}

require_once dirname(__DIR__).'/vendor/autoload.php';

function usage()
{
    echo sprintf('usage: %s --pointer=""', __FILE__)."\n";
    exit(1);
}

$options = getopt(null, ['pointer:']);

if (!isset($options['pointer'])) {
    usage();
}

$iterator = Items::fromFile('php://stdin', $options);

foreach ($iterator as $row) {
    echo json_encode($row)."\n";
}

@halaxa
Copy link
Owner Author

halaxa commented Feb 4, 2022

Yes, something along those lines. Using PassthruDecoder would eliminate the overhead of encoding and decoding back. Also, a simple templating system (Mustache for example) could be used to allow the user to format decoded data if that's what they wish. Like:

$ wget <big list of users to stdout> | jm --item-template="{{name}};{{born}}"
Frank Sinatra;1915
Ray Charles;1930

Combined with json pointer it could be quite versatile.

@halaxa halaxa added the enhancement New feature or request label Sep 2, 2022
@pkoppstein
Copy link

For the uninitiated, jq's streaming parser is usually quite difficult to use, but worse, for the following two essential tasks (described here using standard jq syntax), it is typically very slow (many hours or days) for very big files:

  1. .[] - that is, "explode" an array into a stream of its top-level items;

  2. keys_unsorted[] as $k | {($k): .[$k]} -- that is, "exploding" an object into a stream of corresponding singleton (i.e. single-key) objects.

To my knowledge, there is currently no CLI-tool for running these two jq queries conveniently, speedily, and losslessly against very large JSON arrays or objects, respectively. (By "lossless" I mean avoiding the loss of precision in handling JSON numbers.)

Being able to use JSON Pointer to fine-tune the point of the "explosion" would be fantastic!

Thank you!

@pkoppstein
Copy link

pkoppstein commented Oct 11, 2022

@fwolfsjaeger - Unfortunately your script does not preserve the JSON structure of the items at the specified point(s).

Or at least, I tried it with 'pointer' => '/-' and with input: [1,2,[10,20],30] but the array is in effect completely flattened.

--

Incidentally, after running composer successfully, I tried running your script (in the same directory), but the result is
a fatal error, with the message:

Composer autoloader missing.

(In fact, both the files

./vendor/autoload.php
./vendor/halaxa/json-machine/src/autoloader.php

are present.)

@halaxa
Copy link
Owner Author

halaxa commented Oct 11, 2022

Or at least, I tried it with 'pointer' => '/-' and with input: [1,2,[10,20],30] but the array is in effect completely flattened.

That's correct behavior. If you want to iterate top level, use empty string JSON pointer (default). Read more about it in REAMDE to see how exactly a hyphen in JSON pointer works. By using /- you tell it you want to iterate over each iterable on the top level array. But most of the values are scalars, so they are not iterated, just passed along as they are. But when it hits the only array you have there, it is itself iterated. So the result is flat.

@pkoppstein
Copy link

pkoppstein commented Oct 11, 2022

@halaxa - Thank you for your explanation. Please understand that the difficulty I had was precisely because I read the README quite closely, the point being that in JSON, numbers are scalars, not iterables. That is, I would have expected that an attempt to iterate over a number would either result in an error, or nothing at all. (In jq, gojq, and jaq, it results in an error e.g. $JQ -n '12 | .[]'.)

Part of my confusion arose from statements such as the following in an "Overview of JSON Pointer": (*)

Note that the first character of this String [the JSON Pointer] is a ‘/' – this is a syntactic requirement.

When I tried using "/" as the JSON Pointer, I just got an error, so "/-" seemed like the next best bet.

The fact that #fwolfsjaeger's script requires a pointer didn't help my understanding.

Now that I understand how to iterate over an array, I would like to know how to avoid loss of numeric precision, e.g.

400000000000000000000000000000000000000000000000000000000123 => 4.0e+59

Thank you again.

(*) https://www.baeldung.com/json-pointer

@halaxa
Copy link
Owner Author

halaxa commented Oct 11, 2022

No problem :)

This sencence

Note that the first character of this String is a ‘/' – this is a syntactic requirement.

from here https://www.baeldung.com/json-pointer is incorrect. See https://www.rfc-editor.org/rfc/rfc6901#section-5. The official RFC is also linked from the JSON Machine README https://github.com/halaxa/json-machine#what-is-json-pointer-anyway.

"/" actually means Iterate over empty string key in root dictionary.

@halaxa
Copy link
Owner Author

halaxa commented Oct 11, 2022

I'll elaborate on the other two points of yours later. Hopefully tomorrow.

@pkoppstein
Copy link

pkoppstein commented Oct 12, 2022

@halaxa @fwolfsjaeger - My PHP was never good to begin with and is by now very rusty, but the following script has already proven useful to me and might provide a basis for further improvements. Suggestions would of course be welcome.

[EDIT: The script has been moved to Issue#88 ]

@halaxa
Copy link
Owner Author

halaxa commented Oct 12, 2022

Can you move this last post to a new discussion, please?

@pkoppstein
Copy link

@halaxa - The last post is essentially a CLI script, so I thought this would be the best thread?

By the way, many of my colleagues who might benefit from a script such as jm would probably be discouraged by the installation hurdles that currently exist, so I was wondering whether you could envision at some point making JSON Machine available via homebrew? Or are there other alternatives you can suggest?

@halaxa
Copy link
Owner Author

halaxa commented Oct 12, 2022

It sure is a cli script. But I understand you want some suggestions. Discussions would be better place for this. If you want to actually participate with some code to this repository, please use a pull request. This thread is mainly for ideas and suggestions about how should CLI interface work. As for other installation channels, I'll let someone else to do it for now. It needs its own maintenance time which I don't have. It's OSS, anyone can generate any package from any revision. But thank you for your suggestion. Please keep them coming ;)

@halaxa
Copy link
Owner Author

halaxa commented Oct 12, 2022

@pkoppstein That is, I would have expected that an attempt to iterate over a number would either result in an error, or nothing at all. (In jq, gojq, and jaq, it results in an error e.g. $JQ -n '12 | .[]'.)

I understand the confusion. The idea is, that you can specify either iterable or scalar and JSON Machine will always give it to you. Of course you can run into a confusion when usin wildcard JSON Pointer. I have an idea, what about having an option for enabling strict mode? Either you specify AUTO (current behavior) or explicitly set SCALAR_ONLY or VECTOR_ONLY based on what you want to iterate and which error you want to get otherwise.

@pkoppstein Now that I understand how to iterate over an array, I would like to know how to avoid loss of numeric precision, e.g. 400000000000000000000000000000000000000000000000000000000123 => 4.0e+59

Pass a custom ExtJsonDecoder instance do decoder option configured with JSON_BIGINT_AS_STRING.

@pkoppstein
Copy link

@halaxa - I've created Issue#88 in accordance with your request. Should I now delete the script from the message in this thread?

As mentioned in Issue#87, I'm not sure how JSON_BIGINT_AS_STRING helps, as it converts all "bigints" to strings, and doesn't even attempt to handle big or small decimals.

As for SCALAR_ONLY/VECTOR_ONLY, I think that the current behavior is fine, and maybe even the "correct" one, depending on how it's documented. In the "help" provided by my version of the jm script, I've attempted to clarify the intent by emphasizing the idea of "streaming" rather than "iteration".

@halaxa
Copy link
Owner Author

halaxa commented Oct 12, 2022

I've created Issue#88 in accordance with your request. Should I now delete the script from the message in this thread?

I agree with deleting it.

As mentioned in Issue#87, I'm not sure how JSON_BIGINT_AS_STRING helps, as it converts all "bigints" to strings, and doesn't even attempt to handle big or small decimals.

Your example has an integer, so that's why I suggested it. What exact decimal problem are you facing? Oh, I read this before #87. Let's continue there.

@pkoppstein
Copy link

pkoppstein commented Oct 16, 2022

It's not clear to me how a CLI script can be implemented to handle a JSON file that contains more than one top-level JSON entity. There are tools for converting such files to JSONLines format (one JSON entity per line), but it's inconvenient to have to place each of these in a separate file for the sake of JSON Machines. In addition, some of these tools lose numerical precision.

Any suggestions would be appreciated. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Inbox
Development

No branches or pull requests

3 participants