Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document-dependent path evaluation #102

Open
jakajancar opened this issue Jun 17, 2023 · 11 comments
Open

Document-dependent path evaluation #102

jakajancar opened this issue Jun 17, 2023 · 11 comments

Comments

@jakajancar
Copy link

jakajancar commented Jun 17, 2023

Let's say we have a property with an array of dynamically-typed, user-provided parameters.

$options = ['pointer' => '/user-provided-parameters/-'];

Items::fromString('{"user-provided-parameters": [1,2,3]}', $options);
// Expected: [1,2,3]
// Actual: same

Items::fromString('{"user-provided-parameters": [1,false,null]}', $options);
// Expected: [1,false,null]
// Actual: same

Items::fromString('{"user-provided-parameters": [1, [2,3]]}', $options);
// Expected: [1, [2,3]]
// Actual: [1, 2, 3]

Items::fromString('{"user-provided-parameters": []}', $options);
// Expected: []
// Actual: Exception: Paths '/user-provided-parameters/-' were not found in json stream.

This makes JSON Machine not very useful for working with documents with a more dynamic schema. Moreover, even arrays have a special case at length == 0.

I checked the JSON Pointer spec to see if this is an implementation bug or by design. Seems like JSON Pointer is not intended for the (JSONPath-like) selection at all, but for navigation to a single node. Even the - is interpreted differently (the (nonexistent) member after the last array element vs a wildcard which matches any array index). It also would not have the above problem and would always navigate to the expected subtree. It would be better if the readme said "a syntax inspired by JSON Pointer".

Re. a solution, it would be great if there was an option to not automatically descend deeper than the specified path and make the subtree selection not dependent on the values in it.

@halaxa
Copy link
Owner

halaxa commented Jun 17, 2023

TL;DR Just omit the dash at the end.

Hi. Your example works as expected. It seems in your case the JSON Pointer (pointer option) is just not used correctly. The pointer option means "iterate over items in this element". If you only need to iterate over the items in the user-provided-parameters key, just use /user-provided-parameters as the pointer. The dash at the end means "any index" so it matches /user-provided-parameters/0, /user-provided-parameters/1, and so on, and then tries to iterate over what's inside a vector on that index. If you need more explanation, let me know or have a second look at the JSON Machine documentation.

@jakajancar
Copy link
Author

Thanks for the quick response! You're right.

I tried to reduce the case and did it incorrectly. Let me try again:

Let's say we have a number[][] matrix where we want to iterate through cells, same as:

function cells($matrix) {
    foreach ($matrix as $row) {
        foreach ($row as $cell) {
            yield $cell;
        }
    }
}
$options = ['pointer' => '/table/-'];

Items::fromString('{"table": [[1,2], [3,4]]}', $options);
// Expected: [1,2,3,4]
// Actual: same

Items::fromString('{"table": [[1,2], 3]}', $options);
// Expected: error
// Actual: [1,2,3]

Is this possible?

@jakajancar
Copy link
Author

And the reason I was using /table/-/- was because then you get nice results in getCurrentJsonPointer():

1  -  /table/0/0
2  -  /table/0/1
3  -  /table/1/0
4  -  /table/1/1

@jakajancar
Copy link
Author

What are your thoughts on an option "flatten" => false (default true), where of your examples:

JSON Pointer value Will iterate through
(empty string - default) ["this", "array"] or {"a": "this", "b": "object"} will be iterated (main level)
/result/items {"result": {"items": ["this", "array", "will", "be", "iterated"]}}
/0/items [{"items": ["this", "array", "will", "be", "iterated"]}] (supports array indices)
/results/-/status {"results": [{"status": "iterated"}, {"status": "also iterated"}]} (a hyphen as an array index wildcard)
/ (gotcha! - a slash followed by an empty string, see the spec) {"":["this","array","will","be","iterated"]}
/quotes\" {"quotes\"": ["this", "array", "will", "be", "iterated"]}

All of them return a single item, except /results/-/status (with an explicit wildcard) returns the same as today?

@halaxa
Copy link
Owner

halaxa commented Jun 18, 2023

I'm not sure what the question is now. Can you be more specific?

Anyway, let me just elaborate a little on the flatten topic. JSON Machine supports finding data in a JSON down to a single scalar value if needed. It does that automatically. If it finds a scalar value at a pointer instead of an object or an array, it just yields it in a single iteration. So it might seem it somehow flattens the structure when used in combination with - and when the structure is not rigid. But in reality, no such thing happens.

Try this and you'll see no deep flattening is happening:

$options = ['pointer' => '/table/-'];

Items::fromString('{"table": [[[1,2]], [3,4]]}', $options);
// Expected: [[1,2],3,4]

Also, this example is not expected to produce an error:

$options = ['pointer' => '/table/-'];
Items::fromString('{"table": [[1,2], 3]}', $options);

because at /table/0 there is [1,2] which is sequentially iterated, and at /table/1 there is 3 which is a scalar value and as such it's simply yielded as a single value.

@jakajancar
Copy link
Author

jakajancar commented Jun 18, 2023

I would expect a behavior where:

  • For every non-wildcard pointer component:
    • Machine navigates into the object property/array element, and
    • the number of items does not increase.
  • For every wildcard pointer component:
    • Machine explodes the object properties/array elements,
    • the number of the items increases, and
    • the key/index is available using getCurrentJsonPointer().

Currently, even a non-wildcard component explodes the items (but has nowhere to indicate this in the path), if the element pointed to is an object/array. It is this behavior that I would like to have a way to disable.


Below is (yet another) example, which demonstrates both my concerns (indexes in getCurrentJsonPointer() and unpredictable levels).

Say you have two-level array mixed[][], where all of these are valid:

{"2d": [[1,2], [3]]}
    $value['2d'][0][0] (/2d/0/0) = 1
    $value['2d'][0][1] (/2d/0/1) = 2
    $value['2d'][1][0] (/2d/1/0) = 3
{"2d": [[1,2], [3,true]]}
    $value['2d'][0][0] (/2d/0/0) = 1
    $value['2d'][0][1] (/2d/0/1) = 2
    $value['2d'][1][0] (/2d/1/0) = 3
    $value['2d'][1][1] (/2d/1/1) = true
{"2d": [[1,2], [3,[4,5]]]}
    $value['2d'][0][0] (/2d/0/0) = 1
    $value['2d'][0][1] (/2d/0/1) = 2
    $value['2d'][1][0] (/2d/1/0) = 3
    $value['2d'][1][1] (/2d/1/1) = [4,5]

The following is not valid, because it's not really mixed[][]:

{"2d": [[1,2], false]}
    $value['2d'][0][0] (/2d/0/0) = 1
    $value['2d'][0][1] (/2d/0/1) = 2
    $value['2d'][1][0] = error

I would like to

  1. properly get the elements in the valid examples,
  2. know their indexes, and
  3. (ideally) somewhat gracefully handle the invalid example (error or ignore the non-matching value).

This cannot be currently achieved:

  • If you use /2d/-/-
    • ✅ You do get both indices.
    • ❌ Third valid example ([[1,2], [3,[4,5]]]) gets flattened (and you get 5 items)
    • ✅ The invalid example ignores the invalid element.
  • If you use /2d/-:
    • ❌ You do not get both indices, only the first.
    • ✅ Third valid example doesn't get flattened (properly get 4 elements)
    • ❌ The invalid example gets silently ignored (you get same items as first valid example)

@halaxa
Copy link
Owner

halaxa commented Jun 18, 2023

  • If you use /2d/-/-

    • ✅ You do get both indices.
    • ❌ Third valid example ([[1,2], [3,[4,5]]]) gets flattened (and you get 5 items)
      • That's a feature, not a bug as explained earlier.
    • ✅ The invalid example ignores the invalid element.
  • If you use /2d/-:

    • ❌ You do not get both indices, only the first.
    • ✅ Third valid example doesn't get flattened (properly get 4 elements)
    • ❌ The invalid example gets silently ignored (you get same items as first valid example)
      • Not-found items get ignored. That's normal behavior. It's as if you wanted the find command to fail on every existing file in the searched dir that does not match searched string.

@halaxa
Copy link
Owner

halaxa commented Jun 18, 2023

Sorry for being brief ;)

@jakajancar
Copy link
Author

No worries, I appreciate your responses, responsiveness, and patience with me iterating on trying to get the best example.

  • If you use /2d/-/-

    • ❌ Third valid example ([[1,2], [3,[4,5]]]) gets flattened (and you get 5 items)

      • That's a feature, not a bug as explained earlier.

Yes, I understand. But disabling this feature is essentially my feature request! :D

  • If you use /2d/-:

    • ❌ You do not get both indices, only the first.

I'm not saying that the items do not get iterated over, just that in the getCurrentJsonPointer() return value you don't have both indices (which makes sense, since there is not "placeholder" for them).

  • ❌ The invalid example gets silently ignored (you get same items as first valid example)

    • Not-found items get ignored. That's normal behavior. It's as if you wanted the find command to fail on every existing file in the searched dir that does not match searched string.

By "silently ignored" I don't mean not returned by the iterator (that's what happens with /2d/-/- and that's OK) but returned identically than if it was in a different structure.


Perhaps I owe an explanation for this admittedly weird use-case:

I'm querying OpenAI's text completions AI with the new function calling/structured output mechanism, which returns JSON. JSON Machine is used to return results in a streaming fashion to the user live (see videos here if curious). That table should be string[][] and 95% of the time it is, but occasionally the model hallucinates and omits a level of nesting, adds a level of nesting, returns the wrong number of rows or cells. So when iterating over /2d/-/- I check both the indexes to be monotonically increasing with no gaps, that the values are indeed string, and so on... very defensively.


In recap, I don't think path nr# 2 (/2d/-) is the way forward. /2d/-/- is mostly there, but I would prefer not to have that auto-descent feature.

@halaxa
Copy link
Owner

halaxa commented Jun 30, 2023

But disabling this feature is essentially my feature request! :D

Now it makes perfect sense 😁. Because in terms of JSON Machine, there's no 'flattening', I'd suggest modifying the scalar parsing logic, which is what's actually behind your problem. Maybe an option something like iterate_scalars, with three settings:

  • AUTO (current behavior, would remain the default)
  • ALWAYS/ONLY/FORCE (an iterable on the pointer position will throw)
  • NEVER (a scalar on the pointer position will throw)

This example of yours:

$options = ['pointer' => '/table/-'];

Items::fromString('{"table": [[1,2], 3]}', $options);
// Expected: error
// Actual: [1,2,3]

would then throw an error with option 'iterate_scalars' => NEVER

@halaxa
Copy link
Owner

halaxa commented Jun 30, 2023

Also for a less predictable structure maybe #36 would help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants