Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include array index #107

Open
Minitour opened this issue Dec 7, 2023 · 2 comments
Open

Include array index #107

Minitour opened this issue Dec 7, 2023 · 2 comments
Labels

Comments

@Minitour
Copy link

Minitour commented Dec 7, 2023

Is your feature request related to a problem? Please describe.
I am streaming values from a large JSON file into a dataframe, but I am unable to group relevant items together due to lack of depth.

Describe the solution you'd like
For example, instead of
A.item.B.item.C which can be repeated many times.

It would be great to have something like:
A.[0].B.[0].C

For example for the following object:

{
   "A": [
         {
              "B": [
                     { "C": "Test-1" },
                     { "C": "Test-2" },
                     { "C": "Test-3" }
               ]
         },
         {
              "B": [
                     { "C": "Test-4" },
                     { "C": "Test-5" },
                     { "C": "Test-6" }
               ]
         }
   ]
}

I would expect to see the following events:

Prefix Name Value
A.[0].B.[0].C string Test-1
A.[0].B.[1].C string Test-2
A.[0].B.[2].C string Test-3
A.[1].B.[0].C string Test-4
A.[1].B.[1].C string Test-5
A.[1].B.[2].C string Test-6

Describe alternatives you've considered
N/A

@Minitour Minitour changed the title Include array depth Include array index Dec 7, 2023
@Minitour
Copy link
Author

Minitour commented Dec 7, 2023

Update:

I hacked something real quick by modifying the common.py:

class Index:

    def __init__(self, initial_value=0):
        self._value = initial_value

    def increment(self):
        self._value += 1

    def decrement(self):
        self._value -= 1

    def __str__(self):
        return f'{self._value}'

@utils.coroutine
def parse_basecoro(target):
    path = []
    while True:
        event, value = yield
        if event == 'map_key':
            prefix = '.'.join(map(str, path[:-1]))
            path[-1] = value
        elif event == 'start_map':
            if path and (indx := path[-1]) and type(indx) == Index:
                indx.increment()
            prefix = '.'.join(map(str, path))
            path.append(None)
        elif event == 'end_map':
            path.pop()
            prefix = '.'.join(map(str, path))
        elif event == 'start_array':
            prefix = '.'.join(map(str, path))
            path.append(Index(0))
        elif event == 'end_array':
            path.pop()
            prefix = '.'.join(map(str, path))
        else:  # any scalar value
            prefix = '.'.join(map(str, path))
        target.send((prefix, event, value))

Although it is not the best solution, it certainly achieves what I am looking for. Please consider adding something similar, but in the meantime, I will be using patch to monkey-patch the library.

@rtobar
Copy link

rtobar commented Dec 13, 2023

Hi @Minitour, thanks for taking an interest in improving ijson!

I think the idea is good in principle, but the suggested implementation is not going to fly. In particular:

  • It breaks code for users of the items and kvitems calls, and that's an absolute no.
  • It also breaks code for users of the parser calls, and that's also an absolute no.
  • I'm not sure if you're aware, but modifying common.py applies the changes to all backends except yajl2_c, which is the default one (because it's 10x faster than the next one in the list).

If I implemented this, I'd do it at the items/kvitems level, where you could interpret the [n]s in the given prefix and match them to the nth appearance of item in the underlying path. Also, maybe instead of a.b.[0].c one could simply have a.b.0.c? The brackets seem unnecessary.

In any case, I'm in no hurry to implement this. Maybe if more people somehow upvote this I could give it some attention. It would also be an incentive if someone (you?) presented a modified version of items_basecoro that understood these numeric indices as indicated above, hopefully with tests -- then we could iterate into a final solution that covered all backends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants