Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to refer to multiple joined frames' columns #3421

Open
hallmeier opened this issue Feb 7, 2023 · 9 comments
Open

Add a way to refer to multiple joined frames' columns #3421

hallmeier opened this issue Feb 7, 2023 · 9 comments
Labels
improve Improvement of an existing functionality

Comments

@hallmeier
Copy link

I want to index df on column A with jf and then join with jf2 to update column C with jf2's column D (also naming it C wouldn't help here).

from datatable import dt, f, join, g

df = dt.Frame("""A B C
                 a e 0
                 b e 0
                 b f 0
                 c f 0
                 d f 2""")

jf = dt.Frame("""A
                 b
                 c""")
jf.key = "A"

jf2 = dt.Frame("""B D
                  e 3
                  f 4""")
jf2.key = "B"

So after updating df would be:

df_desired = dt.Frame("""A B C
                         a e 0
                         b e 3
                         b f 4
                         c f 4
                         d f 1""")

Joining works perfectly:

df[g[0] != None, :, join(jf), join(jf2)]
#    | A      B          C      D
#    | str32  str32  int32  int32
# -- + -----  -----  -----  -----
#  0 | b      e          0      3
#  1 | b      f          0      4
#  2 | c      f          0      4
# [3 rows x 4 columns]

But I can't update in the same step because column Dcannot be accessed. I'd like to do something like this:

df[g[0] != None, dt.update(C=g["D"]), join(jf), join(jf2)]
# datatable.exceptions.KeyError: Column D does not exist in the Frame; did you mean A?

The columns of df are in f and the columns of jf are in g, but the columns of jf2 cannot be accessed in the j-statement.

While this is a feature request, I'd also appreciate good ideas for workarounds.

@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Feb 8, 2023

The issue here is that you are doing two joins at once. While technically this is going to work, as we allow multiple join nodes internally, this is not something we ever guaranteed to work. If you look at [i, j, ...] documentation you will notice, that there is only one join parameter, hence, there is only one g namespace.

While we eventually could add official support for multiple joins and multiple g namespaces (though it could be pretty cumbersome for users), for the moment I would not recommend to do [i, j, join(...), join(...), ...], because we don't even cover that in our tests.

As a workaround, I would propose to split your logic into several steps, i.e.

>>> DT = df[g[0] != None, :, join(jf)]
>>> DT[:, [f["A"], f["B"], g["D"].alias("C")], join(jf2)]
   | A      B          C
   | str32  str32  int32
-- + -----  -----  -----
 0 | b      e          3
 1 | b      f          4
 2 | c      f          4
[3 rows x 3 columns]

@hallmeier
Copy link
Author

Okay, thank you for your answer. In the documentation of the join parameter it says "This parameter may be listed multiple times if you need to join with several frames.", so I thought it was intended functionality. I propose you clarify this a bit more, depending on how you plan to move forward regarding multiple join frames. While this would be cool functionality that has some uses, I understand that considering the technical implications everywhere makes development more complex.

@oleksiyskononenko
Copy link
Contributor

Yes, you are right. But from the signature it is not obvious one can do multiple joins and probably we didn’t think it through with respect to addressing other joining frames. I also do not see we have even one test that tests multiple join functionality.

My feeling is that if we allow multiple joins we must have a way to address the frame’s columns. The problem is how the new namespaces should look like:
— new letters;
— a list of namespaces;

The options I just listed are not really good from the user perspective, I guess. Though the second one could be acceptable to some extend.

@samukweku
Copy link
Collaborator

@hallmeier do you mind pointing me to the link with the quote you referenced about calling the parameter multiple times?

@hallmeier
Copy link
Author

It's right in the __getitem__ documentation oleksiys linked

@samukweku
Copy link
Collaborator

samukweku commented Feb 9, 2023

Wow I like the fact that you can join multiple frames... Keeping track of namespaces might be complex 🤷‍♂️. At any rate I don't think it should be deprecated, probably update d docs to say that at the moment only two namespaces are supported, with an example

@oleksiyskononenko
Copy link
Contributor

Yeah, we definitely need to address this issue at some point, though it is not obvious to me how. The way we are doing it now with f and g is not really flexible when it comes to addressing an arbitrary joined frame.

@oleksiyskononenko oleksiyskononenko added the improve Improvement of an existing functionality label Feb 14, 2023
@oleksiyskononenko oleksiyskononenko changed the title Namespaces of subsequent joinframes Add a way to refer to multiple joined frames' columns Feb 14, 2023
@hallmeier
Copy link
Author

The "list of namespaces" idea sounds good to me. f and g are pretty standard, so we shouldn't mess with them. But h could hold a list of namespaces for joined frames after the first one.

An alternative idea is to have a dict of namespaces populated by keyword arguments of join. Normally, it is empty, but if you pass a frame to join() as a keyword argument, you can retrieve its namespace by this name. If the join() API should stay extensible for future keyword arguments, named frames could be passed as a dict in the first argument.

@oleksiyskononenko
Copy link
Contributor

Yes, probably a dict is better, because it is complicated to keep a track of the joined frames once their number is more than one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improve Improvement of an existing functionality
Projects
None yet
Development

No branches or pull requests

3 participants