`df1Row[df2Column] == df1Row[df2Column.name]` #445

Jolanrensen · 2023-08-22T14:04:38Z

Implemented fix for #442.

It used to be that df1Row[df2Column] == df2Column[df1Row.index], but IMO that's confusing. I can see it was done to make df1Row[df2Column] and df2Column[df1Row] return the same thing but across the library, this is only actually used once (in toList and was fixed by swapping row/cols).

It's also very common for the library to call df1Row[newColumn] for reasons I don't fully understand. So my fix checks if the name of newColumn already exists in the row, if so, return the value at that name. If not, do the magic you did before and let the column handle the resolving of the value.

This seems to work with all tests and fixes the specific example of the issue.

…Column.name]

koperagen · 2023-09-12T15:23:32Z

This behavior is defined by org.jetbrains.kotlinx.dataframe.DataColumn#resolveSingle. Logic here is that when you use DataColumn as a ColumnReference and it's resolved, you get values from the column itself, not the dataframe column with such name. I do not immediately recall where it works like that. Hm.. I think we need some more evidence to make a decision and change behavior in org.jetbrains.kotlinx.dataframe.DataColumn#resolveSingle

Jolanrensen · 2023-09-13T11:16:33Z

The only place I found it breaking in the library was here: core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/toList.kt. There, the column was taken from the row while the other way around was intended.

I also looked into forbidding such behavior, as taking a column from another dataframe from a row seems odd, but that's still used a lot across the implementation.

But purely from a user-standpoint, what would you expect df1Row[df2Column] to return?

koperagen · 2023-09-19T19:25:59Z

Ah, ok, i think i finally understood. Somehow this part was hard for me to understand, that's why i was suspicious of this fix

It used to be that df1Row[df2Column] == df2Column[df1Row.index], but IMO that's confusing. I can see it was done to make df1Row[df2Column] and df2Column[df1Row] return the same thing but across the library, this is only actually used once (in toList and was fixed by swapping row/cols).

Now i think we talk about the same thing, but differently. Yeah, indeed df1Row[df2Column] gives you values from df2Column if df2Column is pure DataColumn, unlike columns created by column accessors API:

val name by columnOf(1, 2, 3) // ResolvingValueColumn
df.select { name }.print()

val col = df.name.lastName.map { it.length } named "abc" // pure DataColumn
df.select { col }.print()

So yeah, to me the reason why it works like this is that we need this behavior for data schema api:

 df.select { 
    name.lastName // pure DataColumn
}

But it backfires in getValue(row: AnyRow) because it doesn't make sense. Hm, still not sure if i see the full picture, but i hope my thoughts made it clearer for you as well :D What do you think?
I can say that fix looks reasonable. I suggest to change if (column.name() in df.columnNames()) to something faster. Maybe check that ColumnReference is our problematic DataColumn, or something else

Jolanrensen · 2023-09-20T10:59:26Z

@koperagen Thanks! you just gave me another test case that this change indeed broke. I'll try to find another solution which is a) more efficient and b) doesn't break df.select { dataColumnWithNameAlsoExistingInDf }

Jolanrensen · 2023-09-20T18:27:35Z

That's interesting...

val aColumn = columnOf(3, 4, 5) named "a"
dataFrameOf("a")(1, 2, 3)
    .select { aColumn }

This results in "1, 2, 3". So it takes the name from aColumn, since it exists in the dataframe already, instead of returning the new "3, 4, 5" column. But the funny thing is that this does not use my fix. This behavior was already there.

koperagen · 2023-09-20T18:41:22Z

That's interesting...
val aColumn = columnOf(3, 4, 5) named "a"
dataFrameOf("a")(1, 2, 3)
    .select { aColumn }
This results in "1, 2, 3". So it takes the name from aColumn, since it exists in the dataframe already, instead of returning the new "3, 4, 5" column. But the funny thing is that this does not use my fix. This behavior was already there.

Right. Columns created by columnOf are "force resolved" and behave differently from columns created by let's say df["a"]

public inline fun <reified T> columnOf(vararg values: T): DataColumn<T> =
    createColumn(values.asIterable(), typeOf<T>(), true).forceResolve()

Jolanrensen · 2023-09-21T10:54:38Z

@koperagen hmm, you're right

val aColumn = dataFrameOf("a")(4, 5, 6)["a"]
dataFrameOf("a")(1, 2, 3)
    .select { aColumn }

results in "4, 5, 6" indeed. I don't know entirely how to rectify this anymore... but I do think they should all behave the same

…speed.

Jolanrensen · 2023-09-22T11:51:04Z

But it backfires in getValue(row: AnyRow) because it doesn't make sense. Hm, still not sure if i see the full picture, but i hope my thoughts made it clearer for you as well :D What do you think? I can say that fix looks reasonable. I suggest to change if (column.name() in df.columnNames()) to something faster. Maybe check that ColumnReference is our problematic DataColumn, or something else

Okay, new implementation doesn't check all names anymore; it attempts to resolve both manners (from column by row index and from dataframe by name) and gives by name priority if it was able to resolve it and it was different from the other manner:

val fromColumnByRow = column.getValue(this)

val fromDfByName = df
    .let {
        try {
            it.getColumnOrNull(column.name())
        } catch (e: IllegalStateException) {
            return fromColumnByRow
        }
    }
    .let { it ?: return fromColumnByRow }
    .let {
        try {
            it[index]
        } catch (e: IndexOutOfBoundsException) {
            return fromColumnByRow
        }
    }
    .let {
        try {
            it as R
        } catch (e: ClassCastException) {
            return fromColumnByRow
        }
    }

return when {

    // Issue #442: df1Row[df2Column] should be df1Row[df2Column.name], not df2Column[df1Row(.index)]
    // so, give fromDfByName priority if it's not the same as fromColumnByRow
    fromDfByName != fromColumnByRow -> fromDfByName

    else -> fromColumnByRow
}

added test and small change such that df1Row[df2Column] == df1Row[df2…

4365c83

…Column.name]

Jolanrensen added bug Something isn't working invalid This doesn't seem right labels Aug 22, 2023

Jolanrensen requested a review from koperagen August 22, 2023 14:04

Jolanrensen self-assigned this Aug 22, 2023

not checking all column names anymore in DataRowImpl.get(column) for …

b91ce33

…speed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`df1Row[df2Column] == df1Row[df2Column.name]` #445

`df1Row[df2Column] == df1Row[df2Column.name]` #445

Jolanrensen commented Aug 22, 2023

koperagen commented Sep 12, 2023 •

edited

Jolanrensen commented Sep 13, 2023

koperagen commented Sep 19, 2023

Jolanrensen commented Sep 20, 2023

Jolanrensen commented Sep 20, 2023

koperagen commented Sep 20, 2023 •

edited

Jolanrensen commented Sep 21, 2023

Jolanrensen commented Sep 22, 2023 •

edited

df1Row[df2Column] == df1Row[df2Column.name] #445

Are you sure you want to change the base?

df1Row[df2Column] == df1Row[df2Column.name] #445

Conversation

Jolanrensen commented Aug 22, 2023

koperagen commented Sep 12, 2023 • edited

Jolanrensen commented Sep 13, 2023

koperagen commented Sep 19, 2023

Jolanrensen commented Sep 20, 2023

Jolanrensen commented Sep 20, 2023

koperagen commented Sep 20, 2023 • edited

Jolanrensen commented Sep 21, 2023

Jolanrensen commented Sep 22, 2023 • edited

`df1Row[df2Column] == df1Row[df2Column.name]` #445

`df1Row[df2Column] == df1Row[df2Column.name]` #445

koperagen commented Sep 12, 2023 •

edited

koperagen commented Sep 20, 2023 •

edited

Jolanrensen commented Sep 22, 2023 •

edited