Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply coerces to matrix, inane design decision #21

Open
ifly6 opened this issue Apr 25, 2018 · 3 comments
Open

apply coerces to matrix, inane design decision #21

ifly6 opened this issue Apr 25, 2018 · 3 comments

Comments

@ifly6
Copy link

ifly6 commented Apr 25, 2018

Let's be honest. Apply is just broken for data frames. Defending it by saying that the user just doesn't understand the language, that the language is just fine, and the function is functioning correctly is like saying that your toolbox of misshapen tools where the hammer is just the curved end on both sides is 'just fine'.

The 'correct' way to do this in R apparently is just to write out a for loop. Fortunately for you, you can't just make a for loop iterate over rows, like for row in df.iterrows() in Pandas, you have to explicitly index them.

And fortunately for you, you can't just make a range like 1:nrow(df) (also, who made the stupid choice to call it nrow when nrows makes more sense, their being more than one row...) because if nrow(df0 == 0 then it returns a sequence (1, 0) which breaks your code when you try and run that. R is just built for robustness!

But if you're doing lots of manipulation with lists, so you're familiar with sapply, you can probably fix that issue by using apply with the proper functions, right? Wrong.

a = c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE)
b = c('a', 'b', 'c', 'de', 'f', 'g')
c = c(1, 2, 3, 4, 5, 6)
d = c(0, 0, 0, 0, 0, 1)

wtf = data.frame(a, b, c, d)
wtf$huh = apply(wtf, 1, function(row) {
    if (row['a'] == T) { return('we win') }
    if (row['c'] < 5) { return('hooray') }
    if (row['d'] == 1) { return('a thing') }
    return('huh?')
})

You get this. Because R inexplicably decides that the best way to deal with data frames is to turn them all into data matrices first. So, here, the a column turns into ' TRUE' and 'FALSE'. Silently. Fantastic behaviour.

> wtf
      a  b c d     huh
1  TRUE  a 1 0  hooray
2 FALSE  b 2 0  hooray
3  TRUE  c 3 0  hooray
4 FALSE de 4 0  hooray
5  TRUE  f 5 0    huh?
6  TRUE  g 6 1 a thing

But in a reasonable and sensibly constructed system like Pandas, you can run the exact same thing, like this:

import pandas as pd
df = pd.DataFrame({
    'a': [True, False, True, False, True, True],
    'b': ['a', 'b', 'c', 'de', 'f', 'g'],
    'c': [1, 2, 3, 4, 5, 6],
    'd': [0, 0, 0, 0, 0, 1]
})
def funct(row):
    print(row)
    if row['a']: return 'we win'
    if row['c'] < 5: return 'horray'
    if row['d'] is 1: return 'a thing'
    return 'huh?'

df['huh'] = df.apply(funct, axis=1)
print(df)

And get reasonable answers like these that follow. Look what is possible when you don't make stupid design decisions!

       a   b  c  d     huh
0   True   a  1  0  we win
1  False   b  2  0  horray
2   True   c  3  0  we win
3  False  de  4  0  horray
4   True   f  5  0  we win
5   True   g  6  1  we win
@ifly6
Copy link
Author

ifly6 commented May 3, 2018

At university, I learned that one type of programming language specification is to simple take an implementation of the programming language and define that as the specification. That was mostly done as a thought experiment, before moving on to the actual serious definitions, because it would lead to the insane consequence that it is actually impossible for there to be bugs in the reference implementation, since any behavior is per definition in accordance with the specification!

From a thread discussing why PHP has a left-associative ternary operator for inconceivable reasons.

Given that the response to raising this issue on the R forums was 'this is correct behaviour', I guess we shouldn't complain about anything. There are no bugs.

@Eluvias
Copy link

Eluvias commented Jul 6, 2018

library(plyr)

a = c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE)
b = c('a', 'b', 'c', 'de', 'f', 'g')
c = c(1, 2, 3, 4, 5, 6)
d = c(0, 0, 0, 0, 0, 1)

wtf = data.frame(a, b, c, d)

foo.huh <- function(row) {
    if (row['a'] == T) { return('we win') }
    if (row['c'] < 5) { return('hooray') }
    if (row['d'] == 1) { return('a thing') }
    return('huh?')
}


plyr::adply(wtf, 1, .fun = foo.huh, .expand = TRUE, .id = NULL)
#>       a  b c d     V1
#> 1  TRUE  a 1 0 we win
#> 2 FALSE  b 2 0 hooray
#> 3  TRUE  c 3 0 we win
#> 4 FALSE de 4 0 hooray
#> 5  TRUE  f 5 0 we win
#> 6  TRUE  g 6 1 we win

Created on 2018-07-06 by the reprex package (v0.2.0).

@dwinsemius
Copy link

dwinsemius commented Oct 26, 2019

I got the same result as ifly6 did in R as was offered as the "more correct" result in Python. (and then also offered via plyr construction by Eluvias.

This whole rant seems to ignore the fact that apply is designed (and documented as such) to be used for matrices. It's not "broken" for dataframes; it's just the wrong tool for dataframes. There are a bunch of other reasons NOT to use apply for dataframes, such as the coercion of each row to the "lowest common denominator" data type, so factors become, not character, but rather integers. (You didn't use the "b" column, but it would have been a factor unless your site.profile specifies the options default value of stringsAsFactors to be FALSE. Furthermore the R code used row['a'] == T while the Python code used just the value of a logical vector. That would have been correct in R. It's a common, error-prone practice error to unnecessarily test for equality to TRUE.

And the correct way to create a range that iterates over a sequence like rownames(df) is: seq_along(rownames(df)). And that is precisely because of the potential error mechanism you point out for zero length vectors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants