Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer as.Date() coercion to R level #6107

Open
wants to merge 2 commits into
base: fread-date-idate
Choose a base branch
from

Conversation

MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Apr 29, 2024

Closes #6105

As seen in the tests, this might be a breaking change if any downstreams depend on the output being specifically Date (and not IDate). I am not sure it's possible -- inherits(., "Date") is still TRUE, and relevant methods should just back up to Date methods if not available for IDate. I lean towards just going ahead with this change unless revdeps finds anything concerning, WDYT @jangorecki?

cc also @HughParsonage who may have some extra insights.


One alternative I looked into that won't work is adding "Date" as an alias for "IDate" in the type hierarchy:

data.table/src/freadR.c

Lines 27 to 29 in d19bfef

static int typeSxp[NUT] = {NILSXP, LGLSXP, LGLSXP, LGLSXP, LGLSXP, LGLSXP, INTSXP, REALSXP, REALSXP, REALSXP, REALSXP, INTSXP, REALSXP, STRSXP, REALSXP, STRSXP };
static char typeRName[NUT][10]={"NULL", "logical", "logical", "logical", "logical", "logical", "integer", "integer64", "double", "double", "double", "IDate", "POSIXct", "character", "numeric", "CLASS" };
static int typeEnum[NUT] = {CT_DROP, CT_EMPTY, CT_BOOL8_N, CT_BOOL8_U, CT_BOOL8_T, CT_BOOL8_L, CT_INT32, CT_INT64, CT_FLOAT64, CT_FLOAT64_HEX, CT_FLOAT64_EXT, CT_ISO8601_DATE, CT_ISO8601_TIME, CT_STRING, CT_FLOAT64, CT_STRING};

The problem is that there are files where our rudimentary date parser doesn't detect the date, but as.Date() does, meaning fread.c returns a string --> the hierarchy is broken as there's an attempt to cast string->int32. See in particular this test:

https://github.com/Rdatatable/data.table/blob/d19bfef7026f25bb2d36c17879187d09bcb2b2c3/inst/tests/tests.Rraw#L11043

Copy link
Member Author

MichaelChirico commented Apr 29, 2024

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @MichaelChirico and the rest of your teammates on Graphite Graphite

@MichaelChirico MichaelChirico changed the title C-level changes needed Defer as.Date() coercion to R level Apr 29, 2024
@MichaelChirico

This comment was marked as outdated.

Copy link

github-actions bot commented May 1, 2024

Comparison Plot

Generated via commit 5fc89d7

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 11 minutes and 59 seconds

Time taken to run atime::atime_pkg on the tests: 4 minutes and 11 seconds

@jangorecki
Copy link
Member

Yes, let's see revdeps

@MichaelChirico MichaelChirico marked this pull request as ready for review May 3, 2024 19:21
.ci/atime/tests.R Outdated Show resolved Hide resolved
.ci/atime/tests.R Outdated Show resolved Hide resolved
setup = {
DT = data.table(date=.Date(sample(20000, N, replace=TRUE)))
tmp_csv = tempfile()
fwrite(DT, tmp_csv)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jangorecki I think we should run fread(DT) once here in the setup because of cacheing, right? Or do we need to run in a subprocess? Here the benchmark is really about what happens after the fread.c code is run, only care about differences emerging (1) in freadR.c and/or (2) fread.R post-processing. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(regardless, we see substantial improvement in all 3 cases already)

src/freadR.c Outdated
@@ -335,7 +335,7 @@ bool userOverride(int8_t *type, lenOff *colNames, const char *anchor, const int
type[i]=CT_STRING; // e.g. CT_ISO8601_DATE changed to character here so that as.POSIXct treats the date-only as local time in tests 1743.122 and 2150.11
SET_STRING_ELT(colClassesAs, i, tt);
}
} else {
} else if (type[i] != CT_ISO8601_DATE || tt != char_Date) {
type[i] = typeEnum[w-1]; // freadMain checks bump up only not down
if (w==NUT) SET_STRING_ELT(colClassesAs, i, tt);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the consequence of a dangling else here? (As in, if the new condition evaluates to false, lines 339-340 would previously be evaluated.) I think a comment for this line would be useful here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the call-out, this is something I wasn't understanding well myself.

The key is we just skip updating type[i] for that case. We might consider rewriting this then as:

type[i] = (type[i] == CT_ISO8601_DATE && tt == char_Date) ? type[i] : typeEnum[w-1];

The "trouble" comes in the next line, because w==NUT in the relevant case here (i.e., "Date" is not a known class for C-side coercion).

So with the simple fix, we're back to R-side as.Date() coercion.

As noted, we might prefer that (retaining fully back-compatible Date output columns), but it'll require a corresponding change in the colClasses=list() branch.

So it's down again to whether the result of this PR should be "cols parsed as IDate and requested as Date are returned as IDate, fully avoiding any coercion" or "we now skip the middle step in IDate->char->Date coercion; for maximum efficiency, request IDate instead of Date".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants