Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_csv adds 15 decimal digits to numbers obtained by subtracting from 1 #1516

Open
sralchemab opened this issue Sep 13, 2023 · 4 comments

Comments

@sralchemab
Copy link

Hi there.

I'm puzzled with an issue when writing some values using write_csv. If you take a look at the example below, variables first to fifth, which were obtained by subtracting from 1, when written to a CSV file using write_csv, they look like:

# test_write_csv.csv
first_value,second_value,third_value,fourth_value,fifth_value,sixth_value,seventh_value
0.050000000000000044,0.09999999999999998,0.9999999999999998,0.09999999999999998,0.9999999999999998,3.1,310

When subtracting from a number other than 1, it's being written fine (see sixth and seventh variables).

Note 1: All of the variables are numeric.
Note 2: if you pay attention to variable fifth, you will notice that every value has a decimal digit, and still fails.

Here's the code to reproduce the issue:

(first_value <- 1-0.95)
#> [1] 0.05
(second_value <- 1-0.9)
#> [1] 0.1
(third_value <- (1-0.9)*10)
#> [1] 1
(fourth_value <- 1.0-0.9)
#> [1] 0.1
(fifth_value <- (1.0-0.9)*10)
#> [1] 1
(sixth_value <- 4-0.9)
#> [1] 3.1
(seventh_value <- (4-0.9)*100)
#> [1] 310
(df <- data.frame(
    "first_value" = first_value,
    "second_value" = second_value,
    "third_value" = third_value,
    "fourth_value" = fourth_value,
    "fifth_value" = fifth_value,
    "sixth_value" = sixth_value,
    "seventh_value" = seventh_value
))
#>   first_value second_value third_value fourth_value fifth_value sixth_value
#> 1        0.05          0.1           1          0.1           1         3.1
#>   seventh_value
#> 1           310
readr::write_csv(df, "test_write_csv.csv")

Below is the sessionInfo() output from a laptop with MacOS, although I have tried it as well from a similar install on Debian:

> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: aarch64-apple-darwin20.0.0 (64-bit)
Running under: macOS Monterey 12.6.6

Matrix products: default
BLAS/LAPACK: /Users/santiagorevale/miniconda3/envs/rcore/lib/libopenblas.0.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] readr_2.1.4

loaded via a namespace (and not attached):
 [1] compiler_4.2.3  magrittr_2.0.3  R6_2.5.1        cli_3.6.1
 [5] hms_1.1.3       tools_4.2.3     pillar_1.9.0    glue_1.6.2
 [9] tibble_3.2.1    utf8_1.2.3      fansi_1.0.4     vctrs_0.6.3
[13] tzdb_0.4.0      lifecycle_1.0.3 pkgconfig_2.0.3 rlang_1.1.1

I will really appreciate any feedback on this matter.

Best,
Santiago

@joranE
Copy link

joranE commented Sep 14, 2023

This is merely an artifact of floating point arithmetic as implemented on all computers and is not specific to R. Not all values can be represented exactly in floating point arithmetic.

Run options(digits = 17) and then df[[1]] and you'll see that the actual value is the long version being written by readr, and so that is actually the correct value.

@sralchemab
Copy link
Author

sralchemab commented Sep 14, 2023

Hi @joranE.

If you do options(digits = 17) even the sixth variable has a value (3.1000000000000001) although it's not being written as such. How is the function deciding where to place the digits cutoff? I couldn't find a way to play around with it to avoid writing the values like this. How would you do it?

Other analog functions I work with, like utils::write.csv or data.table::fwrite, and even your own readr::write_csv2 produce the output I would expect:

readr::write_csv(df, "test_write_csv.csv")
# first_value,second_value,third_value,fourth_value,fifth_value,sixth_value,seventh_value
# 0.050000000000000044,0.09999999999999998,0.9999999999999998,0.09999999999999998,0.9999999999999998,3.1,310

readr::write_csv2(df, "test_write_csv2.csv")
# first_value;second_value;third_value;fourth_value;fifth_value;sixth_value;seventh_value
# 0,05;0,1;1;0,1;1;3,1;310

utils::write.csv(df, "data.utils.csv", row.names = FALSE)
# "first_value","second_value","third_value","fourth_value","fifth_value","sixth_value","seventh_value"
# 0.05,0.1,1,0.1,1,3.1,310

data.table::fwrite(df, "data.fwrite.csv")
# first_value,second_value,third_value,fourth_value,fifth_value,sixth_value,seventh_value
# 0.05,0.1,1,0.1,1,3.1,310

On a different note, I noticed another odd bahaviour. I created a column where I wrote in each row a number made of increasing number of digits (up to 20 digits after the decimal separator) of the following number 0.23472354234923784023.

df2 <- data.frame("numbers" = c(0.2, 0.23, 0.234, 0.2347, 0.23472, 0.234723, 0.2347235, 0.23472354, 0.234723542, 0.2347235423, 0.23472354234, 0.234723542349, 0.2347235423492, 0.23472354234923, 0.234723542349237, 0.2347235423492378, 0.23472354234923784, 0.234723542349237840, 0.2347235423492378402, 0.23472354234923784023))

If you take a look at how rows 15 to 20 are being written, you'll see the following, showing an odd behaviour between 17 and the subsequent values:

readr::write_csv(df2, "data2.readr.csv")
# 15 0.234723542349237
# 16 0.2347235423492378
# 17 0.23472354234923784
# 18 0.2347235423492378
# 19 0.2347235423492378
# 20 0.2347235423492378

data.table::fwrite(df2, "data2.fwrite.csv")
# ...
# 15 0.234723542349237
# 16 0.234723542349238
# 17 0.234723542349238
# 18 0.234723542349238
# 19 0.234723542349238
# 20 0.234723542349238

Finally, I read the documentation about readr::write_csv and it says that it's analogous to write.csv with some improvements on performance. But if we get a different outcome, is it actually analogous?

Sorry for the lengthy reply. And thanks for the feedback.

@joranE
Copy link

joranE commented Sep 15, 2023

First, I think you have mistaken me for an author of this package, which I am not, nor am I even a contributor. I'm sure one of the authors will weigh in eventually.

In general, due to the nuances involved in floating point arithmetic, if you want complete control over the decimal precision of data written out to a file, you will need to use something like format() to enforce your digits requirement first and then write the results to file.

Finally, readr::write_csv is a function for writing data to csv's, just like utils::write.csv, they have similar arguments and in the vast, vast number of cases they perform essentially identically. If that doesn't qualify it for the term "analogous" I would suggest perhaps you're being a tad picky.

@cjyetman
Copy link

cjyetman commented Feb 7, 2024

Also not a maintainer or contributor here, but this is a rather succinct explanation of why the behavior you're seeing has little to do with {readr}...

print(1 - 0.95, digits = 16)
#> [1] 0.05000000000000004
print(0.05, digits = 16)
#> [1] 0.05
1 - 0.95 == 0.05
#> [1] FALSE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants