Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_excel() converts date to number #716

Open
xinyongtian opened this issue Nov 10, 2022 · 3 comments
Open

read_excel() converts date to number #716

xinyongtian opened this issue Nov 10, 2022 · 3 comments

Comments

@xinyongtian
Copy link

xinyongtian commented Nov 10, 2022

test1.xlsx
image

I try to covert test1.xlsx to .csv file by using read_excel(), then write.csv()
The excel file test1.xlsx contains date column (DOB) with some typo. After read, the correct DOB date become number (in character type). The read_excel() should at least keep DOB in original text format. If I use MS excel to save as .csv the date format is preserved.

library(readxl)
#> Warning: package 'readxl' was built under R version 4.1.3
df=read_excel('c:/test1.xlsx')
print(df)
#> # A tibble: 2 x 2
#>   NAME  DOB     
#>   <chr> <chr>   
#> 1 Jack  17995   
#> 2 Tom   12/37/38
write.csv(df,'test1.csv',row.names = F)
print(read.csv('test1.csv'))
#>   NAME      DOB
#> 1 Jack    17995
#> 2  Tom 12/37/38
@matthewjnield
Copy link

matthewjnield commented Jan 8, 2023

I will attempt to answer this question, having done my best to recreate your circumstances. This is not ideal because I don't have a copy of your Excel file, and most importantly, I don't know the Excel data types of cells B2 and B3. Whether they are General, Date, or Text affects what data type they are imported into R as by read_excel(). Any of the above that you can provide will help me give a more accurate explanation of your output.

I created an imitation of your file by typing in the values that I see above by hand. I observed the following, and am making the assumption that these things are true for your file as well:

  • Cell B2 was interpreted by Excel as a valid date (April 7, 1949) and was automatically assigned the Date type
  • Cell B3 could not be interpreted by Excel as anything other than just text, since it does not represent a valid date or numeric value, and so was kept as the default General type

Moving now to what happens in R: Because column B contains at least one cell that cannot be interpreted as anything other than text (cell B3), read_excel() guessed that the entire column should be treated as character (text) data. This behavior avoids data loss; if it had guessed differently and applied a date type, cell B3 would have to be NA (a blank/missing value), and information that you had before would not have made it into your data.frame.

Cell B2 is imported as "17995" because that is what is actually stored in the Excel spreadsheet. When Excel interpreted what you typed in as a date, it implicitly converted it to the number 17995, which is the number of days between December 30, 1899 and the date entered (this is Microsoft's convention for storing date data). With the Date type applied, Excel knows to display the value as a date, while still storing the number "under the hood".

This article from the readxl site is highly relevant and is my go-to reference for understanding why my Excel data gets imported the way it does: https://readxl.tidyverse.org/articles/cell-and-column-types.html

Here is my personal approach for ensuring that date data from Excel spreadsheets ends up as the intended date values in R:

library(readxl)
library(dplyr)
library(stringr)
library(lubridate)

# a function that takes a character vector that may contain dates in various formats, and attempts to convert each format to a date value appropriately
convert_excel_dates <-
    function(x){
        case_when(
            str_detect(x, "^[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}$") ~ mdy(x),   # handles values imported as text values in the format "MM/DD/YYYY"
            str_detect(x, "^[0-9]{5}$")                         ~ x |> as.integer() |> as.Date(origin = as.Date("1899-12-30")),  # handles values imported as numbers expressed as days since 1899-12-30 (Microsoft's convention)
            TRUE                                                ~ NA_Date_  # default case, no applicable date format, returns a missing date value
        )
    }


read_excel("test1.xlsx", col_types = "text") |>  # instruct read_excel() to simply import all columns as text, so that I can make tricky type conversions myself
    mutate(DOB_Converted_to_Date = convert_excel_dates(DOB))
#> # A tibble: 3 × 3
#>   NAME   DOB      DOB_Converted_to_Date  
#>   <chr>  <chr>    <date>    
#> 1 Jack   17995    1949-04-07
#> 2 Tom    12/37/38 NA        
#> 3 George 02/17/80 1980-02-17

Handling of additional date formats can be added to case_when() as needed. I added one more row for this example to show that a value that is imported as text but still follows the "MM/DD/YY" format representing a valid date still gets converted correctly. This can happen if someone types in a valid date like "02/17/80" when the cell's Excel data type is set to Text, and Excel makes no attempt to interpret the value as a Date or number.

@xinyongtian
Copy link
Author

xinyongtian commented Jan 8, 2023 via email

@iamio87
Copy link

iamio87 commented Mar 29, 2024

read_excel() converts dates to numbers, because it's not reading the metadata in the Excel file.

read_excel() does properly interpret numbers and texts, but dates are not yet implemented. The comments in the code suggest that the maintainers reasonably believed that Excel used the ISO 8601 format, but it does not.

Excel Spreadsheets store Date and Time information in a numeric format. For example, “1/1/2024 15:30” is stored as the numeric value “45292.645833333336”.
CleanShot 2024-03-28 at 10 28 07@2x

The whole number portion (xxxxxx.0) represents the number of days since the Excel epoch, which begins on December 30, 1899. (They intended it to be January 1, 1900, but they calculated the days wrong in the first version of Excel).

The decimal portion (0.xxxxx) represents the time of day. This applies even when the time of day is the only information in the cell. A cell that only had time information “15:30” would still be stored as “0.645833333336” in the Excel file.

To understand the basic data in an Excel spreadsheet, the parser must consult with 3 separate locations.

  • xl/worksheets/sheet1.xml, sheet2.xml, etc.
  • xl/sharedstrings.xml
  • xl/styles.xml

CleanShot 2024-03-28 at 10 52 31@2x

The worksheet xml files contain all numeric data, and references for text and date information for each worksheet in the Excel file. The text values for all worksheets are stored in sharedstrings.xml. Dates, times, currency, and other numeric-like formatting for all worksheets is stored in the styles.xml.

In order to properly parse dates or formats like currency, read_excel would have to reference the following data structures in the styles.xml:

Each xf::numFmtId corresponds to a different format type. For example:

  • numFmtId="164" is represented as YYYY-MM-DD
  • numFmtId="165" is represented as MM/DD/YY
  • numFmtId="166" is represented as Dollar currency.

An example cell in the worksheet XML might look like this:

45369

The s="1" attribute uses a 0-base index to indicate which cellXfs -> xf node controls the formatting for that cell. Assuming the example stylesheet above, s="1" would indicate the second node in - - would apply. So it would be a date formatted as MM/DD/YY.

Submitting a patch is outside of my scope of expertise, but I did want to provide an explanation for why read_excel() casts Dates to Numbers as a resource for possible future development.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants