read_excel() converts date to number #716

xinyongtian · 2022-11-10T18:22:19Z

test1.xlsx

I try to covert test1.xlsx to .csv file by using read_excel(), then write.csv()
The excel file test1.xlsx contains date column (DOB) with some typo. After read, the correct DOB date become number (in character type). The read_excel() should at least keep DOB in original text format. If I use MS excel to save as .csv the date format is preserved.

library(readxl)
#> Warning: package 'readxl' was built under R version 4.1.3
df=read_excel('c:/test1.xlsx')
print(df)
#> # A tibble: 2 x 2
#>   NAME  DOB     
#>   <chr> <chr>   
#> 1 Jack  17995   
#> 2 Tom   12/37/38
write.csv(df,'test1.csv',row.names = F)
print(read.csv('test1.csv'))
#>   NAME      DOB
#> 1 Jack    17995
#> 2  Tom 12/37/38

matthewjnield · 2023-01-08T14:46:29Z

I will attempt to answer this question, having done my best to recreate your circumstances. This is not ideal because I don't have a copy of your Excel file, and most importantly, I don't know the Excel data types of cells B2 and B3. Whether they are General, Date, or Text affects what data type they are imported into R as by read_excel(). Any of the above that you can provide will help me give a more accurate explanation of your output.

I created an imitation of your file by typing in the values that I see above by hand. I observed the following, and am making the assumption that these things are true for your file as well:

Cell B2 was interpreted by Excel as a valid date (April 7, 1949) and was automatically assigned the Date type
Cell B3 could not be interpreted by Excel as anything other than just text, since it does not represent a valid date or numeric value, and so was kept as the default General type

Moving now to what happens in R: Because column B contains at least one cell that cannot be interpreted as anything other than text (cell B3), read_excel() guessed that the entire column should be treated as character (text) data. This behavior avoids data loss; if it had guessed differently and applied a date type, cell B3 would have to be NA (a blank/missing value), and information that you had before would not have made it into your data.frame.

Cell B2 is imported as "17995" because that is what is actually stored in the Excel spreadsheet. When Excel interpreted what you typed in as a date, it implicitly converted it to the number 17995, which is the number of days between December 30, 1899 and the date entered (this is Microsoft's convention for storing date data). With the Date type applied, Excel knows to display the value as a date, while still storing the number "under the hood".

This article from the readxl site is highly relevant and is my go-to reference for understanding why my Excel data gets imported the way it does: https://readxl.tidyverse.org/articles/cell-and-column-types.html

Here is my personal approach for ensuring that date data from Excel spreadsheets ends up as the intended date values in R:

library(readxl)
library(dplyr)
library(stringr)
library(lubridate)

# a function that takes a character vector that may contain dates in various formats, and attempts to convert each format to a date value appropriately
convert_excel_dates <-
    function(x){
        case_when(
            str_detect(x, "^[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}$") ~ mdy(x),   # handles values imported as text values in the format "MM/DD/YYYY"
            str_detect(x, "^[0-9]{5}$")                         ~ x |> as.integer() |> as.Date(origin = as.Date("1899-12-30")),  # handles values imported as numbers expressed as days since 1899-12-30 (Microsoft's convention)
            TRUE                                                ~ NA_Date_  # default case, no applicable date format, returns a missing date value
        )
    }


read_excel("test1.xlsx", col_types = "text") |>  # instruct read_excel() to simply import all columns as text, so that I can make tricky type conversions myself
    mutate(DOB_Converted_to_Date = convert_excel_dates(DOB))
#> # A tibble: 3 × 3
#>   NAME   DOB      DOB_Converted_to_Date  
#>   <chr>  <chr>    <date>    
#> 1 Jack   17995    1949-04-07
#> 2 Tom    12/37/38 NA        
#> 3 George 02/17/80 1980-02-17

Handling of additional date formats can be added to case_when() as needed. I added one more row for this example to show that a value that is imported as text but still follows the "MM/DD/YY" format representing a valid date still gets converted correctly. This can happen if someone types in a valid date like "02/17/80" when the cell's Excel data type is set to Text, and Excel makes no attempt to interpret the value as a Date or number.

xinyongtian · 2023-01-08T16:09:37Z

Thank you for the explanation and walk around. I wonder how MS excel keeps original character format for B2 ("4/7/1949") when saving test1.xslx as .csv file. "4/7/1949" should be stored somewhere in excel. It would be nice for R to retrieve it ("4/7/1949") when R realizes column B is character.

…

On Sun, Jan 8, 2023, 8:46 AM Matt Nield ***@***.***> wrote: I will attempt to answer this question, having done my best to recreate your circumstances. This is not ideal because I don't have a copy of your file, and most importantly, I don't know the Excel data types of cells B2 and B3. Whether they are General, Date, or Text affects what data type they are imported into R as by read_excel(). Any of the above that you can provide will help me give a more accurate explanation of your output. I created an imitation of your file by typing in the values that I see above by hand and observed the following; I am making the assumption that these things are true for your file as well: - Cell B2 was interpreted by Excel as a valid date (April 7, 1949) and was automatically assigned the *Date type* - Cell B3 could not be interpreted by Excel as anything other than just text, since it does not represent a valid date or numeric value, and so was kept as the default *General type* Moving now to what happens in R: Because column B contains at least one cell that cannot be interpreted as anything other than text (cell B3), read_excel() guessed that the entire column should be treated as character (text) data. This behavior avoids data loss; if it had guessed differently and applied a date type, cell B3 would have to be NA (a blank/missing value), and information that you had before would not make it into your data.frame. This article from the readxl site is highly relevant and is my go-to reference for understanding why my Excel data gets imported the way it does: https://readxl.tidyverse.org/articles/cell-and-column-types.html Here is my personal approach for ensuring that date data from Excel spreadsheets ends up as the intended date values in R: library(readxl) library(dplyr) library(stringr) library(lubridate) # a function that takes a character vector that may contain dates in various formats, and attempts to convert each format to a date value appropriately convert_excel_dates <- function(x){ case_when( str_detect(x, "^[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}$") ~ mdy(x), # handles values imported as text values in the format "MM/DD/YYYY" str_detect(x, "^[0-9]{5}$") ~ x |> as.integer() |> as.Date(origin = as.Date("1899-12-30")), # handles values imported as numbers expressed as days since 1899-12-30 (Microsoft's convention) TRUE ~ NA_Date_ # default case, no applicable date format, returns a missing date value ) } read_excel("test1.xlsx", col_types = "text") |> # instruct read_exce() to simply import all columns as text, so that I can make tricky type conversions myself mutate(DOB_Converted_to_Date = convert_excel_dates(DOB)) #> # A tibble: 3 × 3 #> NAME DOB DOB_Converted_to_Date #> <chr> <chr> <date> #> 1 Jack 17995 1949-04-07 #> 2 Tom 12/37/38 NA #> 3 George 02/17/80 1980-02-17 Handling of additional date formats can be added to case_when() as needed. I added one more row for this example to show that a value that is imported as text but still follows the "MM/DD/YY" format but actually represents a valid date still gets converted correctly. This can happen if someone types in a valid date like "02/17/80" when the cell's Excel data type is set to Text, and Excel makes no attempt to interpret the value as a Date or number. — Reply to this email directly, view it on GitHub <#716 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC32KONZP6AFA36LCQX5RV3WRLHNBANCNFSM6AAAAAAR427YVI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

iamio87 · 2024-03-29T15:47:59Z

read_excel() converts dates to numbers, because it's not reading the metadata in the Excel file.

read_excel() does properly interpret numbers and texts, but dates are not yet implemented. The comments in the code suggest that the maintainers reasonably believed that Excel used the ISO 8601 format, but it does not.

Excel Spreadsheets store Date and Time information in a numeric format. For example, “1/1/2024 15:30” is stored as the numeric value “45292.645833333336”.

The whole number portion (xxxxxx.0) represents the number of days since the Excel epoch, which begins on December 30, 1899. (They intended it to be January 1, 1900, but they calculated the days wrong in the first version of Excel).

The decimal portion (0.xxxxx) represents the time of day. This applies even when the time of day is the only information in the cell. A cell that only had time information “15:30” would still be stored as “0.645833333336” in the Excel file.

To understand the basic data in an Excel spreadsheet, the parser must consult with 3 separate locations.

xl/worksheets/sheet1.xml, sheet2.xml, etc.
xl/sharedstrings.xml
xl/styles.xml

The worksheet xml files contain all numeric data, and references for text and date information for each worksheet in the Excel file. The text values for all worksheets are stored in sharedstrings.xml. Dates, times, currency, and other numeric-like formatting for all worksheets is stored in the styles.xml.

In order to properly parse dates or formats like currency, read_excel would have to reference the following data structures in the styles.xml:

Each xf::numFmtId corresponds to a different format type. For example:

numFmtId="164" is represented as YYYY-MM-DD
numFmtId="165" is represented as MM/DD/YY
numFmtId="166" is represented as Dollar currency.

An example cell in the worksheet XML might look like this:

45369

The s="1" attribute uses a 0-base index to indicate which cellXfs -> xf node controls the formatting for that cell. Assuming the example stylesheet above, s="1" would indicate the second node in - - would apply. So it would be a date formatted as MM/DD/YY.

Submitting a patch is outside of my scope of expertise, but I did want to provide an explanation for why read_excel() casts Dates to Numbers as a resource for possible future development.

stitam mentioned this issue Feb 14, 2024

readxl fails with truncated dates #740

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_excel() converts date to number #716

read_excel() converts date to number #716

xinyongtian commented Nov 10, 2022 •

edited

matthewjnield commented Jan 8, 2023 •

edited

xinyongtian commented Jan 8, 2023 via email

iamio87 commented Mar 29, 2024 •

edited

read_excel() converts date to number #716

read_excel() converts date to number #716

Comments

xinyongtian commented Nov 10, 2022 • edited

matthewjnield commented Jan 8, 2023 • edited

xinyongtian commented Jan 8, 2023 via email

iamio87 commented Mar 29, 2024 • edited

xinyongtian commented Nov 10, 2022 •

edited

matthewjnield commented Jan 8, 2023 •

edited

iamio87 commented Mar 29, 2024 •

edited