Skip to content
This repository has been archived by the owner on May 14, 2018. It is now read-only.

test_na and fix_na for levels as white space #31

Open
jsonbecker opened this issue Sep 8, 2014 · 2 comments
Open

test_na and fix_na for levels as white space #31

jsonbecker opened this issue Sep 8, 2014 · 2 comments

Comments

@jsonbecker
Copy link

One thing I run into a bunch is a blank field (most often with white space) used as missing. This is especially annoying with factors, which then creates a level for the blank space.

Currently, white space alone is not considered a NA_aliases (see here).

Should test_na and fix_na be updated to treat white space as missing, or perhaps should there be a new function that tests for empty levels or blank fields and the fix modifies to NA?

I'm happy to contribute to implement either.

@karthik
Copy link
Owner

karthik commented Sep 8, 2014

Good question @jasonpbecker
Can you give me an example of when this happens? By default R should fill in NAs whenever it encounters a empty cell.

x1,x2,x3
4,1,3
5,,
6,3,234

If I read this .csv file into R, it will automatically convert blank fields to NA.

> (x <- read.csv("~/Desktop/temp.csv"))
  x1 x2  x3
1  4  1   3
2  5 NA  NA
3  6  3 234

I would really appreciate an example of this " This is especially annoying with factors, which then creates a level for the blank space."

@jsonbecker
Copy link
Author

So if you read this file:

foo, bar,,,,2014-09-10, 50.00
baz, bat, ,,2014-09-10, 2014-09-09, 105.00
foo, bat,6103914,,,2014-09-10, 5.00
> read.csv('~/Desktop/test.csv', header=FALSE, stringsAsFactors=FALSE)

   V1   V2      V3 V4         V5          V6  V7
1 foo  bar      NA NA             2014-09-10  50
2 baz  bat      NA NA 2014-09-10  2014-09-09 105
3 foo  bat 6103914 NA             2014-09-10   5

Classes and values for V5:

> sapply(read.csv('~/Desktop/test.csv', header=FALSE, stringsAsFactors=FALSE), class)
         V1          V2          V3          V4          V5 
"character" "character"   "integer"   "logical" "character" 
         V6          V7 
"character"   "numeric" 
> table(read.csv('~/Desktop/test.csv', header=FALSE, stringsAsFactors=FALSE)$V5)

           2014-09-10 
         2          1 

If you don't use stringsAsFactors=FALSE, you get a similar result but the white space is now a level in the factor for V5, etc.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants