Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be tolerant to cell formatter problems #969

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lujop
Copy link
Contributor

@lujop lujop commented Aug 13, 2021

Description

After the changes introduced in 74a54aa where CellFormatters were used to convert types, some excels had format options that make POI throw errors without being able to parse the entire sheet.
In that cases, it's safer to log the error and discard the value that can't be correctly interpreted and converted

@lwhite1
Copy link
Collaborator

lwhite1 commented Aug 18, 2021

Hi @lujop, Thanks for the PR.
I've been thinking about this, and am not completely sure it makes sense as the default behavior. In my work, I could never throw out rows and continue to use the data so it strikes me that this might be too risky.

I noticed recently that the US CDC is using Tablesaw to analyze covid cases, and I would not want them to get bad results because they forgot to check a log :). I think maybe it needs to be an option, with the default behavior to be failing, perhaps with an error message that suggests the "log and continue" option. What do you think?

@lujop
Copy link
Contributor Author

lujop commented Aug 19, 2021

Sure @lwhite1. It has a lot of sense to have the option as you said.
I will do that when I've some time and update the PR.

@lwhite1
Copy link
Collaborator

lwhite1 commented Aug 19, 2021 via email

@lwhite1
Copy link
Collaborator

lwhite1 commented Aug 23, 2021

@lujop The more i think about this option, the more of an edge case it seems to me. Do you analyze a lot of data where it's ok to throw away rows because they come from a spreadsheet with formatting issues? Or is the problem that cell-formats are so frequently unreliable that it's hard to get many spreadsheets imported? If that's the case, I wonder if using cell-formats didn't make things less reliable.

@lujop
Copy link
Contributor Author

lujop commented Aug 24, 2021

I think that neither of the two. My use case it's quite complex and layered, and it's not that I can throw all the incorrect data, most incorrect data can't be processed and some can be with warnings.
But an important requirement is that in any case a descriptive message to the user must be provided that helps him to see where the error is and correct it.

To give some context to what I do has several layers of error checking/parsing:

  • Some preprocessing is done to the excel before using Tablesaw, I tried to do the less I can to that layer but some things must be done because if not the excel it's not parseable at all (renaming duplicated columns, giving a name to columns without a header,...)
  • Then data is parsed in table saw, some logic error checking is done using tablesaw API and also some type conversions
  • And finally, if all is ok several operations to data are done and several output files are generated with converted and processed data.

I recognize that my use case for sure is not the most orthodox one. But I also think it's not a completely strange one. Because when people use excel I think that we can assume that in most cases that data has been fill by humans, and then not be perfect. I can assure you that they can be very creative in doing small things that completely breaking the parsing. In particular, before improving the type detection logic on excel it was a complete nightmare because a lot of excels couldn't be parsed at all.
And giving a more or less clear indication to the use I think that is good instead of saying data is bad with some internal low-level message that they can't understand.
In other formats like CSV that are for sure generated by machines I think that we can expect perfect data, but expecting that in excel I think that it's not a real case.
And although I understand that lot of error checking and validations are outside the scope of tablesaw and need to be done before, I think that having some tools is good because it's a lot better to do that checking with parsed tablesaw table than doing it completely before.

Returning to your original question about cell formatters, it's not very common to have errors, I checked on last day's logs and I hadn't had any, but I encountered some cases that now I can't exactly remember. But was something with a date format that was not supported by poi if I don't remember bad.

I really think that cell formatters and current logic are a lot better than what we had before. And in the worse cases, they don't introduce less parseability, because they are only used when some conversion must be done and before that, there was always an error. Excel doesn't allow to read integer cells as string, or string cells as numbers, you must do the conversions yourself. And the most robust way to do them I think that is by cell formatters. Because for example if you want to see a number as a string (for example some internal code that sometimes can be a number and sometimes not) the way the user wants to have the string for sure is the way it views it in excel and for that the formatters are needed.

@lwhite1
Copy link
Collaborator

lwhite1 commented Aug 28, 2021

I think that neither of the two. My use case it's quite complex and layered, and it's not that I can throw all the incorrect data, most incorrect data can't be processed and some can be with warnings.
But an important requirement is that in any case a descriptive message to the user must be provided that helps him to see where the error is and correct it.

To give some context to what I do has several layers of error checking/parsing:

  • Some preprocessing is done to the excel before using Tablesaw, I tried to do the less I can to that layer but some things must be done because if not the excel it's not parseable at all (renaming duplicated columns, giving a name to columns without a header,...)

FWIW, I think it should be possible to read with duplicate column names. There is an option called .allowDuplicateColumns() which may not work, or maybe works but was not added to the excel importer. That is something that I think should be fixed.

Similarly, with CSVs, I believe the system adds names for unnamed columns (C1 to Cn, I think). Again, I think this is something that is worth doing for Excel if it's possible using POI.

  • Then data is parsed in table saw, some logic error checking is done using tablesaw API and also some type conversions
  • And finally, if all is ok several operations to data are done and several output files are generated with converted and processed data.

I recognize that my use case for sure is not the most orthodox one. But I also think it's not a completely strange one. Because when people use excel I think that we can assume that in most cases that data has been fill by humans, and then not be perfect. I can assure you that they can be very creative in doing small things that completely breaking the parsing. In particular, before improving the type detection logic on excel it was a complete nightmare because a lot of excels couldn't be parsed at all.

This is a problem with the way people create Excel files. It's not something Tablesaw can fix

And giving a more or less clear indication to the use I think that is good instead of saying data is bad with some internal low-level message that they can't understand.

I certainly agree, but I think the preferred solution is to fix the error message rather than log and proceed. Is that not possible?

In other formats like CSV that are for sure generated by machines I think that we can expect perfect data, but expecting that in excel I think that it's not a real case.

I think many CSVs are created by people converting spreadsheets. I know I do that every time rather than use the excel importer. The problem here is that Excel gives the user the ability to apply a cell format, but the method for doing so (selecting a range and changing it manually) is very error prone. It's easy to add a row after you apply the format for example.

The problem more generally is that people are sloppy (or overly fancy in their spreadsheets) in infinite ways. One case I had to fix manually yesterday was they had written comments at the bottom of the spreadsheet. This (and adding calculations at the bottom or the side) is pretty common. So is using cell merging in a way that causes problems. Even Excel cannot export to CSV reliably a spreadsheet with merged cells.

I think also that this creates new inconsistencies with the other import methods, which is undesirable. Should all import methods log and continue on errors? Should it be all errors instead of just this one kind?

This is also an argument against the approach of specifying a range of cells to import.

And although I understand that lot of error checking and validations are outside the scope of tablesaw and need to be done before, I think that having some tools is good because it's a lot better to do that checking with parsed tablesaw table than doing it completely before.

I agree with the concept, but I disagree with the approach of continually broadening the constraints on what is imported.

Returning to your original question about cell formatters, it's not very common to have errors, I checked on last day's logs and I hadn't had any, but I encountered some cases that now I can't exactly remember. But was something with a date format that was not supported by poi if I don't remember bad.

I really think that cell formatters and current logic are a lot better than what we had before. And in the worse cases, they don't introduce less parseability, because they are only used when some conversion must be done and before that, there was always an error. Excel doesn't allow to read integer cells as string, or string cells as numbers, you must do the conversions yourself. And the most robust way to do them I think that is by cell formatters. Because for example if you want to see a number as a string (for example some internal code that sometimes can be a number and sometimes not) the way the user wants to have the string for sure is the way it views it in excel and for that the formatters are needed.

For my education, what happens if I create a spreadsheet of two columns, say country and population, and I don't apply any cell formatting? Does this method work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants