Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(Data Import): custom csv delimiters, UTF-8 BOM handling #26183

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

vmatt
Copy link

@vmatt vmatt commented Apr 27, 2024

Hello,
Closes #26182
Closes #22151
The current Frappe Data Import only supports standard CSV files, leading to issues when importing data from other *sv file formats (e.g., TSV, semicolon-separated values).

image

This pull request introduces the following enhancements:

  1. Custom delimiters toggle: When its enabled, Delimiter options field is shown, where the user to extend the predefined delimiters, and extend with their own as well
  2. Auto-detection of delimiters: The code now utilizes the csv.Sniffer to automatically detect delimiters like tabs (\t), semicolons (;), and commas (,) in the uploaded file.
  3. Manual delimiter specification: A new field delimiter_options is added to the Data Import doctype, allowing users to manually specify the delimiter used in their file, ensuring successful import even when auto-detection fails. (default value is ",;\t|" - most common delimiter types)
  4. Improved error handling: Enhanced error messages provide more informative feedback to users when issues arise during the import process.
  5. +1 fix for UTF8 with BOM: If file is regular utf-8, it can be still be parsed with utf-8-sig encoding, however, there reverse is not true. In windows environments, Excel adds 3 bytes to the beginning of the file. If you try decode utf-8-sig encoded file with utf-8 encoding, the intial 3 bytes (\ufeff) will be not skipped, causing the column parsing to fail.
    These changes expand the capabilities of the Data Import Doctype, making it more flexible and user-friendly for a broader range of file formats and business contexts.
    1. Why is this change not in a separate pull request?
      Without fixing the UTF8 with BOM parsing, if i'm loading my semicolon separated, utf-8 BOM file, it will still fail to recognize the first column, as Frappe fails to strip the doublequotes properly, because the first row will start as \ufeff"col1", instead "col1":
image Introduction encoding list constant (order matters!): `from frappe.core.doctype.file.file import FILE_ENCODING_OPTIONS` Generate error message based on FILE_ENCODING_OPTIONS. image

Tested and validated, added two new test cases in the test_importer.py file.

docs: https://docs.erpnext.com/docs/user/manual/en/data-import

I look forward to your feedback and comments!

@vmatt vmatt requested a review from a team as a code owner April 27, 2024 17:19
@vmatt vmatt requested review from akhilnarang and removed request for a team April 27, 2024 17:19
@vmatt
Copy link
Author

vmatt commented Apr 27, 2024

Example files for the UTF8-BOM fix:
utf8.csv
utf8bom.csv

You can compare the two files using the xxd tool:

$ xxd utf8.csv
00000000: 2263 6f6c 3122 3b22 636f 6c32 220d 0a31  "col1";"col2"..1
00000010: 3b32                                     ;2

Note the first 3 bytes efbb bf

$ xxd utf8bom.csv
00000000: efbb bf22 636f 6c31 223b 2263 6f6c 3222  ..."col1";"col2"
00000010: 0d0a 313b 32                             ..1;2

@vmatt vmatt changed the title feat(Data Import): custom csv delimiters feat(Data Import): custom csv delimiters, UTF-8 BOM handling Apr 27, 2024
@vmatt vmatt force-pushed the data_import_delimiter branch 2 times, most recently from 23e1336 to e5f2878 Compare April 29, 2024 08:22
@vmatt
Copy link
Author

vmatt commented Apr 29, 2024

@akhilnarang, fixed commit messages

@vmatt
Copy link
Author

vmatt commented Apr 29, 2024

@akhilnarang, Docs linked fixed.
Semgrep issues one is fixed, the other is kind of a chicken-egg situation.

  1. frappe-manual-commit will pass, it was only raised in the test_importer.py
    1. I had to define a new method to test the semicolon file get_importer_semicolon()
    2. it's doing the same, as get_importer() (# Commit so that the first import failure does not rollback the Data Import insert.)
  2. frappe-modifying-but-not-comitting-other-method
    1. These rules are fired for code parts I did not modify in File.py,
    2. It ask me to check "if changes to ... are commited to database". Based on that, these values should be commited manually after setting? But then, if let's says I'd modify these to get manually commited to the database, then the commit command would be pick up by the linter and would gave us frappe-manual-commit error. Hence the chicken-egg situation. How should I resolve this?

@akhilnarang
Copy link
Member

@vmatt it's fine, semgrep picks up unchanged parts in modified files as well. After merge it won't bother again until the file is edited again, since it's intentional in some places.

@vmatt
Copy link
Author

vmatt commented Apr 29, 2024

@akhilnarang I just realised that i forgot to push the commit where I added # nosemgrep for commit in the test_importer.py.

Linter will still fail on frappe-modifying-but-not-comitting-other-method, not sure if it will go through the rest of the steps, if that step still failing.

But beside this, I'd say it's complete now, and safe to merge, but let me know, if I have to make any adjustments.

@akhilnarang
Copy link
Member

But beside this, I'd say it's complete now, and safe to merge, but let me know, if I have to make any adjustments.

Great, will test it and let you know.

@vmatt
Copy link
Author

vmatt commented Apr 30, 2024

Hey @akhilnarang, sorry, I messed up my branches late yesterday and accidentally combined two different pull request changes together (#26200). All fixed now, won't happen again. Please trigger the tests now again.

@vmatt
Copy link
Author

vmatt commented Apr 30, 2024

@akhilnarang, Docs linked fixed. Semgrep issues one is fixed, the other is kind of a chicken-egg situation.
2. frappe-modifying-but-not-comitting-other-method

  1. These rules are fired for code parts I did not modify in File.py,
  2. It ask me to check "if changes to ... are commited to database". Based on that, these values should be commited manually after setting? But then, if let's says I'd modify these to get manually commited to the database, then the commit command would be pick up by the linter and would gave us frappe-manual-commit error. Hence the chicken-egg situation. How should I resolve this?

@vmatt it's fine, semgrep picks up unchanged parts in modified files as well. After merge it won't bother again until the file is edited again, since it's intentional in some places.

Should I add semgrep exceptions for File.py to ignore these errors, so it would pass the last test, as these were already part of the code?

Copy link
Member

@akhilnarang akhilnarang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor things

frappe/core/doctype/data_import/data_import.py Outdated Show resolved Hide resolved
frappe/core/doctype/data_import_log/data_import_log.json Outdated Show resolved Hide resolved
frappe/utils/csvutils.py Outdated Show resolved Hide resolved
frappe/utils/csvutils.py Outdated Show resolved Hide resolved
vmatt and others added 4 commits May 7, 2024 21:20
Co-authored-by: Akhil Narang <me@akhilnarang.dev>
Avoid semgrep issue with translated string

Signed-off-by: Akhil Narang <me@akhilnarang.dev>
Signed-off-by: Akhil Narang <me@akhilnarang.dev>
frappe/core/doctype/data_import/test_importer.py Outdated Show resolved Hide resolved
frappe/core/doctype/data_import/test_importer.py Outdated Show resolved Hide resolved
frappe/core/doctype/data_import/test_importer.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants