feat(Data Import): custom csv delimiters, UTF-8 BOM handling #26183

vmatt · 2024-04-27T17:19:49Z

Hello,
Closes #26182
Closes #22151
The current Frappe Data Import only supports standard CSV files, leading to issues when importing data from other *sv file formats (e.g., TSV, semicolon-separated values).

This pull request introduces the following enhancements:

Custom delimiters toggle: When its enabled, Delimiter options field is shown, where the user to extend the predefined delimiters, and extend with their own as well
Auto-detection of delimiters: The code now utilizes the csv.Sniffer to automatically detect delimiters like tabs (\t), semicolons (;), and commas (,) in the uploaded file.
Manual delimiter specification: A new field delimiter_options is added to the Data Import doctype, allowing users to manually specify the delimiter used in their file, ensuring successful import even when auto-detection fails. (default value is ",;\t|" - most common delimiter types)
Improved error handling: Enhanced error messages provide more informative feedback to users when issues arise during the import process.
+1 fix for UTF8 with BOM: If file is regular utf-8, it can be still be parsed with utf-8-sig encoding, however, there reverse is not true. In windows environments, Excel adds 3 bytes to the beginning of the file. If you try decode utf-8-sig encoded file with utf-8 encoding, the intial 3 bytes (\ufeff) will be not skipped, causing the column parsing to fail.
These changes expand the capabilities of the Data Import Doctype, making it more flexible and user-friendly for a broader range of file formats and business contexts.
1. Why is this change not in a separate pull request?
  Without fixing the UTF8 with BOM parsing, if i'm loading my semicolon separated, utf-8 BOM file, it will still fail to recognize the first column, as Frappe fails to strip the doublequotes properly, because the first row will start as \ufeff"col1", instead "col1":

Introduction encoding list constant (order matters!): `from frappe.core.doctype.file.file import FILE_ENCODING_OPTIONS` Generate error message based on FILE_ENCODING_OPTIONS.

Tested and validated, added two new test cases in the test_importer.py file.

docs: https://docs.erpnext.com/docs/user/manual/en/data-import

I look forward to your feedback and comments!

vmatt · 2024-04-27T21:28:27Z

Example files for the UTF8-BOM fix:
utf8.csv
utf8bom.csv

You can compare the two files using the xxd tool:

$ xxd utf8.csv
00000000: 2263 6f6c 3122 3b22 636f 6c32 220d 0a31  "col1";"col2"..1
00000010: 3b32                                     ;2

Note the first 3 bytes efbb bf

$ xxd utf8bom.csv
00000000: efbb bf22 636f 6c31 223b 2263 6f6c 3222  ..."col1";"col2"
00000010: 0d0a 313b 32                             ..1;2

vmatt · 2024-04-29T08:23:47Z

@akhilnarang, fixed commit messages

vmatt · 2024-04-29T09:02:20Z

@akhilnarang, Docs linked fixed.
Semgrep issues one is fixed, the other is kind of a chicken-egg situation.

frappe-manual-commit will pass, it was only raised in the test_importer.py
1. I had to define a new method to test the semicolon file get_importer_semicolon()
2. it's doing the same, as get_importer() (# Commit so that the first import failure does not rollback the Data Import insert.)
frappe-modifying-but-not-comitting-other-method
1. These rules are fired for code parts I did not modify in File.py,
2. It ask me to check "if changes to ... are commited to database". Based on that, these values should be commited manually after setting? But then, if let's says I'd modify these to get manually commited to the database, then the commit command would be pick up by the linter and would gave us frappe-manual-commit error. Hence the chicken-egg situation. How should I resolve this?

akhilnarang · 2024-04-29T09:04:28Z

@vmatt it's fine, semgrep picks up unchanged parts in modified files as well. After merge it won't bother again until the file is edited again, since it's intentional in some places.

vmatt · 2024-04-29T10:29:53Z

@akhilnarang I just realised that i forgot to push the commit where I added # nosemgrep for commit in the test_importer.py.

Linter will still fail on frappe-modifying-but-not-comitting-other-method, not sure if it will go through the rest of the steps, if that step still failing.

But beside this, I'd say it's complete now, and safe to merge, but let me know, if I have to make any adjustments.

akhilnarang · 2024-04-29T10:44:21Z

But beside this, I'd say it's complete now, and safe to merge, but let me know, if I have to make any adjustments.

Great, will test it and let you know.

vmatt · 2024-04-30T11:42:13Z

Hey @akhilnarang, sorry, I messed up my branches late yesterday and accidentally combined two different pull request changes together (#26200). All fixed now, won't happen again. Please trigger the tests now again.

vmatt · 2024-04-30T12:08:44Z

@akhilnarang, Docs linked fixed. Semgrep issues one is fixed, the other is kind of a chicken-egg situation.
2. frappe-modifying-but-not-comitting-other-method

These rules are fired for code parts I did not modify in File.py,

It ask me to check "if changes to ... are commited to database". Based on that, these values should be commited manually after setting? But then, if let's says I'd modify these to get manually commited to the database, then the commit command would be pick up by the linter and would gave us frappe-manual-commit error. Hence the chicken-egg situation. How should I resolve this?

@vmatt it's fine, semgrep picks up unchanged parts in modified files as well. After merge it won't bother again until the file is edited again, since it's intentional in some places.

Should I add semgrep exceptions for File.py to ignore these errors, so it would pass the last test, as these were already part of the code?

akhilnarang

Some minor things

frappe/core/doctype/data_import/data_import.json

frappe/core/doctype/data_import/data_import.py

frappe/core/doctype/data_import_log/data_import_log.json

frappe/utils/csvutils.py

Co-authored-by: Akhil Narang <me@akhilnarang.dev>

… cleanup

Avoid semgrep issue with translated string Signed-off-by: Akhil Narang <me@akhilnarang.dev>

Signed-off-by: Akhil Narang <me@akhilnarang.dev>

frappe/core/doctype/data_import/test_importer.py

vmatt requested a review from a team as a code owner April 27, 2024 17:19

vmatt requested review from akhilnarang and removed request for a team April 27, 2024 17:19

vmatt mentioned this pull request Apr 27, 2024

Add utf-8-sig when reading csv for import or do something else to remove BOM character #22151

Open

vmatt changed the title ~~feat(Data Import): custom csv delimiters~~ feat(Data Import): custom csv delimiters, UTF-8 BOM handling Apr 27, 2024

vmatt force-pushed the data_import_delimiter branch 2 times, most recently from 23e1336 to e5f2878 Compare April 29, 2024 08:22

feat(Data Import): custom delimiters

45eabd3

vmatt force-pushed the data_import_delimiter branch from 98bd595 to 45eabd3 Compare April 30, 2024 11:40

Merge branch 'develop' into data_import_delimiter

c662379

akhilnarang requested changes May 3, 2024

View reviewed changes

vmatt and others added 4 commits May 7, 2024 21:20

fix: fallback to ',' when delimiter_options not defined

d3cfa26

Co-authored-by: Akhil Narang <me@akhilnarang.dev>

feat: CSV import introduce FILE_ENCODING_OPTIONS constant in file.py,…

e69093c

… cleanup

chore(csvutils): update messages

123844b

Avoid semgrep issue with translated string Signed-off-by: Akhil Narang <me@akhilnarang.dev>

chore(doctype/file): update comments

1cebdb2

Signed-off-by: Akhil Narang <me@akhilnarang.dev>

akhilnarang requested changes May 8, 2024

View reviewed changes

frappe/core/doctype/data_import/test_importer.py Outdated Show resolved Hide resolved

frappe/core/doctype/data_import/test_importer.py Outdated Show resolved Hide resolved

frappe/core/doctype/data_import/test_importer.py Outdated Show resolved Hide resolved

fix: comments

0f4e916

vmatt force-pushed the data_import_delimiter branch from 4ca4372 to 0f4e916 Compare May 9, 2024 22:16

vmatt requested a review from akhilnarang May 9, 2024 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(Data Import): custom csv delimiters, UTF-8 BOM handling #26183

feat(Data Import): custom csv delimiters, UTF-8 BOM handling #26183

vmatt commented Apr 27, 2024 •

edited

vmatt commented Apr 27, 2024 •

edited

vmatt commented Apr 29, 2024

vmatt commented Apr 29, 2024 •

edited

akhilnarang commented Apr 29, 2024

vmatt commented Apr 29, 2024

akhilnarang commented Apr 29, 2024

vmatt commented Apr 30, 2024 •

edited

vmatt commented Apr 30, 2024

akhilnarang left a comment

feat(Data Import): custom csv delimiters, UTF-8 BOM handling #26183

Are you sure you want to change the base?

feat(Data Import): custom csv delimiters, UTF-8 BOM handling #26183

Conversation

vmatt commented Apr 27, 2024 • edited

vmatt commented Apr 27, 2024 • edited

vmatt commented Apr 29, 2024

vmatt commented Apr 29, 2024 • edited

akhilnarang commented Apr 29, 2024

vmatt commented Apr 29, 2024

akhilnarang commented Apr 29, 2024

vmatt commented Apr 30, 2024 • edited

vmatt commented Apr 30, 2024

akhilnarang left a comment

Choose a reason for hiding this comment

vmatt commented Apr 27, 2024 •

edited

vmatt commented Apr 27, 2024 •

edited

vmatt commented Apr 29, 2024 •

edited

vmatt commented Apr 30, 2024 •

edited