Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate metadata should support other encondings than utf-8 #323

Open
dominikl opened this issue Mar 18, 2022 · 4 comments
Open

Populate metadata should support other encondings than utf-8 #323

dominikl opened this issue Mar 18, 2022 · 4 comments

Comments

@dominikl
Copy link
Member

populate_metadata.py assumes that the csv files are encoded with utf-8. It fails if that's not the case. Maybe there should an option to specify the encoding.

See https://forum.image.sc/t/populate-metadata-py-and-non-utf-8-csvs/64595

@dominikl
Copy link
Member Author

After a quick glance, I can't really see how populate_metadata.py is affected, as the respective code which is suspected to cause the issue is in populate_roi.py.

@imagesc-bot
Copy link

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/populate-metadata-py-and-non-utf-8-csvs/64595/2

@will-moore
Copy link
Member

@dominikl It's maybe not the most natural place for the code to live, but Populate_Metadata.py does import it:

https://github.com/ome/omero-scripts/blob/14c830099efe1f0d6b32a2b3914febd8ddcd89ea/omero/import_scripts/Populate_Metadata.py#L33

@JulianHn
Copy link

@dominikl : Thanks for opening the issue again. Here is the diff of my modified populate_metadata.py that defines an own FileProvider Class to overwrite the behaviour imported from omero.util.populate_roi.
At the moment it is fixed to latin-1 encoding after it detects a UnicodeDecode Error, but it could of course easily be modified to an arbitrary encoding provided by the user. Futhermore, the logic for truncating the tempfile had to be adjusted, since the size of the original file is obviously no longer the same as that of the new file, in case the encoding is not utf-8.

I'm not sure if it makes sense to redefine this within the metadata script or if it would make sense to make this option available directly within omero.util.populate_roi.DownloadingOriginalFileProvider

5,6d4
< 
< 
15d12
< 
19d15
< 
21d16
< 
30c25,26
< 
---
> import tempfile
> from past.utils import old_div
55d50
< 
60a56,84
> class OwnFileProvider(DownloadingOriginalFileProvider):
>     
>     def get_original_file_data(self, original_file):
>         """
>         Downloads an original file to a temporary file and returns an open
>         file handle to that temporary file seeked to zero. The caller is
>         responsible for closing the temporary file.
>         """
> 
>         
>         self.raw_file_store.setFileId(original_file.id.val)
>         temporary_file = tempfile.NamedTemporaryFile(mode='rt+',
>                                                      dir=str(self.dir),
>                                                      encoding="utf-8-sig")
>         size = original_file.size.val
>         size_new = 0
>         for i in range((old_div(size, self.BUFFER_SIZE)) + 1):
>             index = i * self.BUFFER_SIZE
>             data = self.raw_file_store.read(index, self.BUFFER_SIZE)
>             try:
>                 data_write = data.decode("utf-8").rstrip('\0')
>                 size_new += len(data_write.encode("utf-8-sig"))
>             except UnicodeDecodeError:
>                 data_write = data.decode("latin-1").rstrip('\0')
>                 size_new += len(data_write.encode("utf-8-sig"))
>             temporary_file.write(data_write)            
>         temporary_file.seek(0)
>         temporary_file.truncate(size_new)
>         return temporary_file
122c146
<     provider = DownloadingOriginalFileProvider(conn)
---
>     provider = OwnFileProvider(conn)
198a223,224
> 
> 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants