Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huggingface datasets integration? #55

Open
ChenchaoZhao opened this issue Jan 22, 2023 · 5 comments
Open

Huggingface datasets integration? #55

ChenchaoZhao opened this issue Jan 22, 2023 · 5 comments

Comments

@ChenchaoZhao
Copy link

Any plans for Huggingface datasets integration?

Instead of using pickled dictionary, probably it is better practice to use arrow or parquet format. It should be pretty easy to convert to Huggingface format.

@jonathanking
Copy link
Owner

Hi @ChenchaoZhao , thanks for your interest!

I have not considered either of those. I have found that the pickle format works well enough for my needs. Is there something in particular that makes using this format difficult? Also, if you are interested in contributing by converting the datasets to other formats, I would be happy to host them!

@ChenchaoZhao
Copy link
Author

Hi @jonathanking thank you for the comment!

Pickle is not considered secure in production. How should I contribute if I generate the parquet files?

@jonathanking
Copy link
Owner

I was thinking about how to proceed, and here are my thoughts.

I'm going to release an updated version of SidechainNet in a little while. I think we can wait on creating parquet files until then. However, if you are really interested in contributing, you could perhaps write a function or describe how you might convert the current format (dictionary, key/values of various types) into a format agreeable with the parquet format. Then we could use that code/or general idea when we move forward and release the next version of the code and data.

I'm just not familiar with the format myself, so I'd have to investigate how to reformat the existing data. I see something about formatting it into a DataFrame and then writing a parquet file, so maybe it's not so complicated. It would just need to be able to handle the different kinds of data stored in the dictionary currently (arrays, lists, strings). Let me know what you think!

@ChenchaoZhao
Copy link
Author

Will there be additional features in the next release? Based my understanding, the current version probably can be converted using Huggingface datasets.Dataset method from_dict see https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.from_dict

Then you can upload to Huggingface Hub for more visibility or save them as parquet format (the most compact format) or arrow format. They both support nested fields.

@jonathanking
Copy link
Owner

Yes, I have a handful of features and data standardizations/improvements that I’ve been working with on my research branches that I plan to add to the next release.

Thanks so much for pointing out that function! I didn’t think it would be that easy, but that sounds like a great option. I’ll keep that in mind for when I regenerate the data. I appreciate the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants