Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc polish: "Data Loaders" --> "Datasets" #222

Open
ZhitingHu opened this issue Oct 4, 2019 · 6 comments
Open

Doc polish: "Data Loaders" --> "Datasets" #222

ZhitingHu opened this issue Oct 4, 2019 · 6 comments
Assignees
Labels
enhancement New feature or request topic: docs Issue about docstrings and documentation

Comments

@ZhitingHu
Copy link
Member

The section is titled "Data Loaders" https://texar-pytorch.readthedocs.io/en/latest/code/data.html#data-loaders

Would "Datasets" be better? Or does "Data Loaders" fit the Pytorch convention better?

@huzecong @AvinashBukkittu

@AvinashBukkittu
Copy link
Collaborator

I personally like Datasets as section heading here. All the classes described under this are Datasets provided by texar. Our Data Iterators share similarities with Data Loaders of pytorch. Also, I see that we are missing the doc for SingleDatasetIterator. I don't know if this was intentional.

@ZhitingHu ZhitingHu added enhancement New feature or request topic: docs Issue about docstrings and documentation labels Oct 4, 2019
@ZhitingHu
Copy link
Member Author

@huzecong
Copy link
Collaborator

huzecong commented Oct 4, 2019

I like Dataset as well. I think the terms people use to describe data-related modules are pretty messy, so as long as we're being consistent it's fine. Let me reiterate our definitions:

  • A data source is something that reads and returns raw data examples one by one. Typical data sources include Python lists and iterators (SequenceDataSource and IterDataSource), lines from text files (TextLineDataSource), and pickled objects from binary files (PickleDataSource).
  • A dataset (or data loader) defines how data examples are preprocessed into a format suitable for the task, and how these processed examples can be batched. These are called *Data in our framework for compatibility with the TF version (although I kind of prefer names like MonoTextData to MonoTextDataset because it's shorter and nonetheless to the point). Note that dataset does not perform any of the operations by itself.
  • A data iterator executes the process and batch operations defined in a dataset. PyTorch calls this a "data loader".

It is intentional that we don't include the doc for SingleDatasetIterator. Users are expected to only use the DataIterator interface.

@ZhitingHu
Copy link
Member Author

Thanks for the clarifation. Can these definitions be added to somewhere in the doc?

@ZhitingHu
Copy link
Member Author

We can probably have an "Overview" page for each set of modules, to give an overview and highlight key features. Like in TF: https://www.tensorflow.org/api_docs/python/tf/data

@huzecong
Copy link
Collaborator

huzecong commented Oct 4, 2019

Sure. I'll get on it.

@huzecong huzecong self-assigned this Oct 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request topic: docs Issue about docstrings and documentation
Projects
None yet
Development

No branches or pull requests

3 participants