Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add no_break_image break mode: no partial image at the beginning of a… #740

Merged
merged 8 commits into from Jul 18, 2023

Conversation

violet-zct
Copy link

…n examples

Patch Description
Describe your changes

Testing steps
Describe how you tested your changes

metaseq/data/document_to_sequence.py Outdated Show resolved Hide resolved
metaseq/data/document_to_sequence.py Outdated Show resolved Hide resolved
metaseq/data/document_to_sequence.py Outdated Show resolved Hide resolved
metaseq/tasks/streaming_language_modeling.py Outdated Show resolved Hide resolved
@ArmenAg
Copy link
Contributor

ArmenAg commented Jul 12, 2023

@violet-zct Could we add some tests that cover the behavior we expect? Similar to https://github.com/facebookresearch/metaseq/blob/main/cpu_tests/test_streaming_token_block_dataset.py

…token of an image 3. fix get_document_boundaries()
@violet-zct
Copy link
Author

  1. I pushed a cpu test for no_image_break: https://github.com/facebookresearch/metaseq/blob/cm3v2/metaseq/data/document_to_sequence.py#L295
    During my tests, I fix two corner cases in get_document_boundaries() of the cm3 dataset: https://github.com/facebookresearch/metaseq/blob/cm3v2_chunting_dev/metaseq/data/cm3_dataset.py#L208 and #L211, we probably can hit these corner cases when the document is very long.

  2. In addition, I found that your get_document_boundaries() can't pass the test: https://github.com/facebookresearch/metaseq/blob/cm3v2_chunting_dev/cpu_tests/test_cm3_dataset.py#L97, unless you remove + 1 in this line: https://github.com/facebookresearch/metaseq/blob/cm3v2_chunting_dev/cpu_tests/test_cm3_dataset.py#L97

  3. I fixed image spans by also including the special token at the end of a image.

@ArmenAg
Copy link
Contributor

ArmenAg commented Jul 17, 2023

Tests look great to me. Behavior is as expected. Let's land!

@violet-zct violet-zct merged commit 929cf1a into cm3v2 Jul 18, 2023
1 of 3 checks passed
@violet-zct violet-zct deleted the cm3v2_chunting_dev branch July 18, 2023 05:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants