Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support providing streams into mixer via CLI #130

Open
soldni opened this issue Feb 29, 2024 · 0 comments
Open

Support providing streams into mixer via CLI #130

soldni opened this issue Feb 29, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@soldni
Copy link
Member

soldni commented Feb 29, 2024

@IanMagnusson asks

I'm trying to figure out how to mix using the dolma cli args instead of the config. I want to do something like this but I cant figure out how to index the streams arg correctly:

dolma mix --streams[0].name "$name"
            --streams[0].documents "$input_prefix/$file" \
            --streams[0].output.path "$output_prefix/$file" \
            --streams[0].output.max_size_in_bytes 1000000000 \
            --streams[0].attributes s2orc-eval \
            --streams[0].filter.exculde "$@.attributes[?(@.bff_duplicate_paragraph_spans_decontamination && @.bff_duplicate_paragraph_spans_decontamination[0] && @.bff_duplicate_paragraph_spans_decontamination[0][2] >= 1.0)]" 

We should support this use case. As a stopgap, we should support echo '{...}' | dolma -c - mix, i.e. allow passing config through stdin.

@soldni soldni added the enhancement New feature or request label Feb 29, 2024
@soldni soldni self-assigned this Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant