Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document level granularity of Paracrawl #259

Open
vince62s opened this issue Oct 5, 2023 · 1 comment
Open

Document level granularity of Paracrawl #259

vince62s opened this issue Oct 5, 2023 · 1 comment

Comments

@vince62s
Copy link

vince62s commented Oct 5, 2023

Hi,

is there somewhere a release of Paracrawl with bitextor granularity "Document" instead of sentences.

if not what if the easiest way to reproduce those.

Cheers.

@ZJaume
Copy link
Member

ZJaume commented Oct 6, 2023

Hi @vince62s,

Unfortunately there's no data preserving that kind of information for Paracrawl. In the raw file, you might find the less filtered version we have. If you group by url, and in the same order they appear in the file, concatenating the sentences will give you some kind of "documents" but there will be sentences missing and the order might not be correct. But, if you are interested in document level and not particularly in the languages of Paracrawl, the parallel data from https://macocu.eu has been created with more recent Bitextor versions. Therefore, the latest version of each language-pair has a doc.txt file available for download. In those files, you will find in the columns 3 and 4, a base64 encoded document. Note that you might need additional filtering, as this doc version is less filtered in order to preserve full documents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants