Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having the number of translations in the advanced search #3090

Open
Guybrush88 opened this issue Dec 8, 2023 · 3 comments
Open

Having the number of translations in the advanced search #3090

Guybrush88 opened this issue Dec 8, 2023 · 3 comments
Labels
enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba.

Comments

@Guybrush88
Copy link

As reported by cojiluc on the wall:

Please consider to add "Number of Translations" (at least, at most) (link: direct, indirect) in the search criteria in Advanced Search.

Some advantages:
(1) For people who intend to translate or to find sentences with most translations could be useful, these sentences could be sometimes among the most popular/universal or the most easy sentences.
(2) For people who intend to translate or to find sentences with few translations could be useful, these sentences could be sometimes among the most "virgin" sentences, or the less noisy sentences, etc.
(3) Combining this criterion with some already present criteria could be very useful for the user to localize good sentences.

For the "Length" of a sentences, the advanced search has already this useful feature: Length (At least, At most).

"Number of translations" is not less important than some other criteria.
Let compare it with two already criteria "orphan" and "unapproved" sentences.
The below statistics (for top 20 languages on Tatoeba) shows that for most languages, orphan sentences and unapproved sentences are not a big deal.
I am not saying orphan/unapproved criteria it is not useful, but my only point is that when we have these criteria for filtering just handful of sentences among tens of thousands of sentences, let have "Number of Translations'' as well.

Language; number of all sentences; number of orphan sentences; number of unapproved sentences

English; 1.8M; 47,173; 5,221
Russian 1M; 243; 78
Italian 868K; 0; 18;
Esperanto 736K; 23; 61
Turkish 732K; 281; 237
Kabyle 696K; 16; 42
Berber 651K; 29; 546
German 634K; 7; 50
French 587K; 295; 6,383
Portuguese 424K; 1,156; 86
Spanish 407K; 11; 2,773
Hungarian 401K 2,048; 25
Japanese 241K 100,369; 176
Hebrew 201K; 307; 19
Ukrainian 184K; 0; 13
Dutch 179K; 0; 36
Finish 147K; 25; 7
Polish 124K; 0; 38
Lithuanian 99K; 325; 2
Macedonian 78K; 6; 2

https://tatoeba.org/it/wall/show_message/40365#!#message_40365

@Guybrush88 Guybrush88 added the enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba. label Dec 8, 2023
@ckjpn
Copy link

ckjpn commented Dec 9, 2023

I suspect that if this were to be done, it might be best to not do this in real time, but to generate the number of direct links a sentence has only from time to time -- perhaps once a week, before the weekly downloadable files are created, and then also create a downloadable file with these numbers.

@LBeaudoux
Copy link
Contributor

(1) For people who intend to translate or to find sentences with most translations could be useful, these sentences could be sometimes among the most popular/universal or the most easy sentences.

From my experience, I've learned that the most linked sentences of a language are primarily those that are:

  1. older
  2. shorter
  3. translated/post-linked several times by a single Tatoeban

Surfacing the sentences with a high number of translations would reinforce these biases.

(2) For people who intend to translate or to find sentences with few translations could be useful, these sentences could be sometimes among the most "virgin" sentences, or the less noisy sentences, etc.*

I doubt that there are many translators out there looking for these "virgin" sentences.

(3) Combining this criterion with some already present criteria could be very useful for the user to localize good sentences.

I don't think an extra filter is the proper way to help translators find better sentences to translate. Rather, we should measure the relative number of translators for a sentence compared to its closest peers of the same language, age and length. And then we could use this popularity score as a sorting option for the advanced search.

@ckjpn
Copy link

ckjpn commented Dec 10, 2023

I, too, sort of doubt this would be all that useful, for the reasons mentioned above.

As for "virgin" sentences, those with no translations, these can already be found using the "exclude", "any language", and "direct link" or "any link" options.

Template (pre-filled form):
https://tatoeba.org/en/sentences/advanced_search?&trans_filter=exclude&trans_link=&sort=random

Currently 1,827,697 occurrences
15.5% of our sentences
1,827,697/11,779,865

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba.
Projects
None yet
Development

No branches or pull requests

3 participants