Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: translate this project #48

Open
Tracked by #51
oood opened this issue Mar 3, 2023 · 28 comments
Open
Tracked by #51

Request: translate this project #48

oood opened this issue Mar 3, 2023 · 28 comments
Assignees
Milestone

Comments

@oood
Copy link
Member

oood commented Mar 3, 2023

Rationale:

Since this project will benefit most non-English speakers, it would be more helpful if there was a translation in their language.

Suggestion:

Here's an example translation from Japanese Wikipedia:

  • DB データベース (Database)

That way, when a non-native speaker sees a abbr in code that they don't understand the meaning of, they can reverse lookup its original word, or help the non-native speaker make better abbreviations.

Before getting started, #46 needs to be considered to make contributing to the project easier.

This issue is a fork of a previous thread (#41), you may be interested in reading the previous discussion history.

@kisvegabor
Copy link
Member

Do you mean having separate files for each language or adding the translations of original word next to that word? E.g.

  • software, 软件, ソフトウェア • 🟡 sw { computer science }

@oood
Copy link
Member Author

oood commented Mar 4, 2023

Do you mean having separate files for each language

Of course, each language needs to have separate files ( README_ja-JP.md, README_it-IT.md, README_pt-BR.md ), otherwise that would be messy.

@kisvegabor
Copy link
Member

I agree that it'd be more manageable this way.

I assume we will have max 1-2 new abbreviations or changes per week so it's only a little bit of extra work to maintain the translations.

However, we needed determined contributors for each languages else the translations will be out of sync easily.


Maybe we should store the abbreviations in a yaml/json file like:

{
	"software": {
		"translations": {
			"ZH": "软件",
			"JA": "ソフトウェア"
		},
		"recommended": "sw",
		"not recommended": ["softw", "sware"],
		"context based": {
			"some context": "blah",
			"some context2": "blah2"
		}
	}
}

and from this file we can easily generate the READMEs with CI.

@T1xx1
Copy link
Member

T1xx1 commented Mar 4, 2023

I agree with @kisvegabor using json generating the READMEs would be easier to maintain and in the same time we will fix the JSON problem to create the website.

@T1xx1
Copy link
Member

T1xx1 commented Mar 4, 2023

We should also have lime an index file only having the langauges we provide.

[
"sp",
"it"
and so on
]

So we iterate over this array and only take the en abbr and the corresponding translation in the object.

@kisvegabor
Copy link
Member

kisvegabor commented Mar 4, 2023

Great! @oood suggested the same thing in #47 so it seems we are on the same page.

Let's find a format then! You can see my idea above, but feel free to suggest a something completely different.

@oood
Copy link
Member Author

oood commented Mar 4, 2023

However, we needed determined contributors for each languages else the translations will be out of sync easily.

When we start a new language, it is assumed that the first contributor to start a new language has translated 100% of the current abbreviations/words, so it doesn't matter if the future is out of sync, words added in the future that are not translated will make it easier for people reading pages in that language to realize there is a need to contribute, like those red links on Wikipedia that don't have pages. and untranslated words could be linked to a contribution guide or something to encourage people to submit translations.

DB データベース (Database)
software, [translate it, link] • 🟡 sw { computer science }

Let's find a format then! You can see my idea above, but feel free to suggest a something completely different.

I think it would be great to do that.

We need to first determine what content a word may require.

@oood
Copy link
Member Author

oood commented Mar 4, 2023

We should also have lime an index file only having the langauges we provide.

[
"sp",
"it"
and so on
]

So we iterate over this array and only take the en abbr and the corresponding translation in the object.

Yes, initially we only maintain EN.

We now need to determine what information an abbreviation contains:

Word, abbreviation, context, recommendation, link to issue and ?

BTW, our sorting should be based on abbreviations rather than words themselves, because an abbreviation may correspond to many words, and putting them together is helpful for retrieval.

@T1xx1
Copy link
Member

T1xx1 commented Mar 5, 2023

BTW, our sorting should be based on abbreviations rather than words themselves, because an abbreviation may correspond to many words, and putting them together is helpful for retrieval.

Nop. We just swapped the list some prs ago. When someone is searching for an abbrs they only know the word so it makes more sense to leave it like that

I assure you that sorting them by abbr was a mess.

@kisvegabor
Copy link
Member

Word, abbreviation, context, recommendation, link to issue and ?

We need not recommended as well.

Nop. We just swapped the list some prs ago. When someone is searching for an abbrs they only know the word so it makes more sense to leave it like that

I just wanted to comment the same 🙂

@T1xx1
Copy link
Member

T1xx1 commented Mar 5, 2023

En README.
• _word_ • _recommendation degree_ [_abbr_](link to discussion) { _context if context-sensitive_ }
Sorted by word.

Any other lang README.
• (translation) _word_ • _recommendation degree_ [_abbr_](_link to discussion_) { _context if context-sensitive_ }
Sorted by translation.

@T1xx1
Copy link
Member

T1xx1 commented Mar 5, 2023

Realized that we can also use

[
"sp",
"it"
and so on
]

to create an index README with the available langs.
So if someone is searching for a specific lang it can go there without browsing in the code.
We should add to the en README a link like
[Others traslantions](...)

@oood
Copy link
Member Author

oood commented Mar 5, 2023

As this project grows, I'm not sure if a readme file can be as long as a dictionary.

Maybe alphabetically?

abbreviations-in-code
├─── readme.md
├─── en-US
|	├─── A.md
|	├─── B.md
|	├─── ...
|	└─── Z.md
├─── it-IT
|	├─── A.md
|	├─── ...
|	├─── Z.md
|	└─── readme.md
├─── raw
|	└─── raw
└─── docs

@oood
Copy link
Member Author

oood commented Mar 5, 2023

I assure you that sorting them by abbr was a mess.

I just wanted to comment the same 🙂

OK, I get it. because words are unique, abbreviations are not. I now agree with that idea.

We need not recommended as well.

of course.

En README.
wordrecommendation degree [abbr](link to discussion) { context if context-sensitive }
Sorted by word.

Another improvement I think we can make is context, because some words don't make sense even in their entirety, it needs a description or a link to a Wikipedia page for people to understand what it means.

to create an index README with the available langs.
So if someone is searching for a specific lang it can go there without browsing in the code.
We should add to the en README a link like

that would be great.

How about this?
If a language's page doesn't exist, we copy an English page under that language to encourage people to contribute.

@oood
Copy link
Member Author

oood commented Mar 5, 2023

Maybe we should store the abbreviations in a yaml/json file like:

Is it possible to create separate json for each language?

Because, I think it will be complicated to maintain a huge json as the project grows, after all that requires manual merging.

Or just simply a plain text file, separated by commas, one word per line.

Because raw data has to be maintained by humans, we don't have to make it machine readable, machines can adapt with scripts.

@T1xx1 T1xx1 mentioned this issue Mar 5, 2023
23 tasks
@kisvegabor
Copy link
Member

kisvegabor commented Mar 6, 2023

As this project grows, I'm not sure if a readme file can be as long as a dictionary.

I agree.

├─── en-US
| ├─── A.md
| ├─── B.md
| ├─── ...
| └─── Z.md

It depends on whether we add the acronyms or not. Without them we could have only 1 file/language which would be easier to read and search.

Is it possible to create separate json for each language?

I don't think it's a good idea because this way we need to maintain and keep in sync the whole _recommendation degree_ [_abbr_](_link to discussion_) { _context if context-sensitive_ } part for each language.

I suggest having a single DB file with all words, abbreviations, links,translations etc. It can be large, but it has a simple structure which is easy to follow and understand.

Because raw data has to be maintained by humans, we don't have to make it machine readable, machines can adapt with scripts.

I agree to to pick something which is good for people and write a script for this special format. E.g.

# software
- issue: 123  
- translations
   - zh: ...   
   - ja: ... 
- abbreviations
   - recommended: sw
   - context sensitive: foo1 (context1), foo2 (context2)
   - not recommended: softw, sware

It wasn't my intention but it looks like Markdown, which seems like a good compromise 🙂

@T1xx1
Copy link
Member

T1xx1 commented Mar 6, 2023

# software
- issue: 123  
- translations
   - zh: ...   
   - ja: ... 
- abbreviations
   - recommended: sw
   - context sensitive: foo1 (context1), foo2 (context2)
   - not recommended: softw, sware

It wasn't my intention but it looks like Markdown, which seems like a good compromise 🙂

I prefer having only one dB it's easier to maintain.

# **word**
- translations only if has translations
	- it: ...
	- sp: ...
- abbr: **abbr**
	- degree: red/purple/yellow/green
	- context: **context** only if needed

We need to make it super easy and short as possible so when will have 1000 of abbr the db would not be 10gb.

@oood
Copy link
Member Author

oood commented Mar 6, 2023

It wasn't my intention but it looks like Markdown, which seems like a good compromise 🙂
I prefer having only one dB it's easier to maintain.

I don't have a lot of experience with database formats, so I can't give good advice.

Remember how I found this project by searching for the db? I'm working on a function that dumps a database format into human readable, so I found this project.

However I do want it to be human readable, we don't need to make it machine readable unless we make a tool/robot in the future that can automatically import from issues. But this is just my personal opinion, it depends on how you define future needs and find the best solution.

We need to make it super easy and short as possible so when will have 1000 of abbr the db would not be 10gb.

I don't think it will ever be 10GB since these are just plain text, but documents over 100MB are often difficult to load by a text editor. But I also don't think it will be more than 100MB, maybe 30MB at most including all languages and 5000+ words.

@oood
Copy link
Member Author

oood commented Mar 6, 2023

# software
- issue: 123  
- translations
   - zh: ...   
   - ja: ... 
- abbreviations
   - recommended: sw
   - context sensitive: foo1 (context1), foo2 (context2)
   - not recommended: softw, sware
# **word**
- translations only if has translations
	- it: ...
	- sp: ...
- abbr: **abbr**
	- degree: red/purple/yellow/green
	- context: **context** only if needed

Honestly, the format you guys are thinking about looks a lot like yaml. I like this format, except it's whitespace sensitive.

Yaml also supports comments, which can be helpful, especially if you include comments in your database.

We may not have to reinvent the wheel, yaml is fine with me.

@T1xx1
Copy link
Member

T1xx1 commented Mar 6, 2023

I don't see it as yaml but text. JSON needs {}, [] and "" to be valid and after a bit your eyes go on vacation not considering the space and the format the database will have (no thanks). I don't like yaml and I don't have a good explanation. So that can be our abbr format. Maybe the db name can be.
Main.abbr.txt

@oood
Copy link
Member Author

oood commented Mar 6, 2023

I don't see it as yaml but text.

JSON needs {}, [] and "" to be valid and after a bit your eyes go on vacation not considering the space and the format the database will have (no thanks).

Yes, and JSON needs a proper reader to be easy to read, so I really don't like it, I often use nano to edit text in the terminal, and to put it bluntly, I hate JSON. given that this is a github repository, we'll probably be editing directly with gitHub's web editor, and typing in a browser would suck.

I don't like yaml and I don't have a good explanation. So that can be our abbr format.

The biggest pet peeve of yaml for me is indentation, especially space indentation, which is very error prone.

Maybe the db name can be. Main.abbr.txt

Yes, this can be our own format, no need to follow yaml or json.

I like the freedom and manageability of the format. Anyway we can write a script to convert it to any format, so no problem.

@T1xx1
Copy link
Member

T1xx1 commented Mar 6, 2023

Talking about script and conversions in which lang we should write scripts.
We need a lang that run run locally to generate files. Previously the script was written in Perl but I suggest Nodejs, or bash. Preferably an interpreted so we don't have to compile the lang.

@oood
Copy link
Member Author

oood commented Mar 6, 2023

# software
# **word**

Please consider not using # as headings, because we can create comments in the format, so that our script can skip commented lines.

Or use /* */ as the start of the comment.

It seems to me that this is the yaml 👇, maybe the spaces are out of specification, I didn't double check.

# this is a comment.
word:
- translations:
	- it: ...
- abbr:
	- degree: red/purple/yellow/green
	- context: **context** only if needed
- abbr2:
	- ...:
# new one
word2:
- ......

@oood
Copy link
Member Author

oood commented Mar 6, 2023

We need a lang that run run locally to generate files. Previously the script was written in Perl but I suggest Nodejs, or bash. Preferably an interpreted so we don't have to compile the lang.

I like bash. It can be run directly in GitHub CI.

@T1xx1
Copy link
Member

T1xx1 commented Mar 6, 2023

# this is a comment.
word:
- translations:
	- it: ...
- abbr:
	- degree: red/purple/yellow/green
	- context: **context** only if needed
- abbr2:
	- ...:
# new one
word2:
- ......

Like this format.

I like bash. It can be run directly in GitHub CI.

Great.

@oood
Copy link
Member Author

oood commented Mar 6, 2023

Like this format.

That's yaml, lol.

Great.

Yep, but in bash you can't match a string directly, you have to rely on external programs like awk.

@T1xx1
Copy link
Member

T1xx1 commented Mar 6, 2023

ChatGPT says otherwise.

In Bash, you can use pattern matching to check if a string matches a particular format. Here are some examples:

To check if a string matches a specific word:

bash
if [[ "$string" == "hello" ]]; then
  echo "The string is 'hello'"
fi

To check if a string starts with a particular prefix:

bash
if [[ "$string" == "prefix"* ]]; then
  echo "The string starts with 'prefix'"
fi

To check if a string ends with a particular suffix:

bash
Copy code
if [[ "$string" == *"suffix" ]]; then
  echo "The string ends with 'suffix'"
fi

To check if a string contains a particular substring:

bash
Copy code
if [[ "$string" == *"substring"* ]]; then
  echo "The string contains 'substring'"
fi

To check if a string matches a regular expression pattern:

bash
if [[ "$string" =~ ^[0-9]+$ ]]; then
  echo "The string consists of only digits"
fi

Note that the [[ and ]] are special operators in Bash that allow for more advanced pattern matching than the single [ and ] brackets used in the older test command.

@oood
Copy link
Member Author

oood commented Mar 6, 2023

ChatGPT says otherwise.

No, I mean, you can't get matching words directly from the database, you have to rely on awk or something.

In bash you do something like this:

if [ word = $word ]; then
    Use "awk" to get the "$abbr" in it.
fi

So I'm not sure it will be as efficient as a perl script, since bash is calling a second program whose builtins are not suitable for getting abbr directly from word.

@T1xx1 T1xx1 added this to the Dictionary project milestone Mar 7, 2023
@T1xx1 T1xx1 self-assigned this Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants