experimental multilingual idea #171

richard-rogers · 2023-10-26T16:47:54Z

Uses proposed schema chaining 1380 to support a schema per language for each metric module. Multiple languages can be selected when initializing a metric collection. Metrics are prefixed with the language code.

jamie256

Some initial comments. Its ok if we don't have other languages models plugged in but we should stub out how to swap or at least validate these match the configured language.

jamie256 · 2023-10-26T17:42:13Z

langkit/input_output.py

+    _transformer_model = Encoder(transformer_name, custom_encoder)
+    register_dataset_udf(
+        [_prompt, _response],
+        f"{language}.{_response}.relevance_to_{_prompt}",


this renaming prefixing the language in the metric name will create a discontinuity with existing integrations and break back-compat.

We shouldn't prefix the localization in the metric name, at least not for the original english only launch of LangKit. Better would be to put this in metadata or in the platform something like column entity schema?

Do you want, for example, to track English and French toxicity in the same column?

maybe we could keep the original name for english, and add the language prefix only for other languages?

jamie256 · 2023-10-26T17:45:13Z

langkit/sentiment.py

@@ -41,6 +39,16 @@ def init(lexicon: Optional[str] = None, config: Optional[LangKitConfig] = None):
        _nltk_downloaded = True


The lexicon downloaded I believe is language specific, we can't just rename the metric but still download the english based corpus from nltk right? At least we should perform a check and raise an error or log a warning in many of these metrics where the existing models don't target other languages than en?

FelipeAdachi · 2023-10-26T21:23:08Z

langkit/all_metrics.py

-    input_output.init(config=config)
-    text_schema = udf_schema()
+def init(languages: List[str] = ["en"], config: Optional[LangKitConfig] = None) -> DeclarativeSchema:
+    for language in langauges:


typo here? "langauges"

FelipeAdachi · 2023-10-26T21:26:00Z

langkit/light_metrics.py

-    textstat.init(config=config)
+def init(languages: List[str] = ["en"], config: Optional[LangKitConfig] = None) -> DeclarativeSchema:
+    for language in languages:
+    regexes.init(language, config=config)


looks like identation is wrong here

FelipeAdachi · 2023-10-26T21:30:07Z

langkit/all_metrics.py

Considering that the modules are imported before calling init with the desired languages, does that mean that english will always be applied, and others will be additional language-specific metrics?

experimental multilingual idea

53f6617

richard-rogers requested review from FelipeAdachi and jamie256 October 26, 2023 16:47

richard-rogers marked this pull request as draft October 26, 2023 16:48

jamie256 requested changes Oct 26, 2023

View reviewed changes

FelipeAdachi reviewed Oct 26, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experimental multilingual idea #171

experimental multilingual idea #171

richard-rogers commented Oct 26, 2023 •

edited

jamie256 left a comment

jamie256 Oct 26, 2023

richard-rogers Oct 26, 2023

FelipeAdachi Oct 26, 2023

jamie256 Oct 26, 2023

FelipeAdachi Oct 26, 2023

FelipeAdachi Oct 26, 2023

FelipeAdachi Oct 26, 2023

		@@ -41,6 +39,16 @@ def init(lexicon: Optional[str] = None, config: Optional[LangKitConfig] = None):
		_nltk_downloaded = True

experimental multilingual idea #171

Are you sure you want to change the base?

experimental multilingual idea #171

Conversation

richard-rogers commented Oct 26, 2023 • edited

jamie256 left a comment

Choose a reason for hiding this comment

jamie256 Oct 26, 2023

Choose a reason for hiding this comment

richard-rogers Oct 26, 2023

Choose a reason for hiding this comment

FelipeAdachi Oct 26, 2023

Choose a reason for hiding this comment

jamie256 Oct 26, 2023

Choose a reason for hiding this comment

FelipeAdachi Oct 26, 2023

Choose a reason for hiding this comment

FelipeAdachi Oct 26, 2023

Choose a reason for hiding this comment

FelipeAdachi Oct 26, 2023

Choose a reason for hiding this comment

richard-rogers commented Oct 26, 2023 •

edited