Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow load custom HMM model #48

Open
messense opened this issue Jun 29, 2019 · 2 comments · May be fixed by #92
Open

Allow load custom HMM model #48

messense opened this issue Jun 29, 2019 · 2 comments · May be fixed by #92
Labels
enhancement New feature or request

Comments

@messense
Copy link
Owner

No description provided.

@messense messense added the enhancement New feature or request label Jun 29, 2019
@messense
Copy link
Owner Author

Ref #23

@name1e5s name1e5s linked a pull request Dec 18, 2022 that will close this issue
@awong-dev
Copy link
Collaborator

API-wise, Jieba should expose a struct HmmModel like proposed in #92.

However, unlike in the PR, it might be better to go further and take the HmmModel ownership/reference out of the Jieba struct entirely.

The reason for this is looking around at a few (very random, possibly not represenative) real-world usages, people seem to be confused about whether or not to set hmm to true or false and what the effect of this choice is. In fact, it feels like people just think of "Jieba" as a segmentation method, not "Jieba with hmm" or "Jieba with additional dictionaries" or "Jieba with the default hmm dictionary."

Semantically, it might be clearer for the non-expert user of Jieba, if the API encoded how these options are used.

For example, let's assume the HmmModel was removed from Jieba's construction. You could imagine

trait NoMatchCutter {
  pub(crate) fn cut_with_allocated_memory<'a>(
      sentence: &'a str,
      words: &mut Vec<&'a str>,
      V: &mut Vec<f64>,
      prev: &mut Vec<Option<Status>>,
      path: &mut Vec<Status>);
}

cut<'a>(sentence: &'a String,  Some<NoMatchCutter>) -> Vec<&'a str> {
   ...
}

And then the calling code would pass in an HmmModel decoupling the HmmModel state from the Jieba dictionary state.

As a variant of the above, you could consider doing

cut<'a>(sentence: &'a String) -> Vec<&'a str> { ... }
hmm_cut<'a>(sentence: &'a String,  hmm_model: HmmModel) -> Vec<&'a str> { ... }

But you'd have to annotate the hmm_xxx for each of the variants of Cut.

Lastly, you could imagine doing a hybrid where there is a

trait Cutter {
  pub fn cut<'a>(&self, sentence: &'a str) -> Vec<&'a str>;
  pub fn cut_all<'a>(&self, sentence: &'a str) -> Vec<&'a str>;
  pub fn cut_for_search<'a>(&self, sentence: &'a str) -> Vec<&'a str>;
  pub fn tag<'a>(&'a self, sentence: &'a str) -> Vec<Tag>;
  pub fn tokenize<'a>(&self, sentence: &'a str, mode: TokenizeMode) -> Vec<Token<'a>> ;
}

Then Jieba would implement Cutter but also expose a

cut_with_fallback<'a>(sentence: &'a String,  NoMatchCutter) -> Vec<&'a str> { ... }
cut_all_with_fallback<'a>(sentence: &'a String,  NoMatchCutter) -> Vec<&'a str> { ... }

And you could have a

struct HmmFallbackCutter<'a> {
   jieba: &'a Jieba;
   hmm_model: &'a HmmModel;
}

/// WARNING: This should stay a very thin adapter that is easily constucted.
/// Do not add state or complex initialization or it will make it difficult for other
/// languages to interface due to the lifetime bound.
impl HmmFallbackCutter<'a> {
  pub fn cut<'a>(&self, sentence: &'a str) -> Vec<&'a str> {
    self.jieba.cut_with_fallback(sentence, self.hmm_model)
  }
  ...
}

The last one with the proxy pattern feels the "cleanest".
The first one with the Option<HmmModel> seems the simplest.

Both seem better than passing a weird "true/false" to cut() function where you don't know, on first glance, if hmm is being used somehow on the dict or only on terms that were not findable in the dict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants