Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Public Rust API #28

Open
eiennohito opened this issue Sep 21, 2021 · 4 comments
Open

Public Rust API #28

eiennohito opened this issue Sep 21, 2021 · 4 comments

Comments

@eiennohito
Copy link
Collaborator

eiennohito commented Sep 21, 2021

We want to design public API so Sudachi would be usable like the following.
Syntax can be a bit invalid and all names are open for discussion.

let model = JapaneseModel::from_cfg("...")?;
let mut analyzer = model.new_analyzer();

for line in data {
  for sentence in analyzer.analyze(sentence)? {
    for token in sentence {
      println!(token.surface);
    }
  }
}

Key points of API

  • Model should be immutable and safe to share between threads
  • It contains dictionary, connection, needed const data for preprocessing
  • Analyzer contains mutable data for analysis, e.g. lattice and tries to reuse allocations as much as possible.
  • In the long time, analyzer should have O(1) allocations

Because of Python API and lifetime considerations, Model should be a thin wrapper on Arc<RealModel> or something like that.

Layering

We have Rust API and Python API with different lifetime considerations.
Rust API should use lifetimes to safeguard against misuse and use mostly references for sharing data. On the other hand Python can't use Rust lifetimes and should use mostly Arc for sharing data.

Design proposal here is to have pointer-generic internals with thin wrappers for API types which mostly exist for instantiating concrete types.

API Surface (Types)

  • Dictionary - stores immutable data for tokenization
  • Tokenizer - stores mutable state for tokenization
  • InputBuffer - handles zero-copy input, sentence splitting and streaming of input data (eventually)
  • MorphemeList - analysis result of a single block of input data
  • Morpheme - unit of analysis result
@eiennohito
Copy link
Collaborator Author

Names and semantics should be close to Java version as possible.
(comment from Takaoka-san)

@eiennohito
Copy link
Collaborator Author

eiennohito commented Sep 21, 2021

TL:DR

Nice API with multiple sentences is currently blocked in stable Rust by in-progress GATs feature, also see http://lukaskalbertodt.github.io/2018/08/03/solving-the-generalized-streaming-iterator-problem-without-gats.html.

Scratch prototype impl

Want to have:

  • Analyzer having internal mutable state for storing one sentence (and not more) because of performance reasons
  • Iterator over analyzed sentences needs to borrow analyzer mutably to enforce that it is impossible to access the next sentence before consuming the current one

Problems:

  • Current Rust iterators can't borrow from &self without GATs which would probably introduce new-ish Iterator API as well

What to do

  • Provide non-allocating API for
    • Sentence splitting, returning Iterator of &str
    • Provide sentence-based analysis API
  • Optionally provide allocating combined API, which copies needed information (mostly POS) from analyzer
  • Another option would be to implement Iterator-like pattern without using standard library traits to iterate over analyzed sentences (for consuming in while loop).

@eiennohito
Copy link
Collaborator Author

eiennohito commented Sep 22, 2021

Splitting API into sentence splitter / analysis

for sentence in analyzer.split_sentences(line)? {
 let result = analyzer.analyze_sentence(sentence)?
 for token in result.tokens() {
   // process token
 }
}

@eiennohito eiennohito added this to the 0.1 milestone Oct 1, 2021
@eiennohito eiennohito removed this from the 0.1 milestone Oct 8, 2021
@eiennohito
Copy link
Collaborator Author

eiennohito commented Nov 8, 2021

Morpheme's part_of_speech should not return option of POS array, it should panic when given invalid POS id instead.

@eiennohito eiennohito changed the title Public API Public Rust API Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant