Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support language which need tokenizer (Chinese, Japanese .etc) #123

Open
eromoe opened this issue Feb 6, 2017 · 7 comments
Open

Support language which need tokenizer (Chinese, Japanese .etc) #123

eromoe opened this issue Feb 6, 2017 · 7 comments

Comments

@eromoe
Copy link

eromoe commented Feb 6, 2017

I think iepy need a common interface to embed a tokenizer to support language like Chinese, Japanese .etc.

There is a old ie project with gui named GATE, it contain a pre-trained model and dataset, maybe helpful
https://gate.ac.uk/sale/tao/splitch15.html#sec:misc-creole:language-plugins:chinese

@francolq
Copy link
Contributor

francolq commented Feb 7, 2017

Hello. The preprocessing pipeline can be customized to introduce a different tokenizer. See for instance:

https://github.com/awolfmann/PLN-2015/blob/practico4/information_extraction/resoluciones-unc/bin/preprocess.py

@eromoe
Copy link
Author

eromoe commented Feb 8, 2017

Hello @francolq ,
I have seen how to customise in docs:

    pipeline = PreProcessPipeline([
        CustomTokenizer(),
        CustomSentencer(),
        CustomLemmatizer(),
        CustomPOSTagger(),
        CustomNER(),
        CustomSegmenter(),
    ], docs)
    pipeline.process_everything()

Then I look into the code , preprocess.tokenizer.TokenizeSentencerRunner seems not be used in anywhere. And I found:

  • one pipeline may have multiple runner
  • one runner may have step or not

As I see, there is not just as simple as adding a tokenizer since some runners are relative.It is a little hard to customise without knowing the input and output of each runner and step format and the runner api design principle (Currently I have to view the code and tried to understand what it does, but due to knowledge and language limitation, I may stuck at some place). I would like to help to make iepy compatible with CJK language if anyone could provide the api principle to write the runners. @machinalis @jmansilla

@jmansilla
Copy link
Contributor

Sorry the delay respect this talk. Can I still help here @eromoe ?

@YanWenqiang
Copy link

@eromoe Right now, I want iepy to customize to Chinese, could you give me a hand ?

@eromoe
Copy link
Author

eromoe commented Sep 25, 2017

@YanWenqiang Sorry, I was just need the annotator and object binding of iepy, since it was not easy to integrate Chinese , I have already made my own now.

@YanWenqiang
Copy link

@eromoe All right. Thanks a lot. Now I was also met with this trouble, I really need someone could help me.

@hwaking
Copy link

hwaking commented Dec 9, 2017

@eromoe I am doing Chinese EMR information extraction , can i use iepy to do entity relationship extraction ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants