Skip to content

ouwei2013/medqa_colbert

Repository files navigation

ColBERT trained on the Chinese cmedqa-v2

This is a repo copy from the orignal ColBERT repo, by the Stanford NLP research team

这个库是复制的斯坦福的nlp 研究团队的的ColBERT

But I made some minor changes to the original repo, in order to make it compatible with Chinese BERT

但是我做了一点小的改动,使它能够兼容中文BERT

I trained ColBERT on the cmedqa-v2 dataset and observed impressive performance

我在cmedqa-v2数据集上训练了ColBERT,效果不错

I deployed the model on HuggingFace Space, link:https://huggingface.co/spaces/diagaiwei/ir_chinese_medqa

我把模型部署到了hf space,链接 https://huggingface.co/spaces/diagaiwei/ir_chinese_medqa

if you want to train it on your own data, please organize your data in the following format:

如果你想训练自己的数据,可以按照以下格式组织数据

  • it requires 3 files, queries.tsv, doc.tsv, triplets.jsonl
    • the content of queries.tsv should be like this ( header is not needed in the tsv)
      query_1_id \t 高血压吃什么药 ?
      query_2_id \t 糖尿病吃什么药

    • the content of doc.tsv should be like this ( header is not needed in the tsv):
      doc_1_id \t 高血压应该吃药A
      doc_2_id \t 糖尿病应该吃药B

    • each line in the triplets file should be like this: query_id, pos_doc_id,neg_doc_id, for example:
      [query_1_id, doc_1_id, doc_2_id]
      [query_1_id, doc_1_id, doc_3_id]
      [... ... .... ]

After preparing the data, run python colbert_train.py

准备好数据后,运行 python colbert_train.py

After the training process, run python colbert_index.py to create search index

训练完后,运行 python colbert_index.py构建索引

After indexing run python colbert_search.py to test the model

构建索引后再运行 python colbert_search.py 测试

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published