Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run a pretrained model on unlabeled data? #30

Open
serenalotreck opened this issue Sep 13, 2021 · 5 comments
Open

How to run a pretrained model on unlabeled data? #30

serenalotreck opened this issue Sep 13, 2021 · 5 comments

Comments

@serenalotreck
Copy link

Hi,

I'm looking to apply your pretrained models to an unlabeled, new dataset. I have my dataset in DyGIE format. Looking at the script, it's unclear to me how to do this, becuase there are only two blocks of code in the script. The first is if args.do_train:, where the model is trained, and the second is if args.do_eval:, where the model is evaluated.

I don't want to train, since I'm using a pre-trained model, but I also don't want to evaluate, since my data don't have labels, which makes my use case different than the example of applying the pretrained scibert models to the scierc dataset.

Wondering if you have pointers on how to do this?

Thanks!

@a3616001
Copy link
Member

Hi! I guess the easiest way for you to do this is to still create the "ner" and "relations" field in your unlabeled dataset, but set them to be empty for each sentence. For example, if a document contains 4 sentences, you can set the "ner" and "relations" as {..., "ner": [[], [], [], []], "relations":[[], [], [], []], ...}. After that, you can use --do_eval to generate the prediction file (and ignore the evaluation results in that case).

Thanks for pointing this out! I plan to add a --do_predict feature soon. For now, I think this could be an easy way to do only prediction.

@serenalotreck
Copy link
Author

I just wanted to check in to see if you thought the --do-predict feature would be available soon!

@serenalotreck
Copy link
Author

serenalotreck commented Jan 17, 2022

Just wanted to leave an update for anyone trying this -- your data file should be in a file called dev.json -- I originally had mine in test.json & couldn't get it to work, but it ran once I changed it to dev.json!

Edit: I had typed test.dev, but it should be test.json

@Hubotcoder
Copy link

Hi,
allow me to ask a simple question. What is doc_key? According to 'please make sure doc_key can be used to identify a certain document', should I find any document in the sciERC processed data?

@Shike-Cheng
Copy link

Hi! I guess the easiest way for you to do this is to still create the "ner" and "relations" field in your unlabeled dataset, but set them to be empty for each sentence. For example, if a document contains 4 sentences, you can set the "ner" and "relations" as {..., "ner": [[], [], [], []], "relations":[[], [], [], []], ...}. After that, you can use --do_eval to generate the prediction file (and ignore the evaluation results in that case).

Thanks for pointing this out! I plan to add a --do_predict feature soon. For now, I think this could be an easy way to do only prediction.

I would like to know if the prediction function of the model on the unlabeled dataset has been updated, and where I can see the relevant code, thank you very much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants