Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aspect-classification for different data #3

Closed
kabirwalia8300 opened this issue Jul 9, 2020 · 6 comments
Closed

Aspect-classification for different data #3

kabirwalia8300 opened this issue Jul 9, 2020 · 6 comments

Comments

@kabirwalia8300
Copy link

kabirwalia8300 commented Jul 9, 2020

Hi again, I'm trying to use my own data set in the pipeline following the steps you have listed. I've split the documents into json files as.When I run the argument-classification script, i get the following error:

$ python argument_classification.py --topic culture --index arguana
Start classifying sentences for topic "culture" from doc_id_start 0 with MAX_FILE_SIZE 200000, and FILTER_TOPIC set "True"". Writing to ../../training_data/arguana/culture/
0%| | 0/170 [00:00<?, ?it/s]
string indices must be integers
Crashed at doc_id 0

This error tends to arise from using a string index in dictionaries/JSONs incorrectly. Is there anything I need to change in the construction of the json files?

EDIT
Fixed the issue. I was creating the JSON files incorrectly.

import spacy
nlp = spacy.load('en_core_web_sm')
import json

def makeFile(lst):
d = {'sents':lst}
with open('doc.json', 'w') as filehandle:
json.dump(d, filehandle)

def makeSentList(var):
about_doc = nlp(var)
sentences = list(about_doc.sents)
sentences = [str(x) for x in sentences]
return sentences

One can use the above functions to make a JSON file for the particular task. var is the document to process and lst is the output of the makeSentList function.

@v1nc3nt27
Copy link
Contributor

Hi, thanks for sharing the solution, this should create documents in the correct format. Is the pipeline working now?

@kabirwalia8300
Copy link
Author

Hello. I'm having issues fine-tuning the model due to RAM-related memory issues. I'm working on Colab with 25GB RAM allocation. Any suggestions on changing the pipelining of TF-records for training? and what number of iterations (hence batch size) would you recommend?

@v1nc3nt27
Copy link
Contributor

Hey, with 25GB a batch size of 3 should work fine.

With your second question, you mean the number of iteration steps? It depends on how many samples you have. If you have lots of samples (like 100k+), then I'd suggest to run just one epoch first and see how the results turn out.

@kabirwalia8300
Copy link
Author

I see. Thank you for your response - i think i'm not facing a memory issue anymore but i got the following:

FailedPreconditionError: Error while reading resource variable encoder/encoder_layer_7/multi_head_attention_7/dense_44/kernel/Adagrad from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/encoder/encoder_layer_7/multi_head_attention_7/dense_44/kernel/Adagrad/N10tensorflow3VarE does not exist.

I looked it up and found the solution(s) given in this issue: tensorflow/tensorflow#28287

Since the training is a little different, can you suggest would be the right way to apply the solution given?

@v1nc3nt27
Copy link
Contributor

Hey, you mean you changed the training code? I think without the context or seeing the modified code I cannot really help with this. If this appears without changing the code, please make sure you have the correct tf version given in the requirements.txt.

@kabirwalia8300
Copy link
Author

Hey thank you for the feedback. I was training on Colab and forgot to apply the patch. That fixed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants