New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aspect-classification for different data #3
Comments
Hi, thanks for sharing the solution, this should create documents in the correct format. Is the pipeline working now? |
Hello. I'm having issues fine-tuning the model due to RAM-related memory issues. I'm working on Colab with 25GB RAM allocation. Any suggestions on changing the pipelining of TF-records for training? and what number of iterations (hence batch size) would you recommend? |
Hey, with 25GB a batch size of 3 should work fine. With your second question, you mean the number of iteration steps? It depends on how many samples you have. If you have lots of samples (like 100k+), then I'd suggest to run just one epoch first and see how the results turn out. |
I see. Thank you for your response - i think i'm not facing a memory issue anymore but i got the following: FailedPreconditionError: Error while reading resource variable encoder/encoder_layer_7/multi_head_attention_7/dense_44/kernel/Adagrad from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/encoder/encoder_layer_7/multi_head_attention_7/dense_44/kernel/Adagrad/N10tensorflow3VarE does not exist. I looked it up and found the solution(s) given in this issue: tensorflow/tensorflow#28287 Since the training is a little different, can you suggest would be the right way to apply the solution given? |
Hey, you mean you changed the training code? I think without the context or seeing the modified code I cannot really help with this. If this appears without changing the code, please make sure you have the correct tf version given in the requirements.txt. |
Hey thank you for the feedback. I was training on Colab and forgot to apply the patch. That fixed it. |
Hi again, I'm trying to use my own data set in the pipeline following the steps you have listed. I've split the documents into json files as.When I run the argument-classification script, i get the following error:
$ python argument_classification.py --topic culture --index arguana
Start classifying sentences for topic "culture" from doc_id_start 0 with MAX_FILE_SIZE 200000, and FILTER_TOPIC set "True"". Writing to ../../training_data/arguana/culture/
0%| | 0/170 [00:00<?, ?it/s]
string indices must be integers
Crashed at doc_id 0
This error tends to arise from using a string index in dictionaries/JSONs incorrectly. Is there anything I need to change in the construction of the json files?
EDIT
Fixed the issue. I was creating the JSON files incorrectly.
import spacy
nlp = spacy.load('en_core_web_sm')
import json
def makeFile(lst):
d = {'sents':lst}
with open('doc.json', 'w') as filehandle:
json.dump(d, filehandle)
def makeSentList(var):
about_doc = nlp(var)
sentences = list(about_doc.sents)
sentences = [str(x) for x in sentences]
return sentences
One can use the above functions to make a JSON file for the particular task. var is the document to process and lst is the output of the makeSentList function.
The text was updated successfully, but these errors were encountered: