Aspect-classification for different data #3

kabirwalia8300 · 2020-07-09T10:58:29Z

Hi again, I'm trying to use my own data set in the pipeline following the steps you have listed. I've split the documents into json files as.When I run the argument-classification script, i get the following error:

$ python argument_classification.py --topic culture --index arguana
Start classifying sentences for topic "culture" from doc_id_start 0 with MAX_FILE_SIZE 200000, and FILTER_TOPIC set "True"". Writing to ../../training_data/arguana/culture/
0%| | 0/170 [00:00<?, ?it/s]
string indices must be integers
Crashed at doc_id 0

This error tends to arise from using a string index in dictionaries/JSONs incorrectly. Is there anything I need to change in the construction of the json files?

EDIT
Fixed the issue. I was creating the JSON files incorrectly.

import spacy
nlp = spacy.load('en_core_web_sm')
import json

def makeFile(lst):
d = {'sents':lst}
with open('doc.json', 'w') as filehandle:
json.dump(d, filehandle)

def makeSentList(var):
about_doc = nlp(var)
sentences = list(about_doc.sents)
sentences = [str(x) for x in sentences]
return sentences

One can use the above functions to make a JSON file for the particular task. var is the document to process and lst is the output of the makeSentList function.

The text was updated successfully, but these errors were encountered:

v1nc3nt27 · 2020-07-13T15:08:30Z

Hi, thanks for sharing the solution, this should create documents in the correct format. Is the pipeline working now?

kabirwalia8300 · 2020-07-23T17:15:04Z

Hello. I'm having issues fine-tuning the model due to RAM-related memory issues. I'm working on Colab with 25GB RAM allocation. Any suggestions on changing the pipelining of TF-records for training? and what number of iterations (hence batch size) would you recommend?

v1nc3nt27 · 2020-07-23T18:13:10Z

Hey, with 25GB a batch size of 3 should work fine.

With your second question, you mean the number of iteration steps? It depends on how many samples you have. If you have lots of samples (like 100k+), then I'd suggest to run just one epoch first and see how the results turn out.

kabirwalia8300 · 2020-07-24T14:22:30Z

I see. Thank you for your response - i think i'm not facing a memory issue anymore but i got the following:

FailedPreconditionError: Error while reading resource variable encoder/encoder_layer_7/multi_head_attention_7/dense_44/kernel/Adagrad from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/encoder/encoder_layer_7/multi_head_attention_7/dense_44/kernel/Adagrad/N10tensorflow3VarE does not exist.

I looked it up and found the solution(s) given in this issue: tensorflow/tensorflow#28287

Since the training is a little different, can you suggest would be the right way to apply the solution given?

v1nc3nt27 · 2020-07-28T10:16:12Z

Hey, you mean you changed the training code? I think without the context or seeing the modified code I cannot really help with this. If this appears without changing the code, please make sure you have the correct tf version given in the requirements.txt.

kabirwalia8300 · 2020-08-09T11:49:28Z

Hey thank you for the feedback. I was training on Colab and forgot to apply the patch. That fixed it.

v1nc3nt27 closed this as completed Jan 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aspect-classification for different data #3

Aspect-classification for different data #3

kabirwalia8300 commented Jul 9, 2020 •

edited

v1nc3nt27 commented Jul 13, 2020

kabirwalia8300 commented Jul 23, 2020

v1nc3nt27 commented Jul 23, 2020

kabirwalia8300 commented Jul 24, 2020

v1nc3nt27 commented Jul 28, 2020

kabirwalia8300 commented Aug 9, 2020

Aspect-classification for different data #3

Aspect-classification for different data #3

Comments

kabirwalia8300 commented Jul 9, 2020 • edited

v1nc3nt27 commented Jul 13, 2020

kabirwalia8300 commented Jul 23, 2020

v1nc3nt27 commented Jul 23, 2020

kabirwalia8300 commented Jul 24, 2020

v1nc3nt27 commented Jul 28, 2020

kabirwalia8300 commented Aug 9, 2020

kabirwalia8300 commented Jul 9, 2020 •

edited