lessons.txt

Token indices sequence length is longer than the specified maximum sequence length for this model (3015 > 1024). Running this sequence through the model will result in indexing errors


added max_length=1024 to the tokenizer, or truncation=True


---

different length sequences

(1, 1024, 768) for one sequence
(1, 7, 768) for another

so do a mean across the varying axis, 

data_list[0][4].mean(axis=1).shape

gives (1, 768) every time!

---

linear regression on 18 examples gets 50% accuracy and pretty poor performance.

Let's bump up the data to more than 18 examples, then start looking at more flexible models!

---

I cleaned up the data to make sure I was splitting each line properly and had a nice balanced dataset. Then, I doublechecked my data to make sure things were right in terms of loading it, labeling it, etc.

It seems that there is some strange issue, where I initialized a list of empty lists to serve as a data structure for aggregating my feature vectors. But as I looped through each class, it was appending to BOTH lists instead of the corresponding class! So I had duplicate data for each class.

I have never seen that before. I fixed it by using a dictionary structure to aggregate each class separately. it's nice because I can scale this up easily to multiple classes without making any other code changes than adding a new item to the dict.

now, perfect accuracy on 100 tweets for each of the 2 classes.

---