/
lessons.txt
37 lines (18 loc) · 1.41 KB
/
lessons.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Token indices sequence length is longer than the specified maximum sequence length for this model (3015 > 1024). Running this sequence through the model will result in indexing errors
added max_length=1024 to the tokenizer, or truncation=True
---
different length sequences
(1, 1024, 768) for one sequence
(1, 7, 768) for another
so do a mean across the varying axis,
data_list[0][4].mean(axis=1).shape
gives (1, 768) every time!
---
linear regression on 18 examples gets 50% accuracy and pretty poor performance.
Let's bump up the data to more than 18 examples, then start looking at more flexible models!
---
I cleaned up the data to make sure I was splitting each line properly and had a nice balanced dataset. Then, I doublechecked my data to make sure things were right in terms of loading it, labeling it, etc.
It seems that there is some strange issue, where I initialized a list of empty lists to serve as a data structure for aggregating my feature vectors. But as I looped through each class, it was appending to BOTH lists instead of the corresponding class! So I had duplicate data for each class.
I have never seen that before. I fixed it by using a dictionary structure to aggregate each class separately. it's nice because I can scale this up easily to multiple classes without making any other code changes than adding a new item to the dict.
now, perfect accuracy on 100 tweets for each of the 2 classes.
---