Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retraining, but accuracy is too low on unknown data. #106

Open
Sydeboy opened this issue Jul 14, 2023 · 0 comments
Open

Retraining, but accuracy is too low on unknown data. #106

Sydeboy opened this issue Jul 14, 2023 · 0 comments

Comments

@Sydeboy
Copy link

Sydeboy commented Jul 14, 2023

My configuration is as follows:
batch_size=384, epoch=20, val_check_interval=500, gpu:3090,others are the default configuration,
charset_train=62_mixed-case
charset_test = string.digits + string.ascii_lowercase + string.ascii_uppercase
I have a few questions to ask you

  1. My dataset format is strictly following your format. My data are all characters, only numbers, uppercase and lowercase English. My dataset split ratio is 8:1:1. details as follows
data
----train
--------real
------------D001
----------------train
----------------val
----val
--------D004
----test
--------D001
--------D004

In your paper, I see that real data and val data sets are not divided under the same data set. Can I understand that the data set under data/val is only used for verification and does not participate in training? According to my guess, I placed my divided data set, that is, the training set, in the real directory, the divided test set in the test directory, and different data sets in the data/val directory. For example, I trained the D001 dataset and placed D004 under data/val. The final test is D001 and D004. The accuracy of D001 is high, but the accuracy of D004 is very low. I don't quite understand the role of the two vals in the data directory, can you explain it, thank you!

  1. Another question is, can I use all your data sets plus my own data set for training, using charset_train=62_mixed-case?
    But what I am worried about is that in the demo of hugging face, I used your pre-trained weights to predict my pictures and recognized punctuation marks, but there are no punctuation marks in my data set.
    image

image

What should I do about it?
3. Does the charset used in the test have to be 32_lowercase?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant