Skip to content
This repository has been archived by the owner on Jan 24, 2024. It is now read-only.

in stream mode, the English word has no space after detokenizer and Chinese were messed up #197

Open
lucasjinreal opened this issue Jun 1, 2023 · 14 comments
Labels
question Further information is requested

Comments

@lucasjinreal
Copy link

image

How to resolve this problem?>

@peakji peakji added the question Further information is requested label Jun 1, 2023
@peakji
Copy link
Member

peakji commented Jun 1, 2023

Hi @lucasjinreal. We need more information in order to assist you in resolving the issue.

May I ask which model you are using? Are you using it through the API or through Python?

@lucasjinreal
Copy link
Author

@peakji Ithink its not related about model. For model am simple using Llama.

The reason is that when we decode same id, compare with decode ids in a sentence, tokenizers can be different.

For instance, for ids: [34, 56, 656], tokenizers would decode like: I love u

But if you decode one by one, you will got: Iloveu

It doesn't presever these spaces, and Chinese characters even worse.

However, am not sure is because of this or not for real.

But above is the problems I have indeed.

What's your think? (Mainly simple word do not have spaces compare as original, and Chinese if wrong decoding)

@lucasjinreal
Copy link
Author

Or maybe these is something missed inside your StreamTokenizer? (like ignored some ids). Can u try get decode ids one by one and print it?

outputs = []
          for oid in output_ids:
              # if i > len(input_ids[0]):
              # print(oid)
              word = tokenizer.decode(oid[0])
              print(word, end='')
              outputs.append(word)
              # else:
              #     i += 1
          print()
          outputs = ''.join(outputs)

Me was wrong

@peakji
Copy link
Member

peakji commented Jun 1, 2023

The reason is that when we decode same id, compare with decode ids in a sentence, tokenizers can be different.

StreamTokenizer is specifically designed to handle this properly.

There is an example of the LLaMA tokenizer in the test case, which also includes Chinese characters:

https://github.com/hyperonym/basaran/blob/master/tests/test_tokenizer.py#L48

@lucasjinreal
Copy link
Author

lucasjinreal commented Jun 1, 2023

@peakji Thanks, I just using tokenizer of StreamModel and the Chinese decoding error problems still exist.

image

And I still can not get the spaces between engliesh words .

I think the output stream has some problems, How can I combine it using with model and tokenizer and print correct words in terminal?

@peakji
Copy link
Member

peakji commented Jun 2, 2023

@lucasjinreal
Copy link
Author

lucasjinreal commented Jun 2, 2023

I got no space and Chinese were wrong either (try print(word, end=''))

I don't want change line in every word and I don't want unexpect spaces in un-English characters.

@peakji
Copy link
Member

peakji commented Jun 2, 2023

Could you please provide some example code for us to reproduce the issue?

The output in your first screenshot is apparently not from StreamTokenizer.

@lucasjinreal
Copy link
Author

@peakji second one is, I fisrt usign model = StreamModel(model, tokeizer) and then using model.tokenizer to decode.

Can u guys provide a effect print correct values without change line demo? (correctly all print word one by one

@peakji
Copy link
Member

peakji commented Jun 2, 2023

I fisrt usign model = StreamModel(model, tokeizer) and then using model.tokenizer to decode.

You shouldn't use model.tokenizer directly because it's not a stateful StreamTokenizer but a stateless Huggingface tokenizer.

The correct way could be either:

a. Call the model directly without the need for manual detokenization:

for choice in model("once upon a time"):
print(choice)

b. Create an instance of StreamTokenizer and use that instead:
detokenizer = StreamTokenizer(tokenizer)
expected = "hello world ABC \n 你好"
tokens = tokenizer.encode(expected)
actual = ""
for token in tokens:
actual += detokenizer.decode(token)

@lucasjinreal
Copy link
Author

@peakji thank u! I have solved the first problem.

the english seems oK now. but Chinese still not OK
image

image

the Chinese characters some are ok, some still got weird coding style

@lucasjinreal
Copy link
Author

Some \n which is actually needed seems trimed:

image

@lucasjinreal
Copy link
Author

I resolved the \n issue, but clearly the Chinese not always work:

image

Please take a deeper test!

@peakji
Copy link
Member

peakji commented Jun 2, 2023

We need more information to assist you in resolving the issue. These screenshots alone don't provide much valuable information.

Could you please provide the code you are testing for us to reproduce?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants