in stream mode, the English word has no space after detokenizer and Chinese were messed up #197

lucasjinreal · 2023-06-01T08:11:43Z

How to resolve this problem?>

peakji · 2023-06-01T08:46:29Z

Hi @lucasjinreal. We need more information in order to assist you in resolving the issue.

May I ask which model you are using? Are you using it through the API or through Python?

lucasjinreal · 2023-06-01T09:49:17Z

@peakji Ithink its not related about model. For model am simple using Llama.

The reason is that when we decode same id, compare with decode ids in a sentence, tokenizers can be different.

For instance, for ids: [34, 56, 656], tokenizers would decode like: I love u

But if you decode one by one, you will got: Iloveu

It doesn't presever these spaces, and Chinese characters even worse.

However, am not sure is because of this or not for real.

But above is the problems I have indeed.

What's your think? (Mainly simple word do not have spaces compare as original, and Chinese if wrong decoding)

lucasjinreal · 2023-06-01T09:50:45Z

Or maybe these is something missed inside your StreamTokenizer? (like ignored some ids). Can u try get decode ids one by one and print it?

outputs = []
          for oid in output_ids:
              # if i > len(input_ids[0]):
              # print(oid)
              word = tokenizer.decode(oid[0])
              print(word, end='')
              outputs.append(word)
              # else:
              #     i += 1
          print()
          outputs = ''.join(outputs)

Me was wrong

peakji · 2023-06-01T10:27:54Z

The reason is that when we decode same id, compare with decode ids in a sentence, tokenizers can be different.

StreamTokenizer is specifically designed to handle this properly.

There is an example of the LLaMA tokenizer in the test case, which also includes Chinese characters:

https://github.com/hyperonym/basaran/blob/master/tests/test_tokenizer.py#L48

lucasjinreal · 2023-06-01T10:55:45Z

@peakji Thanks, I just using tokenizer of StreamModel and the Chinese decoding error problems still exist.

And I still can not get the spaces between engliesh words .

I think the output stream has some problems, How can I combine it using with model and tokenizer and print correct words in terminal?

peakji · 2023-06-02T00:50:35Z

Here's a simple example for using Basaran as a Python library: https://github.com/hyperonym/basaran/blob/master/examples/basaran-python-library/main.py

lucasjinreal · 2023-06-02T03:18:00Z

I got no space and Chinese were wrong either (try print(word, end=''))

I don't want change line in every word and I don't want unexpect spaces in un-English characters.

peakji · 2023-06-02T03:31:45Z

Could you please provide some example code for us to reproduce the issue?

The output in your first screenshot is apparently not from StreamTokenizer.

lucasjinreal · 2023-06-02T05:16:14Z

@peakji second one is, I fisrt usign model = StreamModel(model, tokeizer) and then using model.tokenizer to decode.

Can u guys provide a effect print correct values without change line demo? (correctly all print word one by one

peakji · 2023-06-02T05:42:59Z

I fisrt usign model = StreamModel(model, tokeizer) and then using model.tokenizer to decode.

You shouldn't use model.tokenizer directly because it's not a stateful StreamTokenizer but a stateless Huggingface tokenizer.

The correct way could be either:

a. Call the model directly without the need for manual detokenization:

basaran/examples/basaran-python-library/main.py

Lines 8 to 9 in 5ef5ef0

    
           for choice in model("once upon a time"): 
        
               print(choice)

b. Create an instance of StreamTokenizer and use that instead:

basaran/tests/test_tokenizer.py

Lines 54 to 61 in 5ef5ef0

    
           detokenizer = StreamTokenizer(tokenizer) 
        
           expected = "hello world ABC \n 你好" 
        
           tokens = tokenizer.encode(expected) 
        
           actual = "" 
        
           for token in tokens: 
        
               actual += detokenizer.decode(token)

lucasjinreal · 2023-06-02T06:07:35Z

@peakji thank u! I have solved the first problem.

the english seems oK now. but Chinese still not OK

the Chinese characters some are ok, some still got weird coding style

lucasjinreal · 2023-06-02T06:16:19Z

Some \n which is actually needed seems trimed:

lucasjinreal · 2023-06-02T06:59:40Z

I resolved the \n issue, but clearly the Chinese not always work:

Please take a deeper test!

peakji · 2023-06-02T09:31:02Z

We need more information to assist you in resolving the issue. These screenshots alone don't provide much valuable information.

Could you please provide the code you are testing for us to reproduce?

peakji added the question Further information is requested label Jun 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in stream mode, the English word has no space after detokenizer and Chinese were messed up #197

in stream mode, the English word has no space after detokenizer and Chinese were messed up #197

lucasjinreal commented Jun 1, 2023

peakji commented Jun 1, 2023

lucasjinreal commented Jun 1, 2023

lucasjinreal commented Jun 1, 2023

peakji commented Jun 1, 2023

lucasjinreal commented Jun 1, 2023 •

edited

peakji commented Jun 2, 2023

lucasjinreal commented Jun 2, 2023 •

edited

peakji commented Jun 2, 2023

lucasjinreal commented Jun 2, 2023

peakji commented Jun 2, 2023 •

edited

lucasjinreal commented Jun 2, 2023

lucasjinreal commented Jun 2, 2023

lucasjinreal commented Jun 2, 2023

peakji commented Jun 2, 2023

in stream mode, the English word has no space after detokenizer and Chinese were messed up #197

in stream mode, the English word has no space after detokenizer and Chinese were messed up #197

Comments

lucasjinreal commented Jun 1, 2023

peakji commented Jun 1, 2023

lucasjinreal commented Jun 1, 2023

lucasjinreal commented Jun 1, 2023

peakji commented Jun 1, 2023

lucasjinreal commented Jun 1, 2023 • edited

peakji commented Jun 2, 2023

lucasjinreal commented Jun 2, 2023 • edited

peakji commented Jun 2, 2023

lucasjinreal commented Jun 2, 2023

peakji commented Jun 2, 2023 • edited

lucasjinreal commented Jun 2, 2023

lucasjinreal commented Jun 2, 2023

lucasjinreal commented Jun 2, 2023

peakji commented Jun 2, 2023

lucasjinreal commented Jun 1, 2023 •

edited

lucasjinreal commented Jun 2, 2023 •

edited

peakji commented Jun 2, 2023 •

edited