Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode Charecter training issue #40

Open
rahat10120141 opened this issue Jul 29, 2022 · 5 comments
Open

Unicode Charecter training issue #40

rahat10120141 opened this issue Jul 29, 2022 · 5 comments

Comments

@rahat10120141
Copy link

I tried to train My model for translating English to Bengali. After Training when I run the code, The output is not Unicode Bengali character.

I Eat Rice (eng)=> আমি ভাত খাই (Bn)

this type of data is input to the model while training. After complete, when I tested the model by inputting "I Eat Rice" I was expecting "আমি ভাত খাই" as output. But instead of this, the model gave me "Ich esse Reis." I dont know what kind of language is this. Its not related to bengali.

@rahat10120141
Copy link
Author

I tested the output. It was in the german language. But why its In German Language

@rahat10120141
Copy link
Author

    model = SimpleT5()
    model.from_pretrained(model_type="t5", model_name="t5-base")
    path = "D:\\Python\\Quilbot\\Dataset\\translation.csv"
    df = pd.read_csv(path, encoding='utf8',quotechar="'")
    # df.apply(lambda x: pd.lib.infer_dtype(x.values))
    # print(df)
    df = df.rename(columns={"headlines": "source_text", "text": "target_text"})
    df = df[['source_text', 'target_text']]
    # T5 model expects a task related prefix: since it is a summarization task, we will add a prefix "summarize: "
    df['source_text'] = "tn2bn: " + df['source_text']
    print(df)
    train_df, test_df = train_test_split(df, test_size=0.2)
    train_df.shape, test_df.shape
    print(train_df.shape, test_df.shape)
    model.train(train_df=train_df,
                eval_df=test_df,
                source_max_token_len=128,
                target_max_token_len=50,
                batch_size=8,
                max_epochs=3,
                use_gpu=False
                )
    model.load_model("t5", "outputs/translate", use_gpu=False)

    text_to_summarize = "translate: I eat rice."
    print(model.predict(text_to_summarize))

@rahat10120141
Copy link
Author

I have tested it with the commanding phrase: "tn2bn"

@Shivanandroy
Copy link
Owner

@rahat10120141 : How does your train_df looks like before feeding to model?

@rahat10120141
Copy link
Author

T5 Doesn't have an English to Bengali translation. From the beginning, it was giving me German result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants