Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine tuning a T5 model with another language #110

Open
vabatista opened this issue Jul 3, 2023 · 0 comments
Open

Fine tuning a T5 model with another language #110

vabatista opened this issue Jul 3, 2023 · 0 comments

Comments

@vabatista
Copy link

vabatista commented Jul 3, 2023

Hi,

I'm trying to figure out how to prepare data and fine tuning this T5 base model (https://huggingface.co/unicamp-dl/ptt5-base-t5-vocab) with this squad dataset (https://huggingface.co/datasets/squad_v1_pt)

I downloaded data from hugging face to local folder:
image

Then run the following command:

python prepare_data.py \
    --task e2e_qg \
    --model_type t5 \
    --dataset_path data/squad_v1_pt \
    --qg_format highlight_qg_format \
    --max_source_length 512 \
    --max_target_length 32 \
    --train_file_name train_data_e2e_qg_t5_ptbr.pt \
    --valid_file_name valid_data_e2e_qg_t5_ptbr.pt 

But I got this error:

(qagenerator) Apptainer> python prepare_data.py \
    --task e2e_qg \
    --model_type t5 \
    --dataset_path data/squad_v1_pt \
    --qg_format highlight_qg_format \
    --max_source_length 512 \
    --max_target_length 32 \
    --train_file_name train_data_e2e_qg_t5_ptbr.pt \
    --valid_file_name valid_data_e2e_qg_t5_ptbr.pt 
/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/transformers/models/t5/tokenization_t5.py:163: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(
07/03/2023 08:55:55 - INFO - nlp.load -   Checking data/squad_v1_pt/squad_v1_pt.py for additional imports.
07/03/2023 08:55:55 - INFO - nlp.load -   Found main folder for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt
07/03/2023 08:55:55 - INFO - nlp.load -   Found specific version folder for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/65162e0fbe44f19a4d2ad9f5f507d2e965e74249fc3239dc78b4e3bd93bab7c4
07/03/2023 08:55:55 - INFO - nlp.load -   Found script file from data/squad_v1_pt/squad_v1_pt.py to /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/65162e0fbe44f19a4d2ad9f5f507d2e965e74249fc3239dc78b4e3bd93bab7c4/squad_v1_pt.py
07/03/2023 08:55:55 - INFO - nlp.load -   Found dataset infos file from data/squad_v1_pt/dataset_infos.json to /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/65162e0fbe44f19a4d2ad9f5f507d2e965e74249fc3239dc78b4e3bd93bab7c4/dataset_infos.json
07/03/2023 08:55:55 - INFO - nlp.load -   Found metadata file for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/65162e0fbe44f19a4d2ad9f5f507d2e965e74249fc3239dc78b4e3bd93bab7c4/squad_v1_pt.json
Traceback (most recent call last):
  File "/projetos/u4vn/question_generation/prepare_data.py", line 204, in <module>
    main()
  File "/projetos/u4vn/question_generation/prepare_data.py", line 155, in main
    train_dataset = nlp.load_dataset(data_args.dataset_path, name=data_args.qg_format, split=nlp.Split.TRAIN)
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/load.py", line 536, in load_dataset
    builder_instance: DatasetBuilder = builder_cls(
TypeError: 'NoneType' object is not callable

I also tried to copy the data/squad_multitask directory and modify these lines with my URLs:

    _URL = "https://github.com/nunorc/squad-v1.1-pt/raw/master/"
    _DEV_FILE = "dev-v1.1-pt.json"
    _TRAINING_FILE = "train-v1.1-pt.json"

The error now is another:

(qagenerator) Apptainer> python prepare_data.py     --task e2e_qg     --model_type t5     --dataset_path data/squad_v1_pt     --qg_format highlight_qg_format     --max_source_length 512     --max_target_length 32     --train_file_name train_data_e2e_qg_t5_ptbr.pt     --valid_file_name valid_data_e2e_qg_t5_ptbr.pt 
/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/transformers/models/t5/tokenization_t5.py:163: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(
07/03/2023 09:07:01 - INFO - nlp.load -   Checking data/squad_v1_pt/squad_v1_pt.py for additional imports.
07/03/2023 09:07:02 - INFO - nlp.load -   Found main folder for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt
07/03/2023 09:07:02 - INFO - nlp.load -   Found specific version folder for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec
07/03/2023 09:07:02 - INFO - nlp.load -   Found script file from data/squad_v1_pt/squad_v1_pt.py to /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.py
07/03/2023 09:07:02 - INFO - nlp.load -   Found dataset infos file from data/squad_v1_pt/dataset_infos.json to /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/dataset_infos.json
07/03/2023 09:07:02 - INFO - nlp.load -   Found metadata file for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.json
[nltk_data] Downloading package punkt to /home/U4VN/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
07/03/2023 09:07:02 - INFO - nlp.info -   Loading Dataset Infos from /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec
07/03/2023 09:07:02 - INFO - nlp.builder -   Generating dataset squad_multitask (/tmp/u4vn/huggingface/datasets/squad_multitask/highlight_qg_format/1.0.0/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec)
Downloading and preparing dataset squad_multitask/highlight_qg_format (download: Unknown size, generated: Unknown size, post-processed: Unknown sizetotal: Unknown size) to /tmp/u4vn/huggingface/datasets/squad_multitask/highlight_qg_format/1.0.0/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec...
07/03/2023 09:07:02 - INFO - nlp.builder -   Dataset not on Hf google storage. Downloading and preparing it from source
07/03/2023 09:07:04 - INFO - nlp.utils.info_utils -   Unable to verify checksums.
07/03/2023 09:07:04 - INFO - nlp.builder -   Generating split train
0 examples [00:00, ? examples/s]07/03/2023 09:07:04 - INFO - root -   generating examples from = /tmp/u4vn/huggingface/datasets/downloads/6bf2e2bfc0769ed6e47c7935079d8584fb3201dd7915b637bbcf0fe3409710a0.4d4fd5bfbda09cd172db9f6f025e9bbf6d4d7d20cd53cef625822e1f2a34dd1f
Traceback (most recent call last):  
  File "/projetos/u4vn/question_generation/prepare_data.py", line 204, in <module>
    main()
  File "/projetos/u4vn/question_generation/prepare_data.py", line 155, in main
    train_dataset = nlp.load_dataset(data_args.dataset_path, name=data_args.qg_format, split=nlp.Split.TRAIN)
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/load.py", line 548, in load_dataset
    builder_instance.download_and_prepare(
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/builder.py", line 462, in download_and_prepare
    self._download_and_prepare(
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/builder.py", line 537, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/builder.py", line 810, in _prepare_split
    for key, record in utils.tqdm(generator, unit=" examples", total=split_info.num_examples, leave=False):
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.py", line 239, in _generate_examples
    yield count, self.process_qg_text(context, question, qa["answers"][0])
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.py", line 144, in process_qg_text
    start_pos, end_pos = self._get_correct_alignement(context, answer)
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.py", line 131, in _get_correct_alignement
    raise ValueError()
ValueError
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant