Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A problem when using the pecos model to train xtransformer #218

Open
xiaokening opened this issue May 9, 2023 · 3 comments
Open

A problem when using the pecos model to train xtransformer #218

xiaokening opened this issue May 9, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@xiaokening
Copy link

xiaokening commented May 9, 2023

Description

When I train xtransformer with pecos model, a training error occurs in the matcher stage.
the size of dataset is 108457, Hierarchical label tree: [32, 1102]。In the matcher stage, when I was training the second layer of label trees(There is no problem when training the first layer of label trees), after the matcher fine-tuning was completed, it got stuck when predicting the training data, look pecos.xmc.xtransformer.matcher

I think it is caused by my training data set is too large,so I modified the code snippet of pecos.xmc.xtransformer.matcher

P_trn, inst_embeddings = matcher.predict(
                prob.X_text,
                csr_codes=csr_codes,
                pred_params=pred_params,
                batch_size=train_params.batch_size,
                batch_gen_workers=train_params.batch_gen_workers,
                max_pred_chunk=30000,
            )

But another problem happened, see the training log below。

05/08/2023 10:31:56 - INFO - pecos.xmc.xtransformer.matcher - Reload the best checkpoint from /tmp/tmp0kdzh7n5
05/08/2023 10:31:58 - INFO - pecos.xmc.xtransformer.matcher - Predict with csr_codes_next((30000, 1102)) with avr_nnz=172.31423333333333
05/08/2023 10:31:58 - INFO - pecos.xmc.xtransformer.module - Constructed XMCTextTensorizer, tokenized=True, len=30000
05/08/2023 10:32:29 - INFO - pecos.xmc.xtransformer.matcher - Predict with csr_codes_next((30000, 1102)) with avr_nnz=172.2335
05/08/2023 10:32:29 - INFO - pecos.xmc.xtransformer.module - Constructed XMCTextTensorizer, tokenized=True, len=30000
Traceback (most recent call last):
File "/opt/conda/envs/nlp/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/nlp/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/train.py", line 564, in
do_train(args)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/train.py", line 548, in do_train
xtf = XTransformer.train(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/model.py", line 447, in train
res_dict = TransformerMatcher.train(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 1402, in train
P_trn, inst_embeddings = matcher.predict(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 662, in predict
cur_P, cur_embedding = self._predict(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 812, in _predict
cur_act_labels = csr_codes_next[inputs["instance_number"].cpu()]
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 47, in getitem
row, col = self._validate_indices(key)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 159, in _validate_indices
row = self._asindices(row, M)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 191, in _asindices
raise IndexError('index (%d) out of range' % max_indx)
IndexError: index (30255) out of range

I'm not sure if this is a bug, can you give me some advice? Thanks!

Environment

  • Operating system: Ubuntu 20.04.4 LTS container
  • Python version: Python 3.8.16
  • PECOS version:libpecos 1.0.0
@xiaokening xiaokening added the bug Something isn't working label May 9, 2023
@jiong-zhang
Copy link
Contributor

Hi xiaokening, the issue is caused by pre-tensorized prob.X_text has larger instance index than the partitioned chunk size (30000). This should not happen if prob.X_text is not tensorized (list of str).

If you want to manually truncated predict, one simple workaround is to turn off the train_params.pre_tokenize so every chunk of data will be tensorized independently.

@xiaokening
Copy link
Author

thanks! @jiong-zhang

@xiaokening
Copy link
Author

xiaokening commented Aug 24, 2023

@jiong-zhang When I train xtransformer with pecos model, the same training error occurs in the matcher stage. At first I thought that my data volume was too large, but when I increased the memory, this problem would still appear. This problem may occur in any matcher stage(I don't manually truncate predict)

I use the top and free commands to monitor the running of the program. I found that the number of processes suddenly increased and then disappeared. I suspect it is a problem with the dataloader. You can refer to this link

note:after the matcher fine-tuning was completed, it got stuck when predicting the training data at first step, look pecos.xmc.xtransformer.matcher

can you give me some adivce? Thanks

Environment

  • Operating system: Ubuntu 20.04.4 LTS container
  • Python version: Python 3.8.16
  • PECOS version:libpecos 1.0.0
  • pytorch version: pytorch==1.11.0
  • cuda version: 4 x NVIDIA V100 16GB;cudatoolkit=11.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants