Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random 'Segmentation fault (core dumped)' error when training for long spancat #13026

Open
belalsalih opened this issue Sep 28, 2023 · 7 comments
Labels
bug Bugs and behaviour differing from documentation feat / spancat Feature: Span Categorizer

Comments

@belalsalih
Copy link

Hi,
I am getting 'Segmentation fault (core dumped)' when trying to train model for long SpanCat. I know this error could be related to OOM issues but this does not seem the case here. I tried to reduce [nlp] batch_size and [training.batcher.size] as shown in the attached config file and used a VM with very large RAM to make sure we are not running out of memory.
During training the VM memory usage never goes above 40% and even when reducing the [components.spancat.suggester] min_size and max_size the memory usage does not exceed 20% but the training exits with error 'Segmentation fault (core dumped)'.

Note: when training with low [components.spancat.suggester] values the training completes but with all zeroes for F, P and R.

His is the command I am using for training:
python -m spacy train config_spn.cfg --output ./output_v3_lg_1.3 --paths.train ./spacy_models_v3/train_data.spacy --paths.dev ./spacy_models_v3/test_data.spacy --code functions.py -V

This is the training output:

[2023-09-28 09:25:08,461] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
ℹ Saving to output directory: output_v3_lg_1.3
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2023-09-28 09:25:08,610] [INFO] Set up nlp object from config
[2023-09-28 09:25:08,618] [DEBUG] Loading corpus from path: spacy_models_v3/test_data.spacy
[2023-09-28 09:25:08,618] [DEBUG] Loading corpus from path: spacy_models_v3/train_data.spacy
[2023-09-28 09:25:08,619] [INFO] Pipeline: ['tok2vec', 'spancat']
[2023-09-28 09:25:08,621] [INFO] Created vocabulary
[2023-09-28 09:25:09,450] [INFO] Added vectors: en_core_web_lg
[2023-09-28 09:25:09,450] [INFO] Finished initializing nlp object
[2023-09-28 09:25:16,150] [INFO] Initialized pipeline components: ['tok2vec', 'spancat']
✔ Initialized pipeline

============================= Training pipeline =============================
[2023-09-28 09:25:16,158] [DEBUG] Loading corpus from path: spacy_models_v3/test_data.spacy
[2023-09-28 09:25:16,159] [DEBUG] Loading corpus from path: spacy_models_v3/train_data.spacy
ℹ Pipeline: ['tok2vec', 'spancat']
ℹ Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS SPANCAT SPANS_SC_F SPANS_SC_P SPANS_SC_R SCORE


0 0 98109.47 19535.08 0.00 0.00 4.58 0.00
0 200 528.73 781.51 0.00 0.00 3.75 0.00
Segmentation fault (core dumped)

Environment:

Operating System: Ubuntu 20.04.6 LTS
Python Version Used: 3.8.10
spaCy Version Used: 3.6.0
config_spn.cfg.txt

Thanks in advance!

@shadeMe shadeMe added the feat / spancat Feature: Span Categorizer label Sep 28, 2023
@shadeMe
Copy link
Contributor

shadeMe commented Sep 29, 2023

A segmentation fault shouldn't be happening under any circumstances. Could you post the output of the following command?

pip list

Furthermore, I'd appreciate it if you could you try the following for me:

  • Create a new virtualenv with just spacy (and its automatically installed dependencies ).
  • Create a minimal training/eval set (with a small number of examples).
  • Try to reproduce the crash in this virtualenv.

@belalsalih
Copy link
Author

belalsalih commented Sep 30, 2023

Thanks for the reply,
I have created a new venv with only spacy, however I am still getting the same error so this is not related to pip packages. I am using a small sample data(300 docs) for training and validation.

Noticed one thing:
Changing the with in [components.tok2vec.model.encode] from the default 96 to 128 will make the training command complete one iteration then crash, changing this value back to 96 will cause the command to fail without completing any iterations.

Attached debug data output FYR.
debug_data.txt

pip list output:

Package             Version
------------------- ---------
attrs               23.1.0
azure-core          1.28.0
azure-storage-blob  12.17.0
blis                0.7.9
catalogue           2.0.8
certifi             2023.5.7
cffi                1.15.1
charset-normalizer  3.2.0
click               8.1.5
confection          0.1.0
contourpy           1.1.0
cryptography        41.0.2
cycler              0.11.0
cymem               2.0.7
en-core-web-lg      3.6.0
en-core-web-sm      3.6.0
fonttools           4.41.1
fuzzysearch         0.7.3
fuzzywuzzy          0.18.0
idna                3.4
importlib-resources 6.0.0
isodate             0.6.1
Jinja2              3.1.2
joblib              1.3.1
kiwisolver          1.4.4
langcodes           3.3.0
Levenshtein         0.21.1
MarkupSafe          2.1.3
matplotlib          3.7.2
murmurhash          1.0.9
numpy               1.24.4
packaging           23.1
pandas              2.0.3
pathy               0.10.2
Pillow              10.0.0
pip                 23.2
pkg_resources       0.0.0
preshed             3.0.8
pycparser           2.21
pydantic            1.10.11
pyodbc              4.0.39
pyparsing           3.0.9
python-dateutil     2.8.2
python-Levenshtein  0.21.1
pytz                2023.3
rapidfuzz           3.2.0
regex               2023.8.8
requests            2.31.0
scikit-learn        1.3.0
scipy               1.10.1
setuptools          68.0.0
six                 1.16.0
sklearn             0.0.post7
smart-open          6.3.0
spacy               3.6.0
spacy-legacy        3.0.12
spacy-loggers       1.0.4
srsly               2.4.6
thefuzz             0.20.0
thinc               8.1.10
threadpoolctl       3.2.0
tqdm                4.65.0
typer               0.9.0
typing_extensions   4.7.1
tzdata              2023.3
urllib3             2.0.3
wasabi              1.1.2
wheel               0.40.0
zipp                3.16.2

@shadeMe
Copy link
Contributor

shadeMe commented Oct 4, 2023

Thanks for the info - We'll investigate.

@shadeMe shadeMe added the bug Bugs and behaviour differing from documentation label Oct 4, 2023
@belalsalih
Copy link
Author

To anyone facing this issue, I've used NER instead SpanCat and I had no issues.
And for overlapping spans I've trained the model to extract the high level details and trained separate models to extract sub-details from complex data.
I still believe SpanCat is the right way to do it if it worked as intended.

Regards.

@shadeMe
Copy link
Contributor

shadeMe commented Oct 26, 2023

Hi, can you share the training/dev data and the custom code you were using to train the SpanCat model? We'd need that to reproduce the crash and debug the issue.

@belalsalih
Copy link
Author

Hi,
I got this issue while creating a CV parser for our clients, so unfortunately we cannot share the data since it is using live applicants data.
We are not using any custom code to train the model, we are generating the training and saving the training.spacy/dev.spacy on the fly.
The same data that is causing this error is working fine when using NER instead of SapnCat, so I don't think this is data issue as you can see in the debug data shared earlier.
You can check this discussion thread related to this issue 13012.

Regards.

@shadeMe shadeMe added the more-info-needed This issue needs more information label Oct 30, 2023
@shadeMe
Copy link
Contributor

shadeMe commented Oct 30, 2023

That's understandable. The issue is likely a bug in the SpanCat component's code, but we still need to consistently reproduce the crash in order to identify the cause and fix it. If you run into this issue in the future where you can share the data that triggers the crash, please let us know.

@github-actions github-actions bot removed the more-info-needed This issue needs more information label Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation feat / spancat Feature: Span Categorizer
Projects
None yet
Development

No branches or pull requests

2 participants