Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Support NLP text regression #2625

Open
j-adamczyk opened this issue Jul 9, 2023 · 3 comments
Open

[FEAT] Support NLP text regression #2625

j-adamczyk opened this issue Jul 9, 2023 · 3 comments
Labels
ds Tasks suited for Data Scientists linear nlp Affects deepchecks.nlp package

Comments

@j-adamczyk
Copy link

Is your feature request related to a problem? Please describe.

Currently only token classification and text classification are supported for NLP. However, there are important cases for text regression, for example:

  • CTR prediction for advertisements
  • sentiment magnitude prediction, e.g. GCP sentiment analysis predicts continous values instead of classes
  • ordinal regression for texts, e.g. predicting number of stars from 1 to 5 based on review text

Describe the solution you'd like

Support for text regression, similar to tabular regression, but for NLP models, e.g. checking regression error distribution or train-test degradation for regression metrics.

@github-actions github-actions bot added needs triage Issue needs to be labeled and prioritized linear labels Jul 9, 2023
@noamzbr noamzbr added nlp Affects deepchecks.nlp package ds Tasks suited for Data Scientists and removed needs triage Issue needs to be labeled and prioritized labels Jul 19, 2023
@noamzbr
Copy link
Collaborator

noamzbr commented Jul 21, 2023

Thanks for the suggestion @j-adamczyk! Any other features you'd suggest for these task types?

@j-adamczyk
Copy link
Author

@noamzbr thank you for fast response.

This requires a mix of regression tests and NLP tests.

From tabular quickstart, interesting checks are:

  • train-test performance
  • regression error distribution
  • prediction drift
  • simple model comparison

Specifically, tabular regression checks that don't make sense for NLP are weak segments performance (since they are not well defined for NLP), boosting overfit (since NLP does not use boosting typically) and model inference time (which is naturally long for NLP).

From [NLP text classification quickstart]:

  • text property outliers
  • unknown tokens
  • under annotated property segments
  • under annotated metadata segments
  • text duplicates
  • special characters

Also, image regression could also be added in a very similar way (but that is outside the scope of this issue).

@j-adamczyk
Copy link
Author

@noamzbr any news on this? As far as I understand, this is mixing 2 existing things together, and no really new code is needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ds Tasks suited for Data Scientists linear nlp Affects deepchecks.nlp package
Projects
None yet
Development

No branches or pull requests

2 participants