Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiClassifierDLApproach not transforming every row of my dataset #14218

Open
1 task done
AntoineF3006 opened this issue Mar 27, 2024 · 1 comment
Open
1 task done
Assignees
Labels

Comments

@AntoineF3006
Copy link

AntoineF3006 commented Mar 27, 2024

Is there an existing issue for this?

  • I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I am currently working on a multi-output classification task, in order to classify some customers comments into several cateogories. I am using MultiClassifierDLApproach for this task, with already labeled data for training.
I followed this tutorial : https://www.johnsnowlabs.com/mastering-text-classification-with-spark-nlp.

Current Behavior

After fitting my pipeline (described below) on my train set, I am transforming my train and test sets with said pipeline. The results are pretty good, but on some rows the column category is empty and I don't have any calculated probabilities for any category.

Expected Behavior

I was expecting every row to get the probabilities for every category : maybe not selected categories since I have put a treshold at 0.5, but at least the values for each category.

Steps To Reproduce

https://drive.google.com/file/d/1tmJYwZKBVZoHtLcuyWtWhsu6nbonKG-S/view?usp=sharing

On this zip you will find a .ipynb recreating the steps I used to create my pipeline, some sample data and their results, and said pipeline already fitted.
The input column is texte_sw, the label is niveau_2_MC, the output is category.
The issue seems to happen uniformly on my data, the time and date, the length or the number of words doesn't seem to be the problem.

Spark NLP version and Apache Spark

sparknlp.version() : 5.2.3
spark.version : 3.2.0.3.2.7170.1008-2'

Type of Spark Application

Python Application

Java Version

No response

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

No response

@AntoineF3006
Copy link
Author

Hello @maziyarpanahi, is my issue complete enough or do I need to add some more context or data in order to discuss the subject ?
Kind regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants