Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.setReadMonthFirst always reads month first #14098

Open
1 task done
KyriakosAseto opened this issue Dec 18, 2023 · 1 comment
Open
1 task done

.setReadMonthFirst always reads month first #14098

KyriakosAseto opened this issue Dec 18, 2023 · 1 comment
Assignees
Labels

Comments

@KyriakosAseto
Copy link

KyriakosAseto commented Dec 18, 2023

Is there an existing issue for this?

  • I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I am using the example provider by spark nlp and customize the methods and I am trying to set to not read the first month first

date = DateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date") \
    .setReadMonthFirst(False) \
    .setOutputFormat("dd/MM/yyyy")

multiDate = MultiDateMatcher() \
    .setInputCols("document") \
    .setReadMonthFirst(False) \
    .setOutputCol("multi_date") \
    .setOutputFormat("dd/MM/yyyy") 

Current Behavior

The parameter set to False does not matter as it always returns by first month from the input
image

Please see example "I was born at 01/03/98" which is indented to be 1st of March of 1998.

Expected Behavior

To read my example 01/03/1998 by not the month first

Steps To Reproduce

import sparknlp
from sparknlp.annotator import DocumentAssembler, DateMatcher, MultiDateMatcher
from pyspark.sql.types import StringType
from pyspark.ml import Pipeline

spark = sparknlp.start()
spark


documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

date = DateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date") \
    .setReadMonthFirst(False) \
    .setOutputFormat("dd/MM/yyyy")

multiDate = MultiDateMatcher() \
    .setInputCols("document") \
    .setReadMonthFirst(False) \
    .setOutputCol("multi_date") \
    .setOutputFormat("dd/MM/yyyy") 


pipeline = Pipeline().setStages([
    documentAssembler,
    date,
    multiDate
    ])

text_list = ["See you on next monday.", 
             "I was born at 01/03/98", 
             "She was born on 02/03/1966.", 
             "The project started yesterday and will finish next year.", 
             "She will graduate by July 2023.", 
             "She will visit doctor tomorrow and next month again."]

spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")

result = pipeline.fit(spark_df).transform(spark_df)
result.selectExpr("text","date.result as date", "multi_date.result as multi_date").show(truncate=False)

Spark NLP version and Apache Spark

spark-nlp==5.2.0

Type of Spark Application

Python Application

Java Version

openjdk version "11.0.21" 2023-10-17

Java Home Directory

/usr/lib/jvm/java-11-openjdk-amd64

Setup and installation

numpy==1.26.2
py4j==0.10.9.7
pyspark==3.5.0
spark-nlp==5.2.0

Operating System and Version

Ubuntu 22.04

Link to your project (if available)

No response

Additional Information

No response

@kaniosm
Copy link

kaniosm commented Dec 19, 2023

I'm facing the same issue.
You can find a screenshot demonstrating the issue
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants