.setReadMonthFirst always reads month first #14098

KyriakosAseto · 2023-12-18T10:05:24Z

Is there an existing issue for this?

I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I am using the example provider by spark nlp and customize the methods and I am trying to set to not read the first month first

date = DateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date") \
    .setReadMonthFirst(False) \
    .setOutputFormat("dd/MM/yyyy")

multiDate = MultiDateMatcher() \
    .setInputCols("document") \
    .setReadMonthFirst(False) \
    .setOutputCol("multi_date") \
    .setOutputFormat("dd/MM/yyyy")

Current Behavior

The parameter set to False does not matter as it always returns by first month from the input

Please see example "I was born at 01/03/98" which is indented to be 1st of March of 1998.

Expected Behavior

To read my example 01/03/1998 by not the month first

Steps To Reproduce

import sparknlp
from sparknlp.annotator import DocumentAssembler, DateMatcher, MultiDateMatcher
from pyspark.sql.types import StringType
from pyspark.ml import Pipeline

spark = sparknlp.start()
spark


documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

date = DateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date") \
    .setReadMonthFirst(False) \
    .setOutputFormat("dd/MM/yyyy")

multiDate = MultiDateMatcher() \
    .setInputCols("document") \
    .setReadMonthFirst(False) \
    .setOutputCol("multi_date") \
    .setOutputFormat("dd/MM/yyyy") 


pipeline = Pipeline().setStages([
    documentAssembler,
    date,
    multiDate
    ])

text_list = ["See you on next monday.", 
             "I was born at 01/03/98", 
             "She was born on 02/03/1966.", 
             "The project started yesterday and will finish next year.", 
             "She will graduate by July 2023.", 
             "She will visit doctor tomorrow and next month again."]

spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")

result = pipeline.fit(spark_df).transform(spark_df)
result.selectExpr("text","date.result as date", "multi_date.result as multi_date").show(truncate=False)

Spark NLP version and Apache Spark

spark-nlp==5.2.0

Type of Spark Application

Python Application

Java Version

openjdk version "11.0.21" 2023-10-17

Java Home Directory

/usr/lib/jvm/java-11-openjdk-amd64

Setup and installation

numpy==1.26.2
py4j==0.10.9.7
pyspark==3.5.0
spark-nlp==5.2.0

Operating System and Version

Ubuntu 22.04

Link to your project (if available)

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

kaniosm · 2023-12-19T08:56:23Z

I'm facing the same issue.
You can find a screenshot demonstrating the issue

KyriakosAseto added the question label Dec 18, 2023

KyriakosAseto assigned maziyarpanahi Dec 18, 2023

maziyarpanahi assigned wolliq Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.setReadMonthFirst always reads month first #14098

.setReadMonthFirst always reads month first #14098

KyriakosAseto commented Dec 18, 2023 •

edited

kaniosm commented Dec 19, 2023 •

edited

.setReadMonthFirst always reads month first #14098

.setReadMonthFirst always reads month first #14098

Comments

KyriakosAseto commented Dec 18, 2023 • edited

Is there an existing issue for this?

Who can help?

What are you working on?

Current Behavior

Expected Behavior

Steps To Reproduce

Spark NLP version and Apache Spark

Type of Spark Application

Java Version

Java Home Directory

Setup and installation

Operating System and Version

Link to your project (if available)

Additional Information

kaniosm commented Dec 19, 2023 • edited

KyriakosAseto commented Dec 18, 2023 •

edited

kaniosm commented Dec 19, 2023 •

edited