Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] writeToAzureSearch fails when the index has custom analyzers or tokenizers since 0.11.0 #2143

Open
3 of 19 tasks
DBundred-cfc opened this issue Dec 1, 2023 · 1 comment

Comments

@DBundred-cfc
Copy link

DBundred-cfc commented Dec 1, 2023

SynapseML version

0.11.0-1.0.2

System information

  • Python Standalone - python 3.11.3, scala 2.12, Spark 3.4.1, SynapseML 1.0.2
  • Azure Synapse 3.3, SynapseML 0.11.4-spark3.3

Describe the problem

When a version of Synapse ML is used to load data into an index that has a custom analyzer or tokenizer (and possibly other custom objects but they haven't neem tested) it fails with the following error : -

Py4JJavaError: An error occurred while calling z:com.microsoft.azure.synapse.ml.services.search.AzureSearchWriter.write.
: spray.json.DeserializationException: Expected String as JsString, but got {"@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer","charFilters":[],"name":"keyword_analyzer","tokenFilters":["lowercase"],"tokenizer":"keyword_v2"}

This happens with all apiVersions set and seemingly any version greater than 0.11.0. It works correctly against the same index when run using a Spark 3.2 Azure Synapse cluster, which uses Synapse ML version 0.10.2

Code to reproduce issue

Create an index with a custom analyzer

This needs to be done through the API: -
POST https://{{service-name}}.search.windows.net/indexes?api-version={{api-version}}

{
  "name": "{{index-name}}",
  "defaultScoringProfile": null,
  "fields": [
    {
      "name": "Id",
      "type": "Edm.String",
      "searchable": false,
      "filterable": false,
      "retrievable": true,
      "sortable": false,
      "facetable": false,
      "key": true,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "synonymMaps": []
    }
  ],
  "analyzers": [
    {
    "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
    "name": "keyword_analyzer",
    "tokenizer": "keyword_v2",
    "tokenFilters": ["lowercase"]
    }
    ]
}

Try and load the index

Run the following pyspark on a spark 3.4

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MyApp") \
    .config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:1.0.2") \
    .getOrCreate()

import synapse.ml
from synapse.ml.services import writeToAzureSearch
from pyspark.sql.functions import lit, col


df = spark.range(10) \
    .withColumn("Id", col("id").cast("string")) \
    .withColumn("action", lit("upload"))

x = writeToAzureSearch(df, 
        subscriptionKey=admin_key,
        actionCol="action",
        serviceName=search_service,
        indexName=search_index,
        keyCol="Id")

You can also run the same code (without the spark creation) on Azure Synapse 3.3 and get the same result. I imagine this will happen on Databricks, and Synapse 3.4 but haven't tested it.

Other info / logs

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
c:\Users\dbundred\Documents\Projects\OSM\OtherStuff\synapseMlTest.ipynb Cell 2 line 1
     11 from pyspark.sql.functions import lit, col
     14 df = spark.range(10) \
     15     .withColumn("Id", col("id").cast("string")) \
     16     .withColumn("action", lit("upload"))
---> 18 x = writeToAzureSearch(df, 
     19         subscriptionKey=admin_key,
     20         actionCol="action",
     21         serviceName=search_service,
     22         indexName=search_index,
     23         keyCol="Id")

File ~\AppData\Local\Temp\spark-c09947f2-255d-45b5-a241-6a7165bbac06\userFiles-b3991b65-4cea-4012-980d-108b02406dcc\com.microsoft.azure_synapseml-cognitive_2.12-1.0.2.jar\synapse\ml\services\search\AzureSearchWriter.py:28, in writeToAzureSearch(df, **options)
     26 jvm = SparkContext.getOrCreate()._jvm
     27 writer = jvm.com.microsoft.azure.synapse.ml.services.search.AzureSearchWriter
---> 28 writer.write(df._jdf, options)

File c:\Users\dbundred\AppData\Local\Programs\Python\Python311\Lib\site-packages\py4j\java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

File c:\Users\dbundred\AppData\Local\Programs\Python\Python311\Lib\site-packages\pyspark\errors\exceptions\captured.py:169, in capture_sql_exception.<locals>.deco(*a, **kw)
    167 def deco(*a: Any, **kw: Any) -> Any:
    168     try:
--> 169         return f(*a, **kw)
    170     except Py4JJavaError as e:
    171         converted = convert_exception(e.java_exception)

File c:\Users\dbundred\AppData\Local\Programs\Python\Python311\Lib\site-packages\py4j\protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling z:com.microsoft.azure.synapse.ml.services.search.AzureSearchWriter.write.
: spray.json.DeserializationException: Expected String as JsString, but got {"@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer","charFilters":[],"name":"keyword_analyzer","tokenFilters":["lowercase"],"tokenizer":"keyword_v2"}
	at spray.json.package$.deserializationError(package.scala:23)
	at spray.json.ProductFormats.fromField(ProductFormats.scala:63)
	at spray.json.ProductFormats.fromField$(ProductFormats.scala:51)
	at com.microsoft.azure.synapse.ml.services.search.AzureSearchProtocol$.fromField(AzureSearchSchemas.scala:67)
	at spray.json.ProductFormatsInstances$$anon$11.read(ProductFormatsInstances.scala:341)
	at spray.json.ProductFormatsInstances$$anon$11.read(ProductFormatsInstances.scala:319)
	at spray.json.JsValue.convertTo(JsValue.scala:33)
	at com.microsoft.azure.synapse.ml.services.search.IndexParser.parseIndexJson(AzureSearchAPI.scala:25)
	at com.microsoft.azure.synapse.ml.services.search.IndexParser.parseIndexJson$(AzureSearchAPI.scala:24)
	at com.microsoft.azure.synapse.ml.services.search.AzureSearchWriter$.parseIndexJson(AzureSearch.scala:147)
	at com.microsoft.azure.synapse.ml.services.search.AzureSearchWriter$.getVectorColConf(AzureSearch.scala:325)
	at com.microsoft.azure.synapse.ml.services.search.AzureSearchWriter$.prepareDF(AzureSearch.scala:269)
	at com.microsoft.azure.synapse.ml.services.search.AzureSearchWriter$.write(AzureSearch.scala:432)
	at com.microsoft.azure.synapse.ml.services.search.AzureSearchWriter$.write(AzureSearch.scala:440)
	at com.microsoft.azure.synapse.ml.services.search.AzureSearchWriter.write(AzureSearch.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:833)

What component(s) does this bug affect?

  • area/cognitive: Cognitive project
  • area/core: Core project
  • area/deep-learning: DeepLearning project
  • area/lightgbm: Lightgbm project
  • area/opencv: Opencv project
  • area/vw: VW project
  • area/website: Website
  • area/build: Project build system
  • area/notebooks: Samples under notebooks folder
  • area/docker: Docker usage
  • area/models: models related issue

What language(s) does this bug affect?

  • language/scala: Scala source code
  • language/python: Pyspark APIs
  • language/r: R APIs
  • language/csharp: .NET APIs
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/synapse: Azure Synapse integrations
  • integrations/azureml: Azure ML integrations
  • integrations/databricks: Databricks integrations
Copy link

github-actions bot commented Dec 1, 2023

Hey @DBundred-cfc 👋!
Thank you so much for reporting the issue/feature request 🚨.
Someone from SynapseML Team will be looking to triage this issue soon.
We appreciate your patience.

@github-actions github-actions bot added the triage label Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant