Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is Redshift supported as a data source? #522

Open
jbleduigou opened this issue Nov 27, 2023 · 0 comments
Open

Is Redshift supported as a data source? #522

jbleduigou opened this issue Nov 27, 2023 · 0 comments
Labels
question Further information is requested

Comments

@jbleduigou
Copy link

Hello,

I have been testing Deequ.
So far I had mixed results when using Redshift as a datasource.

I am using Spark Redshift library in order to load a data frame from Redshift.

One example of the problems I had is with uniqueness verification of a column.
I get the following error:

java.sql.SQLException: Exception thrown in awaitResult: 
        at io.github.spark_redshift_community.spark.redshift.JDBCWrapper.executeInterruptibly(RedshiftJDBCWrapper.scala:172) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at io.github.spark_redshift_community.spark.redshift.JDBCWrapper.executeInterruptibly(RedshiftJDBCWrapper.scala:145) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at io.github.spark_redshift_community.spark.redshift.RedshiftRelation.UnloadDataToS3(RedshiftRelation.scala:328) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at io.github.spark_redshift_community.spark.redshift.RedshiftRelation.$anonfun$buildScanFromSQL$1(RedshiftRelation.scala:271) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at scala.Option.orElse(Option.scala:447) ~[scala-library-2.12.17.jar:?]
        at io.github.spark_redshift_community.spark.redshift.RedshiftRelation.buildScanFromSQL(RedshiftRelation.scala:271) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at io.github.spark_redshift_community.spark.redshift.pushdown.RedshiftScanExec$$anon$1.call(RedshiftScanExec.scala:53) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at io.github.spark_redshift_community.spark.redshift.pushdown.RedshiftScanExec$$anon$1.call(RedshiftScanExec.scala:49) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: com.amazon.redshift.util.RedshiftException: ERROR: cannot cast type boolean to double precision
        at com.amazon.redshift.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2613) ~[redshift-jdbc42-2.1.0.23.jar:?]
        at com.amazon.redshift.core.v3.QueryExecutorImpl.processResultsOnThread(QueryExecutorImpl.java:2281) ~[redshift-jdbc42-2.1.0.23.jar:?]
        at com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1886) ~[redshift-jdbc42-2.1.0.23.jar:?]
        at com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1878) ~[redshift-jdbc42-2.1.0.23.jar:?]

The data itself is the Sample Database provided by AWS.
The verification code is as follows:

    val verificationResult = VerificationSuite()
      .onData(df)
      .addCheck(
        Check(CheckLevel.Error, "Data Quality Checks")
          .isUnique("eventid")
      )
      .run()

Is using Redshift as a datasource supported by Deequ?

@jbleduigou jbleduigou added the question Further information is requested label Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant