Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to supersede IdentityToZeroTransformation and NullToZeroTransformation #224

Open
Jiaweihu08 opened this issue Oct 26, 2023 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@Jiaweihu08
Copy link
Member

What went wrong?

Both IdentityToZeroTransformation and NullToZeroTransformation are to handle special instances where LinearTransformer is used to map Numeric columns, but the values are either identical or all null. Ideally, these should be superseded when appending "regular" data by LinearTransformation instances. For now, it is not the case.

How to reproduce?

For IdentityToZeroTransformation for instance(and similarly for NullToZeroTransformation):

import org.apache.spark.sql.delta.DeltaLog
import io.qbeast.spark.delta.DeltaQbeastSnapshot
import io.qbeast.core.transform.IdentityToZeroTransformation
import spark.implicits._

case class IdentityCls(col1: String, col2: Int, col3: Double)

val idTestPath = "/tmp/test1/"
val identityData = (1 to 1000).map(_ => IdentityCls("1", 1, 1d)).toDS()
(identityData
	.write
	.mode("overwrite")
	.option("columnsToIndex", "col2")
	.option("cubeSize", "10000")
	.format("qbeast")
	.save(idTestPath)
)

(DeltaQbeastSnapshot(DeltaLog.forTable(spark, idTestPath)
  .update())
  .loadLatestRevision
  .transformations
  .head
  .isInstanceOf[IdentityToZeroTransformation]
) // true

// scala.MatchError at io.qbeast.core.transform.IdentityToZeroTransformation.transform(Transformation.scala:56)
((1 to 1000)
  .map(i => IdentityCls(s"$i", i, i.toDouble))
  .toDS()
  .write
  .mode("append")
  .format("qbeast")
  .save(idTestPath)
)

2. Branch and commit id:

main, f066acf

3. Spark version:

3.4.1

4. Hadoop version:

3.3.4

5. How are you running Spark?

Locally

@Jiaweihu08 Jiaweihu08 added the bug Something isn't working label Oct 26, 2023
@Jiaweihu08 Jiaweihu08 self-assigned this Oct 26, 2023
@osopardo1
Copy link
Member

My initial thoughts on this:

  1. IdentityTransformation should NOT be superseded by another IdentityTransformation. (By definition, the space value of Identity A is not considered in Identity B unless value a and value b are the same).
  2. IdentityTransformation should NOT superseded by a NullToZeroTransformation. Same case as the Identity.
  3. IdentityTransformation might be superseded by a LinearTransformation if max and min cover the identity value.

Now, in cases 1 and 2, we might require a trigger of another type of transformation, such as LinearTransformation, in which we include values from A and B as the ranges.

Does it make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants