Match Type NULL_OR_BLANK causing zingg.block.Block NPE #818

TXAggie2000 · 2024-04-12T20:09:36Z

I am using 0.3.3 to train and dedupe a very simple dataset. The initial results matched too many incorrect values due to null fields. I went back and added NULL_OR_BLANK to the field definition and now I can't even get through training without failure. Here is the current field definition:

fieldDefinition = [ {'fieldName':'id', 'matchType':'DONT USE', 'fields':'id', 'dataType':'"integer"'}, {'fieldName':'email', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'email', 'dataType':'"string"'}, {'fieldName':'firstname', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'firstname', 'dataType':'"string"'}, {'fieldName':'lastname', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'lastname', 'dataType':'"string"'}, {'fieldName':'phone', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'phone', 'dataType':'"string"'} ]

The 'phone' field was initially NUMERIC, but adding NULL_OR_BLANK to that caused failure. The above would sometimes get through a single training and labeling, but I was never able to train/label enough data before a failure would occur.

All we want to do is have null values not count as a match. How do I proceed?

Thanks,
Scott

The text was updated successfully, but these errors were encountered:

sonalgoyal · 2024-04-13T05:02:55Z

Can you please share the error message?

TXAggie2000 · 2024-04-15T15:05:52Z

Here is the stacktrace:

Driver stacktrace: at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:240) ... 100 more Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 14 in stage 16.0 failed 4 times, most recent failure: Lost task 14.3 in stage 16.0 (TID 167) (10.139.64.13 executor 5): java.lang.NullPointerException at zingg.block.Block$BlockFunction.call(Block.java:403) at zingg.block.Block$BlockFunction.call(Block.java:393) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:761) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:81) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:174) at org.apache.spark.scheduler.Task.$anonfun$run$4(Task.scala:137) at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:126) at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:137) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.Task.run(Task.scala:96) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:902) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1697) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:905) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:760) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)

sonalgoyal · 2024-04-15T17:19:26Z

thanks. is the match type for id field dont_use or dont use?

TXAggie2000 · 2024-04-15T19:01:16Z

Sorry, it is DONT_USE

sonalgoyal · 2024-04-15T19:15:09Z

Ok thanks. Changing the field type of phone from numerical to string seems to be causing this, as the unmarked and/or marked data from earlier rounds would be a number and now a string. How much training data do you have? Is it possible to start from scratch on a new model? Or if you want, you could change the ltraning data under the model folder through pyspark.

Hope that helps

TXAggie2000 · 2024-04-15T19:24:03Z

I have essentially started from scratch each time. We currently aren't setting this up for incremental. I have essentially cleared the directory each time and am creating a new database on databricks.

EDIT: Just to clarify, this has been done several times, so anytime I have made any changes to the model, I start over. I again, started from a new directory and am still getting same error.

sonalgoyal · 2024-04-16T07:14:00Z

Ok. then this may be a bug in the code which is triggered in the case of certain values in the data. Is it possible for you to share a test case and your config for us to reproduce this issue at our end?

TXAggie2000 · 2024-04-16T14:52:20Z

Certainly. Besides the config, what exactly do you need from me? Sample set of data? I am using the Databricks Solution Accelerator for this.

sonalgoyal · 2024-04-16T15:31:16Z

Yes, a sample dataset and config/python code should be good enough to get started on reproducing this.

@vikasgupta78 fyi

TXAggie2000 · 2024-04-17T19:16:24Z

Sorry for the delay. Since this is personal data, I am having to generate mock data with the same fields and then running that through to make sure the error still exists.

TXAggie2000 · 2024-04-18T01:45:16Z

Here is the mock dataset:
MOCK_DATA.csv

Here is the field definition:

fieldDefinition = [
    {'fieldName':'id', 'matchType':'DONT_USE', 'fields':'id', 'dataType':'"integer"'},
    {'fieldName':'email', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'email', 'dataType':'"string"'},
    {'fieldName':'firstname', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'firstname', 'dataType':'"string"'},
    {'fieldName':'lastname', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'lastname', 'dataType':'"string"'},
    {'fieldName':'phone', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'phone', 'dataType':'"string"'}
    ]

The code is the same as the Databricks Solution Accelerator with the exception that I remove the loading of the incremental in 00.1 and copied the attached dataset in both downloads and initial. If following that document, the failure is during 01_Intitial for the step 'Get Data (Run Once Per Cycle)' or sometimes it will get through that and fail during the next step 'Perform Labeling (Run Repeatedly Until All Candidate Pairs Labeled)'. With the NUUL_OR_BLANK, I have never gotten past two iterations of those two steps. Without, I was able to run and label repeatedly until I had enough matches to proceed.

Thanks,
Scott

sonalgoyal · 2024-04-18T04:05:05Z

thank a lot @TXAggie2000 , will take a look today.

sonalgoyal · 2024-04-18T04:12:26Z

One question @TXAggie2000 - have you tried with zingg 0.4.0 ?

TXAggie2000 · 2024-04-18T14:00:25Z

One question @TXAggie2000 - have you tried with zingg 0.4.0 ?

I have not. I had followed the Solution Accelerator which uses 0.3.3 and has success de-duping a few different datasets, but only started having issues when adding the extra match type.

sonalgoyal · 2024-04-18T14:23:16Z

I see. I can not locate the null_or_blank type in 0.3.3. I would suggest trying on 0.4.0 to see if this problem persists.

TXAggie2000 · 2024-04-18T14:24:56Z

I see. I can not locate the null_or_blank type in 0.3.3. I would suggest trying on 0.4.0 to see if this problem persists.

Understood. I will try it with 0.4.0 and I will let you know!

Thanks,
Scott

TXAggie2000 · 2024-04-18T17:44:46Z

Tried the same code with 0.4.0 and am now getting the error:

Error: Failed to load class zingg.client.Client.

vikasgupta78 · 2024-04-19T06:51:42Z

Can you please share the steps you used to install 0.4.0 and also the spark/java version you are using?

TXAggie2000 · 2024-04-19T18:37:19Z

@vikasgupta78 - I had modified the notebooks (config/setup) in the solution accelerator to download that version. I did notice that I was a minor spark version off for that so, I am re-testing with Databricks runtime version 14.3 LTS

sonalgoyal · 2024-04-19T18:49:49Z

Cool. Please use dbr 14.2 and spark 3.5.0 with Zingg 0.4.0.

TXAggie2000 · 2024-04-19T19:22:00Z

Okay, I ran it in 14.2, spark 3.5.0 with Zingg 0.4.0 and still have the same error:

Error: Failed to load class zingg.client.Client

Here is the code for the findTrainingData where it is failing:

def run_find_training_job():
  '''
  The purpose of this function is to run the Zingg findTraining job that generates
  candidate pairs from the initial set of data specified in the job's configuration
  '''
  
  # identify the find training job
  find_training_job = ZinggJob( config['job']['initial']['findTrainingData'], config['job']['databricks workspace url'], config['job']['api token'])
  
  # run the job and wait for its completion
  find_training_job.run_and_wait()

  return

config['job']['initial']['findTrainingData'] = zingg_initial_findTrainingData

Could this code have changed from 0.3.3 to 0.4.0?

Keep in mind that if I download 0.3.3 and remove the NULL_OR_BLANK everything runs as expected. I just update 8 lines of code to switch.

vikasgupta78 · 2024-04-20T09:28:36Z

You seem to be using an older notebook , please try https://github.com/zinggAI/zingg-vikas/blob/0.4.0/examples/databricks/FebrlExample.ipynb

TXAggie2000 · 2024-04-20T17:01:30Z

@vikasgupta78 - Thank you! I will review and test over the weekend and let you know by Monday!

TXAggie2000 · 2024-04-22T18:27:43Z

@vikasgupta78 - I got through one round of training/labeling. On the second pass at training the data, I got the following exception:

Py4JJavaError: An error occurred while calling o497.execute.
: zingg.common.client.ZinggClientException: Exception thrown in awaitResult: Job aborted due to stage failure: Task 0 in stage 206.0 failed 4 times, most recent failure: Lost task 0.3 in stage 206.0 (TID 264) (10.139.64.10 executor driver): java.lang.IllegalArgumentException: requirement failed: Vector should have dimension larger than zero.
	at scala.Predef$.require(Predef.scala:281)
	at org.apache.spark.ml.stat.SummarizerBuffer.add(Summarizer.scala:476)
	at org.apache.spark.ml.stat.SummarizerBuffer.add(Summarizer.scala:552)
	at org.apache.spark.ml.stat.Summarizer$.$anonfun$getClassificationSummarizers$1(Summarizer.scala:235)
	at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)
	at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
	at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)
	at scala.collection.TraversableOnce.aggregate(TraversableOnce.scala:260)
	at scala.collection.TraversableOnce.aggregate$(TraversableOnce.scala:260)
	at scala.collection.AbstractIterator.aggregate(Iterator.scala:1431)
	at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$4(RDD.scala:1316)
	at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$6(RDD.scala:1317)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:896)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:896)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:407)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:404)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:371)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:196)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:181)
	at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:146)
	at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:129)
	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:146)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:936)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:103)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:939)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:831)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
	at zingg.common.core.executor.TrainingDataFinder.execute(TrainingDataFinder.java:139)
	at zingg.common.client.Client.execute(Client.java:251)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
	at py4j.Gateway.invoke(Gateway.java:306)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
	at java.lang.Thread.run(Thread.java:750)

TXAggie2000 · 2024-04-22T20:54:19Z

This is my field definitions. Didn't see any examples in the docs for multiple match types, but it said it takes an array:

id = FieldDefinition("id", "integer", MatchType.DONT_USE)
email = FieldDefinition("email", "string", [MatchType.EMAIL,MatchType.NULL_OR_BLANK])
fname = FieldDefinition("firstname", "string", [MatchType.FUZZY,MatchType.NULL_OR_BLANK])
lname = FieldDefinition("lastname", "string", [MatchType.FUZZY,MatchType.NULL_OR_BLANK])
phone = FieldDefinition("phone", "string", [MatchType.NUMERIC,MatchType.NULL_OR_BLANK])
fieldDefs = [id, email, fname, lname, phone]
args.setFieldDefinition(fieldDefs)

vikasgupta78 · 2024-04-23T06:58:22Z

Did you change the definition after a round of training / labelling?

vikasgupta78 · 2024-04-23T09:51:12Z

can you try with phone = FieldDefinition("phone", "string", [MatchType. FUZZY,MatchType.NULL_OR_BLANK])

vikasgupta78 · 2024-04-23T09:51:40Z

I would be happy to try it out locally if there is a test data you could share

sonalgoyal · 2024-04-23T09:53:08Z

I would be happy to try it out locally if there is a test data you could share

@vikasgupta78 #818 (comment)

vikasgupta78 · 2024-04-29T09:42:48Z

I tried the training samples , on running fresh findTrainingData I am seeing following in logs:

24/04/29 14:59:15 WARN TrainingDataFinder: Read training samples 33 neg 0

=> training samples is being used

sonalgoyal · 2024-04-30T19:06:01Z

If you want to book time and the issue is still not resolved, please use the link on the docs @TXAggie2000

TXAggie2000 · 2024-05-01T14:25:02Z

Thanks everyone. I am struggling to get decent results with it, and it seems the results are either not using the training data or not ignoring null or blank values. I've redone it several times. I will continue to train, but seeing a lot of records in a single cluster that used one column to match across

vikasgupta78 · 2024-05-02T06:52:29Z

I tried to replicate/fix the issue:

In match type changed for EMAIL to EMAIL from FUZZY
email = FieldDefinition("email", "string", MatchType.EMAIL,MatchType.NULL_OR_BLANK)
Changed the order to:
fieldDefs = [id, fname, lname, phone, email]
Ran 10-15 rounds of findTrainingData and label
In so many rounds only 1 pair came which can be called a match
=> MOCK_DATA.csv is by and large free from duplicates
On running match
all rows ended up in there own individual clusters
=> no dupes found
=> may be data we have is only partial so that's why no matches are being found

I am attaching the python file I used (renamed to txt) and final output I got.

Let me know how we can help further.

output.csv

FebrlExample818.txt

vikasgupta78 · 2024-05-02T06:53:26Z

Also worth mentioning that in final few rounds model started converging that zingg prediction was in line with the input I gave in label

vikasgupta78 · 2024-05-03T05:07:35Z

@TXAggie2000 did you get a chance to look into the results?

TXAggie2000 · 2024-05-03T14:16:48Z

I am rerunning again. I had switched the email and phone order because those are more important to determine a match. There could be two different people within a company with the same email domain and phone number that we would consider to be a match. That would make those two fields more significant in that case, correct?

vikasgupta78 · 2024-05-03T14:39:06Z

@TXAggie2000 From what you are describing does that mean you won't consider first name and last name in such cases? If in some cases you consider certain fields like first name and last name as match but in other cases you don't it will not work out. Model has to be consistent. Fields you don't want to consider should be don't match.

e.g. if you consider x@y.com vikas gupta as match with b@y.com sonal goyal
but don't consider x@y.com vikas gupta as match with b@z.com sonal goyal as match it will just confuse the model and it will just not work

If you want to consider domain as a company better to split it in seperate field.

In summary be consistent in your training otherwise you will not get good matches

TXAggie2000 · 2024-05-07T02:18:25Z

I had some issues I corrected, and stuck with your suggestion. I went through 4 training cycles over the course of the day (findTrainingData took about an hour each cycle). When done, I had 3 matches and 83 labeled as not matching. With the ~40 records of matching manual training data, the trainMatch failed with not enough training data. After an additional round, it increased to 103 not matching. "trainMatch" did not error out, but the results were still subpar. For example, one cluster had 41 matches for John Smith where the phone numbers and emails were all different. Another where first name and last name were Mike, had 282 matches but none of the phone numbers or emails matched. I am not sure if I just need to spend several days training the model to get different results or is there a way to weigh these differently, or at least equally.

vikasgupta78 · 2024-05-07T05:23:57Z

Are any of the pairs you marked as match in training had different phone numbers and email but same name?

If you did this for any of them it would result in Zingg also learning it the same way.

Please run --phase generateDocs and send me the files so that I can check if there is a problem with training data

TXAggie2000 · 2024-05-07T14:10:31Z

The phase generateDocs ran with no issues, but there is no docs directory in my model directory. I ran the following block:


#Zingg execution for the given phase
zingg = ZinggWithSpark(args, options)
zingg.initAndExecute()

It's interesting, because if I run the following:

displayHTML(open(DOCS_DIR+"model.html", 'r').read())

I can see the results, but if I print DOCS_DIR: /mnt/data/raw/zingg/models/contacts/docs/, this location does not exist.

java.io.FileNotFoundException: No such file or directory /mnt/data/raw/zingg/models/contacts/docs

The model.html shows: Unmarked 0/144, Marked 144/144 (8 Matches, 103 Non-Matches, 33 Unsure)

No sign of the training data...

vikasgupta78 · 2024-05-07T14:56:57Z

Training samples won't come in generateDocs. The purpose of it is to review label data.

Regarding DOCS_DIR :

If you see in https://github.com/zinggAI/zingg-vikas/blob/0.4.0/examples/databricks/FebrlExample.ipynb

you need to assign DOCS_DIR = zinggDir + "/" + modelId + "/docs/"
and check that docs succesfully generated
dbutils.fs.ls('file:'+DOCS_DIR)

Please share model.html so that I can review

TXAggie2000 · 2024-05-07T15:02:30Z

Yes, that example is what I ran and showed what the path was when I ran print(DOCS_DIR), so you could see the path was set

I can see the results, but if I print DOCS_DIR: /mnt/data/raw/zingg/models/contacts/docs/, this location does not exist.

vikasgupta78 · 2024-05-07T15:03:17Z

whats your zinggDir?

vikasgupta78 · 2024-05-07T15:03:56Z

and what are you getting when you do dbutils.fs.ls('file:'+DOCS_DIR)

TXAggie2000 · 2024-05-07T15:05:11Z

DOCS_DIR: /mnt/data/raw/zingg/models/contacts/docs/
zinngDir: /mnt/data/raw/zingg/models
modelId: contacts

dbutils.fs.ls('file:'+DOCS_DIR): java.io.FileNotFoundException: No such file or directory /mnt/data/raw/zingg/models/contacts/docs

vikasgupta78 · 2024-05-07T15:09:25Z

and if you do dbutils.fs.ls('file:'+ zinggDir)

TXAggie2000 · 2024-05-07T15:10:24Z

[FileInfo(path='file:/mnt/data/raw/zingg/models/contacts/', name='contacts/', size=4096, modificationTime=1715090590510)]

vikasgupta78 · 2024-05-07T15:10:29Z

Also if you are able to see models.html as you have said above can you please share that?

TXAggie2000 · 2024-05-07T15:21:28Z

I believe the issue with it saving is because the notebook is referencing a mounted directory. If I run the following, it creates it in the mounted directory:

with open(DOCS_DIR+"model.html", 'r') as source_file:
    file_contents = source_file.read()

with open('/dbfs/'+zinggDir + "/" + modelId+"/docs/model.html", 'w') as destination_file:
    destination_file.write(file_contents)

TXAggie2000 · 2024-05-07T15:23:41Z

How can I send this to you privately

TXAggie2000 · 2024-05-07T15:26:56Z

I did see that on one of my matches, one record does not have a phone number and the other does not have an email. This should have been marked as uncertain. Is there a way to correct that?

TXAggie2000 · 2024-05-07T15:30:05Z

Also, at what point is the training data used? trainMatch?

vikasgupta78 · 2024-05-07T15:59:37Z

How can I send this to you privately

you can send it 1-1 on slack

vikasgupta78 · 2024-05-07T16:00:08Z

Also, at what point is the training data used? trainMatch?

it is used in findTrainingData, trainMatch

vikasgupta78 · 2024-05-07T16:01:01Z

I did see that on one of my matches, one record does not have a phone number and the other does not have an email. This should have been marked as uncertain. Is there a way to correct that?

use updateLabel phase

TXAggie2000 · 2024-05-07T16:31:23Z

It appears this phase is requiring input, so I am guessing I will need to build out a notebook widget for cluster id and get that input into this phase since we are running a notebook and not command line. I did not see an example in the source code for this unless I missed it.

vikasgupta78 · 2024-05-07T16:34:53Z

How can I send this to you privately

you can send it 1-1 on slack

zinggai.slack.com

TXAggie2000 · 2024-05-09T15:29:53Z

I wasn't able to log in to slack. Thought I had an account at one point, but apparently not. I also wasn't able to get the --updateLabel phase to work in a notebook, so I just started over. Still not getting good results so I will keep training.

vikasgupta78 · 2024-05-10T05:02:04Z

Just ensure that you are consistent in training and also with training samples. prefer to use actual data with training samples instead of handcrafted data

sonalgoyal assigned vikasgupta78 Apr 16, 2024

Match Type NULL_OR_BLANK causing zingg.block.Block NPE #818

Match Type NULL_OR_BLANK causing zingg.block.Block NPE #818

Comments

TXAggie2000 commented Apr 12, 2024 • edited

sonalgoyal commented Apr 13, 2024

TXAggie2000 commented Apr 15, 2024

sonalgoyal commented Apr 15, 2024

TXAggie2000 commented Apr 15, 2024

sonalgoyal commented Apr 15, 2024

TXAggie2000 commented Apr 15, 2024 • edited

sonalgoyal commented Apr 16, 2024

TXAggie2000 commented Apr 16, 2024

sonalgoyal commented Apr 16, 2024

TXAggie2000 commented Apr 17, 2024

TXAggie2000 commented Apr 18, 2024 • edited

sonalgoyal commented Apr 18, 2024

sonalgoyal commented Apr 18, 2024

TXAggie2000 commented Apr 18, 2024

sonalgoyal commented Apr 18, 2024

TXAggie2000 commented Apr 18, 2024

TXAggie2000 commented Apr 18, 2024

vikasgupta78 commented Apr 19, 2024

TXAggie2000 commented Apr 19, 2024

sonalgoyal commented Apr 19, 2024

TXAggie2000 commented Apr 19, 2024 • edited

vikasgupta78 commented Apr 20, 2024

TXAggie2000 commented Apr 20, 2024

TXAggie2000 commented Apr 22, 2024

TXAggie2000 commented Apr 22, 2024

vikasgupta78 commented Apr 23, 2024

vikasgupta78 commented Apr 23, 2024

vikasgupta78 commented Apr 23, 2024

sonalgoyal commented Apr 23, 2024

vikasgupta78 commented Apr 29, 2024

sonalgoyal commented Apr 30, 2024

TXAggie2000 commented May 1, 2024

vikasgupta78 commented May 2, 2024

vikasgupta78 commented May 2, 2024

vikasgupta78 commented May 3, 2024

TXAggie2000 commented May 3, 2024

vikasgupta78 commented May 3, 2024

TXAggie2000 commented May 7, 2024

vikasgupta78 commented May 7, 2024

TXAggie2000 commented May 7, 2024 • edited

vikasgupta78 commented May 7, 2024 • edited

TXAggie2000 commented May 7, 2024

vikasgupta78 commented May 7, 2024

vikasgupta78 commented May 7, 2024

TXAggie2000 commented May 7, 2024

vikasgupta78 commented May 7, 2024 • edited

TXAggie2000 commented May 7, 2024

vikasgupta78 commented May 7, 2024

TXAggie2000 commented May 7, 2024

TXAggie2000 commented May 7, 2024

TXAggie2000 commented May 7, 2024

TXAggie2000 commented May 7, 2024

vikasgupta78 commented May 7, 2024

vikasgupta78 commented May 7, 2024

vikasgupta78 commented May 7, 2024

TXAggie2000 commented May 7, 2024

vikasgupta78 commented May 7, 2024

TXAggie2000 commented May 9, 2024

vikasgupta78 commented May 10, 2024

TXAggie2000 commented Apr 12, 2024 •

edited

TXAggie2000 commented Apr 15, 2024 •

edited

TXAggie2000 commented Apr 18, 2024 •

edited

TXAggie2000 commented Apr 19, 2024 •

edited

TXAggie2000 commented May 7, 2024 •

edited

vikasgupta78 commented May 7, 2024 •

edited

vikasgupta78 commented May 7, 2024 •

edited