Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match Type NULL_OR_BLANK causing zingg.block.Block NPE #818

Open
TXAggie2000 opened this issue Apr 12, 2024 · 70 comments
Open

Match Type NULL_OR_BLANK causing zingg.block.Block NPE #818

TXAggie2000 opened this issue Apr 12, 2024 · 70 comments
Assignees

Comments

@TXAggie2000
Copy link

TXAggie2000 commented Apr 12, 2024

I am using 0.3.3 to train and dedupe a very simple dataset. The initial results matched too many incorrect values due to null fields. I went back and added NULL_OR_BLANK to the field definition and now I can't even get through training without failure. Here is the current field definition:

fieldDefinition = [ {'fieldName':'id', 'matchType':'DONT USE', 'fields':'id', 'dataType':'"integer"'}, {'fieldName':'email', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'email', 'dataType':'"string"'}, {'fieldName':'firstname', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'firstname', 'dataType':'"string"'}, {'fieldName':'lastname', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'lastname', 'dataType':'"string"'}, {'fieldName':'phone', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'phone', 'dataType':'"string"'} ]

The 'phone' field was initially NUMERIC, but adding NULL_OR_BLANK to that caused failure. The above would sometimes get through a single training and labeling, but I was never able to train/label enough data before a failure would occur.

All we want to do is have null values not count as a match. How do I proceed?

Thanks,
Scott

@sonalgoyal
Copy link
Member

Can you please share the error message?

@TXAggie2000
Copy link
Author

Here is the stacktrace:

Driver stacktrace: at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:240) ... 100 more Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 14 in stage 16.0 failed 4 times, most recent failure: Lost task 14.3 in stage 16.0 (TID 167) (10.139.64.13 executor 5): java.lang.NullPointerException at zingg.block.Block$BlockFunction.call(Block.java:403) at zingg.block.Block$BlockFunction.call(Block.java:393) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:761) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:81) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:174) at org.apache.spark.scheduler.Task.$anonfun$run$4(Task.scala:137) at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:126) at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:137) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.Task.run(Task.scala:96) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:902) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1697) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:905) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:760) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)

@sonalgoyal
Copy link
Member

thanks. is the match type for id field dont_use or dont use?

@TXAggie2000
Copy link
Author

Sorry, it is DONT_USE

@sonalgoyal
Copy link
Member

Ok thanks. Changing the field type of phone from numerical to string seems to be causing this, as the unmarked and/or marked data from earlier rounds would be a number and now a string. How much training data do you have? Is it possible to start from scratch on a new model? Or if you want, you could change the ltraning data under the model folder through pyspark.

Hope that helps

@TXAggie2000
Copy link
Author

TXAggie2000 commented Apr 15, 2024

I have essentially started from scratch each time. We currently aren't setting this up for incremental. I have essentially cleared the directory each time and am creating a new database on databricks.

EDIT: Just to clarify, this has been done several times, so anytime I have made any changes to the model, I start over. I again, started from a new directory and am still getting same error.

@sonalgoyal
Copy link
Member

Ok. then this may be a bug in the code which is triggered in the case of certain values in the data. Is it possible for you to share a test case and your config for us to reproduce this issue at our end?

@TXAggie2000
Copy link
Author

Certainly. Besides the config, what exactly do you need from me? Sample set of data? I am using the Databricks Solution Accelerator for this.

@sonalgoyal
Copy link
Member

Yes, a sample dataset and config/python code should be good enough to get started on reproducing this.

@vikasgupta78 fyi

@TXAggie2000
Copy link
Author

Sorry for the delay. Since this is personal data, I am having to generate mock data with the same fields and then running that through to make sure the error still exists.

@TXAggie2000
Copy link
Author

TXAggie2000 commented Apr 18, 2024

Here is the mock dataset:
MOCK_DATA.csv

Here is the field definition:

fieldDefinition = [
    {'fieldName':'id', 'matchType':'DONT_USE', 'fields':'id', 'dataType':'"integer"'},
    {'fieldName':'email', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'email', 'dataType':'"string"'},
    {'fieldName':'firstname', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'firstname', 'dataType':'"string"'},
    {'fieldName':'lastname', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'lastname', 'dataType':'"string"'},
    {'fieldName':'phone', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'phone', 'dataType':'"string"'}
    ]

The code is the same as the Databricks Solution Accelerator with the exception that I remove the loading of the incremental in 00.1 and copied the attached dataset in both downloads and initial. If following that document, the failure is during 01_Intitial for the step 'Get Data (Run Once Per Cycle)' or sometimes it will get through that and fail during the next step 'Perform Labeling (Run Repeatedly Until All Candidate Pairs Labeled)'. With the NUUL_OR_BLANK, I have never gotten past two iterations of those two steps. Without, I was able to run and label repeatedly until I had enough matches to proceed.

Thanks,
Scott

@sonalgoyal
Copy link
Member

thank a lot @TXAggie2000 , will take a look today.

@sonalgoyal
Copy link
Member

One question @TXAggie2000 - have you tried with zingg 0.4.0 ?

@TXAggie2000
Copy link
Author

One question @TXAggie2000 - have you tried with zingg 0.4.0 ?

I have not. I had followed the Solution Accelerator which uses 0.3.3 and has success de-duping a few different datasets, but only started having issues when adding the extra match type.

@sonalgoyal
Copy link
Member

I see. I can not locate the null_or_blank type in 0.3.3. I would suggest trying on 0.4.0 to see if this problem persists.

@TXAggie2000
Copy link
Author

I see. I can not locate the null_or_blank type in 0.3.3. I would suggest trying on 0.4.0 to see if this problem persists.

Understood. I will try it with 0.4.0 and I will let you know!

Thanks,
Scott

@TXAggie2000
Copy link
Author

Tried the same code with 0.4.0 and am now getting the error:

Error: Failed to load class zingg.client.Client.

@vikasgupta78
Copy link
Collaborator

Can you please share the steps you used to install 0.4.0 and also the spark/java version you are using?

@TXAggie2000
Copy link
Author

@vikasgupta78 - I had modified the notebooks (config/setup) in the solution accelerator to download that version. I did notice that I was a minor spark version off for that so, I am re-testing with Databricks runtime version 14.3 LTS

@sonalgoyal
Copy link
Member

Cool. Please use dbr 14.2 and spark 3.5.0 with Zingg 0.4.0.

@TXAggie2000
Copy link
Author

TXAggie2000 commented Apr 19, 2024

Okay, I ran it in 14.2, spark 3.5.0 with Zingg 0.4.0 and still have the same error:

Error: Failed to load class zingg.client.Client

Here is the code for the findTrainingData where it is failing:

def run_find_training_job():
  '''
  The purpose of this function is to run the Zingg findTraining job that generates
  candidate pairs from the initial set of data specified in the job's configuration
  '''
  
  # identify the find training job
  find_training_job = ZinggJob( config['job']['initial']['findTrainingData'], config['job']['databricks workspace url'], config['job']['api token'])
  
  # run the job and wait for its completion
  find_training_job.run_and_wait()

  return

config['job']['initial']['findTrainingData'] = zingg_initial_findTrainingData

Could this code have changed from 0.3.3 to 0.4.0?

Keep in mind that if I download 0.3.3 and remove the NULL_OR_BLANK everything runs as expected. I just update 8 lines of code to switch.

@vikasgupta78
Copy link
Collaborator

You seem to be using an older notebook , please try https://github.com/zinggAI/zingg-vikas/blob/0.4.0/examples/databricks/FebrlExample.ipynb

@TXAggie2000
Copy link
Author

@vikasgupta78 - Thank you! I will review and test over the weekend and let you know by Monday!

@TXAggie2000
Copy link
Author

@vikasgupta78 - I got through one round of training/labeling. On the second pass at training the data, I got the following exception:

Py4JJavaError: An error occurred while calling o497.execute.
: zingg.common.client.ZinggClientException: Exception thrown in awaitResult: Job aborted due to stage failure: Task 0 in stage 206.0 failed 4 times, most recent failure: Lost task 0.3 in stage 206.0 (TID 264) (10.139.64.10 executor driver): java.lang.IllegalArgumentException: requirement failed: Vector should have dimension larger than zero.
	at scala.Predef$.require(Predef.scala:281)
	at org.apache.spark.ml.stat.SummarizerBuffer.add(Summarizer.scala:476)
	at org.apache.spark.ml.stat.SummarizerBuffer.add(Summarizer.scala:552)
	at org.apache.spark.ml.stat.Summarizer$.$anonfun$getClassificationSummarizers$1(Summarizer.scala:235)
	at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)
	at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
	at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)
	at scala.collection.TraversableOnce.aggregate(TraversableOnce.scala:260)
	at scala.collection.TraversableOnce.aggregate$(TraversableOnce.scala:260)
	at scala.collection.AbstractIterator.aggregate(Iterator.scala:1431)
	at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$4(RDD.scala:1316)
	at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$6(RDD.scala:1317)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:896)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:896)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:407)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:404)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:371)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:196)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:181)
	at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:146)
	at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:129)
	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:146)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:936)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:103)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:939)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:831)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
	at zingg.common.core.executor.TrainingDataFinder.execute(TrainingDataFinder.java:139)
	at zingg.common.client.Client.execute(Client.java:251)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
	at py4j.Gateway.invoke(Gateway.java:306)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
	at java.lang.Thread.run(Thread.java:750)

@TXAggie2000
Copy link
Author

This is my field definitions. Didn't see any examples in the docs for multiple match types, but it said it takes an array:

id = FieldDefinition("id", "integer", MatchType.DONT_USE)
email = FieldDefinition("email", "string", [MatchType.EMAIL,MatchType.NULL_OR_BLANK])
fname = FieldDefinition("firstname", "string", [MatchType.FUZZY,MatchType.NULL_OR_BLANK])
lname = FieldDefinition("lastname", "string", [MatchType.FUZZY,MatchType.NULL_OR_BLANK])
phone = FieldDefinition("phone", "string", [MatchType.NUMERIC,MatchType.NULL_OR_BLANK])
fieldDefs = [id, email, fname, lname, phone]
args.setFieldDefinition(fieldDefs)

@vikasgupta78
Copy link
Collaborator

Did you change the definition after a round of training / labelling?

@vikasgupta78
Copy link
Collaborator

can you try with phone = FieldDefinition("phone", "string", [MatchType. FUZZY,MatchType.NULL_OR_BLANK])

@vikasgupta78
Copy link
Collaborator

I would be happy to try it out locally if there is a test data you could share

@sonalgoyal
Copy link
Member

I would be happy to try it out locally if there is a test data you could share

@vikasgupta78 #818 (comment)

@vikasgupta78
Copy link
Collaborator

I tried the training samples , on running fresh findTrainingData I am seeing following in logs:

24/04/29 14:59:15 WARN TrainingDataFinder: Read training samples 33 neg 0

=> training samples is being used

@sonalgoyal
Copy link
Member

If you want to book time and the issue is still not resolved, please use the link on the docs @TXAggie2000

@TXAggie2000
Copy link
Author

Thanks everyone. I am struggling to get decent results with it, and it seems the results are either not using the training data or not ignoring null or blank values. I've redone it several times. I will continue to train, but seeing a lot of records in a single cluster that used one column to match across

@vikasgupta78
Copy link
Collaborator

I tried to replicate/fix the issue:

  1. In match type changed for EMAIL to EMAIL from FUZZY
    email = FieldDefinition("email", "string", MatchType.EMAIL,MatchType.NULL_OR_BLANK)

  2. Changed the order to:
    fieldDefs = [id, fname, lname, phone, email]

  3. Ran 10-15 rounds of findTrainingData and label

  4. In so many rounds only 1 pair came which can be called a match
    => MOCK_DATA.csv is by and large free from duplicates

  5. On running match
    all rows ended up in there own individual clusters
    => no dupes found
    => may be data we have is only partial so that's why no matches are being found

I am attaching the python file I used (renamed to txt) and final output I got.

Let me know how we can help further.

output.csv

FebrlExample818.txt

@vikasgupta78
Copy link
Collaborator

Also worth mentioning that in final few rounds model started converging that zingg prediction was in line with the input I gave in label

@vikasgupta78
Copy link
Collaborator

@TXAggie2000 did you get a chance to look into the results?

@TXAggie2000
Copy link
Author

I am rerunning again. I had switched the email and phone order because those are more important to determine a match. There could be two different people within a company with the same email domain and phone number that we would consider to be a match. That would make those two fields more significant in that case, correct?

@vikasgupta78
Copy link
Collaborator

@TXAggie2000 From what you are describing does that mean you won't consider first name and last name in such cases? If in some cases you consider certain fields like first name and last name as match but in other cases you don't it will not work out. Model has to be consistent. Fields you don't want to consider should be don't match.

e.g. if you consider x@y.com vikas gupta as match with b@y.com sonal goyal
but don't consider x@y.com vikas gupta as match with b@z.com sonal goyal as match it will just confuse the model and it will just not work

If you want to consider domain as a company better to split it in seperate field.

In summary be consistent in your training otherwise you will not get good matches

@TXAggie2000
Copy link
Author

I had some issues I corrected, and stuck with your suggestion. I went through 4 training cycles over the course of the day (findTrainingData took about an hour each cycle). When done, I had 3 matches and 83 labeled as not matching. With the ~40 records of matching manual training data, the trainMatch failed with not enough training data. After an additional round, it increased to 103 not matching. "trainMatch" did not error out, but the results were still subpar. For example, one cluster had 41 matches for John Smith where the phone numbers and emails were all different. Another where first name and last name were Mike, had 282 matches but none of the phone numbers or emails matched. I am not sure if I just need to spend several days training the model to get different results or is there a way to weigh these differently, or at least equally.

@vikasgupta78
Copy link
Collaborator

Are any of the pairs you marked as match in training had different phone numbers and email but same name?

If you did this for any of them it would result in Zingg also learning it the same way.

Please run --phase generateDocs and send me the files so that I can check if there is a problem with training data

@TXAggie2000
Copy link
Author

TXAggie2000 commented May 7, 2024

The phase generateDocs ran with no issues, but there is no docs directory in my model directory. I ran the following block:


#Zingg execution for the given phase
zingg = ZinggWithSpark(args, options)
zingg.initAndExecute()

It's interesting, because if I run the following:

displayHTML(open(DOCS_DIR+"model.html", 'r').read())

I can see the results, but if I print DOCS_DIR: /mnt/data/raw/zingg/models/contacts/docs/, this location does not exist.

java.io.FileNotFoundException: No such file or directory /mnt/data/raw/zingg/models/contacts/docs

The model.html shows: Unmarked 0/144, Marked 144/144 (8 Matches, 103 Non-Matches, 33 Unsure)

No sign of the training data...

@vikasgupta78
Copy link
Collaborator

vikasgupta78 commented May 7, 2024

Training samples won't come in generateDocs. The purpose of it is to review label data.

Regarding DOCS_DIR :

If you see in https://github.com/zinggAI/zingg-vikas/blob/0.4.0/examples/databricks/FebrlExample.ipynb

you need to assign DOCS_DIR = zinggDir + "/" + modelId + "/docs/"
and check that docs succesfully generated
dbutils.fs.ls('file:'+DOCS_DIR)

Please share model.html so that I can review

@TXAggie2000
Copy link
Author

Yes, that example is what I ran and showed what the path was when I ran print(DOCS_DIR), so you could see the path was set

I can see the results, but if I print DOCS_DIR: /mnt/data/raw/zingg/models/contacts/docs/, this location does not exist.

@vikasgupta78
Copy link
Collaborator

whats your zinggDir?

@vikasgupta78
Copy link
Collaborator

and what are you getting when you do dbutils.fs.ls('file:'+DOCS_DIR)

@TXAggie2000
Copy link
Author

DOCS_DIR: /mnt/data/raw/zingg/models/contacts/docs/
zinngDir: /mnt/data/raw/zingg/models
modelId: contacts

dbutils.fs.ls('file:'+DOCS_DIR): java.io.FileNotFoundException: No such file or directory /mnt/data/raw/zingg/models/contacts/docs

@vikasgupta78
Copy link
Collaborator

vikasgupta78 commented May 7, 2024

and if you do dbutils.fs.ls('file:'+ zinggDir)

@TXAggie2000
Copy link
Author

[FileInfo(path='file:/mnt/data/raw/zingg/models/contacts/', name='contacts/', size=4096, modificationTime=1715090590510)]

@vikasgupta78
Copy link
Collaborator

Also if you are able to see models.html as you have said above can you please share that?

@TXAggie2000
Copy link
Author

I believe the issue with it saving is because the notebook is referencing a mounted directory. If I run the following, it creates it in the mounted directory:

with open(DOCS_DIR+"model.html", 'r') as source_file:
    file_contents = source_file.read()

with open('/dbfs/'+zinggDir + "/" + modelId+"/docs/model.html", 'w') as destination_file:
    destination_file.write(file_contents)

@TXAggie2000
Copy link
Author

How can I send this to you privately

@TXAggie2000
Copy link
Author

I did see that on one of my matches, one record does not have a phone number and the other does not have an email. This should have been marked as uncertain. Is there a way to correct that?

@TXAggie2000
Copy link
Author

Also, at what point is the training data used? trainMatch?

@vikasgupta78
Copy link
Collaborator

How can I send this to you privately

you can send it 1-1 on slack

@vikasgupta78
Copy link
Collaborator

Also, at what point is the training data used? trainMatch?

it is used in findTrainingData, trainMatch

@vikasgupta78
Copy link
Collaborator

I did see that on one of my matches, one record does not have a phone number and the other does not have an email. This should have been marked as uncertain. Is there a way to correct that?

use updateLabel phase

@TXAggie2000
Copy link
Author

It appears this phase is requiring input, so I am guessing I will need to build out a notebook widget for cluster id and get that input into this phase since we are running a notebook and not command line. I did not see an example in the source code for this unless I missed it.

@vikasgupta78
Copy link
Collaborator

How can I send this to you privately

you can send it 1-1 on slack

zinggai.slack.com

@TXAggie2000
Copy link
Author

I wasn't able to log in to slack. Thought I had an account at one point, but apparently not. I also wasn't able to get the --updateLabel phase to work in a notebook, so I just started over. Still not getting good results so I will keep training.

@vikasgupta78
Copy link
Collaborator

Just ensure that you are consistent in training and also with training samples. prefer to use actual data with training samples instead of handcrafted data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants