Replies: 1 comment 2 replies
-
If I understand the question correctly, I think the approach you want to take is:
I don't think this methodology is documented at all but here's a script that hopefully illustrates the idea. Sorry - wanted to do some working code here for you but ran out of time so this is more a sketch # do your two linking models here
# my_nodes is then a list of all nodes from both models
# my_edges is then the UNION ALLed table of predictions
# You might need to dedupe it
linker = SparkLinker(
my_nodes,
settings,
break_lineage_method="parquet",
set_up_basic_logging=False,
)
df_predict = linker.register_table_predict(my_edges, overwrite=True)
linker.cluster_pairwise_predictions_at_threshold(
df_predict, threshold_match_probability=v
) |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I'm looking at applying separate dedupe and link models, as recommend in #1601 (reply in thread) as a way to perform an initial clustering of two datasets and then a link between them, and I was hoping for a little more insight into how this 3-model pipeline would work in practice.
I want to go down this route as I have two datasets where some specific rules are easier to implement with two respective
dedupe_only
models, and these separate dedupes have provided nice pairwise predictions and clusters within the respective datasets. The aim here isn't really to remove duplicates in the dataset I suppose, but rather provide clusters of records which then want linking by the subsequent linker.How can I actually apply the
link_only
model here? I build alink_only
model and pass in the two datasets, but how could that take into account the previously identified clusters and predictions during training? Could I also instantiate the m/u parameters based on the two prior models? Can I use the outputs (pairwise or clusters) from the two previous models in some way?Beta Was this translation helpful? Give feedback.
All reactions