Point out gephi as a debugger #1096

NickCrews · 2022-09-14T22:03:26Z

This isn't a bug, I just wanted to point out a tool that has been really useful for me for debugging and analyzing the performance of my dedupe.

My workflow is

generate pairs, score them, and cluster them
create a networkx graph. The records are nodes in the graph, the scores are the edges. I save the fields of a record saved as node attributes. I also include the label from cluster() for each node. I try different runs with different threshold values to see what changes, or I've implemented my own clustering to compare.
filter the network to be managably sized. Too big and my computer can't handle the following analysis. What I've done is only keep nodes that are in large components as they are some of the more interesting.
export the graph to the .gexf format using networkx.write_gexf(my_graph)
open the .gexf file in gephi and analyze

I can color each node by its label to see how clustering does:

and hover over individual nodes to see what the fields were for each node:

This has been indispensable to figure out:

Is my scoring sane? Are there some pairs that are obviously wrong that I need to adjust my metrics?
How does the clustering perform on these scores? what should I set my threshold to?

Here is the .gexf files from the screenshot if you want to download gephi and play around. This is from publicly available campaign donation data, records are the donors for individual donations. I thresholded the scores to either be 0 or 1 to make things simpler. This only includes records in components larger than 140.
components_larger_than_140.gexf.zip

The text was updated successfully, but these errors were encountered:

tendres · 2022-09-16T02:57:06Z

On Sep 14, 2022, at 5:03 PM, Nick Crews ***@***.***> wrote: This isn't a bug, I just wanted to point out a tool that has been really useful for me for debugging and analyzing the performance of my dedupe. My workflow is generate pairs, score them, and cluster them create a networkx graph. The records are nodes in the graph, the scores are the edges. I save the fields of a record saved as node attributes. I also include the label from cluster() for each node. I try different runs with different threshold values to see what changes, or I've implemented my own clustering to compare. filter the network to be managably sized. Too big and my computer can't handle the following analysis. What I've done is only keep nodes that are in large components as they are some of the more interesting. export the graph to the .gexf format using networkx.write_gexf(my_graph) open the .gexf file in gephi and analyze I can color each node by its label to see how clustering does: and hover over individual nodes to see what the fields were for each node: This has been indispensable to figure out: Is my scoring sane? Are there some pairs that are obviously wrong that I need to adjust my metrics? How does the clustering perform on these scores? what should I set my threshold to? Here is the .gexf files from the screenshot if you want to download gephi and play around. This is from publicly available campaign donation data, records are the donors for individual donations. I thresholded the scores to either be 0 or 1 to make things simpler. This only includes records in components larger than 140. components_larger_than_140.gexf.zip — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

Nick - Thanks for writing up this post. Long time gephi/dedupe user and kudos on a brilliant use case. Using gephi to look at scoring and clustering is brilliant - glad to see use of this 3rd party tools to strengthen dedupe. Thanks again. -tom

NickCrews · 2022-09-17T19:56:47Z

Thanks Tom. Glad that someone found it useful!

Depending on what you or Forest or others think about how universally valuable this is, perhaps we add support for this sort of debugging? eg a dedupe.to_networkx(records: Iterable[Mapping[RecordID, RecordDict]], scores: np.ndarray | np.memap) -> networkx.Graph utility method, and a short write up in the docs, maybe on the troubleshooting page?

I'm not sure how else this issue is actionable, so it should get closed if we can't come up with something we want to change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Point out gephi as a debugger #1096

Point out gephi as a debugger #1096

NickCrews commented Sep 14, 2022

tendres commented Sep 16, 2022 via email

NickCrews commented Sep 17, 2022 •

edited

Point out gephi as a debugger #1096

Point out gephi as a debugger #1096

Comments

NickCrews commented Sep 14, 2022

tendres commented Sep 16, 2022 via email

NickCrews commented Sep 17, 2022 • edited

NickCrews commented Sep 17, 2022 •

edited