Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Point out gephi as a debugger #1096

Open
NickCrews opened this issue Sep 14, 2022 · 2 comments
Open

Point out gephi as a debugger #1096

NickCrews opened this issue Sep 14, 2022 · 2 comments

Comments

@NickCrews
Copy link
Contributor

This isn't a bug, I just wanted to point out a tool that has been really useful for me for debugging and analyzing the performance of my dedupe.

My workflow is

  1. generate pairs, score them, and cluster them
  2. create a networkx graph. The records are nodes in the graph, the scores are the edges. I save the fields of a record saved as node attributes. I also include the label from cluster() for each node. I try different runs with different threshold values to see what changes, or I've implemented my own clustering to compare.
  3. filter the network to be managably sized. Too big and my computer can't handle the following analysis. What I've done is only keep nodes that are in large components as they are some of the more interesting.
  4. export the graph to the .gexf format using networkx.write_gexf(my_graph)
  5. open the .gexf file in gephi and analyze

I can color each node by its label to see how clustering does:
image

and hover over individual nodes to see what the fields were for each node:
image

This has been indispensable to figure out:

  1. Is my scoring sane? Are there some pairs that are obviously wrong that I need to adjust my metrics?
  2. How does the clustering perform on these scores? what should I set my threshold to?

Here is the .gexf files from the screenshot if you want to download gephi and play around. This is from publicly available campaign donation data, records are the donors for individual donations. I thresholded the scores to either be 0 or 1 to make things simpler. This only includes records in components larger than 140.
components_larger_than_140.gexf.zip

@tendres
Copy link

tendres commented Sep 16, 2022 via email

@NickCrews
Copy link
Contributor Author

NickCrews commented Sep 17, 2022

Thanks Tom. Glad that someone found it useful!

Depending on what you or Forest or others think about how universally valuable this is, perhaps we add support for this sort of debugging? eg a dedupe.to_networkx(records: Iterable[Mapping[RecordID, RecordDict]], scores: np.ndarray | np.memap) -> networkx.Graph utility method, and a short write up in the docs, maybe on the troubleshooting page?

I'm not sure how else this issue is actionable, so it should get closed if we can't come up with something we want to change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants