Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linker.estimate_u_using_random_sampling fails with default arguments, with no clear indication why #2165

Closed
ADBond opened this issue May 3, 2024 · 3 comments
Labels
bug Something isn't working splink4

Comments

@ADBond
Copy link
Contributor

ADBond commented May 3, 2024

Splink 4:

from splink import DuckDBAPI, Linker, SettingsCreator, splink_datasets
import splink.comparison_library as cl


settings = SettingsCreator("dedupe_only", [cl.ExactMatch("first_name"), cl.ExactMatch("surname")])


linker = Linker(splink_datasets.fake_1000, settings, DuckDBAPI())

linker.estimate_u_using_random_sampling()

Get the error: TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'

Some options for behaviour:

  • max_pairs = None means we set proportion as 1, similar to if we have more max_pairs than possible record pairs
  • explicit error message if it is left as None, forcing user to set a non-default value making this argument non-optional
  • a directly valid default value
@ADBond ADBond added bug Something isn't working splink4 labels May 3, 2024
@RobinL
Copy link
Member

RobinL commented May 3, 2024

Tricky, especially since it's unlikely the user will have an intuition for the ''correct'/'best' parameter (it wouldn't be very good if they set it to like, 100 or something!)

I'd possibly set the default value to 1e6 which will be quick to compute and will produce 'reasonably decent' estimates. And make sure we try to ensure in examples/tutorials we always explicitly set it to sensible values

@ADBond
Copy link
Contributor Author

ADBond commented May 7, 2024

I'd possibly set the default value to 1e6 which will be quick to compute and will produce 'reasonably decent' estimates. And make sure we try to ensure in examples/tutorials we always explicitly set it to sensible values

My only concern with that is that we don't really have a mechanism for flagging that estimates may be unreliable - I'd worry some users might not look at the docs and just use the default, and then get (somewhat) unreliable estimates. I guess we could emit a warning if it's left at this value that they may want to increase the number for more serious estimation?

@RobinL
Copy link
Member

RobinL commented May 8, 2024

Agreed - i think a warning if it's left as the default should suffice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working splink4
Projects
None yet
Development

No branches or pull requests

2 participants