Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create_export_csv rounding should be sigfigs, not just digits #1945

Open
dsweber2 opened this issue Feb 15, 2024 · 2 comments
Open

create_export_csv rounding should be sigfigs, not just digits #1945

dsweber2 opened this issue Feb 15, 2024 · 2 comments
Labels
future-solution Solutions to problems we don't have yet but still dread good first issue

Comments

@dsweber2
Copy link
Contributor

Specifically, this line:

export_df = export_df.round({"val": 7, "se": 7})

Not currently a problem as far as I can tell, but if we ever get a signal whose data is of the size 1.423e-8, then this will round to zero.

I hacked together a function that does this correctly from some stack overflow code for nwss

def sig_digit_round(value, n_digits):
"""Round the number of significant digits (x.xxe5 would be 3) to `N`."""
in_value = value
value = np.asarray(value).copy()
zero_mask = (value == 0) | np.isinf(value) | np.isnan(value)
value[zero_mask] = 1.0
sign_mask = value < 0
value[sign_mask] *= -1
exponent = np.ceil(np.log10(value))
result = 10**exponent * np.round(value * 10 ** (-exponent), n_digits)
result[sign_mask] *= -1
result[zero_mask] = in_value[zero_mask]
return result

but it would be good to understand if there are good reasons this was written the way it is before simply replacing it.

@dsweber2 dsweber2 added good first issue future-solution Solutions to problems we don't have yet but still dread labels Feb 15, 2024
@melange396
Copy link
Contributor

Good call.

This rounding behavior was asked for in #769 and implemented in #779 (nearly exactly 3 years ago!), but neither the issue nor the PR give hints as to the reason or rationale for the decision. I was around at the time but not involved in this effort, and i cant remember why this might've happened... I have to imagine that it was done for a specific purpose instead of on a whim, but it could have been a premature optimization. Who knows?! Some possibilities for "why" could include:

  • to try to save some disk space by shrinking the CSV files a bit?
  • a part of the system somewhere else was using a weird representation for these values, like a fixed-point type (in python and/or MySQL), or storing them in strings of a bounded length?
  • similar to the above point: potential differences in the default float format or encoding between machine architectures or languages?
  • concerns about loss of precision from numerical instability?

@melange396
Copy link
Contributor

A nice intermediate step for this could be to move the sig_digit_round() method into a new delphi_utils module (in https://github.com/cmu-delphi/covidcast-indicators/tree/main/_delphi_utils_python/delphi_utils ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
future-solution Solutions to problems we don't have yet but still dread good first issue
Projects
None yet
Development

No branches or pull requests

2 participants