Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default dtypes & core.construction #188

Open
smitkadvani opened this issue Feb 12, 2024 · 4 comments
Open

default dtypes & core.construction #188

smitkadvani opened this issue Feb 12, 2024 · 4 comments

Comments

@smitkadvani
Copy link
Contributor

smitkadvani commented Feb 12, 2024

looking at tests, there is a lot of boilerplate that could be reduced, and tests could be made more readable, if we could specify dtypes for functions in core.construction (including from_any, from_list, and from_series).

for example:

    df1 = pd.DataFrame(
        [
            ['chr1', 1, 1]
        ],
        columns=['chrom','start','end']
    ).astype({"start": pd.Int64Dtype(), "end": pd.Int64Dtype()})

would become

df1 = bf.from_any(['chr1', 1, 1])

We provide a dictionary for default columns names in core.specs, however there does not seem to be a dictionary (or other specification) for default dtypes.

One option would be to add them right after the default column names in core.specs:
https://github.com/open2c/bioframe/blob/main/bioframe/core/specs.py#L11C1-L12C1

If added, should they be int, pd.Int64Dtype(), or something else for start and end?

@gfudenberg
Copy link
Member

@nvictus @golobor any suggestions?
I guess the idea would be:

_rc = {"colnames": {"chrom": "chrom", "start": "start", "end": "end"},
    "col_dtpyes": {"chrom": str, "start": pd.Int64Dtype(), "end": pd.Int64Dtype()}
}

and something like:

def from_any(regions, fill_null=False, name_col="name", cols=None, col_dtypes=None):
    ck1_dtype, sk1_dtype, ek1_dtype = _get_default_col_dtypes() if cols is None else cols

@nvictus
Copy link
Member

nvictus commented Mar 20, 2024

Why col_dtypes as opposed to dtypes?

@gfudenberg
Copy link
Member

gfudenberg commented Mar 20, 2024

I guess we could say the same thing about colnames vs cols... as keys in the _rc dictionary (which could perhaps also get a less cryptic name)?

@golobor
Copy link
Member

golobor commented Mar 20, 2024

rc is borrowed from matplotlib which borrowed it from Linux: https://stackoverflow.com/questions/11030552/what-does-rc-mean-in-dot-files . It doesn't have to be named like that, can be, for example, _conf.

pd.Int64Dtype() is probably better than int, right? I think we use it quite extensively throughout the library, so we might as well make it the default - this way we'll have more consistent NaNs.

names/dtypes is fine with me! We can also use a nested dict or a SimpleNamespace/NamedTuple (not bad, since we only need to read/modify variables with existing keys):

_conf = dict(
    names = dict(chrom='chrom', start='start', end='end'),
    dtypes = dict(...)
)

or:

import types
_conf = types.SimpleNamespace()
_conf.col =  types.SimpleNamespace()
_conf.col.names =  types.SimpleNamespace()
_conf.col.names.chrom=chrom
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants