Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updates for schemas #123

Open
4 tasks
gfudenberg opened this issue Sep 28, 2022 · 8 comments
Open
4 tasks

updates for schemas #123

gfudenberg opened this issue Sep 28, 2022 · 8 comments

Comments

@gfudenberg
Copy link
Member

gfudenberg commented Sep 28, 2022

  • consider adding a rename mapper to read_table(), since we would often do the following:
rmsk.rename(
   columns={ "genoName": "chrom",   "genoStart": "start", "genoEnd": "end"}, inplace=True,
)
@nvictus
Copy link
Member

nvictus commented Sep 28, 2022

Instead of manually adding all UCSC table schemas, we should consider an approach to build the schema "database" that automatically sources table schemas from UCSC and translates them into something python/user-friendly. An alternative would be to grab data and schemas directly from UCSC's sql database dynamically: see cruzdb.

Since the UCSC schemas are all typed SQL schemas, we could consider optionally adding things like:

  • numpy/pandas translations of those types
  • field exclusion rules (e.g. "bin")
  • renaming rules (e.g. chromStart --> start)

But here we risk doing a lot of manual curation again.

# schemas.py --> schemas.json ??
[
...
{
    "name": "rmsk",
    "family": "ucsc",
    "fields": [...],
    "rename": {....},
    "exclude": [...],
},
...
]

@gfudenberg
Copy link
Member Author

another option for implementing a renaming dictionary would be following the format for colname remapping in core.specs

def _get_default_colnames():

@agalitsyna
Copy link
Member

GFF/GTF reader proposal

Usually parsing gff/gtf files is painful because of the "attributes" column that is a key-value dictionary stored as string.
I do not see a simple option to do that in reading GFF/GTF schema.
I usually either end up using some python-based gtf parser (which frequently produces errors and is not easy to debug), or use custom column expansion like this:

pd.DataFrame.from_records( df_genes.attributes.apply(lambda x: {y[0]:y[1] for y in re.findall(r'([^\s]*) "([^\s]*)"; ', x)} ) )]

Does it make sense to add an option to do a similar expansion on any column with user-specified regex, e.g. here?

def read_table(filepath_or, schema=None, schema_is_strict=False, **kwargs):

GTF/GFF defaults can be added as examples.

Pros:

  • automated gtf parser in bioframe
  • customizable regex reader (there might be variations in what people understand as GTF/GFF)

Cons:

  • Heavy computations with pandas + regex
  • Additional piece of code that requires maintenance.

Maybe it's already in UCSC @nvictus and it's simple to re-use?

@nvictus
Copy link
Member

nvictus commented Oct 19, 2022

Btw, there's this which got sandboxed. It's slow, as you might expect:

https://github.com/open2c/bioframe/blob/fbd129c1444cef7c34edce067027ab5f65172fe8/bioframe/sandbox/gtf_io.py

@agalitsyna
Copy link
Member

agalitsyna commented Oct 19, 2022

Ah, I did not notice that one. The disadvantage of this one is that it's not generalized and cannot be simply customized if there are no spaces in the annotation or there's a mix of quote chars / item separators.

I would upvote dissandboxing it.

@nvictus
Copy link
Member

nvictus commented Oct 19, 2022

Agreed that there's plenty of room for improvement.

I would advocate for now keeping it as a function downstream from read_table, to be applied on an unparsed dataframe column or series.

@gfudenberg
Copy link
Member Author

gfudenberg commented Oct 19, 2022

there's also gtfparse -- worth considering adding it as a dependency?
(with key function read_gtf)

@agalitsyna
Copy link
Member

I was unable to parse some public gtf files with it, only custom solution worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants