updates for schemas #123

gfudenberg · 2022-09-28T21:47:16Z

add rmsk schema
https://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=rep&hgta_track=rmsk&hgta_table=rmsk&hgta_doSchema=describe+table+schema
"""bin swScore milliDiv milliDel milliIns genoName genoStart genoEnd genoLeft strand repName repClass repFamily repStart repEnd repLeft id"""

Line 171 in ccb8e70

consider adding a rename mapper to read_table(), since we would often do the following:

rmsk.rename(
   columns={ "genoName": "chrom",   "genoStart": "start", "genoEnd": "end"}, inplace=True,
)

consider adding a set of columns to be dropped, e.g. when they come from a database they sometimes include indexing columns like bin that are not very useful
think how to add dtypes to schemas, such as was done in cooltools.cli.pipeup:
https://github.com/open2c/cooltools/blob/1212cf0757741951a6be15bb7351cf35240493a0/cooltools/cli/pileup.py#L138

The text was updated successfully, but these errors were encountered:

nvictus · 2022-09-28T23:25:54Z

Instead of manually adding all UCSC table schemas, we should consider an approach to build the schema "database" that automatically sources table schemas from UCSC and translates them into something python/user-friendly. An alternative would be to grab data and schemas directly from UCSC's sql database dynamically: see cruzdb.

Since the UCSC schemas are all typed SQL schemas, we could consider optionally adding things like:

numpy/pandas translations of those types
field exclusion rules (e.g. "bin")
renaming rules (e.g. chromStart --> start)

But here we risk doing a lot of manual curation again.

# schemas.py --> schemas.json ??
[
...
{
    "name": "rmsk",
    "family": "ucsc",
    "fields": [...],
    "rename": {....},
    "exclude": [...],
},
...
]

gfudenberg · 2022-09-29T22:40:59Z

another option for implementing a renaming dictionary would be following the format for colname remapping in core.specs

bioframe/bioframe/core/specs.py

Line 13 in ccb8e70

def _get_default_colnames():

agalitsyna · 2022-10-19T19:17:25Z

GFF/GTF reader proposal

Usually parsing gff/gtf files is painful because of the "attributes" column that is a key-value dictionary stored as string.
I do not see a simple option to do that in reading GFF/GTF schema.
I usually either end up using some python-based gtf parser (which frequently produces errors and is not easy to debug), or use custom column expansion like this:

pd.DataFrame.from_records( df_genes.attributes.apply(lambda x: {y[0]:y[1] for y in re.findall(r'([^\s]*) "([^\s]*)"; ', x)} ) )]

Does it make sense to add an option to do a similar expansion on any column with user-specified regex, e.g. here?

bioframe/bioframe/io/fileops.py

Line 43 in fbd129c

def read_table(filepath_or, schema=None, schema_is_strict=False, **kwargs):

GTF/GFF defaults can be added as examples.

Pros:

automated gtf parser in bioframe
customizable regex reader (there might be variations in what people understand as GTF/GFF)

Cons:

Heavy computations with pandas + regex
Additional piece of code that requires maintenance.

Maybe it's already in UCSC @nvictus and it's simple to re-use?

nvictus · 2022-10-19T19:36:32Z

Btw, there's this which got sandboxed. It's slow, as you might expect:

https://github.com/open2c/bioframe/blob/fbd129c1444cef7c34edce067027ab5f65172fe8/bioframe/sandbox/gtf_io.py

agalitsyna · 2022-10-19T19:49:34Z

Ah, I did not notice that one. The disadvantage of this one is that it's not generalized and cannot be simply customized if there are no spaces in the annotation or there's a mix of quote chars / item separators.

I would upvote dissandboxing it.

nvictus · 2022-10-19T19:58:44Z

Agreed that there's plenty of room for improvement.

I would advocate for now keeping it as a function downstream from read_table, to be applied on an unparsed dataframe column or series.

gfudenberg · 2022-10-19T20:02:26Z

there's also gtfparse -- worth considering adding it as a dependency?
(with key function read_gtf)

agalitsyna · 2022-10-19T20:29:52Z

I was unable to parse some public gtf files with it, only custom solution worked.

gfudenberg added the enhancement label Sep 28, 2022

gfudenberg mentioned this issue Jan 19, 2024

Bring back GTF attributes parser? #141

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updates for schemas #123

updates for schemas #123

gfudenberg commented Sep 28, 2022 •

edited by agalitsyna

nvictus commented Sep 28, 2022

gfudenberg commented Sep 29, 2022

agalitsyna commented Oct 19, 2022

nvictus commented Oct 19, 2022

agalitsyna commented Oct 19, 2022 •

edited

nvictus commented Oct 19, 2022 •

edited

gfudenberg commented Oct 19, 2022 •

edited

agalitsyna commented Oct 19, 2022

updates for schemas #123

updates for schemas #123

Comments

gfudenberg commented Sep 28, 2022 • edited by agalitsyna

nvictus commented Sep 28, 2022

gfudenberg commented Sep 29, 2022

agalitsyna commented Oct 19, 2022

nvictus commented Oct 19, 2022

agalitsyna commented Oct 19, 2022 • edited

nvictus commented Oct 19, 2022 • edited

gfudenberg commented Oct 19, 2022 • edited

agalitsyna commented Oct 19, 2022

gfudenberg commented Sep 28, 2022 •

edited by agalitsyna

agalitsyna commented Oct 19, 2022 •

edited

nvictus commented Oct 19, 2022 •

edited

gfudenberg commented Oct 19, 2022 •

edited