Feature Request: Tidy Output #28

nlubock · 2018-12-05T00:48:22Z

I was wondering if it would be possible to output clusters in a tidy format rather than the existing wide format. For example:

> cat test.tsv
GGGG	50
AGGG	10
AAGG	5
AAAG	20
TGGG	20
TTTT	100

>  starcode -q -d1 --sphere --print-clusters -i test.tsv
TTTT	100	TTTT
GGGG	80	GGGG,AGGG,TGGG
AAAG	25	AAAG,AAGG

> starcode -q -d1 --sphere --print-tidy-clusters -i test.tsv
TTTT    100     TTTT    100
GGGG    80      GGGG    50
GGGG    80      AGGG    10
GGGG    80      TGGG    20
AAAG    25      AAAG    20
AAAG    25      AAGG    5

Obviously this not critical, but I think it would be a useful feature.

The text was updated successfully, but these errors were encountered:

ezorita · 2018-12-05T08:35:29Z

Hi nlubock, thanks for your request.

Indeed it's an interesting feature that could be implemented without too much effort given the latest updates in the clustering algorithms.

We will consider it and get back to you soon.

darachm · 2021-01-26T18:47:29Z

Hey folks, I came here to ask for something similar, so I figured I'd just add a comment here.
I am currently using starcode in bioinformatics pipelines to cluster to centroids, then am counting combinations of 4 sets of barcodes later. So I would be very interested in a mode that could output tables with tuples sort of like:

[ input line number, clustered centroid sequence ]
[ unique input sequence, clustered centroid sequence ]

My current solution is to use AWK:grimacing: on the cluster-id-containing output file like:

mawk '{ split($3,a,","); for (i in a){ print a[i] "," $1 } }'

Do y'all think the first option (input line number, clustered centroid sequence) would be easy to implement? I took an intro to C course 10 years ago and mainly use shell/R/python , do you think it'd be helpful for me to sketch out a prototype? I assume the output-making code would be easy to find?

darachm · 2021-09-23T15:00:20Z

Hey @nlubock , a similar feature is now ready for testing in the feature/tidy branch, as discussed on this pull request : #40 (comment)
So if you're still working with this, maybe give it a spin? It's different than you describe, but should still work.

ezorita self-assigned this Dec 5, 2018

darachm mentioned this issue Sep 5, 2021

Mocked-up suggestion - outputting canonical cluster sequence per-read for use in bioinformatics pipelines #40

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Tidy Output #28

Feature Request: Tidy Output #28

nlubock commented Dec 5, 2018

ezorita commented Dec 5, 2018

darachm commented Jan 26, 2021

darachm commented Sep 23, 2021

Feature Request: Tidy Output #28

Feature Request: Tidy Output #28

Comments

nlubock commented Dec 5, 2018

ezorita commented Dec 5, 2018

darachm commented Jan 26, 2021

darachm commented Sep 23, 2021