Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Tidy Output #28

Open
nlubock opened this issue Dec 5, 2018 · 3 comments
Open

Feature Request: Tidy Output #28

nlubock opened this issue Dec 5, 2018 · 3 comments
Assignees

Comments

@nlubock
Copy link

nlubock commented Dec 5, 2018

I was wondering if it would be possible to output clusters in a tidy format rather than the existing wide format. For example:

> cat test.tsv
GGGG	50
AGGG	10
AAGG	5
AAAG	20
TGGG	20
TTTT	100

>  starcode -q -d1 --sphere --print-clusters -i test.tsv
TTTT	100	TTTT
GGGG	80	GGGG,AGGG,TGGG
AAAG	25	AAAG,AAGG

> starcode -q -d1 --sphere --print-tidy-clusters -i test.tsv
TTTT    100     TTTT    100
GGGG    80      GGGG    50
GGGG    80      AGGG    10
GGGG    80      TGGG    20
AAAG    25      AAAG    20
AAAG    25      AAGG    5

Obviously this not critical, but I think it would be a useful feature.

@ezorita
Copy link
Collaborator

ezorita commented Dec 5, 2018

Hi nlubock, thanks for your request.

Indeed it's an interesting feature that could be implemented without too much effort given the latest updates in the clustering algorithms.

We will consider it and get back to you soon.

@ezorita ezorita self-assigned this Dec 5, 2018
@darachm
Copy link

darachm commented Jan 26, 2021

Hey folks, I came here to ask for something similar, so I figured I'd just add a comment here.
I am currently using starcode in bioinformatics pipelines to cluster to centroids, then am counting combinations of 4 sets of barcodes later. So I would be very interested in a mode that could output tables with tuples sort of like:

  • [ input line number, clustered centroid sequence ]
  • [ unique input sequence, clustered centroid sequence ]

My current solution is to use AWK:grimacing: on the cluster-id-containing output file like:

mawk '{ split($3,a,","); for (i in a){ print a[i] "," $1 } }'

Do y'all think the first option (input line number, clustered centroid sequence) would be easy to implement? I took an intro to C course 10 years ago and mainly use shell/R/python , do you think it'd be helpful for me to sketch out a prototype? I assume the output-making code would be easy to find?

@darachm
Copy link

darachm commented Sep 23, 2021

Hey @nlubock , a similar feature is now ready for testing in the feature/tidy branch, as discussed on this pull request : #40 (comment)
So if you're still working with this, maybe give it a spin? It's different than you describe, but should still work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants