Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removing sample(s) from the talon database #104

Open
ew367 opened this issue May 4, 2022 · 8 comments
Open

removing sample(s) from the talon database #104

ew367 opened this issue May 4, 2022 · 8 comments

Comments

@ew367
Copy link

ew367 commented May 4, 2022

Hi

How does talon process datasets that have only partially been added to the input database? When running talon again using the same config file and input databse will it continue from part way through the partial dataset, or skip the dataset since part of it is already present? My HPC job was interrupted during processing and I want to know how I can tell whether the sample that was being added at the time has been completed, or if data is still missing after running talon again.

Thanks

@callumparr
Copy link

Also interested to know. When this happened I just deleted the database and initialize a new one as there isn’t a —resume flag.

@ew367
Copy link
Author

ew367 commented May 4, 2022

Also interested to know. When this happened I just deleted the database and initialize a new one as there isn’t a —resume flag.

Yes, that has been my approach in the past too, but my new dataset is HUGE and had already been running for nearly 2 weeks before the interruption so I really don't want to do that this time!

@fairliereese
Copy link
Member

You can use this python code to check if the dataset has been added to your database:

import sqlite3
import pandas as pd

db = 'database_name.db'
with sqlite3.connect(db) as conn:
     q = 'SELECT dataset_name FROM dataset'
     datasets = pd.read_sql_query(q, conn)

print(datasets.dataset_name.tolist())

TALON is pretty good about discarding or not pushing incomplete changes to the database but this is not a surefire method. What I typically do is I make a backup copy of my TALON database before trying to add new datasets to it. That way, if the run fails, I can simply restart using the backup. I'm sorry there's not a better way to do this but this is definitely something that I learned based on getting burned in the past as well.

@callumparr
Copy link

What I typically do is I make a backup copy of my TALON database before trying to add new datasets to it. That way, if the run fails, I can simply restart using the backup

That's simple and ingenious. Not sure why I did not think to do that. TY for replies as always.

@ew367
Copy link
Author

ew367 commented May 6, 2022

Thanks for your input everyone.

Does anyone know if there is a way to remove an indivual dataset from a database? If so, I could just remove the dataset that it was part way through proccessing and then readd it...

The partially processed dataset definately exists in the database, but I'm not convinced that it has been fully added. I used talon_filter_transcripts on a new db that I created just from the 'suspect' dataset in question, then did the same on the global db after additionally specifying --datasets=suspectDataset and they were not comparable. The output from the global db contains less rows than the output from using the db created using the single suspect dataset.

@rb520826
Copy link

Hi, thanks for the above suggestions. I am also interested in the removal of a sample from the database - was there any update to whether this is possible please?

Thanks!

@callumparr
Copy link

You can use this python code to check if the dataset has been added to your database:

import sqlite3
import pandas as pd

db = 'database_name.db'
with sqlite3.connect(db) as conn:
     q = 'SELECT dataset_name FROM dataset'
     datasets = pd.read_sql_query(q, conn)

print(datasets.dataset_name.tolist())

TALON is pretty good about discarding or not pushing incomplete changes to the database but this is not a surefire method. What I typically do is I make a backup copy of my TALON database before trying to add new datasets to it. That way, if the run fails, I can simply restart using the backup. I'm sorry there's not a better way to do this but this is definitely something that I learned based on getting burned in the past as well.

Is it possible to extract the sample description, the second column in the config file.

I been playing around with sqlite3 module trying to get the column headers of the dataset table in the database but its a beyond me.

@fairliereese
Copy link
Member

Is it possible to extract the sample description, the second column in the config file.

I been playing around with sqlite3 module trying to get the column headers of the dataset table in the database but its a beyond me.

You should be able to pull that info out using the following sql query: SELECT DISTINCT sample FROM dataset

As an aside, if you're interested in navigating the stuff in the TALON database, I'd definitely recommend downloading a DB viewer such as this one. You can look through the tables and write / test out queries on your tables so you don't have to open up python and sqlite3 every time you want to poke around.

As another aside, I am much more comfortable in pandas in python than I am in manipulating these tables through sqlite3. If that's more your speed, there are sqlite3 functions that will literally dump a table from a database into a pandas table (see here for example) to make it easy to work on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants