-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Urgent Help: Cannot Speed up Metric Extraction script #164
Comments
There are many considerations here, but I will start with the simplest steps to speed up metric extraction. The first is that I would recommend using the The next recommendation would be to increase the window size. Generally a value close to the read length median is a good value to use. If you go too much larger the regions will contain too much empty space and currently do not support sparse matrices. If you go smaller then you end up having to process the same read several times in order to extract the reference-anchored metrics. This is likely what is happening in the script posted here. If one wanted to make a blazing fast metric extraction reads could be processed once and all of the metrics from the read output in to an efficient format for the intended downstream purpose. I think these updates should get you most of the way to a much more efficient metric extraction code snippet. Note also that the sample rate is likely the same for the whole run. So you could extract the dwell matrix and then divide the whole matrix by the sample rate at the end. |
@marcus1487 thank you so much for the prompt reply I really appreciate that. These fixes are great however I am running into an issue where a Remora Error (Bam record not found in POD5) stops the execution of the entire |
I've just pushed a branch ( |
Thank you Marcus, that issue seems to be resolved now. My one final question is once I have the numpy arrays, how do I access a specific read by its read_id. For example if the shape of this call |
You'll have to dig a level deeper into the API to extract the read_ids for each row in the metrics arrays. If you look at the |
After extracting the read id's though, how would you go about adding the value to the dictionary. I am struggling to understand how I can change the existing code
How would I go about adding these read_ids so I can index the numpy arrays with them? Or were you talking about creating a different numpy array per read. I am very confused with this my apologies. |
I do not know of any way to directly index into a numpy array rows with string keys. I am thinking about a dict with the read_id to the numpy array row index for that read id. So something like |
Hello, I have been working on trying to extract metrics with remora on a very large nanopore dataset. I am running into issues with the speed at which I can extract the metrics. I am essentially attempting to extract metrics for 100bp windows I have created. Below is the code that I am using which is working however it is taking extremely long. Is there any way to more efficiently acheive what I want?
The text was updated successfully, but these errors were encountered: