Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Readfish basecalling performance for messy samples #343

Closed
jamesemery opened this issue Mar 26, 2024 · 6 comments
Closed
Labels
documentation Improvements or additions to documentation Stale

Comments

@jamesemery
Copy link

We are trying to using readfish on some different sorts of library extractions on our PromethION instrument and we have noticed that for some our library preps we are seeing very poor performance for readfish/the basecaller that falls well outside of our expectations from testing on other libraries. Specifically we are seeing that for some of these libraries with many short fragments that readfish will be overwhelmed sending very large 1000+ batches of reads to the basecaller causing the round-trip time for reads to creep up. We have observed this will push the mean unblock time for reads into the 1500-2000bp range. An example of the slow performance:

2024-03-18 16:07:17,736 readfish.targets 1225R/2.3572s; Avg: 1224R/2.3997s; Seq:15,751; Unb:1,448,632; Pro:219,374; Slow batches (>1.00s): 1375/1375
2024-03-18 16:07:20,439 readfish.targets 1259R/2.6369s; Avg: 1224R/2.3999s; Seq:15,767; Unb:1,449,703; Pro:219,546; Slow batches (>1.00s): 1376/1376
2024-03-18 16:07:23,057 readfish.targets 1330R/2.5835s; Avg: 1224R/2.4001s; Seq:15,781; Unb:1,450,843; Pro:219,722; Slow batches (>1.00s): 1377/1377
2024-03-18 16:07:25,400 readfish.targets 1251R/2.3066s; Avg: 1224R/2.4000s; Seq:15,791; Unb:1,451,935; Pro:219,871; Slow batches (>1.00s): 1378/1378
2024-03-18 16:07:27,622 readfish.targets 1226R/2.1911s; Avg: 1224R/2.3998s; Seq:15,801; Unb:1,452,981; Pro:220,041; Slow batches (>1.00s): 1379/1379
2024-03-18 16:07:30,019 readfish.targets 1169R/2.3623s; Avg: 1224R/2.3998s; Seq:15,815; Unb:1,453,983; Pro:220,194; Slow batches (>1.00s): 1380/1380
2024-03-18 16:07:32,693 readfish.targets 1201R/2.6497s; Avg: 1224R/2.4000s; Seq:15,824; Unb:1,455,036; Pro:220,333; Slow batches (>1.00s): 1381/1381
2024-03-18 16:07:35,041 readfish.targets 1253R/2.3117s; Avg: 1224R/2.3999s; Seq:15,838; Unb:1,456,122; Pro:220,486; Slow batches (>1.00s): 1382/1382
2024-03-18 16:07:37,686 readfish.targets 1236R/2.6143s; Avg: 1224R/2.4001s; Seq:15,843; Unb:1,457,208; Pro:220,631; Slow batches (>1.00s): 1383/1383
2024-03-18 16:07:39,899 readfish.targets 1226R/2.1606s; Avg: 1224R/2.3999s; Seq:15,862; Unb:1,458,270; Pro:220,776; Slow batches (>1.00s): 1384/1384
2024-03-18 16:07:42,855 readfish.targets 1205R/2.9359s; Avg: 1224R/2.4003s; Seq:15,874; Unb:1,459,300; Pro:220,939; Slow batches (>1.00s): 1385/1385
2024-03-18 16:07:45,034 readfish.targets 1263R/2.1481s; Avg: 1224R/2.4001s; Seq:15,883; Unb:1,460,405; Pro:221,088; Slow batches (>1.00s): 1386/1386
2024-03-18 16:07:47,709 readfish.targets 1244R/2.6334s; Avg: 1224R/2.4003s; Seq:15,890; Unb:1,461,499; Pro:221,231; Slow batches (>1.00s): 1387/1387
2024-03-18 16:07:50,251 readfish.targets 1249R/2.4927s; Avg: 1224R/2.4003s; Seq:15,901; Unb:1,462,602; Pro:221,366; Slow batches (>1.00s): 1388/1388
2024-03-18 16:07:52,575 readfish.targets 1223R/2.2922s; Avg: 1224R/2.4003s; Seq:15,910; Unb:1,463,664; Pro:221,518; Slow batches (>1.00s): 1389/1389
2024-03-18 16:07:54,578 readfish.targets 1155R/1.9720s; Avg: 1224R/2.4000s; Seq:15,916; Unb:1,464,649; Pro:221,682; Slow batches (>1.00s): 1390/1390
2024-03-18 16:07:56,655 readfish.targets 1113R/2.0184s; Avg: 1224R/2.3997s; Seq:15,924; Unb:1,465,612; Pro:221,824; Slow batches (>1.00s): 1391/1391

I wanted to ask if you have any tips/advice for making the basecaller more performant? We observe that for the tower we have (which has 4 A800 GPUs that are all averaging ~20-40% utilization with several trials active) we still see this very long batch time when we have a lot of short reads. Here is the command we are using to to launch the basecaller:

/opt/ont/guppy/bin/guppy_basecall_server --port 5556 --config dna_r10.4.1_e8.2_400bps_5khz_fast.cfg --log_path /home/prom/targetedscripts/guppy_basecaller_logs  --ipc_threads 5 --max_queued_reads 3000 --chunks_per_runner 48 --num_callers 32 -x cuda:all

Screenshot 2024-03-25 at 3 07 53 PM

In terms of versions:
Version Guppy Basecall Service Software, (C)Oxford Nanopore Technologies plc. Version 6.5.7+ca6d6af, client-server API version 15.0.0

Have you seen behavior like this before for readfish? What is the bottleneck here? Passing the data to the GPUs? Clearly the actual processing on the GPUs is not enough to saturate the performance. I just want to ask if there is any advice that we are missing for how to run the basecaller/readfish to improve the response time so fewer of the batches in these messy samples are slow.

I also notice in the Readfish code that you wait for all reads to return from a batch before proceeding to align and unblock. Could this issue be alleviated by aligning/unblocking reads before the entire basecaller batch has finished? If it matters this is a barcoded run and as the trial keeps running and more of the pores start to degrade the unblock time becomes tolerable and few of the batches are slow to round trip.

Copy link

We found the following entry in the FAQ which you may find helpful:

Feel free to close this issue if you found an answer in the FAQ. Otherwise, please give us a little time to review.

This is an automated reply, generated by FAQtory

@mattloose
Copy link
Contributor

Hi,

You are running a slightly strange set up here. Firstly you have configured a second instance of the guppy basecall service on the same GPUs as the original service is running. This is going to possibly reduce your performance as guppy is not designed to work in this way.

You have two options. One would be to point readfish at the exisiting guppy server that the promethION is already running on port 5555 and not setup your own dedicated instance of guppy. This should give reasonable performance and should not suffer from bottlenecks.

If you want to ensure that you do not get bottlenecked then you should consider switching your system so that minKNOW only utilises a subset of the GPUs and 1 GPU is dedicated to the adaptive sampling pathway. In this example you would modify the standard guppy server to use cuda 1-3 and run your own guppy server completely independently on cuda 4.

However, I also note that you are running an guppy server and not a more recent dorado basecaller server - you should consider updating your system.

My final comment on the setup is have you considered using the ONT built in adaptive sampling as a comparator here to see if it is an issue with readfish or something else?

I would definitely not run two guppy servers on the same physical GPUs.

With respect to the later question about waiting for the basecalling batch to complete before aligning and unbocking. The answer to this is that the aligning step is not the bottleneck. The issue is the time it takes to basecall and get the data back. If that is slower than the time to generate the data then the whole process will lag regardless of the alignment and unblock times.

I will make a note to update the documentation as to how best set up on these systems. But in short my advice would be:

  1. Only run 1 guppy/dorado server on each physical GPU.
  2. If you wish to run a dedicated server for readfish base calling compared to minKNOW basecalling (a good idea in my mind) you can but they should be on physically separate GPUs.
  3. Consider updating to dorado if your workflows allow it.

@mattloose mattloose added the documentation Improvements or additions to documentation label Mar 26, 2024
@jamesemery
Copy link
Author

Hello @mattloose. Thank you for the reply. We have some testing to do on our end to try to run this to ground.

There are a few factors at play for why things are being run this way, specifically we have barcodes on these reads and are testing with barcode-dependent sequencing (though for these samples its trivially applied) which rules out using built in MinKNOW adaptive sampling in the long run as they still don't support that feature as far as I have been able to tell. Its worth testing it however. How do I configure the basecaller_server that MinKNOW spins up for its own purposes? I don't see anything in the UI for that.

We certainly want to give Dorado a try, though based on ONT's own admission i'm not expecting anything too drastic in terms of performance between Guppy and Dorado.

We have run successfully with these exact conditions for many samples in the past and its only with these recent lower quality samples that we are seeing a problem. I don't know for certain where the second basecalling server is from, I was under the impression that it was spun up by MinKNOW for its own purposes. In general we do not run these adaptive sampling runs with MinKNOW basecalling enabled in the first place as we didn't want to risk hitting performance.

It should be noted, we ran 3 samples in paralell all pointed at the same base-caller server, only 2 of the samples (which happened to be of lower quality in the first place) were hitting more than a few slow batches with a third sample behaving normally. All three were pointed at the same base-caller server. It seems from looking at the Info logs that the block time is ~linear across all of them in the number of reads being fed to the basecaller per-batch. I don't know how to measure if they are interfering with eachother but its possible, thats why I commented on the relatively low GPU utilization. By that logic if I wanted to run this more efficiently I would divide my folowcells into multiple readfish instances to reduce that batching time which seams wrong. Would you recommend parallelizing these experiments across different basecall servers pointed at separate GPUs?

I am surprised by your suggestion that we link the basecaller to only one GPU. I would have assumed that the guppy/dorado_basecall_servers are dynamically scheduling between the available GPUS which should cut down on potential processing or bus bottlenecks. Is it a known limitation that we should not be using the scheduler feature for the basecaller? We are willing to entertain and at least try this mode for running.

@mattloose
Copy link
Contributor

Hi,

Thanks for using the barcoding feature!

So the preferred way of running this on your system is to use the onboard basecaller service which is started by MinKNOW. For this you would not need to start the basecaller service yourself - so the command below is not needed.

/opt/ont/guppy/bin/guppy_basecall_server --port 5556 --config dna_r10.4.1_e8.2_400bps_5khz_fast.cfg --log_path /home/prom/targetedscripts/guppy_basecaller_logs --ipc_threads 5 --max_queued_reads 3000 --chunks_per_runner 48 --num_callers 32 -x cuda:all

Instead, you would tell readfish to use the already running server on port 5555.

You are correct that dorado/guppy should handle performance well across all the GPUs. When we want to guarantee that that is the case, we set up a dedicated GPU for readfish. If you want to do that you will need to change the configuration for the default guppy service setup by MinKNOW. Most likely this is availabel under the systemctl guppyd or doradod.

However, from the additional information you have given above it is possible that you are running libraries that simply have too much short material - it would have to be a real spike of short material to have this strong an effect i think. One thing you could try is running the library for a few minutes without readfish and check the distributions. You could compare between the "good" and the "bad" library as well.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Apr 27, 2024
Copy link

github-actions bot commented May 2, 2024

This issue was closed because there has been no response for 5 days after becoming stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation Stale
Projects
None yet
Development

No branches or pull requests

2 participants