Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create custom databases #11

Open
NamenloseHeldin opened this issue Oct 6, 2023 · 19 comments
Open

How to create custom databases #11

NamenloseHeldin opened this issue Oct 6, 2023 · 19 comments

Comments

@NamenloseHeldin
Copy link

Hi!
I would like to use your tool with my own reference database.

I installed using option 2 after reading through #10

git clone https://github.com/liaoherui/StrainScan.git
cd StrainScan

conda env create -f environment_candidate.yaml
conda activate strainscan

chmod 755 library/jellyfish-linux
chmod 755 library/dashing_s128

However, when I try to run the StrainScan_build.py command with your test data, I get the following error:
~/StrainScan$ python StrainScan_build.py -i Test_genomes -o DB_Small

2023-10-06 12:14:33,705 - Constructing matrix with dashing (jaccard index)
2023-10-06 12:14:33,822 - Hierarchical clustering
2023-10-06 12:14:34,117 - Constructing the tree
Traceback (most recent call last):
  File "StrainScan_build.py", line 162, in <module>
    main()
  File "StrainScan_build.py", line 127, in main
    31, params])
  File "/home/StrainScan/library/Build_tree.py", line 385, in build_tree
    kmer_index = extract_kmers(fna_mapping[i.identifier], fna_path, ksize, kmer_index_dict, kmer_index, Lv, spec, tree_dir, alpha_ratio, i.identifier)
  File "/home/StrainScan/library/Build_tree.py", line 115, in extract_kmers
    reverse = seqpy.revcomp(forward)
AttributeError: module 'seqpy' has no attribute 'revcomp'

When I try to use it with my own genomes, I get the following different error:
python StrainScan_build.py -i /path/fasta/ -o /StrainScan_DB

2023-10-06 12:23:53,950 - Constructing matrix with dashing (jaccard index)
2023-10-06 12:23:54,539 - Hierarchical clustering
Traceback (most recent call last):
  File "StrainScan_build.py", line 162, in <module>
    main()
  File "StrainScan_build.py", line 116, in main
    cls_file, cls_res)
  File "/home/StrainScan/library/select_rep.py", line 46, in pick_rep
    final_res[ele[2]]=int(ele[0])
IndexError: list index out of range

Any ideas?

@liaoherui
Copy link
Owner

Hi, thanks for using StrainScan!

Currently, you can use Bioconda to install the new version of StrainScan (v1.0.13). We have updated it to the latest GitHub version.
conda install -c bioconda strainscan

The first problem could be a result of conflict python version. To avoid that error, you can try to create an environment whose python version is 3.7.3.

For the second problem, it looks a little strange. We haven't met this error before. Thus, to debug, would you mind sending me the input genomes that make the error happen? Then, I will investigate the potential problem and solution. Thanks!

@dugala239
Copy link

Hi, thanks for using StrainScan!

Currently, you can use Bioconda to install the new version of StrainScan (v1.0.13). We have updated it to the latest GitHub version. conda install -c bioconda strainscan

The first problem could be a result of conflict python version. To avoid that error, you can try to create an environment whose python version is 3.7.3.

For the second problem, it looks a little strange. We haven't met this error before. Thus, to debug, would you mind sending me the input genomes that make the error happen? Then, I will investigate the potential problem and solution. Thanks!

I have the same error when I used ~2000 genomes to build a customized database. Is it a problem when the genome number is too high?

@liaoherui
Copy link
Owner

Hi,

It's hard to say. Have you tried using StrainScan to build a database with a subset of your 2000 genomes? (e.g. with 100 genomes)? If it works, then it may be a result of large number genomes. Otherwise, it may be caused by other reasons.

However, according to my experience, the large number of genomes usually lead to memory issue rather than "index" error. Need to check. Sorry for the inconvenience.

@dugala239
Copy link

Hey,

Thanks for your reply! !
A subset of genomes worked. I am using genome & MAGs for the database building. Do you think the poor quality of MAGs can cause this "index" problem? I did not encounter memory problems using 80 CPU and 600GB memory.

@liaoherui
Copy link
Owner

Hi,

Sorry for late reply. Can you send me your hclsMap_95.txt file in the output folder (usually, it's under "Outdir/Cluster_Result")? Then, I can quickly check the possible reason for this problem. Thanks!

@dugala239
Copy link

Hi,

Sorry for late reply. Can you send me your hclsMap_95.txt file in the output folder (usually, it's under "Outdir/Cluster_Result")? Then, I can quickly check the possible reason for this problem. Thanks!

Dear Ray,

I have attached the "hclsMap_95.txt" file. Thanks for checking!

hclsMap_95.txt

@liaoherui
Copy link
Owner

Hi,

I tested the code (line 46, index error part) with the provided file (hclsMap_95.txt) and it worked well. This is a little strange. May I ask for the complete error log info you got? Thanks a lot for your time and assistance!

@dugala239
Copy link

Hi,

I tested the code (line 46, index error part) with the provided file (hclsMap_95.txt) and it worked well. This is a little strange. May I ask for the complete error log info you got? Thanks a lot for your time and assistance!

This is the error log I got
"2024-05-13 11:58:13,080 - Constructing matrix with dashing (jaccard index)
2024-05-13 11:58:51,360 - Hierarchical clustering
2024-05-13 11:59:00,113 - Constructing the tree
Traceback (most recent call last):
File "/lisc/user/ye/.conda/envs/strainscan/bin/strainscan_build", line 10, in
sys.exit(main())
File "/lisc/user/ye/.conda/envs/strainscan/lib/python3.7/site-packages/StrainScan/StrainScan_build.py", line 128, in main
31, params])
File "/lisc/user/ye/.conda/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/Build_tree.py", line 281, in build_tree
for i in temp[2].split(","):
IndexError: list index out of range
strainscan_build-1505346.err (END)"

@liaoherui
Copy link
Owner

I see... It seems that this error is caused by hclsMap_95_recls.txt (the same folder with hclsMap_95.txt) rather than hclsMap_95.txt. Can you send me that file, too? Thanks!

@dugala239
Copy link

I see... It seems that this error is caused by hclsMap_95_recls.txt (the same folder with hclsMap_95.txt) rather than hclsMap_95.txt. Can you send me that file, too? Thanks!

hey, I have attached the file. Thanks
hclsMap_95_recls.txt
hclsMap_95_Rep.txt

@liaoherui
Copy link
Owner

liaoherui commented May 13, 2024

I checked these files and found there are some genomes with the same strain ID in the input (see below). Please make sure all input genomes have unique ID/Name and rerun the program to see whether it works. Thanks!

image

@dugala239
Copy link

I checked these files and found there are some genomes with the same strain ID in the input (see below). Please make sure all input genomes have unique ID/Name and rerun the program to see whether it works. Thanks!

image

Dear Ray,

Thank you for your immediate assistance with troubleshooting.

My genome names are in "*__bin.number.fna" format and the pipeline seems to cut off whatever is after __bin, so some MAGs ended up with the same strain name causing the issue as mentioned above. The database is working at the moment. Nice pipeline!

@liaoherui
Copy link
Owner

Good to hear that! If there are any further problems, please let me know! Again, thanks for using StrainScan!

@dugala239
Copy link

Good to hear that! If there are any further problems, please let me know! Again, thanks for using StrainScan!

Dear Ray,

I would like to hear your opinion about the analysis. Since I have ~2000 genomes, it can end up with many clusters. So I decided to define the clusters first and build a customized database.
Can I compare the relative abundance of defined clusters in a metagenomic sample? The Strainscan output sums up the abundance of all strains to 1. Does it mean that I can only compare the composition of all clusters in a sample, but not the relative abundance.

@liaoherui
Copy link
Owner

"Can I compare the relative abundance of defined clusters in a metagenomic sample?"

Yes. You can get the cluster relative abundance by adding the summed frequencies (Predicted_Depth (Ab*cls_depth)) of identified strains for each cluster as the total abundance of the cluster in the sample.

For example, you got something like
ID Strain_Name Cluster_ID Relative_Abundance Predicted_Depth (Enet) Predicted_Depth (Ab*cls_depth) Coverage Coverd/Total_kmr
1 Strain_A C1 0.3984296647659111 87.85996285435535 1.995290772579857 0.8032051656635306 54981/68452
2 Strain_B C2 0.33126055378500013 6.515928000998808 1.658915449167767 0.5102015852823173 20855/40876
3 Strain_C C3 0.18988646589720443 3.7350856478774794 0.9509299802390313 0.4822947720497346 13266/27506
4 Strain_D C4 0.08042331555188444 1.5819346063093358 0.40275088330893394 0.5901400469521574 14580/24706

Then, you can estimate the relative abundance of C1 using: Relative abundance C1 = 1.995/(1.995+1.658+0.95+0.40).

In addition, when you decide to use the custom cluster file, then sometimes, there could be problems caused by the diverse similarity in the defined cluster. (But of course you can try.)

@dugala239
Copy link

"Can I compare the relative abundance of defined clusters in a metagenomic sample?"

Yes. You can get the cluster relative abundance by adding the summed frequencies (Predicted_Depth (Ab*cls_depth)) of identified strains for each cluster as the total abundance of the cluster in the sample.

For example, you got something like ID Strain_Name Cluster_ID Relative_Abundance Predicted_Depth (Enet) Predicted_Depth (Ab*cls_depth) Coverage Coverd/Total_kmr 1 Strain_A C1 0.3984296647659111 87.85996285435535 1.995290772579857 0.8032051656635306 54981/68452 2 Strain_B C2 0.33126055378500013 6.515928000998808 1.658915449167767 0.5102015852823173 20855/40876 3 Strain_C C3 0.18988646589720443 3.7350856478774794 0.9509299802390313 0.4822947720497346 13266/27506 4 Strain_D C4 0.08042331555188444 1.5819346063093358 0.40275088330893394 0.5901400469521574 14580/24706

Then, you can estimate the relative abundance of C1 using: Relative abundance C1 = 1.995/(1.995+1.658+0.95+0.40).

In addition, when you decide to use the custom cluster file, then sometimes, there could be problems caused by the diverse similarity in the defined cluster. (But of course you can try.)

Hey Ray,

Thanks for the detailed explanation.

@dugala239
Copy link

dugala239 commented May 19, 2024

Dear Ray,

One more question popped up. I ended up with C146, C154, ... C2225, which are not continuous cluster numbers. Did the pipeline filter out C1, C2, C3 etc.? And how?

Another thing is that I used ~2400 genomes and got 2200 clusters according to the the database. Does it make sense?

Thanks!
hclsMap_95_recls.txt

@liaoherui
Copy link
Owner

Hi,

For the first question, if you found the non-continuous ID in the log, then don't panic. It's normal and a result of intra-cluster (only cluster with more than one strain will have this step) k-mer matrix construction step. If you check the file "hclsMap_95_recls.txt", you may find all strains are included in these clusters and all clusters are also included with continuous number.

For the second question, it could be. This could happen if your input strains genomes are highly divergent, which means their k-mer similarity is not very high.

@dugala239
Copy link

dugala239 commented May 20, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants