Providing example files #1

cmarcuscy · 2019-01-07T11:53:29Z

Hi developers of novoCaller,

I have tried running the first layer of novoCaller with the following command but the program just keep on running for over 24 hours without generating any output data. I am new to bioinformatics so please correct me if I made any mistakes.

Command:
novoCaller -I input.vcf -O step_1_out.txt -T sample_id.txt -X 1 -P 0.005 -E 0.008

vcf:
example.vcf.gz

sample ID file:
sample_id.txt

It would be very helpful if you can provide example files for the program.

Thanks a lot!

Marcus

The text was updated successfully, but these errors were encountered:

anwoy · 2019-01-07T12:09:23Z

Hi Marcus, Thank you for your question. The 'sample_id.txt' file should contain the sample names as is present in the vcf file. In the vcf file the sample names are AGG0030, AGG0031 and AGG0032 but in the 'sample_id.txt' file the sample names are sample1, sample2, and sample3. novoCaller needs unrelated control samples are present which the algorithm uses to judge the quality of the calls. The example vcf file contains only three samples which make the trio. Please try using an example vcf file with larger number of samples. Best Regards, Anwoy

…

On Mon, Jan 7, 2019 at 5:23 PM cmarcuscy ***@***.***> wrote: Hi developers of novoCaller, I have tried running the first layer of novoCaller with the following command but the program just keep on running for over 24 hours without generating any output data. I am new to bioinformatics so please correct me if I made any mistakes. Command: novoCaller -I input.vcf -O step_1_out.txt -T sample_id.txt -X 1 -P 0.005 -E 0.008 vcf: example.vcf.gz <https://github.com/bgm-cwg/novoCaller/files/2732605/example.vcf.gz> sample ID file: sample_id.txt <https://github.com/bgm-cwg/novoCaller/files/2732610/sample_id.txt> It would be very helpful if you can provide example files for the program. Thanks a lot! Marcus — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJwCNxKx0bpm--dwJi8lgjc_w4ljcrCzks5vAzU5gaJpZM4ZzXhq> .

cmarcuscy · 2019-01-11T02:45:14Z

Hi Anwoy,

Thank you for you answers. Upon you suggestions, I have incorporated more samples (261 samples) in the run and make sure the samples names and sample ID matches, but still, the program is unable to generate any data (after running for 2 days), nor did an error message pops up. Do you have any suggestion on how I should troubleshoot?

Thanks a lot!

Regards,
Marcus

anwoy · 2019-01-11T06:43:31Z

Can you please send me the vcf file and the samples.txt file?

…

On Fri, Jan 11, 2019, 8:15 AM cmarcuscy ***@***.*** wrote: Hi Anwoy, Thank you for you answers. Upon you suggestions, I have incorporated more samples (261 samples) in the run and make sure the samples names and sample ID matches, but still, the program is unable to generate any data (after running for 2 days), nor did an error message pops up. Do you have any suggestion on how I should troubleshoot? Thanks a lot! Regards, Marcus — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJwCN-0S9ydnveb35Mp3khL_NkGdmObXks5vB_q7gaJpZM4ZzXhq> .

cmarcuscy · 2019-01-13T04:10:00Z

Dear Anwoy,

Please find the vcf (first 1000 lines) and samples.txt files below. Thanks!

novocaller_sample.vcf.gz

novoCaller_samples.txt

Marcus

anwoy · 2019-01-13T13:02:57Z

Thanks Marcus, I will get back to you soon.

…

On Sun, Jan 13, 2019 at 9:40 AM cmarcuscy ***@***.***> wrote: Dear Anwoy, Please find the vcf (first 1000 lines) and samples.txt files below. Thanks! novocaller_sample.vcf.gz <https://github.com/bgm-cwg/novoCaller/files/2752795/novocaller_sample.vcf.gz> novoCaller_samples.txt <https://github.com/bgm-cwg/novoCaller/files/2752794/novoCaller_samples.txt> Marcus — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJwCN8W-72wFVYTO--ffC3Bw6L7HBnpVks5vCrGZgaJpZM4ZzXhq> .

anwoy · 2019-01-16T04:53:47Z

Hi Marcus, The caller was made to read the output of VEP (variant effect predictor) which is present in the FORMAT field with the key 'CSQ'. Since VEP was not run on the vcf file, the caller did not work. Thanks for finding this bug. I will make it so that the caller gives an error when it doesn't find the 'CSQ' key. You can try running VEP on the file and running the caller again. Best Regards, Anwoy

…

On Sun, Jan 13, 2019 at 6:32 PM anwoy mohanty ***@***.***> wrote: Thanks Marcus, I will get back to you soon. On Sun, Jan 13, 2019 at 9:40 AM cmarcuscy ***@***.***> wrote: > Dear Anwoy, > > Please find the vcf (first 1000 lines) and samples.txt files below. > Thanks! > > novocaller_sample.vcf.gz > <https://github.com/bgm-cwg/novoCaller/files/2752795/novocaller_sample.vcf.gz> > > novoCaller_samples.txt > <https://github.com/bgm-cwg/novoCaller/files/2752794/novoCaller_samples.txt> > > Marcus > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#1 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AJwCN8W-72wFVYTO--ffC3Bw6L7HBnpVks5vCrGZgaJpZM4ZzXhq> > . >

cmarcuscy · 2019-01-18T01:17:32Z

Hi Anwoy,

Thank you for your work to fix the bug. I will try running novocaller after running VEP.

Regards,
Marcus

cmarcuscy · 2019-02-11T03:33:40Z

Hi Anwoy,
I have tried annotating the vcf with VEP and I now successfully get the program to run. Nonetheless, I encounter some unexpected results.

infilename=/home/ramsar1971/project/asd/Reannotation/vep/ASD_276.recaliecalls_kggseq_samprm_vep.vcf
trio_ID_filename=/home/ramsar1971/project/asd/Reannotation/ASD88_Trio_novocaller.txt
outfilename=/home/ramsar1971/project/asd/Reannotation/vep/ASD_276_step1_out.txt
X_choice=1
PP_thresh=0.005
ExAC_thresh=0.008
vcf_line_cols:

0 1 2 3 4 5 6 7 8
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
total_candidates=261
end_col=260
number of parents = 258
number of children = 3
parent_cols=
3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:34:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:51:52:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:69:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:86:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102:103:104:105:106:107:108:109:110:111:112:113:114:115:116:117:118:119:120:121:122:123:124:125:126:127:128:129:130:131:132:133:134:135:136:137:138:139:140:141:142:143:144:145:146:147:148:149:150:151:152:153:154:155:156:157:158:159:160:161:162:163:164:165:166:167:168:169:170:171:172:173:174:175:176:177:178:179:180:181:182:183:184:185:186:187:188:189:190:191:192:193:194:195:196:197:198:199:200:201:202:203:204:205:206:207:208:209:210:211:212:213:214:215:216:217:218:219:220:221:222:223:224:225:226:227:228:229:230:231:232:233:234:235:236:237:238:239:240:241:242:243:244:245:246:247:248:249:250:251:252:253:254:255:256:257:258:259:260:
trio_set=
1:2:0:
CSQ_ExAC_AF_col=32

It seems that the program only recognizes three sets of trios among the 88 trios included.
Another point to note is that the output only contains 1 candidate DN mutation:

Do you have any idea? Thanks!
Input vcf:
1000_novocaller.vcf.gz

Input txt file:
pedigree.txt

Output file:
novocaller_step1_out.txt

Marcus

anwoy · 2019-02-23T13:33:40Z

Hi Marcus, Sorry for the late reply. Yes the caller was made for a Mendelian diseases research team which generally works on cases comprising of one trio when a de-novo case is suspected. Although the code can be modified to give output for all the trios. The expected number of de-novo mutations in the coding region per trio (which is where the software looks at) is around 1 ~ 3 in number. So I would say the 1 call is within the expected number of calls. If you are interested in running the caller for a large scale de-novo study, the code will have to be modified slightly. Best Regards, Anwoy

…

On Mon, Feb 11, 2019 at 9:03 AM cmarcuscy ***@***.***> wrote: Hi Anwoy, I have tried annotating the vcf with VEP and I now successfully get the program to run. Nonetheless, I encounter some unexpected results. infilename=/home/ramsar1971/project/asd/Reannotation/vep/ASD_276.recaliecalls_kggseq_samprm_vep.vcf trio_ID_filename=/home/ramsar1971/project/asd/Reannotation/ASD88_Trio_novocaller.txt outfilename=/home/ramsar1971/project/asd/Reannotation/vep/ASD_276_step1_out.txt X_choice=1 PP_thresh=0.005 ExAC_thresh=0.008 vcf_line_cols: ------------------------------ 0 1 2 3 4 5 6 7 8 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT total_candidates=261 end_col=260 number of parents = 258 number of children = 3 parent_cols= 3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:34:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:51:52:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:69:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:86:87:88:89:90:91:92:93:94:95:96:97:98:99 💯 101:102:103:104:105:106:107:108:109:110:111:112:113:114:115:116:117:118:119:120:121:122:123:124:125:126:127:128:129:130:131:132:133:134:135:136:137:138:139:140:141:142:143:144:145:146:147:148:149:150:151:152:153:154:155:156:157:158:159:160:161:162:163:164:165:166:167:168:169:170:171:172:173:174:175:176:177:178:179:180:181:182:183:184:185:186:187:188:189:190:191:192:193:194:195:196:197:198:199:200:201:202:203:204:205:206:207:208:209:210:211:212:213:214:215:216:217:218:219:220:221:222:223:224:225:226:227:228:229:230:231:232:233:234:235:236:237:238:239:240:241:242:243:244:245:246:247:248:249:250:251:252:253:254:255:256:257:258:259:260: trio_set= 1:2:0: CSQ_ExAC_AF_col=32 It seems that the program only recognizes three sets of trios among the 88 trios included. Another point to note is that the output only contains 1 candidate DN mutation: Do you have any idea? Thanks! Input vcf: 1000_novocaller.vcf.gz <https://github.com/bgm-cwg/novoCaller/files/2849782/1000_novocaller.vcf.gz> Input txt file: pedigree.txt <https://github.com/bgm-cwg/novoCaller/files/2849783/pedigree.txt> Output file: novocaller_step1_out.txt <https://github.com/bgm-cwg/novoCaller/files/2849784/novocaller_step1_out.txt> Marcus — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJwCN1jucQ83DLPq7NWP8bgOKCPcrbKbks5vMOSVgaJpZM4ZzXhq> .

aojielian · 2019-03-04T02:02:45Z

Hi Anwoy,

I've already got the CSQ vcf which means run VEP on VCF. Here is my command to run novocaller "./novoCaller -I 11.vcf -O SSC02220.txt -T trio_ids.txt -X 1 -P 0.5 -E 0.008"

the trio_ids.txt looks like "SSC02220 SSC02219 SSC02217 "

The 11.vcf is quad vcf, which have 4 individuals in this VCF. Can novoCaller works on quad VCFs? or something wrong with my command line?

Sorry to ask you so many trivial questions

Best Regards,

Aojie

ghost · 2019-03-18T10:09:58Z

Hi Anwoy，

I am perplexed about unrelated control samples.
Are the unrelated samples those with normal phenotype, these with other disease or different samples that have the same phenotype?

I am new to bioinformatics. There's so much that I don't understand.
Sorry to ask you so many trivial questions

Thanks a lot!
Liangdy

anwoy · 2019-03-18T13:47:59Z

Hi Liangdy, the unrelated samples can be samples with normal phenotype, or samples with other diseases. Best Regards, Anwoy

…

On Mon, Mar 18, 2019 at 3:39 PM liangdyGao ***@***.***> wrote: Hi Anwoy， I am perplexed about unrelated control samples. Are the unrelated samples those with normal phenotype, these with other disease or different samples that have the same phenotype? I am new to bioinformatics. There's so much that I don't understand. Sorry to ask you so many trivial questions Thanks a lot! Liangdy — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJwCN16Jcc14CW-0iAAWAr2hpdRZ-7fFks5vX2X2gaJpZM4ZzXhq> .

anwoy · 2019-03-18T14:14:23Z

The unrelated samples must also not be related to the proband (cousins, uncles, aunts etc. of the proband are not preferred).

…

On Mon, Mar 18, 2019 at 3:39 PM liangdyGao ***@***.***> wrote: Hi Anwoy， I am perplexed about unrelated control samples. Are the unrelated samples those with normal phenotype, these with other disease or different samples that have the same phenotype? I am new to bioinformatics. There's so much that I don't understand. Sorry to ask you so many trivial questions Thanks a lot! Liangdy — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJwCN16Jcc14CW-0iAAWAr2hpdRZ-7fFks5vX2X2gaJpZM4ZzXhq> .

ghost · 2019-03-19T02:08:53Z

Hi Anwoy,

Thank you for your answers.

If we merge multiple vcf files by vcftools or bcftools , the unrelated sample information of the merged file may display as follows:

#CHR POS ... AGG0002 AGG0003 AGG0001
Q X ... 1/0:10,0:10:27:0,27,405 .:.:.:.:. .:.:.:.:.

AGG003 and AGG0001 lose information such as DP, PQ and so on .

When merging vcfs in bam-level with GATK , the information above is preserved. But the computional amount is obviously increased.

#CHR POS ... AGG0002 AGG0003 AGG0001
Q X ... 1/0:10,0:10:27:0,27,405 2/2:10,0:10:27:0,27,405 3/3:12,0:12:30:0,30,450

Which approach is more suitable for DNMs calling in order to maximize accuracy and eliminate false negatives?
Or these adjustments almost have no effect on the final result?

Sorry to ask you so many trivial questions just like before

Thanks a lot!
Liangdy

anwoy · 2019-03-22T11:26:22Z

The AD information (allele depth) is needed in as many unrelated samples as possible as that information is used to judge the quality of the de-novo call.

…

On Tue, Mar 19, 2019 at 7:38 AM liangdyGao ***@***.***> wrote: Hi Anwoy, Thank you for you answers. If we merge multiple vcf files by vcftools or bcftools , the unrelated sample information of the merged file may display as follows: #CHR POS ... AGG0002 AGG0003 AGG0001 Q X ... 1/0:10,0:10:27:0,27,405 .:.:.:.:. .:.:.:.:. AGG003 and AGG0001 lose information such as DP, PQ and so on . When merging vcfs in bam-level with GATK , the information above is preserved. But the computional amount is obviously increased. #CHR POS ... AGG0002 AGG0003 AGG0001 Q X ... 1/0:10,0:10:27:0,27,405 0/0:10,0:10:27:0,27,405 0/0:12,0:12:30:0,30,450 Which approach is more suitable for DNMs calling in order to maximize accuracy and eliminate false negatives? Or these are almost no effect on the final result? Sorry to ask you so many trivial questions just like before Thanks a lot! Liangdy — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJwCN7yyvzJDzDV2RKI6cKZF1olwms-aks5vYEa1gaJpZM4ZzXhq> .

olenamarchenko1234 · 2022-10-18T15:45:47Z

@anwoy Thank you for the tips! Can you provide an example of the runtime for an exome trio? full genome trio? Can it be scaled to run on a pvcf with 50K samples?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Providing example files #1

Providing example files #1

cmarcuscy commented Jan 7, 2019

anwoy commented Jan 7, 2019 via email

cmarcuscy commented Jan 11, 2019

anwoy commented Jan 11, 2019 via email

cmarcuscy commented Jan 13, 2019

anwoy commented Jan 13, 2019 via email

anwoy commented Jan 16, 2019 via email

cmarcuscy commented Jan 18, 2019

cmarcuscy commented Feb 11, 2019

anwoy commented Feb 23, 2019 via email

aojielian commented Mar 4, 2019

ghost commented Mar 18, 2019

anwoy commented Mar 18, 2019 via email

anwoy commented Mar 18, 2019 via email

ghost commented Mar 19, 2019 •

edited by ghost

anwoy commented Mar 22, 2019 via email

olenamarchenko1234 commented Oct 18, 2022

Providing example files #1

Providing example files #1

Comments

cmarcuscy commented Jan 7, 2019

anwoy commented Jan 7, 2019 via email

cmarcuscy commented Jan 11, 2019

anwoy commented Jan 11, 2019 via email

cmarcuscy commented Jan 13, 2019

anwoy commented Jan 13, 2019 via email

anwoy commented Jan 16, 2019 via email

cmarcuscy commented Jan 18, 2019

cmarcuscy commented Feb 11, 2019

anwoy commented Feb 23, 2019 via email

aojielian commented Mar 4, 2019

ghost commented Mar 18, 2019

anwoy commented Mar 18, 2019 via email

anwoy commented Mar 18, 2019 via email

ghost commented Mar 19, 2019 • edited by ghost

anwoy commented Mar 22, 2019 via email

olenamarchenko1234 commented Oct 18, 2022

ghost commented Mar 19, 2019 •

edited by ghost