Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Providing example files #1

Open
cmarcuscy opened this issue Jan 7, 2019 · 16 comments
Open

Providing example files #1

cmarcuscy opened this issue Jan 7, 2019 · 16 comments

Comments

@cmarcuscy
Copy link

Hi developers of novoCaller,

I have tried running the first layer of novoCaller with the following command but the program just keep on running for over 24 hours without generating any output data. I am new to bioinformatics so please correct me if I made any mistakes.

Command:
novoCaller -I input.vcf -O step_1_out.txt -T sample_id.txt -X 1 -P 0.005 -E 0.008

vcf:
example.vcf.gz

sample ID file:
sample_id.txt

It would be very helpful if you can provide example files for the program.

Thanks a lot!

Marcus

@anwoy
Copy link
Collaborator

anwoy commented Jan 7, 2019 via email

@cmarcuscy
Copy link
Author

Hi Anwoy,

Thank you for you answers. Upon you suggestions, I have incorporated more samples (261 samples) in the run and make sure the samples names and sample ID matches, but still, the program is unable to generate any data (after running for 2 days), nor did an error message pops up. Do you have any suggestion on how I should troubleshoot?

Thanks a lot!

Regards,
Marcus

@anwoy
Copy link
Collaborator

anwoy commented Jan 11, 2019 via email

@cmarcuscy
Copy link
Author

Dear Anwoy,

Please find the vcf (first 1000 lines) and samples.txt files below. Thanks!

novocaller_sample.vcf.gz

novoCaller_samples.txt

Marcus

@anwoy
Copy link
Collaborator

anwoy commented Jan 13, 2019 via email

@anwoy
Copy link
Collaborator

anwoy commented Jan 16, 2019 via email

@cmarcuscy
Copy link
Author

Hi Anwoy,

Thank you for your work to fix the bug. I will try running novocaller after running VEP.

Regards,
Marcus

@cmarcuscy
Copy link
Author

Hi Anwoy,
I have tried annotating the vcf with VEP and I now successfully get the program to run. Nonetheless, I encounter some unexpected results.

infilename=/home/ramsar1971/project/asd/Reannotation/vep/ASD_276.recaliecalls_kggseq_samprm_vep.vcf
trio_ID_filename=/home/ramsar1971/project/asd/Reannotation/ASD88_Trio_novocaller.txt
outfilename=/home/ramsar1971/project/asd/Reannotation/vep/ASD_276_step1_out.txt
X_choice=1
PP_thresh=0.005
ExAC_thresh=0.008
vcf_line_cols:


0 1 2 3 4 5 6 7 8
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
total_candidates=261
end_col=260
number of parents = 258
number of children = 3
parent_cols=
3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:34:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:51:52:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:69:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:86:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102:103:104:105:106:107:108:109:110:111:112:113:114:115:116:117:118:119:120:121:122:123:124:125:126:127:128:129:130:131:132:133:134:135:136:137:138:139:140:141:142:143:144:145:146:147:148:149:150:151:152:153:154:155:156:157:158:159:160:161:162:163:164:165:166:167:168:169:170:171:172:173:174:175:176:177:178:179:180:181:182:183:184:185:186:187:188:189:190:191:192:193:194:195:196:197:198:199:200:201:202:203:204:205:206:207:208:209:210:211:212:213:214:215:216:217:218:219:220:221:222:223:224:225:226:227:228:229:230:231:232:233:234:235:236:237:238:239:240:241:242:243:244:245:246:247:248:249:250:251:252:253:254:255:256:257:258:259:260:
trio_set=
1:2:0:
CSQ_ExAC_AF_col=32

It seems that the program only recognizes three sets of trios among the 88 trios included.
Another point to note is that the output only contains 1 candidate DN mutation:

Do you have any idea? Thanks!
Input vcf:
1000_novocaller.vcf.gz

Input txt file:
pedigree.txt

Output file:
novocaller_step1_out.txt

Marcus

@anwoy
Copy link
Collaborator

anwoy commented Feb 23, 2019 via email

@aojielian
Copy link

Hi Anwoy,

I've already got the CSQ vcf which means run VEP on VCF. Here is my command to run novocaller "./novoCaller -I 11.vcf -O SSC02220.txt -T trio_ids.txt -X 1 -P 0.5 -E 0.008"

the trio_ids.txt looks like "SSC02220 SSC02219 SSC02217 "

The 11.vcf is quad vcf, which have 4 individuals in this VCF. Can novoCaller works on quad VCFs? or something wrong with my command line?

Sorry to ask you so many trivial questions

Best Regards,

Aojie

@ghost
Copy link

ghost commented Mar 18, 2019

Hi Anwoy,

I am perplexed about unrelated control samples.
Are the unrelated samples those with normal phenotype, these with other disease or different samples that have the same phenotype?

I am new to bioinformatics. There's so much that I don't understand.
Sorry to ask you so many trivial questions

Thanks a lot!
Liangdy

@anwoy
Copy link
Collaborator

anwoy commented Mar 18, 2019 via email

@anwoy
Copy link
Collaborator

anwoy commented Mar 18, 2019 via email

@ghost
Copy link

ghost commented Mar 19, 2019

Hi Anwoy,

Thank you for your answers.

If we merge multiple vcf files by vcftools or bcftools , the unrelated sample information of the merged file may display as follows:

#CHR POS ... AGG0002 AGG0003 AGG0001
Q X ... 1/0:10,0:10:27:0,27,405 .:.:.:.:. .:.:.:.:.

AGG003 and AGG0001 lose information such as DP, PQ and so on .

When merging vcfs in bam-level with GATK , the information above is preserved. But the computional amount is obviously increased.

#CHR POS ... AGG0002 AGG0003 AGG0001
Q X ... 1/0:10,0:10:27:0,27,405 2/2:10,0:10:27:0,27,405 3/3:12,0:12:30:0,30,450

Which approach is more suitable for DNMs calling in order to maximize accuracy and eliminate false negatives?
Or these adjustments almost have no effect on the final result?

Sorry to ask you so many trivial questions just like before

Thanks a lot!
Liangdy

@anwoy
Copy link
Collaborator

anwoy commented Mar 22, 2019 via email

@olenamarchenko1234
Copy link

@anwoy Thank you for the tips! Can you provide an example of the runtime for an exome trio? full genome trio? Can it be scaled to run on a pvcf with 50K samples?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants