Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: [config.yaml error] #1018

Open
bioinfolabmu opened this issue Nov 26, 2023 · 15 comments
Open

BUG: [config.yaml error] #1018

bioinfolabmu opened this issue Nov 26, 2023 · 15 comments
Labels
bug Something isn't working

Comments

@bioinfolabmu
Copy link

Describe the bug
My raw pair-end FASTQ data has the name like this

Sample1_Treat2_Replicate1_1.fq.gz
Sample1_Treat2_Replicate1_2.fq.gz

In my configuration file, I have the following entries:

fqsuffix: fq
fqext1: _1
fqext2: _2

Consequently, the program was looking for file named Sample1_Treat2_Replicate1__1.fq.gz (# two underscore symbol, instead of one)
Then, I modified the configuration file as

fqsuffix: fq
fqext1: 1
fqext2: 2

Next, I got this error.

Error validating config file.
ValidationError: 1 is not of type 'string'

Failed validating 'type' in schema['properties']['fqext1']:
OrderedDict([('description',
'filename suffix when handling paired-end data, '
'describing the forward read'),
('default', 'R1'),
('type', 'string')])

On instance['fqext1']:
1

@bioinfolabmu bioinfolabmu added the bug Something isn't working label Nov 26, 2023
@bioinfolabmu
Copy link
Author

Alright, I think I know what is the problem

fqsuffix: fq
fqext1: '1'
fqext2: '2'

This modification has solved my configuration file problem.

@bioinfolabmu
Copy link
Author

A relevant question in Sample.tsv file.

For the second line in this file, I have the following tab-delimited column descriptions.

sample assembly dev_stage treatment biological_replicates

My questions is "are these column descriptions be fixed as the key words in your tool? For example, what happen if I use "developmental_stage" to replace "dev_stage"? or use "genome" to replace "assembly"?

I did not find the relevant requirement information in your document web page. I am sorry if I missed something in your documentation.

@Maarten-vd-Sande
Copy link
Member

Great that you fixed it. You can have any number of columns, and name them whatever you want. There are however certain column names that have a specific meaning. Such as sample, assembly, biological_replicates, and descriptive_name. These columns are used internally by seq2science for specific stuff. Moreover, you can use your column names in the differential peak/gene calling step: https://vanheeringen-lab.github.io/seq2science/content/DESeq2.html#contrast-in-the-samples-tsv

Specifically:

  • "developmental_stage" -> "dev_stage" does not change anything in how seq2science runs, as those columns are ignored (except when they are used to define contrasts)
  • "assembly" -> "genome" makes it so that seq2science won't work. As the assembly column is required. This column specifies which assembly is used

@bioinfolabmu
Copy link
Author

Thank you for quick response.

(1) So, considering the DEG analysis, the names in samples.tsv should be in consistent with your requirement. Actually, "dev_stage" should be "stages". That way, we can make sure to get proper results by seq2sequence. Am I correct?

(2) The column names such as "sample", "assembly", " stages", "treatments", "biological_replicates", "technical_replicates" and "condition" are easily applicable to many researcher's data analysis need. That should be sufficient. I noticed that "descriptive_name" requires unique constraint among different rows in the samples.tsv. Am I right? I am curios how does seq2science use "descriptive_name" internally?

@Maarten-vd-Sande
Copy link
Member

(1) I don't understand the question. For the DEG analysis you can use any column(s) in the samples file. You can use dev_stage or stages. Just make sure that you contrast specification in the config.yaml reflects the correct column name

If you use the column dev_stage: dev_stage_one_two, and if you use stages: stages_one_two. It can be any column you want. You can even combine multiple columns for batch effect correction: https://vanheeringen-lab.github.io/seq2science/content/DESeq2.html#batch-effect-correction

(2) descriptive_name is one of the special columns that seq2science uses internally, just like for example sample, assembly, and biological replicates. It is used for the count table and for the final multiqc report

@bioinfolabmu
Copy link
Author

Thanks. My qeustion was that "should we use 'stage' instead of 'dev_stage' or 'developmental_stages'? ". You said that it does not matter, because they are not the key words used in seq2science. My guess is that 'condition' is also not the key words used by seq2science. So, we can use different variations for it, such as 'conditions' or 'my_conditions', etc. Right?

@bioinfolabmu
Copy link
Author

When I run my own data for alignment with Star, I encountered a bug. I am debugging now to see what happens. I noticed that Salmon as the quantifier tool, is not affected at all. It generates its own data. This means that Salmon is using his own alignment tool to finish the quantification itself. My next question is that, after I fixed the bug of running star, how can I connect the start alignment results to feed salmon for qunatification?

@Maarten-vd-Sande
Copy link
Member

Yes you are right! I guess that's not entirely clear from the docs.

sample, assembly, descriptive_name, biological_replicates, and technical_replicates are column names used by seq2science internally. Any other column name is basically ignored, unless you use it for DESeq2

@bioinfolabmu
Copy link
Author

Here is what in my config.yaml:

aligner:
star:
align: --quantMode GeneCounts --outSAMtype BAM

Seq2science gives me a fatal error in log file, saying "Duplicate parameter". I am trying to solve this problem.

@Maarten-vd-Sande
Copy link
Member

Yeah that's perhaps unclear on our side (again). Almost all rules have sensible defaults, so you don't have to tune them. So you could just say: aligner: star.

We force star to output a BAM by default, as we need a bam as its output, so we always have --outSAMtype BAM_Unsorted. This gives a duplicate parameter

see: https://vanheeringen-lab.github.io/seq2science/content/all_rules.html#star-align

@Maarten-vd-Sande
Copy link
Member

Also I'm not sure if the downstream steps work when you change quantmode

@bioinfolabmu
Copy link
Author

I remembered that I read somewhere, you do not support 2-pass start alignemnt yet. What is your recommendation if we want to do the two passes, and then come back to Seq2science again?

@bioinfolabmu
Copy link
Author

I also encounter a problem running trimglora, but no problem with fastp. My guess is the similar problem with configuration with default paramters or no default. I will debug that later.

@Maarten-vd-Sande
Copy link
Member

Maarten-vd-Sande commented Nov 28, 2023

I'm not familiar with 2-pass start alignment of star so I can't comment on that... What does it do? What changes? The sample fastqs, the genome assembly, or the index?

@siebrenf
Copy link
Member

Hey bioinfolabmu,

I'm trying to read your questions but I get a bit confused. Please keep to one question per git Issue (I really don't mind if you open multiple 👼 )

I'll open some new issues for each question here that we haven't answered yet, and then try to answer them there!

This was referenced Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants