Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

Calling variants using IsoPhase from Iso-Seq data #198

Open
Yexin-Zhang opened this issue Feb 28, 2022 · 1 comment
Open

Calling variants using IsoPhase from Iso-Seq data #198

Yexin-Zhang opened this issue Feb 28, 2022 · 1 comment

Comments

@Yexin-Zhang
Copy link

Yexin-Zhang commented Feb 28, 2022

Hi,

I was trying to use IsoPhase do call SNPs from the Pacbio Isoseq data for a single sample. However, when I run the select_loci_to_phase.py with all the files required, I was keeping getting 0 loci.

  • The GFF file is obtained after running SQANTI3 filtering
11      PacBio  transcript      63617628        63627350        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
11      PacBio  exon    63617628        63619094        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
11      PacBio  exon    63620065        63620182        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
11      PacBio  exon    63622845        63623057        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
11      PacBio  exon    63623737        63623954        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
11      PacBio  exon    63627298        63627350        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
12      PacBio  transcript      20211007        20216155        .       +       .       transcript_id "PB.76.2"; gene_id "PB.76";
12      PacBio  exon    20211007        20211486        .       +       .       transcript_id "PB.76.2"; gene_id "PB.76";
12      PacBio  exon    20214033        20214764        .       +       .       transcript_id "PB.76.2"; gene_id "PB.76";
12      PacBio  exon    20215256        20216155        .       +       .       transcript_id "PB.76.2"; gene_id "PB.76";
  • The FASTQ file is obtained after doing CCS analyses - LIMA - Isoseq refine, but before clustering. The headers of each sequence are in the format @m64319e_211227_144417/3211320/ccs.

  • The read_stat file is the count information after collapse obtained by running get_abundance_post_collapse.py

id      is_fl   stat    pbid
m64319e_211227_144417/3211320/ccs       Y       unique  PB.1.1
m64319e_211227_144417/65145123/ccs      Y       unique  PB.1.1
m64319e_211227_144417/111935771/ccs     Y       unique  PB.2.1
m64319e_211227_144417/67240066/ccs      Y       unique  PB.2.1
m64319e_211227_144417/123208230/ccs     Y       unique  PB.3.1
m64319e_211227_144417/160236121/ccs     Y       unique  PB.3.1
m64319e_211227_144417/42860688/ccs      Y       unique  PB.3.1

I have also manually checked that the coverage at some loci is over 40, even some isoforms (e.g. P.B.78.13) are having support from more than 40 reads.

It was very mysterious since when I used the exactly same reference genome and GFF file, but changed the FASTQ file to sample1..FL.RE.CL.hq.fasta (obtained after doing clustering, headers are formated as >transcript/0 full_length_coverage=4;length=5054), as well as changed the read_stat file into a new file converted using python <path_to>/phasing/utils/convert_group_to_read_stat_file.py, everything worked fine.

The new read_stat file is like:

id      pbid    is_fl   stat
transcript/304  PB.1.1  Y       unique
transcript/3111 PB.2.1  Y       unique
transcript/3713 PB.3.1  Y       unique
transcript/3605 PB.4.1  Y       unique
transcript/2086 PB.5.1  Y       unique
transcript/2546 PB.6.1  Y       unique
transcript/1800 PB.7.1  Y       unique

I was very confused, since I was hoping the phasing step would take the information of the full-length read sequences before clustering into isoform sequences. It would be very helpful if you could help with this problem. Thank you.

Monica

@cpaisie
Copy link

cpaisie commented Mar 4, 2024

@Yexin-Zhang did you use the --fq flag in the command when you ran select_loci_to_phase.py? I had the same problem as you, but when I added --fq to the commands, everything ran as expected.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants