xTea on plant species #20

agolicz · 2021-09-24T07:33:38Z

Hi,
We wanted to try xTea on a plant species. Is that possible or does it require a human reference?

Agnieszka

simoncchu · 2021-09-24T21:10:14Z

Hi, the current repeat library is only prepared for human. You need to prepare a library for the plant species that you want to work on. Generally, you need to know what type (family) of repeats you are working on, the consensus sequence of them, the reference genome of the species, and also the repeat annotation (from RepeatMasker or other tools). With this, we can generate the library for the plant species and run on the alignments. I didn't try this before. I think it should work, but need extra effort for library prepare.

agolicz · 2021-09-28T11:19:38Z

Thanks! We will give it a try see how it goes.

mnshgl0110 · 2021-11-12T16:06:15Z

It would be really helpful if there is a guide on how to generate repeat library for non-humans.

@agolicz did you manage to run xTea on plants? Could you please share how did it go?

agolicz · 2021-11-13T17:13:17Z

Unfortunately I have not had the time yet. Hopefully at the beginning of next year.
EDTA https://github.com/oushujun/EDTA, seems like a good candidate for plant repeat library creation.

simoncchu · 2021-12-13T22:38:17Z

I put a readme to prepare the repeat library here: https://github.com/parklab/xTea/tree/master/xtea/rep_lib_prep. Would you like to have a try? Note, xTea only works for TE insertions of known type, and need the repatmasker annotation of the TEs you are interested, and also a consensus sequence.

agolicz · 2021-12-14T10:39:07Z

Would it also be possible to add fasta header format (>xxx) for the TE library and the repeatmasker command to be used with that library?

simoncchu · 2021-12-14T17:55:46Z

What do you mean by fasta header format? For which file?
The repeatmasker output format is explained here: https://www.repeatmasker.org/webrepeatmaskerhelp.html. To run RepeatMasker on your assembled genome, you need to construct a consensus sequence library, and then feed in RepeatMasker (will call blast) to do the annotation. You can run tools like repeatscout to construct the consensus from the assembled genome. You can check the parameters by running RepeatMasker --help. If the species you are working on has some published reference genome, with big chance other people had annotated the genome, and you can directly use those.

agolicz · 2021-12-14T18:22:45Z

For the file:
TE-type.consensus.fa and TE_copies_with_flank.fa
Per repeat masker documentation: https://www.animalgenome.org/bioinfo/resources/manuals/RepeatMasker.html
The recommended format for IDs in a custom library is:

repeatname#class/subclass
or simply
repeatname#class

I just wanted to confirm that xTea expects the same.

For running pipelines like that on non-model species it is very helpful to have toy datasets, so we can ensure everything is formatted as expected. Some formatting conventions are not the same for human/animal/plant genomics.

simoncchu · 2021-12-15T13:56:21Z

You only need to extract the TE type you wanted to work on (each type separately). For example, if you have a repeatmasker output for the whole genome named species_rmsk.out, and you are interested in say LINE1, then you could run grep "LINE1" species_rmsk.out > species_rmsk_L1.out.

When generate the TE_copies_with_flank.fa, you only need to feed in the full length copies, so you need to select based on length from the generated pecies_rmsk_L1.out.

I am thinking of having a script to automatically generate this, but it's not easy to have a fix mode. Different species/TEs are of different length and different ids (some are customized set).

agolicz · 2021-12-15T14:32:01Z

Ok, thanks that makes sense. I will try to give it a try in January.

adriaaanarcillo · 2022-03-01T09:09:24Z

Hello, @agolicz! I would like to ask if you have successfully used x-Tea on plants already? If so, how did it go? Thanks.

DR-genomics · 2022-05-10T23:03:38Z

Hello,
I tried to create a plant repeat library using your instructions given in: https://github.com/parklab/xTea/tree/master/xtea/rep_lib_prep. However, I received an error: xtea: error: no such option: -P

And I don't see -P option in the xtea help page as well. What does -P stand for here?

Thanks!

simoncchu · 2022-05-11T13:30:42Z

Could you try again by replacing xtea with python full-path/x_TEA_main.py ?

DR-genomics · 2022-05-11T22:16:06Z

I tried the following and got this error:
python xtea/x_TEA_main.py -P -K -p ./ -r ../refgenome.fasta -a RMasker.out -o /home/xTea/TE_copies_with_flank.fa -e 100

Traceback (most recent call last):
File "xtea/x_TEA_main.py", line 345, in
x_annotation.load_rmsk_annotation()
File "/gpfs20/scratch/dramacha/xTea/xtea/x_annotation.py", line 241, in load_rmsk_annotation
start_pos = int(fields[5])
ValueError: invalid literal for int() with base 10: 'position'

zhuxf-lab · 2022-06-19T08:44:22Z

I am trying to prepare xTea repeat library using the chm13 genome.
I got the TE-type_rmsk.out, but currently have trouble getting the full-length-TE-type_rmsk.out.
Do you have any suggestions for how to get the full-length TE? By structure, or by length?

simoncchu · 2022-06-19T15:53:12Z

@zhuxf-lab it's based on length for the active Human retrotransposons. For example, L1, I set >5900bp as full length.

zhuxf-lab · 2022-06-21T13:54:07Z

@zhuxf-lab it's based on length for the active Human retrotransposons. For example, L1, I set >5900bp as full length.

Hi, I tried using >5900bp as the cutoff for the full length L1. I run hg38 first to see whether I can reproduce the result in the provided hg38 rep_lib_annotation data. It turned out that the result I got was much larger than the annotation file provided. For example, the hg38_FL_L1_flanks.fa file I got is 53MB (using -e 100), while the size of hg38_FL_L1_flanks_3k.fa in the provided rep_lib_annotation file is 2MB. I attached my code here, any idea where is incorrect? The hg38 reference genome and repeatmasker output file are all from UCSC.

#########
grep "LINE1" hg38.fa.out > hg38.fa_L1.out
cat hg38.fa_L1.out | while read line
do
eval $(echo ${line}|awk '{printf("var_9=%s;var_12=%s;var_13=%s;var_14=%s;",$9,$12,$13,$14)}')
if [ $var_9 == "C" ];then
i_length=$(($var_13 - $var_14))
else
i_length=$(($var_13 - $var_12))
fi
if [ $i_length -gt 5900 ];then
echo "$line"
fi
done >hg38.fa_L1_full_length.out ### this is to select out the LINE1 >5900bp

python x_TEA_main.py -P -K -p ./ -r hg38.fa -a hg38.fa_L1_full_length.out -o hg38.fa_L1_full_length_with_flank_e100.fa -e 100
#########

And is it reasonable to set cutoff for full-length Alu, SVA, HERV as 250bp, 1900bp, 8900bp?

It would be super helpful if you could kindly add chm13 into the rep_lib_annotation data. Thank you!

simoncchu · 2022-06-22T15:31:05Z

@zhuxf-lab I moved your question to a new issue #50, I'll work on it asap.

simoncchu · 2022-06-23T17:21:02Z

@zhuxf-lab while I am working on this issue, the size difference (53M vs 2M) is because I only select L1HS (reported active L1) rather than all the L1 subfamilies.

Alu, SVA, HERV as 250bp, 1900bp, 8900bp?

For SVA, I set 700bp.

zhuxf-lab · 2022-07-01T05:05:21Z

@zhuxf-lab while I am working on this issue, the size difference (53M vs 2M) is because I only select L1HS (reported active L1) rather than all the L1 subfamilies.

Alu, SVA, HERV as 250bp, 1900bp, 8900bp?

For SVA, I set 700bp.

Ok, Thanks!

bismarck1008 · 2022-09-05T07:09:30Z

I'm very curious to know if the process for the custom repeat library is available now.
https://github.com/parklab/xTea/tree/master/xtea/rep_lib_prep

simoncchu · 2022-09-05T15:10:53Z

@bismarck1008 it should work

bismarck1008 · 2022-09-07T23:11:04Z

xtea -P -K -p ./ -r path-of-reference-genome.fa -a path-to-rep-lib-folder/full-length-TE-type_rmsk.out -o path-output-folder/TE_copies_with_flank.fa -e 100
I tried the command line above. P, K, and e parameters are not identified

simoncchu · 2022-09-08T12:45:12Z

try python your-xtea-folder/x_TEA_main.py instead of xtea? @bismarck1008

adriludwig · 2022-12-09T15:01:37Z

Considering that other species have other elements than just L1, Alu, SVA and HERV, would xTea identify them? In this case which would be the option for y parameter? Thanks

simoncchu · 2022-12-09T21:56:49Z

@adriludwig , use "-y 32". Here is a readme: https://github.com/parklab/xTea/tree/master/xtea/rep_lib_prep (at the bottom). It's not convenient as you can only run one repeat type at a time. I'll try to write up a new version/wrapper for this.

adriludwig · 2022-12-12T09:51:51Z

Thanks very much, @simoncchu. We are currently using mobster, but we would also like to test other tools. So I'll keep an eye on xTea updates.

simoncchu added the enhancement New feature or request label Sep 25, 2021

simoncchu mentioned this issue Jun 22, 2022

User request CHM13 libs #50

Closed

simoncchu mentioned this issue Nov 1, 2022

Using xTEa for Drosophila #63

Closed

simoncchu mentioned this issue Jun 21, 2023

no such option: --bamsnap #84

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xTea on plant species #20

xTea on plant species #20

agolicz commented Sep 24, 2021

simoncchu commented Sep 24, 2021

agolicz commented Sep 28, 2021

mnshgl0110 commented Nov 12, 2021

agolicz commented Nov 13, 2021

simoncchu commented Dec 13, 2021

agolicz commented Dec 14, 2021 •

edited

simoncchu commented Dec 14, 2021

agolicz commented Dec 14, 2021 •

edited

simoncchu commented Dec 15, 2021 •

edited

agolicz commented Dec 15, 2021

adriaaanarcillo commented Mar 1, 2022

DR-genomics commented May 10, 2022

simoncchu commented May 11, 2022

DR-genomics commented May 11, 2022

zhuxf-lab commented Jun 19, 2022

simoncchu commented Jun 19, 2022

zhuxf-lab commented Jun 21, 2022

simoncchu commented Jun 22, 2022

simoncchu commented Jun 23, 2022 •

edited

zhuxf-lab commented Jul 1, 2022

bismarck1008 commented Sep 5, 2022

simoncchu commented Sep 5, 2022

bismarck1008 commented Sep 7, 2022

simoncchu commented Sep 8, 2022 •

edited

adriludwig commented Dec 9, 2022 •

edited

simoncchu commented Dec 9, 2022

adriludwig commented Dec 12, 2022

xTea on plant species #20

xTea on plant species #20

Comments

agolicz commented Sep 24, 2021

simoncchu commented Sep 24, 2021

agolicz commented Sep 28, 2021

mnshgl0110 commented Nov 12, 2021

agolicz commented Nov 13, 2021

simoncchu commented Dec 13, 2021

agolicz commented Dec 14, 2021 • edited

simoncchu commented Dec 14, 2021

agolicz commented Dec 14, 2021 • edited

simoncchu commented Dec 15, 2021 • edited

agolicz commented Dec 15, 2021

adriaaanarcillo commented Mar 1, 2022

DR-genomics commented May 10, 2022

simoncchu commented May 11, 2022

DR-genomics commented May 11, 2022

zhuxf-lab commented Jun 19, 2022

simoncchu commented Jun 19, 2022

zhuxf-lab commented Jun 21, 2022

simoncchu commented Jun 22, 2022

simoncchu commented Jun 23, 2022 • edited

zhuxf-lab commented Jul 1, 2022

bismarck1008 commented Sep 5, 2022

simoncchu commented Sep 5, 2022

bismarck1008 commented Sep 7, 2022

simoncchu commented Sep 8, 2022 • edited

adriludwig commented Dec 9, 2022 • edited

simoncchu commented Dec 9, 2022

adriludwig commented Dec 12, 2022

agolicz commented Dec 14, 2021 •

edited

agolicz commented Dec 14, 2021 •

edited

simoncchu commented Dec 15, 2021 •

edited

simoncchu commented Jun 23, 2022 •

edited

simoncchu commented Sep 8, 2022 •

edited

adriludwig commented Dec 9, 2022 •

edited