Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xTea on plant species #20

Open
agolicz opened this issue Sep 24, 2021 · 27 comments
Open

xTea on plant species #20

agolicz opened this issue Sep 24, 2021 · 27 comments
Labels
enhancement New feature or request

Comments

@agolicz
Copy link

agolicz commented Sep 24, 2021

Hi,
We wanted to try xTea on a plant species. Is that possible or does it require a human reference?

Agnieszka

@simoncchu
Copy link
Collaborator

Hi, the current repeat library is only prepared for human. You need to prepare a library for the plant species that you want to work on. Generally, you need to know what type (family) of repeats you are working on, the consensus sequence of them, the reference genome of the species, and also the repeat annotation (from RepeatMasker or other tools). With this, we can generate the library for the plant species and run on the alignments. I didn't try this before. I think it should work, but need extra effort for library prepare.

@simoncchu simoncchu added the enhancement New feature or request label Sep 25, 2021
@agolicz
Copy link
Author

agolicz commented Sep 28, 2021

Thanks! We will give it a try see how it goes.

@mnshgl0110
Copy link

It would be really helpful if there is a guide on how to generate repeat library for non-humans.

@agolicz did you manage to run xTea on plants? Could you please share how did it go?

@agolicz
Copy link
Author

agolicz commented Nov 13, 2021

Unfortunately I have not had the time yet. Hopefully at the beginning of next year.
EDTA https://github.com/oushujun/EDTA, seems like a good candidate for plant repeat library creation.

@simoncchu
Copy link
Collaborator

I put a readme to prepare the repeat library here: https://github.com/parklab/xTea/tree/master/xtea/rep_lib_prep. Would you like to have a try? Note, xTea only works for TE insertions of known type, and need the repatmasker annotation of the TEs you are interested, and also a consensus sequence.

@agolicz
Copy link
Author

agolicz commented Dec 14, 2021

Would it also be possible to add fasta header format (>xxx) for the TE library and the repeatmasker command to be used with that library?

@simoncchu
Copy link
Collaborator

What do you mean by fasta header format? For which file?
The repeatmasker output format is explained here: https://www.repeatmasker.org/webrepeatmaskerhelp.html. To run RepeatMasker on your assembled genome, you need to construct a consensus sequence library, and then feed in RepeatMasker (will call blast) to do the annotation. You can run tools like repeatscout to construct the consensus from the assembled genome. You can check the parameters by running RepeatMasker --help. If the species you are working on has some published reference genome, with big chance other people had annotated the genome, and you can directly use those.

@agolicz
Copy link
Author

agolicz commented Dec 14, 2021

For the file:
TE-type.consensus.fa and TE_copies_with_flank.fa
Per repeat masker documentation: https://www.animalgenome.org/bioinfo/resources/manuals/RepeatMasker.html
The recommended format for IDs in a custom library is:

repeatname#class/subclass
or simply
repeatname#class

I just wanted to confirm that xTea expects the same.

For running pipelines like that on non-model species it is very helpful to have toy datasets, so we can ensure everything is formatted as expected. Some formatting conventions are not the same for human/animal/plant genomics.

@simoncchu
Copy link
Collaborator

simoncchu commented Dec 15, 2021

You only need to extract the TE type you wanted to work on (each type separately). For example, if you have a repeatmasker output for the whole genome named species_rmsk.out, and you are interested in say LINE1, then you could run grep "LINE1" species_rmsk.out > species_rmsk_L1.out.

When generate the TE_copies_with_flank.fa, you only need to feed in the full length copies, so you need to select based on length from the generated pecies_rmsk_L1.out.

I am thinking of having a script to automatically generate this, but it's not easy to have a fix mode. Different species/TEs are of different length and different ids (some are customized set).

@agolicz
Copy link
Author

agolicz commented Dec 15, 2021

Ok, thanks that makes sense. I will try to give it a try in January.

@adriaaanarcillo
Copy link

Hello, @agolicz! I would like to ask if you have successfully used x-Tea on plants already? If so, how did it go? Thanks.

@DR-genomics
Copy link

Hello,
I tried to create a plant repeat library using your instructions given in: https://github.com/parklab/xTea/tree/master/xtea/rep_lib_prep. However, I received an error: xtea: error: no such option: -P

And I don't see -P option in the xtea help page as well. What does -P stand for here?

Thanks!

@simoncchu
Copy link
Collaborator

Could you try again by replacing xtea with python full-path/x_TEA_main.py ?

@DR-genomics
Copy link

I tried the following and got this error:
python xtea/x_TEA_main.py -P -K -p ./ -r ../refgenome.fasta -a RMasker.out -o /home/xTea/TE_copies_with_flank.fa -e 100

Traceback (most recent call last):
File "xtea/x_TEA_main.py", line 345, in
x_annotation.load_rmsk_annotation()
File "/gpfs20/scratch/dramacha/xTea/xtea/x_annotation.py", line 241, in load_rmsk_annotation
start_pos = int(fields[5])
ValueError: invalid literal for int() with base 10: 'position'

@zhuxf-lab
Copy link

I am trying to prepare xTea repeat library using the chm13 genome.
I got the TE-type_rmsk.out, but currently have trouble getting the full-length-TE-type_rmsk.out.
Do you have any suggestions for how to get the full-length TE? By structure, or by length?

@simoncchu
Copy link
Collaborator

@zhuxf-lab it's based on length for the active Human retrotransposons. For example, L1, I set >5900bp as full length.

@zhuxf-lab
Copy link

@zhuxf-lab it's based on length for the active Human retrotransposons. For example, L1, I set >5900bp as full length.

Hi, I tried using >5900bp as the cutoff for the full length L1. I run hg38 first to see whether I can reproduce the result in the provided hg38 rep_lib_annotation data. It turned out that the result I got was much larger than the annotation file provided. For example, the hg38_FL_L1_flanks.fa file I got is 53MB (using -e 100), while the size of hg38_FL_L1_flanks_3k.fa in the provided rep_lib_annotation file is 2MB. I attached my code here, any idea where is incorrect? The hg38 reference genome and repeatmasker output file are all from UCSC.

#########
grep "LINE1" hg38.fa.out > hg38.fa_L1.out
cat hg38.fa_L1.out | while read line
do
eval $(echo ${line}|awk '{printf("var_9=%s;var_12=%s;var_13=%s;var_14=%s;",$9,$12,$13,$14)}')
if [ $var_9 == "C" ];then
i_length=$(($var_13 - $var_14))
else
i_length=$(($var_13 - $var_12))
fi
if [ $i_length -gt 5900 ];then
echo "$line"
fi
done >hg38.fa_L1_full_length.out ### this is to select out the LINE1 >5900bp

python x_TEA_main.py -P -K -p ./ -r hg38.fa -a hg38.fa_L1_full_length.out -o hg38.fa_L1_full_length_with_flank_e100.fa -e 100
#########

And is it reasonable to set cutoff for full-length Alu, SVA, HERV as 250bp, 1900bp, 8900bp?

It would be super helpful if you could kindly add chm13 into the rep_lib_annotation data. Thank you!

@simoncchu
Copy link
Collaborator

@zhuxf-lab I moved your question to a new issue #50, I'll work on it asap.

@simoncchu
Copy link
Collaborator

simoncchu commented Jun 23, 2022

@zhuxf-lab while I am working on this issue, the size difference (53M vs 2M) is because I only select L1HS (reported active L1) rather than all the L1 subfamilies.

Alu, SVA, HERV as 250bp, 1900bp, 8900bp?

For SVA, I set 700bp.

@zhuxf-lab
Copy link

@zhuxf-lab while I am working on this issue, the size difference (53M vs 2M) is because I only select L1HS (reported active L1) rather than all the L1 subfamilies.

Alu, SVA, HERV as 250bp, 1900bp, 8900bp?

For SVA, I set 700bp.

Ok, Thanks!

@bismarck1008
Copy link

I'm very curious to know if the process for the custom repeat library is available now.
https://github.com/parklab/xTea/tree/master/xtea/rep_lib_prep

@simoncchu
Copy link
Collaborator

@bismarck1008 it should work

@bismarck1008
Copy link

xtea -P -K -p ./ -r path-of-reference-genome.fa -a path-to-rep-lib-folder/full-length-TE-type_rmsk.out -o path-output-folder/TE_copies_with_flank.fa -e 100
I tried the command line above. P, K, and e parameters are not identified

@simoncchu
Copy link
Collaborator

simoncchu commented Sep 8, 2022

try python your-xtea-folder/x_TEA_main.py instead of xtea? @bismarck1008

@adriludwig
Copy link

adriludwig commented Dec 9, 2022

Considering that other species have other elements than just L1, Alu, SVA and HERV, would xTea identify them? In this case which would be the option for y parameter? Thanks

@simoncchu
Copy link
Collaborator

@adriludwig , use "-y 32". Here is a readme: https://github.com/parklab/xTea/tree/master/xtea/rep_lib_prep (at the bottom). It's not convenient as you can only run one repeat type at a time. I'll try to write up a new version/wrapper for this.

@adriludwig
Copy link

Thanks very much, @simoncchu. We are currently using mobster, but we would also like to test other tools. So I'll keep an eye on xTea updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants