This repository contains details on the initial project to deploy the software Genes Linked by Underlying Evolution (GLUE) in the Antiviral Unit (AVU) in order to integrate HCV whole genome sequencing and clinical and epidemiological data. The project also uses the Maximun-Likelihood Clade Assignment (MLCA) method, provided by the software, to assay a HCV classification based on phylogenetic relationships. A prospective implementation for this intial project would be the integration of an antiviral drug resistance analysis tool developed by the University of Glasgow Centre for Virus Research in collaboration with AVU at Public Health England (PHE).
The project repository has two different branches:
-
an initial branch called msc_bioinformatics_QMUL, which was used to store the original version v1.0 submitted as Juan's research project for the MSc in Bioinformatics at QMUL in September 2021. No new commits have been added since and no future modifications are planned for this branch.
-
the main branch, used to improve some features of the project and fix issues that were discovered at the time of the development of the MSc research project, which are discussed in the section 3.2. Adding and managing the AVU custom tables for the GLUE project. This branch holds the version v2.0.
To clone a specific branch of the project repository, please use:
git clone -b <branch-name> https://github.com/juanledesma78/HCV-GLUE-AVU
The project has been tested using FASTA sequence data generated by a Sequence Capture Enrichment Next Generation Sequencing (NGS) protocol used by AVU to deliver its service routine. However, the clinical and epidemiological data was randomly created so that no real patient information has been used to test the project.
GLUE software can be installed in two different ways, which can be found on http://glue-tools.cvr.gla.ac.uk/#/installation and it can be used on MS Windows, Linux and Mac OSX. The current project has been tested on a linux ubuntu 18.04.6 LTS machine.
The following packages were tested for the use of the GLUE software in this project:
- Glue v1.1.107
- Java: openjdk version "1.8.0_282", OpenJDK Runtime Environment
- Mysql: v8.0.23-0ubuntu0.20.04.1 for Linux on x86_64 (Ubuntu)
- Blast-2.2.31
- Mafft v7.475
- Raxml v8.2.12
Python 3.8 and other libraries such as Pandas v1.2.5, Sqlite v3.33.0, Sqlalchemy v1.4.22 and Alembic v1.7.4 were used to manage the database and other steps in the project. The full list of requirements needed can be found in the file gluedatabase_environment.yml, which also can be used to create a conda environment called gluedatabase
to manage Python and its dependencies in the project by typing:
conda env create -f gluedatabase_environment.yml
In order to activate the environment, please use
conda activate gluedatabase
Before creating the environment, modify the prefix in the yml file by replacing /home/phe.gov.uk/juan.ledesma/miniconda3/
with the path to where conda is installed on your computer. A simple way to get that path is typing in a terminal:
conda env list
which returns the path to base or where conda is located
# conda environments:
#
base * /home/another.user/miniconda3
For details on how to use and get familiar with GLUE please visit the official GLUE documentation. For specific details on the coding used to build the initial GLUE project, please visit Introduction and Materials and Methods of the MSc Bioinformatics Project dissertation.
When the repository is successfully cloned a main project directory is created containing several subdirectories to store different data:
- sources/, sequence data in FASTA and/or GenBank XML format
- glue/, scripts written with glue commannds needed to create the projetc in the console and JavaScript programs
- modules/, configuration of the modules (tools) in GLUE
- alignments/, alignment of the reference sequences
- trees/, phylogenetic trees and other related files
- tabular/, dummy databases to test the project
- zDrive/, raw consensus sequence data in FASTA format
Additional subirectories with other functions for the development of the project are:
- logs/, logs, reports and queries for different steps of the project
- database_model/, python script with the final model of the project database
- migrations/, scripts to modify the intial GLUE mysql database
Once the user is familiar with the project and BEFORE building the new project is strongly recommended to delete the following information:
- all the subdirectories named NGSXX in sources/. Never remove ncbi-refseqs/, as this is needed for building the GLUE project hcv_glue_avu.
- all the files contained in epi/, fasta/, genotyping/, molis/ and queries in subdirectory logs/
- subdiretory migrations and file alembic.ini
All these files were used to test the project during its implementation, so they are not needed any longer if the project is itended to be used.
GLUE software and the project must be always run from the main project directory. From a terminal, run the following command to start the GLUE console:
gluetools.sh
Once the glue console is displayed run the the master project (avuHcvProject.glue) for building the project hcv_glue_avu by typing:
run file avuHcvProject.glue
After completion and validation that the project has been succesfully created, the user can optionally confirm the existence of project by entering the command on the GLUE console:
project hcv_glue_avu
The console will displays:
Mode path: /project/hcv_glue_avu
indicating the mode were the user is currently working. To check that the reference sequences and some other information exist in the project, type for instance:
list sequence sequenceID subtype genotype
and it will display a list with information related to the reference sequences (press N to show more information or Q to quit).
Mode path: /project/hcv_glue_avu
GLUE> list sequence sequenceID subtype genotype
+===============================+=========+=================================================+
| sequenceID | subtype | genotype |
+===============================+=========+=================================================+
| AB030907 | b | 2 |
| AB031663 | k | 2 |
| AB047639 | a | 2 |
| AB622121 | - | RF2b_1b_2 |
| AB661382 | b | 2 |
| AB661388 | b | 2 |
| AB677527 | - | RF2b_1b_4 |
| AB677530 | - | RF2b_1b_3 |
| AF009606 | a | 1 |
| AF064490 | a | 5 |
| AJ851228 | - | 1 |
| AM408911 | - | RF2_5 |
| AM910652 | g | 1 |
| AY051292 | c | 1 |
| AY587845 | - | RF2k_1b |
| AY651061 | c | 1 |
| AY859526 | a | 6 |
| D00944 | a | 2 |
| D10988 | b | 2 |
| D14853 | c | 1 |
| D17763 | a | 3 |
| D28917 | a | 3 |
| D49374 | b | 3 |
| D50409 | c | 2 |
| D63821 | k | 3 |
| D63822 | g | 6 |
| D84262 | b | 6 |
| D84263 | d | 6 |
| D84264 | k | 6 |
| D84265 | h | 6 |
| D90208 | b | 1 |
| DQ155560 | - | RF2i_6p |
| DQ155561 | i | 2 |
| DQ278891 | - | 6 |
| DQ278892 | w | 6 |
| DQ278893 | - | 6 |
| DQ278894 | n | 6 |
| DQ314805 | e | 6 |
| DQ314806 | g | 6 |
| DQ364460 | - | RF2b_1b_1 |
| DQ418786 | d | 4 |
| DQ418789 | a | 4 |
| DQ835760 | f | 6 |
| DQ835761 | j | 6 |
| DQ835762 | i | 6 |
| DQ835766 | m | 6 |
| DQ835767 | m | 6 |
| DQ835768 | n | 6 |
| DQ835769 | j | 6 |
| DQ835770 | i | 6 |
| DQ988074 | a | 4 |
| EF108306 | a | 7 |
+===============================+=========+=================================================+
Sequences 1 to 52 of 694 [F:first, L:last, P:prev, N:next, Q:quit]
Another way of checking that the project has been succesfully created using the GLUE console is typing exit
to exit the mode /project/hcv_glue_avu
and then typing:
schema-project hcv_glue_avu
Once in Mode path: /schema-project/hcv_glue_avu
, type table
and use the tab key to list all the objects created in the project:
alignment alignment_member feature feature_location member_floc_note reference sequence
var_almt_note variation who_country who_intermediate_region who_region who_sub_region
Information for each new table object can be visualised by specifying the name:
table sequence list field
which will display whether new fields customised by the user has been added to the table sequence in the project hcv_glue_avu.
GLUE> table sequence list field
+==========+==========================+=========+===========+
| table | name | type | maxLength |
+==========+==========================+=========+===========+
| sequence | earliest_collection_year | INTEGER | - |
| sequence | gb_accession_version | VARCHAR | 50 |
| sequence | gb_collection_month | VARCHAR | 50 |
| sequence | gb_collection_month_day | INTEGER | - |
| sequence | gb_collection_year | INTEGER | - |
| sequence | gb_country | VARCHAR | 50 |
| sequence | gb_create_date | DATE | - |
| sequence | gb_genotype | VARCHAR | 50 |
| sequence | gb_gi_number | VARCHAR | 50 |
| sequence | gb_host | VARCHAR | 50 |
| sequence | gb_isolate | VARCHAR | 300 |
| sequence | gb_lab_construct | BOOLEAN | - |
| sequence | gb_length | INTEGER | - |
| sequence | gb_locus | VARCHAR | 50 |
| sequence | gb_organism | VARCHAR | 50 |
| sequence | gb_place_sampled | VARCHAR | 200 |
| sequence | gb_primary_accession | VARCHAR | 50 |
| sequence | gb_pubmed_id | VARCHAR | 50 |
| sequence | gb_recombinant | BOOLEAN | - |
| sequence | gb_subtype | VARCHAR | 50 |
| sequence | gb_taxonomy | VARCHAR | 200 |
| sequence | gb_update_date | DATE | - |
| sequence | genotype | VARCHAR | 50 |
| sequence | genotyping_core | VARCHAR | 50 |
| sequence | genotyping_e1 | VARCHAR | 50 |
| sequence | genotyping_e2 | VARCHAR | 50 |
| sequence | genotyping_method | VARCHAR | 200 |
| sequence | genotyping_ns2 | VARCHAR | 50 |
| sequence | genotyping_ns3 | VARCHAR | 50 |
| sequence | genotyping_ns4a | VARCHAR | 50 |
| sequence | genotyping_ns4b | VARCHAR | 50 |
| sequence | genotyping_ns5a | VARCHAR | 50 |
| sequence | genotyping_ns5b | VARCHAR | 50 |
| sequence | genotyping_p7 | VARCHAR | 50 |
| sequence | genotyping_polyprotein | VARCHAR | 50 |
| sequence | hcv_wg_pipeline | VARCHAR | 100 |
| sequence | latest_collection_year | INTEGER | - |
| sequence | pipeline_version | VARCHAR | 50 |
| sequence | reference_status | VARCHAR | 20 |
| sequence | sample_id | VARCHAR | 50 |
| sequence | subtype | VARCHAR | 50 |
+==========+==========================+=========+===========+
To exit GLUE type exit
in the console until Mode path: /
is reached and then exit
again or just type quit
to exit the glue console from every Mode path.
An alternative way of checking the project and its tables have been created is using MySQL workbench to inspect the relational MySQL database that GLUE generates when the project hcv_glue_avu is built.
Even though GLUE allows to add new tables to the project that the user can define by extending the schema (see example of WHO tables), some issues were discovered in the initial project (v1.0) when dealing with custom tables (see Section Results and Discussion of MSc project dissertation.
The custom table hcv_glue_avu_sample was supposed to be connected to the default table hcv_glue_avu_sequence by using the GLUE command:
create link sequence sample --multiplicity MANY_TO_ONE
in line 40 of the file avuCustomTables.glue of the version 1.0. Similar behaviour was expected to happen for the other tables defined in avuCustomTables.glue as well as for the custom tables who created using SchemaExtensions.glue. However the relationships and foreign keys that these custom tables have one each other or even with the fixed GLUE tables were missing as shown in Figure 1 and confirmed by inspection of the sql script for the database of the project v1.0.
Figure 1. ER Diagram of the database schema of the project hcv_glue_avu v1.0 created by MySQL workbench. Note that GLUE names the tables with the project name hcv_glue_avu and the specific name for each table.
Another issue identified in the initial project v1.0 was when querying the database (see page 24 of MSc project dissertation), which made impossible to query multiple custom tables at the same time. In addition, this could create a potential problem if new custom tables were required in the project and connected to existing ones. Python scripts were used to perform this step in the initial project version 1.0.
As an alternative for creating and querying the custom tables needed in the project (rather than using the standard way in GLUE) and getting fully control of them, SQLalchemy (SQLA) was used in the project hcv_glue_avu version 2.0. A database model called hcv_glue_avu_sqla_model.py was developed to define the tables created in the step 3.1. Building the initial project and add the new custom tables required for the GLUE project. This process was performed using declarative mapping, relationship definitions and other parameters by means of SQLA ORM. Once the database model was completed, Alembic 1.7.4 was used to "restore" foreign keys missing for the tables WHO and to add the foreign keys between the new AVU custom tables and the default GLUE tables.
Firstly, Alembic must be initialised by using the command:
alembic init migrations
which creates a file alembic.ini and a directory called migrations, which contains the files env.py, README, script.py.mako and the subdirectory versions.
Then the initial GLUE MySQL database is connected by replacing # sqlalchemy.url = driver://user:pass@localhost/dbname
in the file alembic.ini with:
sqlalchemy.url = mysql+pymysql://gluetools:glue12345@localhost/GLUE_TOOLS
To allow Alembic to compare the GLUE MySQL database with the database model created by SQLA, the access to the SQLA script is added to env.py by replacing target_metadata = None
with:
from database_model import hcv_glue_avu_sqla_model
target_metadata = hcv_glue_avu_sqla_model.Base.metadata
Once these parameters are added a "comparison" can be made by using:
alembic revision --autogenerate -m "name"
, being name any informative message to define the migation (i.e. "initial test", "adding AVU custom tables and foreign key for who tables"...). In this step, the new custom tables (patient, sample, drug, treatment and hospital) defined by the SQLA model are added to the GLUE MySQL database as well as a new table called alembic_version, which is initially empty. All the foreing keys between the AVU custom tables are added to the schema but their keys linking to the default GLUE tables are still missing (i.e. columns sample.id and sequence.sample_id). A python script (i.e. 460d623af7cb_adding AVU custom tables and foreign key for who tables), which is created and saved in migrations/version/, displays suggestions of changes to match the GLUE MySQL database and the SQLA model.
"""adding AVU custom tables and foreign key for who tables
Revision ID: 460d623af7cb
Revises:
Create Date: 2021-11-17 14:42:11.994401
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import mysql
# revision identifiers, used by Alembic.
revision = '460d623af7cb'
down_revision = None
branch_labels = None
depends_on = None
def upgrade():
# ### commands auto generated by Alembic - please adjust! ###
op.drop_table('field')
op.drop_table('schema_version')
op.drop_index('TABLE_NAME', table_name='AUTO_PK_SUPPORT')
op.drop_table('AUTO_PK_SUPPORT')
op.drop_table('project')
op.drop_table('link')
op.drop_table('custom_table')
op.create_index('sample_id', 'hcv_glue_avu_sequence', ['sample_id'], unique=False)
op.create_index('who_country_id', 'hcv_glue_avu_sequence', ['who_country_id'], unique=False)
op.create_foreign_key(None, 'hcv_glue_avu_sequence', 'hcv_glue_avu_sample', ['sample_id'], ['id'])
op.create_foreign_key(None, 'hcv_glue_avu_sequence', 'hcv_glue_avu_who_country', ['who_country_id'], ['id'])
op.create_foreign_key(None, 'hcv_glue_avu_who_country', 'hcv_glue_avu_who_sub_region', ['who_sub_region_id'], ['id'])
op.create_foreign_key(None, 'hcv_glue_avu_who_country', 'hcv_glue_avu_who_region', ['who_region_id'], ['id'])
op.create_foreign_key(None, 'hcv_glue_avu_who_country', 'hcv_glue_avu_who_intermediate_region', ['who_intermediate_region_id'], ['id'])
# ### end Alembic commands ###
def downgrade():
# ### commands auto generated by Alembic - please adjust! ###
op.drop_constraint(None, 'hcv_glue_avu_who_country', type_='foreignkey')
op.drop_constraint(None, 'hcv_glue_avu_who_country', type_='foreignkey')
op.drop_constraint(None, 'hcv_glue_avu_who_country', type_='foreignkey')
op.drop_constraint(None, 'hcv_glue_avu_sequence', type_='foreignkey')
op.drop_constraint(None, 'hcv_glue_avu_sequence', type_='foreignkey')
op.drop_index('who_country_id', table_name='hcv_glue_avu_sequence')
...
The python file is modified to keep/add the specific parameters in the function upgrade (or downgrade if needed) to make the GLUE MySQL database match the SQLA model. Once the final changes are made, the following command is used to upgrade the changes:
alembic upgrade 460d623af7cb
By contrast, If the user wants to add changes included in downgrade(), upgrade
must be replaced with donwgrade
.
The alembic_version table, recently added to the GLUE MySQL database, is updated with the version of the migration 460d623af7cb. Optionally, a SQL script can be generated to record all the parameters used to create the database using the command:
alembic upgrade 460d623af7cb --sql > migrations/460d623af7cb_initial_migration.sql
As a result, the GLUE MySQL database is updated with the new AVU tables and the foreign keys of the WHO tables are "restored" to link one each other and the WHO tables to the table sequence (Figure 2.).
Figure 2. ER Diagram of the database schema of the project hcv_glue_avu v2.0 created by MySQL workbench after doing the migration with Alembic and SQLAlchemy.
The script import_fasta_seqs_to_GLUE.py is used to import the nucleotide FASTA sequences generated by NGS saved in zDrive into the GLUE project. In a first step, the fasta header and the file name of each sequence are renamed with the same ID so that any potential issue with GLUE is avoided. The NGS id is also added to the final sequence ID for convenience and then the edited sequences are saved in sources/. The script calls GLUE to import the edited sequences to the project hcv_glue_avu and generates a report that is saved in logs/fasta.
The script takes an argument -i/ --input
, which is the specific path to the directory in Zdrive/ where fasta sequences to be imported are (i.e. zDrive/NGS91/FASTAs):
python3 import_fasta_seqs_to_GLUE.py -i <path-to-fasta-sequences-in-zDrive/>
Additional information on how to use the script can be accesed by using:
python3 import_fasta_seqs_to_GLUE.py --help
GLUE provides a method called Maximum-likelihood clade assignment (MLCA) to perform a genetic classification based on the phylogenetic relationships between new sequences and the reference sequences (for more details check the oiriginal publication). The components and intructions that GLUE needs to perform this analysis are stored in a JavaScript program called genotyping_by_genes.js. Although most of the NGS sequences consist of whole genomes, the program will perform the classification based on single genes. Once the process is finished it will update the information in the specific columns in table sequence (i.e. genotyping_core, genotyping_e1..) and will also consolidate all the result in columns sequence.genotype and sequence.subtype.
The script genotyping_of_NGS_sequences.py is used to call GLUE to run the JavaScript program in the console taking as an argument a specific NGS run of interest saved in sources/:
python3 genotyping_of_NGS_sequences.py -run <path-to-specific-run-in-sources/>
Once the process is completed, a report with the details and results of this step is generated and saved in logs/genotyping
Additional information on how to use the script can be accesed by using:
python3 genotyping_of_NGS_sequences.py --help
Even though the new tables and relationships have been added to the database using SQLA/Alembic (see Figure 2), GLUE does not recognise these new changes. Therefore the population of data from the testing SQLite databases (molis.db and epi_database) is carried out using SQLA as well.
The script populate_molis_metadata_using_sqlalchemy.py is used to add new records from the testing molis database into the custom tables patient and sample of the GLUE MySQL database. It also takes the fasta headers of new sequences and extracts information from them, which is then updated into the extended columns sequence.sample_id, sequence.hcv_wg_pipeline and sequence.pipeline_version. The data for the table hospital is also populated by using a CSV file with a list of codes and names of hospitals. This list included few hospitals and artificial codes for testing reasons but the new hospitals/clinics/institutions/codes and modifications can be done if needed. Once the data population of the tables hospital, patient, sample and sequence has finished, a report for the current NGS run is generated in logs/molis.
The script takes as an argument -s/--sources
the path to the NGS run in sources/ where the new sequences which information to be updated are.
python3 populate_molis_metadata_using_sqlalchemy.py -s <path-to-specific-run-in-sources/>
Additional information on how to use the script can be accesed by using:
python3 populate_molis_metadata_using_sqlalchemy.py --help
Additional clinical and epidemiological information from the testing epi_database is updated and added into the tables patient and treatment respectively in the project, using the script add_epidata_to_patient_and_treatment.py. In this process, the table drug is also populated with records from a CSV file. For testing purposes, the CSV file contained few records but the list can be updated with new antiviral drugs if needed. Once the data transfer is completed, a report for the current NGS run is generated in logs/epi.
As for the previous python script, this one also takes as an argument -s/--sources
the path to the NGS run in sources/ where the new sequences are.
python3 add_epidata_to_patient_and_treatment.py -s <path-to-specific-run-in-sources/>
Additional information on how to use the script can be accesed by using:
python3 add_epidata_to_patient_and_treatment.py --help
The querying system for the GLUE MySQL database consists of two elements. A module called query_engine.py contains some intial functions used to consult the database and generate reports stored in CSV format files in logs/queries. A second script database_query_tool.py is used to call the functions from the previous module and run the queries of interest.
At the moment, the system only take 4 arguments --patient
, --run
, --drug
and --sampledate
, which are independent and must be called one a time to process the query. For instance, to consult and retrieve information from the database related to runs NGS91 and NGS92 following commands should be used:
python3 database_query_tool.py --run sources/NGS91 sources/NGS92
Additional information on how to use the script can be accesed by using:
python3 database_query_tool.py --help