v0.2.4
What's Changed
- slicing and check contigs by @zktuong in #159
- add new functions and rework github actions by @zktuong in #161
New features
slicing functionality
- the
Dandelion
object can now be sliced like aAnnData
, or pandasDataFrame
!vdj[vdj.data['productive'] == 'T'] Dandelion class object with n_obs = 38 and n_contigs = 94 data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'rearrangement_status' metadata: 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'v_call_gdT_VDJ', 'd_call_gdT_VDJ', 'j_call_gdT_VDJ', 'v_call_gdT_VJ', 'j_call_gdT_VJ', 'productive_gdT_VDJ', 'productive_gdT_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'duplicate_count_gdT_VDJ', 'duplicate_count_gdT_VJ', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
vdj[vdj.metadata['productive_VDJ'] == 'T'] Dandelion class object with n_obs = 17 and n_contigs = 36 data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'rearrangement_status' metadata: 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'v_call_gdT_VDJ', 'd_call_gdT_VDJ', 'j_call_gdT_VDJ', 'v_call_gdT_VJ', 'j_call_gdT_VJ', 'productive_gdT_VDJ', 'productive_gdT_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'duplicate_count_gdT_VDJ', 'duplicate_count_gdT_VJ', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
vdj[vdj.metadata_names.isin(['cell1', 'cell2', 'cell3', 'cell4', 'cell5'])] Dandelion class object with n_obs = 5 and n_contigs = 20 data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'rearrangement_status' metadata: 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'v_call_gdT_VDJ', 'd_call_gdT_VDJ', 'j_call_gdT_VDJ', 'v_call_gdT_VJ', 'j_call_gdT_VJ', 'productive_gdT_VDJ', 'productive_gdT_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'duplicate_count_gdT_VDJ', 'duplicate_count_gdT_VJ', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
vdj[vdj.data_names.isin(['contig1','contig2','contig3','contig4','contig5'])] Dandelion class object with n_obs = 2 and n_contigs = 5 data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'rearrangement_status' metadata: 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'v_call_gdT_VDJ', 'd_call_gdT_VDJ', 'j_call_gdT_VDJ', 'v_call_gdT_VJ', 'j_call_gdT_VJ', 'productive_gdT_VDJ', 'productive_gdT_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'duplicate_count_gdT_VDJ', 'duplicate_count_gdT_VJ', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
- not sure implementing it like
adata[:, adata.var.something]
make sense as it's not really row information in the data slot? - also the base slot in
Dandelion
is.data
, and doesn't make sense for.metadata
to be the 'row' - maybe scverse/scirpy#327 can come up with a better strategy and i can adopt that later on.
- not sure implementing it like
ddl.pp.check_contigs
- created a new function
ddl.pp.check_contigs
as a way to just check if contigs are ambiguous, rather than outright removing them. I envisage that this will eventually replacesimple
mode inddl.pp.filter_contigs
in the future.- new column in
.data
:ambiguous
, T/F to indicate whether contig is considered ambiguous or not (different from cell level ambiguous). - the
.metadata
and several other functions ignores any contigs marked asT
to maintain the same behaviour - The largest difference between
ddl.pp.check_contigs
andddl.pp.filter_contigs
is that the onus is on the user to remove any 'bad' cells from the GEX data (illustrated in the tutorial) withcheck_contigs
whereas this happens semi-automatically withfilter_contigs
.
- new column in
ddl.update_metadata
now comes with a 'by_celltype' option
- This brings a new feature - B cell, alpha-beta T cell and gamma-delta T cell associated columns for V,D,J,C and productive columns!
- this is achieved through a new
.retrieve_celltype
subfunction in theQuery
class, which breaks up the retrieval into the 3 major groups ifby_celltype = True
. - No longer the need to guess which belongs to which and allows for easy slicing! This does cause a bit of
.obs
bloating. - Which leads to the removal of
constant_status_VDJ
,constant_status_VJ
,productive_status_VDJ
,productive_status_VJ
as the metadata is getting bloated with the slight rework of Dandelion metadata slot to account for the new B/abT/gdT columns
- this is achieved through a new
tl.productive_ratio
- Calculates a cell-level representation of productive vs non-productive contigs.
- Plotting is achieved through
pl.productive_ratio
- Plotting is achieved through
tl.vj_usage_pca
- Computes PCA on a cell-level representation of V/J gene usage across designated groupings
- uses
scanpy.pp.pca
internally - Plotting can be achieved through
scanpy.pl.pca
- uses
bug fixes
- fix cell ordering issue scverse/scirpy#347
- small refactor of
ddl.pp.filter_contigs
- moved some of the repetitive loops into callable functions
- deprecate
filter_vj_chains
argument and replaced withfilter_extra_vdj_chains
andfilter_extra_vj_chains
to hopefully enable more interpretable behaviour. fixes #158 - umi adjustment step was buggy but i have now made the behaviour consistent with how it functions in
ddl.pp.check_contigs
rearrangement_status_VDJ
andrearrangement_status_VJ
(renamed fromrearrangement_VDJ_status
andrearrangement_VJ_status
) from now gives a single value for whether a chimeric rearrangement occured e.g. TRDV pairing with TRAJ and TRAC as in this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4267242/- fixed issues with progress bars getting out of hand
- fixed issue with
ddl.tl.find_clones
crashing if more than 1 type of loci is found in the data.- now a
B
,abT
andgdT
prefix will be appended to BCR/TR-ab/TR-gd clones.
- now a
- check_contigs, find_clones and define_clones were removing non-productive contigs even though there's no need to. May cause issues with filter_contigs... but there's a problem for next time.
- fix issue with min_size in network not behaving as intended. switch to using connected components to find which nodes to trim
other changes
- new column
chain_status
, to summarise the reworkedlocus_status
column.- Should contain values like
ambiguous
,Orphan VDJ
,Single pair
etc, similar tochain_pairing
in scirpy.
- Should contain values like
- Also fixed the ordering of metadata to make it more presentable, instead of just randomly slotting into the data frame.
ddl.concat
now allows for custom suffix/prefix - only operates onsequence_id
- remove
.edges
from Dandelion class because this doesn't get used anywhere and it's also stored in thenetworkx
graphs - minimum spanning tree construction performed using
networkx
directly so that i don't have to keep changing the adjacency matrices frompandas
tonetworkx
back and forth - clean up documentation slightly
Full Changelog: v0.2.2...v0.2.4