Implement SBOL sequence feature glyphs #64

wilkox · 2023-07-11T07:19:43Z

Implement the Synthetic Biology Open Standard (SBOL) sequence feature glyphs. The current rough plan for this is:

Add a geom_<glyph>() function for each sequence feature glyph
geom_gene_arrow() becomes a wrapper for geom_CDS() and soft deprecated
geom_subgene_arrow() becomes a wrapper for geom_polypeptide_region() and soft deprecated
geom_feature() becomes a convenience geom that wraps all the glyph geoms and accepts a type aesthetic. This allows a user to draw different types of sequence feature with a single layer, rather than having to add a new geom layer for each type, which could get tedious. This function will lack the flexibility of having different aesthetic mappings for different glyphs, as well as fine-tune control of glyph geometry, but it will still probably cover a large proportion of use cases. To maintain backward compatibility, if the type aesthetic is not mapped, it should draw promoters or locations with the current geom_feature() interface, but this functionality will be soft-deprecated
Continue the pattern of using the xmin, xmax and forward aesthetics to control the direction of directional glyphs
Continue the pattern of geom_gene_arrow()/geom_gene_label() by having a separate geom_<glyph>_label() for each glyph (geom_CDS_label(), geom_intron_label() etc.). This is necessary to preserve the ggplot2 grammar, as a user might want to use different aesthetic mappings for glyphs and labels. geom_gene_label(), geom_feature_label(), and geom_subgene_label() would become wrappers and soft deprecated
Each of the geom_<glyph>() layers will also accept a label aesthetic which draws a text label for the feature with sensible defaults

I've opened a SBOL_glyphs branch to work on this, and added geom_aptamer() and geom_aptamer_label() as a starting point:

library(tidyverse)
library(gggenes)
aptamers <- data.frame(molecule = c("Genome1", "Genome1", "Genome1", "Genome2", "Genome2", "Genome2"), location = c(50, 71, 13, 8, 12, 91), name = paste0("Apt", 1:6))
ggplot(aptamers, aes(x = location, y = molecule, label = name)) +
  geom_aptamer(inherit.aes = TRUE, height = grid::unit(10, "mm")) +
  geom_aptamer_label(inherit.aes = TRUE, height = grid::unit(10, "mm"))

^{Created on 2023-07-11 with reprex v2.0.2}

To get the coordinates for the aptamer glyph, I downloaded the SVG files for the glyphs from the latest SBOL release then converted them to grid-compatible coordinates with the svgparser package:

aptamer <- read_svg("~/Downloads/glyphs-svg/aptamer.svg", obj_type = "data.frame")

This has the pleasing benefit that paths expressed as Bézier curves in the SVG are automatically converted into a series of short line segments, which sidesteps the trouble of transforming Béziers into polar coordinates. I think this method of extracting the glyph coordinates from the SVG assets should be the rule, though no doubt there will be some exceptions where this is not be best choice.

The text was updated successfully, but these errors were encountered:

wilkox · 2023-07-11T07:56:36Z

ccb1482 introduced a new internal approach to drawing the same geom in multiple coordinate systems which eliminates the need to have separate functions or makeContent methods for each coordinate system. The 'standard' process for defining how a glyph geom is drawn should be:

In the geom function, the data are transformed into grid viewport coordinates with the data_to_grid() function. This introduces some coordinate system-agnostic terms:
- 'along' refers to displacement along the molecular backbone; this is x in normal Cartesian coordinates, y in flipped Cartesian coordinates, and theta in polar coordinates. 'alongness' means coordinate system-agnostic 'width'
- 'away' refers to displacement perpendicular from the molecular backbone; this is x in normal Cartesian coordinates, y in flipped Cartesian coordinates, and r in polar coordinates. 'awayness' means coordinate system-agnostic 'height'
The geometry of the glyph is then defined in terms of along and away coordinates. In places where this involves a fixed grid::unit() distance, this distance is converted into the appropriate along or away value with unit_to_alaw()
For polar coordinate systems only, the glyph geometry is segmented with segment_polygon() or segment_polargon(): long lines or curves are segmented into a large number of small lines. This is what allows e.g. the top and bottom edges of the arrow in geom_gene_arrow() to become smooth circle segments in polar coordinates.
Finally, the along/away coordinates are converted into absolute grid viewport coordinates (x/y) with alaw_to_grid(), and the geom is drawn with either grid::polygonGrob() (for closed glyphs), grid::polylineGrob() (for open glyphs), or a grid::gTree() containing some mix of the two.

For label geoms, the process is slightly different, because makecontent.fittexttree() from ggfittext handles the final transformation into the appropriate coordinate system:

First, as with glyphs, the data are transformed into grid viewport coordinates with data_to_grid()
The geometry of the label bounding box is defined in along/away coordinates
The along/away columns in the data frame are then renamed to whatever combination of xmin/xmax/ymin/ymax is appropriate to the plot coordinate system, and the data frame is handed off to make content.fittexttree() to be drawn. This step could probably be captured in a utility function as it is likely to be generic for glyph labels.

zdk123 · 2023-07-12T01:11:24Z

Should features still have xmin/xmax to indicate reverse complements? For SBOL glyphs, features on the reverse strand should be below the origin?

In #62, you introduced the 'forward' aesthetic - maybe this should be 'direction' instead and accept forward/reverse as inputs.

wilkox · 2023-07-23T03:02:21Z

Good question. I think there are a few interlocking issues here.

The `forward` aesthetic

There are two main issues that I see with the forward aesthetic in its current form. The first is that it conflates the concept of the forward/reverse strand with a forward/reverse 'direction', and conflates three different 'directions' into one:

In the 5' → 3' direction on the forward strand
Towards the right side of the plot (or towards the top for flipped coordinates, or clockwise for polar coordinates)
In the direction of increasing x-axis values

It's a fairly safe assumption that 1. and 2. are the same thing, because by convention molecules are drawn such that the forward strand runs 5' → 3' left to right. The SBOL Visual specification section 5.2 says:

Nucleic acid features in a sequential relationship SHOULD be drawn from 5’ left to 3’ right on the inline strand and from 5’ right to 3’ left on the reverse complement strand. In terms of the SBOL 3 data model, this indicates a Constraint on the relative ordering of two features.

So, I think it's OK to leave it up to the user to manually change things if this assumption is not correct.

(1./2.) and 3. are usually but not always the same thing, and when this assumption is wrong (e.g. when scale_x_reverse() is used) this is not handled consistently by gggenes. With scale_x_reverse(), geom_gene_arrow() geoms will flip horizontally, but oriented geom_feature()s will not. I am happy to leave this as an open problem for now as it seems like an uncommon use case, but I think ultimately the correct behaviour is for forward-strand features to always point to the right (or top, or clockwise) regardless of the order of the x-axis.

The second main issue with forward is that I bungled the implementation such that it doesn't mean the same thing in different geoms. For geom_gene_arrow(), forward == TRUE just flips the direction from that implied by xmin/xmax, which means it could potentially mean 'on the reverse strand' if xmin > xmax. This is different from the behaviour with geom_feature(), where forward == TRUE does always mean 'on the forward strand'. Intuitively it seems that forward == TRUE should always mean 'on the forward strand' regardless of the xmin/xmax values.

Glyphs where the meaning of 'forward' is not obvious

For sequence features that are tightly associated with transcription (e.g. CDS, primer binding site, promoter), it is obvious what 'forward' means, i.e. on the forward strand and oriented towards the 3' end. For overhang sites and sticky-ended restriction sites this is not so clear. My feeling is that the least confusing way to do this is to have forward == TRUE mean 'with a 3' overhang' (i.e. 'with a forward strand overhang') for these glyphs, and to make sure this is documented well for these geoms.

Varients of reverse strand glyphs

The SBOL visual specification section 5.2 point 2 provides for variants for some glyphs associated with the reverse strand:

The recommendation is that these glyphs should be flipped both horizontally and vertically. gggenes currently draws reverse-oriented promoters (geom_feature()) by flipping horizontally only, and does not accept any forward aesthetic at all for terminators. Both geom_feature() and geom_terminator() can be used to draw a vertically flipped glyph by setting feature_height or terminator_height respectively to a negative value, but this is clunky and turns what should properly be an aesthetic into a geometry setting.

There is an additional problem here if gggenes implements double-stranded backbones in future: the baseline of the glyph would be slightly offset from the ggplot2 panel.grid.major.y line.

My proposed solution

Consistent with the SBOL specification, the forward strand is defined as the strand that:
- Is always the strand drawn when a single-stranded backbone is drawn
- Runs from from 5’ on the left to 3’ on the right (or bottom to top in flipped coordinates, or from the 12 o'clock position clockwise in polar coordinates)
- When a double-stranded backbone is drawn, is always drawn above (flipped coordinates: to the right of; polar coordinates: radial outwards from) the reverse strand
forward == TRUE should consistently mean 'on the forward strand' in the new geoms, and should override the strand inferred from xmin/xmax for those geoms that accept xmin/xmax. Mutatis mutandis for forward == FALSE and 'on the reverse strand'. Features that only accept x are assumed to be on the forward strand unless otherwise specified with forward == FALSE. This introduces an unfortunate discrepancy between the old geom_<gene|subgene>_arrow() and new SBOL glyph geoms until the former are fully deprecated, but ultimately I think this is the least worst solution.¹
When forward is not mapped, geoms that accept xmin/xmax should infer the strand from xmin/xmax. When xmin < xmax, the feature is inferred to be on the forward strand, and when xmin > xmax it is inferred to be on the reverse strand. In the future it would be useful to implement detection of a flipped x-axis and to swap the inferred stand in this situation, but this doesn't need to happen now
For glyphs with directional overhangs (overhang sites and sticky-ended restriction sites), forward == TRUE means 'with a 3’ overhang' or 'with the overhang on the forward strand', and mutatis mutandis for forward == FALSE.
The glyph for any geom on the reverse strand should be by default flipped both horizontally and vertically from the forward-strand variant
For appropriate glyphs (e.g. for geom_promoter()), geoms should accept a variant argument to specify a variant form of the glyph. This argument should accept an atomic character value. Initially the only supported variant will be 'reverse_above', which indicates that features on the reverse strand should be drawn horizontally but not vertically flipped (the 'MAY' version of the promotor glyph in Figure 10 above). Later this variant argument may be expanded to support other variant glyphs, for example glyphs offset from the panel.grid.major.y line if double-stranded backbones are added in a future version
The *_height arguments to the new geoms should return an error if given negative values
All documentation should refer to the forward and reverse strands, rather than forward and reverse 'orientation' or 'direction'

¹ I've considered renaming the forward aesthetic in the new glyphs to provide a clean break from the old terminology, but I just can't come up with a good alternative name. 'Sense' and 'coding' don't work, as the forward strand won't always be the sense or coding strand for different CDSs. The SBOL specification unfortunately does not require nor use consistent terminology; I found the names 'forward strand', '+ strand', and 'inline strand' used interchangeably in the specification. '+' or 'plus' are both syntactically and semantically confusing, and 'inline' seems to be entirely an invention of SBOL that nobody else uses. The best I have come up with so far is 'Watson' (as in 'the Watson strand') which introduces a problematic capital letter and some even more problematic political issues.

wilkox added the enhancement label Jul 11, 2023

wilkox mentioned this issue Jul 11, 2023

Terminator point feature #62

Closed

wilkox mentioned this issue Dec 6, 2023

would be nice for protein people #70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement SBOL sequence feature glyphs #64

Implement SBOL sequence feature glyphs #64

wilkox commented Jul 11, 2023

wilkox commented Jul 11, 2023 •

edited

zdk123 commented Jul 12, 2023 •

edited

wilkox commented Jul 23, 2023

Implement SBOL sequence feature glyphs #64

Implement SBOL sequence feature glyphs #64

Comments

wilkox commented Jul 11, 2023

wilkox commented Jul 11, 2023 • edited

zdk123 commented Jul 12, 2023 • edited

wilkox commented Jul 23, 2023

The forward aesthetic

Glyphs where the meaning of 'forward' is not obvious

Varients of reverse strand glyphs

My proposed solution

wilkox commented Jul 11, 2023 •

edited

zdk123 commented Jul 12, 2023 •

edited

The `forward` aesthetic