Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering implementation epic #305

Open
1 of 3 tasks
Adoni5 opened this issue Oct 17, 2023 · 16 comments
Open
1 of 3 tasks

Filtering implementation epic #305

Adoni5 opened this issue Oct 17, 2023 · 16 comments
Assignees
Labels
enhancement New feature or request epic Feature being added to readfish needs discussion A topic/feature that needs discussion from maintainers and users

Comments

@Adoni5
Copy link
Contributor

Adoni5 commented Oct 17, 2023

Feature being added - Filtering!

Filter reads and alignments based on a flexible "mini language" specified in the TOML.
Would go into config TOML under the respective place for the filtering? I.e caller_settings/mapper_settings
Applied in the tight targets loop

# for base calling
filter = [
  "metadata.sequence_length > 0",
  "metadata.sequence_length < 1000",
]

# for alignment
filter = [
  "is_primary",
  "mapq > 40",
  "strand == -1",
]

This is parsed into magic Enums and Classes in _filter.py

chunks  = read_until_client.get_read_batch(...)
filtered_calls, calls  = partition(basecall_filter, basecall(chunks))
filtered_aligns, aligns = partition(alignment_filter, align(calls))

for result in aligns:
    print("boo these alignments are trash")

for filtered_item in filtered_calls + filtered_aligns:
    print("Woohoo we have great success in filtering")

I suppose we would store these on the respective classes? _PluginModule or something I forget

Ideas

  • Extend language to startsWith/endsWith
  • and/or/not logical operators

Issues that need resolving/clarification

  • VERY footgunny - for example sequence.metadata.length < 0, mapq < 0 and goodbye all reads. How can we safeguard against this? I suggest maybe starting only with PAF, and maybe adding some checks in validation.
  • Where do we add the tracking of filtering status. Do we add it directly to the Result object, do we add it straight into the plugin basecall/map_reads methods (would involve having to write separate implementations for new plugins), and have the plugin return two Iterables, one of reads/Results instances that passed and one that failed?
  • What do we do with Results that fail validations, unblock or proceed?
    • Fails basecalling filtering
    • Fails alignment filtering
  • DO we add a fails_validation to the toml/Conditions section, which defaults to proceed? This then relies on the exceeded max chunk behaviour
  • How and where do we log this?
  • What will it look like/where will it be placed in the config?
  • What will the API between targets and plug-ins look like?
  • How will we ensure that targets doesn’t miss any data?

Tasks

  1. Stale work-item
    Adoni5
@Adoni5 Adoni5 added the epic Feature being added to readfish label Oct 17, 2023
@Adoni5 Adoni5 self-assigned this Oct 17, 2023
@alexomics alexomics added the needs discussion A topic/feature that needs discussion from maintainers and users label Oct 19, 2023
@mattloose
Copy link
Contributor

VERY footgunny - for example sequence.metadata.length < 0, mapq < 0 and goodbye all reads. How can we safeguard against this? I suggest maybe starting only with PAF, and maybe adding some checks in validation.

For this, I suggest we define acceptable filters for each component and disallow footguns where we can think of them. We will need good documentation for each filter.

@mattloose
Copy link
Contributor

Where do we add the tracking of filtering status. Do we add it directly to the Result object, do we add it straight into the plugin basecall/map_reads methods (would involve having to write separate implementations for new plugins), and have the plugin return two Iterables, one of reads/Results instances that passed and one that failed?

I think the idea is that we have a flag on the results object that tracks if the result has failed a filter? We might need to think how we track the failed filters?

@mattloose
Copy link
Contributor

mattloose commented Oct 19, 2023

What do we do with Results that fail validations, unblock or proceed?
Fails basecalling filtering
Fails alignment filtering

This is complex.

In my view each filter line should be independent - i.e if you have:

# for alignment
filter = [
  "is_primary",
  "mapq > 40",
  "strand == -1",
]

failing on any one of these conditions would result in the alignment being nulled. In this state the molecule would be treated as a no_map and sequencing would proceed as specified in the toml for a no_map read.

One can also imagine the posibility of wanting to take a specific action when a filter is evaluated to false:

# for alignment
filter = [
   "mapq >40:stop_receiving,unblock",
   "strand == -1",
]

In this case, a read which had a mapq > 40 would stop_receiving if true and unblock if false. If the strand was +1 it would be treated as no_map.

This is a fairly complex filtering language and would need to be well documented with some use cases.

Note this would also address:

DO we add a fails_validation to the toml/Conditions section, which defaults to proceed? This then relies on the exceeded max chunk behaviour

@mattloose
Copy link
Contributor

How and where do we log this?

Perhaps this should go in the chunk log (as was) with a tag concept ala sam files?

@mattloose
Copy link
Contributor

What will it look like/where will it be placed in the config?

I think this will have to be a property of a region? I can envision scenarios with barcoded samples where you want different filters depedning on sample type.

@mattloose
Copy link
Contributor

What will the API between targets and plug-ins look like?

In my view available filters for a plugin should be defined by the plugin and they should be whitelisted by the plugin. And filter passed to a plugin that is not explicitly defined by the plugin should be ignored. On validation of a toml any such filters should be alerted to the user.

In this case, a plugin can be valid with no defined filters. Therefore filter availability is optional.

@mattloose
Copy link
Contributor

How will we ensure that targets doesn’t miss any data?

I don't quite understand this - I think targets will see every result regardless of filtering if we follow the above suggestions.

@mattloose
Copy link
Contributor

I think any read which has been filtered out should be labelled as "filtered" in the chunks log. This is different to an action_overridden (or whatever). In the context of action overridden a decision was made about a specific read which matches the toml but has been changed for a reason outside the toml spec (e.g first read or duplex).

If a read has been filtered, the behaviour is specified by the toml so readfish has not overridden what the toml asked for - it has done as it should have done.

@Adoni5
Copy link
Contributor Author

Adoni5 commented Oct 25, 2023

After chatting with @alexomics on Friday - we came to a couple of conclusions, decided to write them up here after a few days to give me time to forget some of them.

  1. The concept of filtering in it's initial form will do two things:
    • If a Base-called read doesn't meet a filtering threshold, it should be excluded from alignment. This will mean that filtering can increase the speed of the loop.
    • If an alignment fails filtering it should be excluded from decision making. This is NOT the same as overriding an action.
  2. For alignments the filtering step should be applied for each alignment in the case of multiple alignments. If ANY alignment passes, that alignment moves on.
  3. We shall track the filtering status on the Results object, with a bool.
  4. We perform the filtering outside of the Plugin but inside the loop. This would mean the targets.py workflow would look something like
signal -> basecalling(signal) -> filter_basecalls(calls) -> align(filtered_calls) ->
                                                    filter_alignments(alignments) -> decide(filtered_alignments)
  1. No Guardrails - too much work and rigidity. If you want to shoot yourself in the foot go ahead
  2. No region based filtering at this point.
  3. Add a new filters table to the config TOML. This would then have alignment and basecalling sections. This would be accessed as config.filters in the code base.

Think that's it for now!

@mattloose
Copy link
Contributor

I agree with most of the above - 1 -3 is all good.

4 - I'm not convinced about where filtering should be performed? Could you expand on the reasoning for filtering in targets rather than the plugin?

With respect to 1-3 I broadly agree but I wonder if there should be some suggestions or guidance about foot guns?

@Adoni5
Copy link
Contributor Author

Adoni5 commented Oct 25, 2023

With respect to 1-3 I broadly agree but I wonder if there should be some suggestions or guidance about foot guns?

I think we should make it clear that this is experimental and that no verification on the actual filtering conditions will be performed, and state clearly that if a user doesn't double check their filtering statements, they could ruin their experiment, much in the same way we disclaim using readfish at all!

If we filter in targets, rather than plugin it means we don't have to:

  • Find a way to access the filters in the Plugins.
  • We ensure we don't miss any read chunks as they go through the pipeline - and don't have to work out whether we return separate iterables of Results etc. That way we can easily ensure they show up in the debug_log
  • We can keep it a separate module - as it's experimental

One thing it would be very useful to do now is to come up with some Key, useful use cases that we can use as tests during development!

That way we can not lose sight of why we are doing this. Do you have any ideas @mattloose, @alexomics ?

I'm thinking:
* mapq > 40 - Reason: only procede on confident mappings
* sequence.metadata.len > 300 - Reason: Confident of no edge effects

But I can't think of any others off the top of my head

@mattloose
Copy link
Contributor

I think we should make it clear that this is experimental and that no verification on the actual filtering conditions will be performed, and state clearly that if a user doesn't double check their filtering statements, they could ruin their experiment, much in the same way we disclaim using readfish at all!

I do agree - but I thought we had considered adding a whitelist for acceptable filters? However - all this is definitiely at the users own risk!

I think your examples are good - I will have a think about more of them....

@Adoni5
Copy link
Contributor Author

Adoni5 commented Oct 25, 2023

The reason I don't want to add a whitelist of acceptable filters is that we will end up having to maintain that list and edit it. If nanopore changes the layout of the base-call results returned from dorado v4.3, filtering base-calls will work for v4.2 but not v4.3 and we will then have to manually edit our white list. But then do we have two sets? One for v4.3 and one for V4.2?

Whilst letting anyone filter anything does have the added danger of people making mistakes or doing something silly, I think that the amount of work that comes with a manual white list simply isn't worth it

@mattloose
Copy link
Contributor

Suggested filters:

  • mapq
  • primary/secondary?
  • sequence length
  • sequence quality (only analyse reads whose basecalling quality is above a certain threshold?)
  • [ ]

@mattloose
Copy link
Contributor

I think we should consider having filters at the region level - if there was a filter for a region you ignore the global filter. It there isn't a region filter you apply the global filter.

THis would allow testing of filter types on a single flowcell through different regions - e.g. mapQual >0, >20, > 40 all on one flowcell.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Nov 25, 2023
@mattloose mattloose removed the Stale label Nov 25, 2023
@Adoni5 Adoni5 added the enhancement New feature or request label Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request epic Feature being added to readfish needs discussion A topic/feature that needs discussion from maintainers and users
Projects
None yet
Development

No branches or pull requests

3 participants