Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plot does not show outliers at most zoomed-out level #234

Open
vinay-hebb opened this issue Jul 7, 2023 · 3 comments
Open

Plot does not show outliers at most zoomed-out level #234

vinay-hebb opened this issue Jul 7, 2023 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@vinay-hebb
Copy link

May be a noob question

Setup: I am using plotly-resampler with dynamic aggregation
Requirement: I want to see outliers in zoomed out level and want to zoom-in to visualize data (without resampling) and understand the nuances of the data
Questions:

  1. Will dash plotly always shows "outliers" in most zoomed out level.
  2. Is this even possible with this setup?

Problem:
Zoomed-out screenshot
image

Zoomed-in screenshot
image

Green circled points appear with zooming-in, Is it possible to see them even at zoomed-out level? As they are missed in zoomed-out level, I might miss them in my data analysis

@vinay-hebb vinay-hebb changed the title Missing Outliers Plot does not show outliers at most zoomed-out level Jul 7, 2023
@jonasvdd jonasvdd added the question Further information is requested label Jul 7, 2023
@jonasvdd
Copy link
Member

jonasvdd commented Jul 7, 2023

@vinay-hebb,

The purpose of Plotly-resampler, as implied by its name, is to resample (or aggregate) data in order to improve the scalability of time-series visualization. This aggregation process involves selecting a fixed number of data points within a given range. You can think of this as selecting single data points for sub-intervals within this range.

When zoomed out, a larger interval is used to select the data points, which may result in the omission of certain interesting points. However, these data-aggregation algorithms are designed to capture the general trend and extreme values. On the other hand, when zooming in, the interval decreases, leading to a more detailed representation of the data.

I'm rather intrigued on why wouldn't you want the resample on zoom functionality?

So what can you do:

  • accept that this resample-on-zoom is how plotly_resampler works (and possibly increase the number of data points that will be selected (See docs - max_n_samples argument of FigureResampler constructor)).
  • Alternatively, if the amount of data is manageable, you can choose to display the raw data without using plotly-resampler.

Hope this answers your question,
Kind regards,
Jonas

@vinay-hebb
Copy link
Author

The purpose of Plotly-resampler, as implied by its name, is to resample (or aggregate) data in order to improve the scalability of time-series visualization. This aggregation process involves selecting a fixed number of data points within a given range. You can think of this as selecting single data points for sub-intervals within this range.

I was of the impression that "somehow" plotly-resampler will always shows at least 1 point for a cluster of points which are "close enough". In that sense, I thought that there is always a representative and more points can be visualized by zooming in.

When zoomed out, a larger interval is used to select the data points, which may result in the omission of certain interesting points. However, these data-aggregation algorithms are designed to capture the general trend and extreme values. On the other hand, when zooming in, the interval decreases, leading to a more detailed representation of the data.

I understand that there will be information loss in zoomed out view. Just to confirm, information loss is possible such that there is no representative for a cluster of points right?

I'm rather intrigued on why wouldn't you want the resample on zoom functionality?

I am fine with resample on zoom functionality. I just wanted to ensure that if I zoom "sufficiently enough", I should be able to view complete information.

I will try with max_n_samples argument. Can you kindly point me to limitations/caveats of this algorithm where information loss could be crucial?

@jonasvdd
Copy link
Member

jonasvdd commented Jul 26, 2023

Hi @vinay-hebb,

I hope you are doing well, and sorry for this late reply, I had some holidays! 🌴

at least 1 point for a cluster of points which are "close enough".

A: you could certainly implement such an algorithm, but such algorithms often require more than 1 pass over the data (which can be time-constraining). As of now, All supported aggregation algorithms only use a single (and sometimes even a parrallelizable single) pass over the raw data to selects a datapoint for each bin. These bins can be defined as:
$$size_{bin} = \frac{N}{n_{out}}$$
with $N$ the data size and $n_{out}$ the number of selected samples.

It is this utilization of linear/bin-wise data aggregators ensures that plotly-resampler is able to scale to even Billions of data-points per trace! 🐎

I thought that there is always a representative and more points can be visualized by zooming in.

A: Indeed, when you zoom in, the aggregation algorithm will re-run and select the same number of data points! :) (i.e. $n_{out}$, which is by default 1000) for this smaller data-range (resulting in smaller bin sizes).

I understand that there will be information loss in zoomed out view. Just to confirm, information loss is possible such that there is no representative for a cluster of points right?

A: As we use bins, from which we will only select a fixed amount of data points, and there is a possibility of more than 1 cluster per bin, you may, indeed, lose some points that representative for other clusters.
However, as you can zoom in (i.e. resulting in aggregation on smaller bins) or increase the number of shown data points via $n_{out}$, we do not believe that this is a major limitation of plotly-resampler.

I am fine with resample on zoom functionality. I just wanted to ensure that if I zoom "sufficiently enough", I should be able to view complete information.

This should always be true! Also note how the orange [R] in the legend will disappear when the raw (non-aggregated) data is shown.

I will try with max_n_samples argument. Can you kindly point me to limitations/caveats of this algorithm where information loss could be crucial?

Good question, there is no straightforward answer; I would hint to use an $n_{out}$ of 3x your figure canvas width in pixels. However, as your "cluster" might exist of one data point between other extrema, this still is not a guarantee that it will be selected. Also bear in mind, that increasing this parameter increases the network payload size (and rendering time of refreshing), which may reduce the interactivity speed.

I can maybe point you to #247, in which I elaborate more on the ideal number samples; and the default downsampler (which is MinMaxLTTB)

Hope this helps you further,
Kind regards,
Jonas

@jonasvdd jonasvdd self-assigned this Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants