Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The KDE transform creates values where there are none when used with {"resolve": "shared"} #3815

Open
joelostblom opened this issue Oct 28, 2023 · 6 comments
Assignees
Labels
enhancement For enhancement of existing features

Comments

@joelostblom
Copy link
Contributor

joelostblom commented Oct 28, 2023

If {"resolve": "shared"} is set, the extent of grouped density transforms incorrectly use the min/max of the entire dataset instead of for each group, resulting in long lines where there are no observations at all, instead of stopping the density at the last data point in the group. I noticed this in Vega-Lite, but wonder if it could be fixed directly in the KDE transform in Vega instead of doing some post-processing such as dropping zeros in Vega-Lite. I understand that the computation need to happen over the same domain to enable stacking, but would it be possible to trim the densities after that to only include values that exists within each group? This would also be helpful for the violinplot implementation.

This chart is created in altair 5.1.2 which uses VL 5.15.1 and shows the undesired behavior:

image
Open the Chart in the Vega Editor

The desired behavior would look like this where each density is cut at the min/max values of each group:

image

Altair code
import altair as alt
from vega_datasets import data

source = data.iris.url

alt.Chart(source, height=100).transform_density(
    'petalWidth',
    groupby=['species']
).mark_area(stroke='black').encode(
    alt.X('value:Q'),
    alt.Y('density:Q').stack(False),
    alt.Facet('species:N', columns=1, title=None).header(labelFontWeight='bold', labelFontSize=12)
)

Ref vega/vega-lite#9078

@joelostblom
Copy link
Contributor Author

@jheer Do you think this is something that is suitable for implementation on the Vega side of things or does it belong in the Vega-Lite repo?

A related issue stemming from this is that setting the x-scaled to "independent" does not have the intended effect. Take for example this chart where I would like the axis to be adjusted in each subplot to only span the range of the data, so that I can see both distributions clearly:

image
Open the Chart in the Vega Editor

@mattijn
Copy link
Contributor

mattijn commented Nov 10, 2023

If I add a "y":"independent" to the scale-resolver in the VL-spec and in the Vega-spec remove the impute transform and make the kde transform to resolve independent it seems like what you are after:
image

Open the Chart in the Vega Editor

In my opinion this is something for the VL-repository.

@joelostblom
Copy link
Contributor Author

Thanks @mattijn, setting the transform resolve to independent would fix both the specs above, but it would lead to jagged appearance when having two densities in the same chart as described in this issue vega/vega-lite#9078. So we would either need another way to fix that (maybe setting the steps + the extent?) so that we can use "resolve": "independent", or make the shared resolve more flexible so that it works with the examples in this issue. I'm happy with whichever solution is the easiest to implement and support these use cases.

@joelostblom
Copy link
Contributor Author

After investigating this further, I can give a more comprehensive explanation of what is going on. Here is a single spec that contains both issue. I can't find any combination of parameters that supports each density ending at the min/max value of the data AND being able to have the two grouped/colored densities display properly on top of each other.

Step 1: Coloring by one variable and faceting by another. You can see how the lower facet ("Open") is extended all the way to the x-axis min around 3.5, although there are not data points there. (Also note that by default Vega-Lite now stacks areas which is not ideal for distribution densities since it makes them harder to compare, but this is a separate issue vega/vega-lite#9170).

image
Open the Chart in the Vega Editor

Step 2: I can fix the issue with the extension to zero if I set the resolve to independent and remove the impute transform as you suggested. However, that automatically unstacks the areas (which often is a good default but would be unexpected to someone who explicitly specified a stacked density:

image
Open the Chart in the Vega Editor

In other words, I can't find a combination of parameters that allows me to create this chart (correctly stacked on top, and correct extent on the bottom):

image

@domoritz domoritz added enhancement For enhancement of existing features and removed bug For bugs or other software errors labels Dec 18, 2023
@domoritz domoritz self-assigned this Dec 18, 2023
@domoritz
Copy link
Member

It looks like the first case can be resolved with explicitly setting the kde resolve which we add in vega/vega-lite#9172

#3815 (comment) is a bit trickier but could be addressed with a clip property that removes density values outside the original data domain per group. This could be a useful feature anyway (for both shared and independent density computation.

@domoritz
Copy link
Member

I'm working on this now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement For enhancement of existing features
Projects
None yet
Development

No branches or pull requests

3 participants