Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rectY Mark is not updating bin step size as data is filtered #203

Open
RyanChapman2x opened this issue Oct 18, 2023 · 2 comments
Open

rectY Mark is not updating bin step size as data is filtered #203

RyanChapman2x opened this issue Oct 18, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@RyanChapman2x
Copy link

I have a data set with long min/max tails that results in step=100,000 (from the bins() function in bin.js). As I filter data with a shared Selection, the min/max tails disappear and the resulting stats should reduce to step=10,000 or less. However, the Mark retains its original min/max stats for this channel and re-uses these in bin.js, always resulting in the same (now inappropriate) step=100,000. My bar plot therefore becomes a single large bar that covers my entire filtered data set.

What I expect to happen:
As data is filtered, the min and max statistics for that channel on that Mark should be recalculated.

How do I set the Mark to recalculate min/max for its channel as a result of filtering on the Selection?

So far I've tried updating Mark class with a filterIndexable function that always returns true, but this has no effect.

@jheer jheer added the enhancement New feature or request label Oct 18, 2023
@jheer
Copy link
Member

jheer commented Oct 18, 2023

Thanks for raising the issue! You are correct that currently the bins do not update: for now the [min, max] field stats are calculated once upon initialization and then reused. To get dynamic bins we need to update the internal mark implementation to update the range extent stats in response to filtering selections. I'd like to find a clean/elegant way to do this, such that it is properly reused across a range of mark types and their corresponding data transformations.

However, please note that with upcoming travel and other obligations it may be weeks or even 1-2 months before I have time to do this work. In the meantime I'd be happy to help provide guidance in anyone is interested in starting a PR.

Here is where the bin transform accesses the mark's statistics:
https://github.com/uwdata/mosaic/blob/main/packages/vgplot/src/transforms/bin.js#L35

A first question is if we should update these stats dynamically, or develop an alternative abstraction for getting the current field extents.

@RyanChapman2x
Copy link
Author

Here's my monkeypatch.

I passed the filter variable down from FilterGroup.defaultUpdate() into a new Coordinator.updateCatalog() function (the updateCatalog() function performs the same client.fieldInfo update that was originally only in the Coordinator.connect() function, only this function is designed to be called prior to each updateClient() call so that the statistics are available when each client queries for new data). We continue to pass filter down into Catalog.queryFields(), which again passes filter into Catalog.fieldInfo(). This function is where the actual logic occurs: we take the original query output from the summarize function and we just patch filter into query.where.

This probably isn't ideal, but it works decently in my limited testing. Note that this does result in NaN errors in DuckDB in some cases; I think this happens when the selection causes the filter to return no results and therefore the min/max statistics turn into NaNs somewhere, which then get passed into SQL where they have no meaning. But it doesn't stop my use case from running so I'll leave that solution for someone else to figure out.

FilterGroup.js

function defaultUpdate(mc, clients, selection) {
  return Promise.all(Array.from(clients).map(client => {
    const filter = selection.predicate(client);
    if (filter != null) {
      mc.updateCatalog(client, filter);
      return mc.updateClient(client, client.query(filter));
    }
  }));
}

Coordinator.js

  async updateCatalog(client, filter) {
    const { catalog } = this;

    // retrieve field statistics
    const fields = client.fields();
    if (fields?.length) {
      client.fieldInfo(await catalog.queryFields(fields, filter));
    }
  }

Catalog.js

  async queryFields(fields, filter={}) {
    const list = await resolveFields(this, fields);
    const data = await Promise.all(list.map(f => this.fieldInfo(f, filter)));
    return data.filter(x => x);
  }

Catalog.js

async fieldInfo({ table, column, stats }, filter={}) {  
  const tableInfo = await this.tableInfo(table);  
  const colInfo = tableInfo[column];  
  
  // column does not exist  
  if (colInfo == null) return;  
  
  // no need for summary statistics  
  if (!stats?.length) return colInfo;  
  
  let query = summarize(colInfo, stats)  
  query.query.where = filter || query.where;  
  const result = await this.mc.query(  
      query,  
      { persist: true }  
  );  
  const info = { ...colInfo, ...(Array.from(result)[0]) };  
  
  // coerce bigint to number  
  for (const key in info) {  
    const value = info[key];  
    if (typeof value === 'bigint') {  
      info[key] = Number(value);  
    }  
  }  
  
  return info;  
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants