You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I had an issue similar to #653, in that I needed to plot a large 1D data set over time. Something like df.viz.scatter(), but able to handle large data sets. I want to see potential single outliers in the data, so straight downsampling was not an option. After some experimentation, I found a simple way to do this using the capabilities provided by vaex.
I am asking you to know if you would be interested in adding native support for something like this.
The idea is to provide matplotlib with a rasterized image of the scatter plot instead of all the data points. All data points that fall into the same screen pixel can be treated as one. I suppose this is a kind of downsampling, but in a way that doesn't affect the scatter plot. In practice, I achieve this by using vaex.dataframe.DataFrame.count() to generate a heatmap of the counts, and then limit the value of all nonzero bins to 1. This effectively generates a black/white scatter plot where each data point is one pixel large. We can then apply a marker to the scatter with scipy.ndimage.grey_dilation(), and set the marker color. I have provided a code example below.
importvaeximportnumpyasnpimportmatplotlibimportmatplotlib.pyplotaspltimportscipydf=vaex.open('data_8M_samples.hdf5') # Open Datafig=plt.figure(1) # Create figureax=fig.subplots()
plt.grid('on') # Enable gridlines# Fetch size of plot area in pixelsbbox=ax.get_window_extent().transformed(fig.dpi_scale_trans.inverted())
plot_width_pixels=round(bbox.width*fig.dpi)
plot_height_pixels=round(bbox.height*fig.dpi)
xlims=df['time'].minmax()
ylims=df['data_to_plot'].minmax()
# Make a heatmap of the data counts with bins equal to the screen resolution of the figurecounts=df.count(None,
binby=[df['time'], df['data_1']],
shape=[plot_width_pixels, plot_height_pixels],
limits=[xlims, ylims])
monochrome_scatter_plot=np.minimum(counts, 1)
monochrome_scatter_plot=np.rot90(monochrome_scatter_plot)
# Create the marker that represents a data pointmarker_radius=5marker_color= [0, 0.4470, 0.7410]
xx, yy=np.mgrid[-marker_radius:marker_radius+1, -marker_radius:marker_radius+1]
footprint=np.logical_or(xx-yy==0, xx+yy==0) # This particular marker is for a cross# Apply the marker to the scatter plotmonochrome_scatter_plot_with_markers=scipy.ndimage.grey_dilation(monochrome_scatter_plot, footprint=footprint)
# Convert monochrome to RGB with alpha channelcolor_scatter_plot=np.stack([monochrome_scatter_plot_with_markers*marker_color[0], # Red channelmonochrome_scatter_plot_with_markers*marker_color[1], # Green channelmonochrome_scatter_plot_with_markers*marker_color[2], # Blue channelmonochrome_scatter_plot_with_markers], axis=2) # Alpha channelplt.imshow(color_scatter_plot, extent=[xlims[0], xlims[1], ylims[0], ylims[1]], aspect='auto')
plt.show(block=True)
This example is non-interactive, but some modifications can add full interactivity similar to the interactive widgets in Jupyter notebook. I don't use notebooks, so I made an interactive version in a traditional python script to test the concept by recalculating the counts whenever zooming/panning/resizing the window (source code below). This also supports showing multiple data series in the same scatter plot due to the alpha channel of the rasterized image.
Legends could be added manually, but I have not done so here. Lines between the data points is not possible, though, as there could be data points outside the plot area that we have not accounted for.
Source code for the interactive version:
importvaeximportnumpyasnpimportmatplotlibimportmatplotlib.pyplotaspltimportscipydefmain():
df=vaex.open('data_8M_samples.hdf5')
df.my_viz.my_scatter(df['time'], df['data_1'], PlotMarker(shape='filled-circle', radius=5, color=[0, 0.4470, 0.7410]))
df.my_viz.my_scatter(df['time'], df['data_2'], PlotMarker(shape='cross', radius=5, color=[0.8500, 0.3250, 0.0980]))
plt.grid('on')
plt.title('No downsampling!')
plt.show(block=True)
# Marker used to represent a data point in the scatter plotclassPlotMarker:
def__init__(self, shape='filled-circle', radius=5, color=None):
ifcolorisNone:
color= [0, 0.4470, 0.7410]
self.shape=shapeself.radius=radiusself.color=color# Custom interactive scatter plot@vaex.register_dataframe_accessor('my_viz', override=True)classScatterPlot(object):
def__init__(self, df):
self.df=dfdefmy_scatter(self, x, y, marker=PlotMarker()):
# Get data limitsx_lims=x.minmax()
y_lims=y.minmax()
# get axis limitsax=plt.gca()
fig=ax.figureiflen(ax.get_images()) ==0:
# Zoom slightly out on x-axis, to ensure all data is easily visibleylim_range= (y_lims[1] -y_lims[0])
y_lims= [y_lims[0] -0.05*ylim_range, y_lims[1] +0.1*ylim_range]
else:
# If another scatter is already plotted, zoom out if neccesary, but don't zoom inylim_range=max(y_lims[1], ax.get_ylim()[1]) -min(y_lims[0], ax.get_ylim()[0])
y_lims[0] =min(y_lims[0] -0.05*ylim_range, ax.get_ylim()[0])
y_lims[1] =max(y_lims[1] +0.05*ylim_range, ax.get_ylim()[1])
ax.set_xlim(x_lims)
ax.set_ylim(y_lims)
im, panning=None, Nonedefupdate_plot(_=None):
nonlocalim, panningifpanning: # Panning will result in a constant stream of callbacks. Updating each time is laggy.return# Fetch size of plot area in pixelsbbox=ax.get_window_extent().transformed(fig.dpi_scale_trans.inverted())
[width_pixels, height_pixels] = [round(bbox.width*fig.dpi), round(bbox.height*fig.dpi)]
# Get axis limits to calculate scatter plot with
[xlims, ylims] = [ax.get_xlim(), ax.get_ylim()]
# Make a heatmap of the data counts with bins equal to the screen resolution of the figurecounts=self.df.count(None, binby=[x, y], shape=[width_pixels, height_pixels], limits=[xlims, ylims])
color_plot=_make_scatter_plot_image(counts, marker)
# Show imageifimisNone:
im=plt.imshow(color_plot, extent=[xlims[0], xlims[1], ylims[0], ylims[1]], aspect='auto')
else:
# When refreshing the plot, update the old imageim.set(data=color_plot, extent=[xlims[0], xlims[1], ylims[0], ylims[1]])
ax.figure.canvas.draw()
update_plot()
ax.callbacks.connect('xlim_changed', update_plot)
ax.callbacks.connect('ylim_changed', update_plot)
fig.canvas.mpl_connect('resize_event', update_plot)
# When panning the view, the constant callbacks to update_plot cause lots of lag. Oly update when panning is finisheddefpanning_started(_=None):
nonlocalpanningpanning=Truedefpanning_stopped(_=None):
nonlocalpanningpanning=Falseupdate_plot()
fig.canvas.mpl_connect('button_press_event', panning_started)
fig.canvas.mpl_connect('button_release_event', panning_stopped)
def_make_scatter_plot_image(counts, marker):
xx, yy=np.mgrid[-marker.radius:marker.radius+1, -marker.radius:marker.radius+1]
ifmarker.shape=='filled-circle': # Circle (filled in)footprint=xx**2+yy**2< (marker.radius+0.5) **2elifmarker.shape=='hollow-circle': # Circle (not filled in)footprint=np.logical_and((marker.radius-0.5) **2<xx**2+yy**2, xx**2+yy**2< (marker.radius+0.5) **2)
elifmarker.shape=='filled-square': # Square (filled in)footprint=np.ones(shape=xx.shape)
elifmarker.shape=='hollow-square': # Square (not filled in)footprint=np.logical_or(np.logical_or(xx==marker.radius, xx==-marker.radius), np.logical_or(yy==marker.radius, yy==-marker.radius))
elifmarker.shape=='cross': # Crossfootprint=np.logical_or(xx-yy==0, xx+yy==0)
else:
raiseException(f'Marker {marker.shape} in make_plot_image not recognized')
monochrome_scatter_plot=np.minimum(counts, 1)
monochrome_scatter_plot=np.rot90(monochrome_scatter_plot)
monochrome_scatter_plot_with_markers=scipy.ndimage.grey_dilation(monochrome_scatter_plot, footprint=footprint)
color_plot=np.stack([monochrome_scatter_plot_with_markers*marker.color[0],
monochrome_scatter_plot_with_markers*marker.color[1],
monochrome_scatter_plot_with_markers*marker.color[2],
monochrome_scatter_plot_with_markers], axis=2)
returncolor_plotif__name__=='__main__':
main()
The text was updated successfully, but these errors were encountered:
Hi! I had an issue similar to #653, in that I needed to plot a large 1D data set over time. Something like
df.viz.scatter()
, but able to handle large data sets. I want to see potential single outliers in the data, so straight downsampling was not an option. After some experimentation, I found a simple way to do this using the capabilities provided by vaex.I am asking you to know if you would be interested in adding native support for something like this.
The idea is to provide matplotlib with a rasterized image of the scatter plot instead of all the data points. All data points that fall into the same screen pixel can be treated as one. I suppose this is a kind of downsampling, but in a way that doesn't affect the scatter plot. In practice, I achieve this by using
vaex.dataframe.DataFrame.count()
to generate a heatmap of the counts, and then limit the value of all nonzero bins to 1. This effectively generates a black/white scatter plot where each data point is one pixel large. We can then apply a marker to the scatter withscipy.ndimage.grey_dilation()
, and set the marker color. I have provided a code example below.This example is non-interactive, but some modifications can add full interactivity similar to the interactive widgets in Jupyter notebook. I don't use notebooks, so I made an interactive version in a traditional python script to test the concept by recalculating the counts whenever zooming/panning/resizing the window (source code below). This also supports showing multiple data series in the same scatter plot due to the alpha channel of the rasterized image.
Legends could be added manually, but I have not done so here. Lines between the data points is not possible, though, as there could be data points outside the plot area that we have not accounted for.
Source code for the interactive version:
The text was updated successfully, but these errors were encountered: