Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Tabular data block that can handle simple CSVs and text files #592

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

ml-evs
Copy link
Member

@ml-evs ml-evs commented Feb 16, 2024

This PR adds a simple block that passes a data file through pd.read_csv() and allows you to make a selectable scatter plot of its columns. It also adds a base component BokehBlock that can probably be used by several other blocks with minor tweaking of the supported extensions (which should be dynamic soon anyway).

We could make this increasingly useful by being more robust with read_csv args and maybe wrapping our own file reader. e.g., I took some random SECM data and realised that the header is formatted in such a way that pandas fails to read it. I think one nice idea could be to write a wrapper to read_csv that does a binary search to detect the header 'properly', or b) reads a CSV file in reverse, populates the values until the first "broken" line, then treats them as column headers (this could actually be very useful generally, e.g., within pandas itself).

Copy link

cypress bot commented Feb 16, 2024

Passing run #1835 ↗︎

0 40 0 0 Flakiness 0

Details:

Merge d3b9938 into 4ed1cf2...
Project: datalab Commit: ab33c878de ℹ️
Status: Passed Duration: 02:11 💡
Started: May 27, 2024 3:05 PM Ended: May 27, 2024 3:07 PM

Review all test suite changes for PR #592 ↗︎

@ml-evs ml-evs added enhancement New feature or request datablock An issue pertaining to a specific datablock labels Feb 16, 2024
@jdbocarsly
Copy link
Member

Awesome!! This is going to be super useful.

Already works for me for a basic csv file, but some thoughts on possible future developments:

I agree that a UI to set the parameters for read_csv would be great. I've been thinking of something with live feedback (i.e. you can see the file, and see how changing the parameters splits it up differently, like some spreedsheet applications can do). That's probably a longer term goal, but for now, just having options for sep, skip_rows, and comment would probably work for a majority of cases. (your idea of reading csvs backwards is interesting too...)

Adding a UI for naming/renaming columns and potentially dropping spurious columns would be the other useful feature, but would take more work. Add a button to open in google drive??

Incidentally, by adding sep=None to the read_csv call, pandas automatically tries to guess the separator using csv.Sniffer().sniff(). I tested this out, and it worked pretty well and let me read a variety of files including some .csv cycling data and some .xye synchrotron data (tab-delimited). If we want to add the UI for specifying skip_rows, etc., we can explicitely use Sniffer to guess starting parameters.

@jdbocarsly
Copy link
Member

A few more immediate comments:

  • When I tried an invalid file, it through an error and make it so that I couldn't reload the sample ever again without erroring.
  • Unfortunately our selectable_axis_plot doesn't save the selected columns anywhere, so you have to reselect the correct columns every time you load the page. I wonder if we can figure out an elegant way to save those. Honestly, it would probably involve moving the select components out of bokeh and generating them directly in vue so that they behave like all the other inputs in a block.
  • Can you rename this to something like "CSVBlock", instead of "PlotterBlock"? The other blocks are named by what kind of data they deal with, rather than how they display the data.

Copy link
Member

@jdbocarsly jdbocarsly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some thoughts

pydatalab/pydatalab/blocks/common.py Outdated Show resolved Hide resolved
webapp/src/resources.js Outdated Show resolved Hide resolved
webapp/src/components/datablocks/BokehBlock.vue Outdated Show resolved Hide resolved
pydatalab/pydatalab/blocks/common.py Outdated Show resolved Hide resolved
@ml-evs ml-evs changed the title Add a simple plotter block that can handle simple CSVs and text files Add a Tabular data block that can handle simple CSVs and text files Feb 16, 2024
@ml-evs
Copy link
Member Author

ml-evs commented Feb 16, 2024

Should be merged after #590

@ml-evs ml-evs requested a review from jdbocarsly May 18, 2024 14:58
@ml-evs
Copy link
Member Author

ml-evs commented May 18, 2024

I think this is good enough for a first hash, would be good to get in before 0.4.0

Copy link

codecov bot commented May 18, 2024

Codecov Report

Attention: Patch coverage is 60.52632% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 67.18%. Comparing base (4ed1cf2) to head (d3b9938).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #592      +/-   ##
==========================================
- Coverage   67.27%   67.18%   -0.09%     
==========================================
  Files          62       62              
  Lines        3746     3782      +36     
==========================================
+ Hits         2520     2541      +21     
- Misses       1226     1241      +15     
Files Coverage Δ
pydatalab/pydatalab/blocks/__init__.py 100.00% <100.00%> (ø)
pydatalab/pydatalab/bokeh_plots.py 76.76% <100.00%> (ø)
pydatalab/pydatalab/blocks/common.py 60.81% <58.33%> (-2.35%) ⬇️

Copy link
Member

@jdbocarsly jdbocarsly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works well, but I get a few unecessary warning displays when using:

  1. When making a new block, I get the following warning before starting:
Screenshot 2024-05-19 at 3 48 05 PM 2. When loading a sample. Screenshot 2024-05-19 at 3 47 58 PM

pydatalab/pydatalab/blocks/common.py Outdated Show resolved Hide resolved
@ml-evs
Copy link
Member Author

ml-evs commented May 27, 2024

Works well, but I get a few unecessary warning displays when using:

1. When making a new block, I get the following warning before starting:

Screenshot 2024-05-19 at 3 48 05 PM 2. When loading a sample. Screenshot 2024-05-19 at 3 47 58 PM

I've fixed these warnings, and also tweaked the pandas settings so that we can at least read the awkward Raman text files in our repo and a simple csv (with tests). Obviously there's no global solution for this unless we build out a whole UI, but think this should at least fail nicely...

@ml-evs
Copy link
Member Author

ml-evs commented May 27, 2024

A few more immediate comments:

* When I tried an invalid file, it through an error and make it so that I couldn't reload the sample ever again without erroring.

Also I couldn't repro this, so hopefully I fixed it through other changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datablock An issue pertaining to a specific datablock enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants