Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the fuse option to the data entered linearly #3011

Open
SoftTools59654 opened this issue Feb 22, 2024 · 1 comment
Open

Add the fuse option to the data entered linearly #3011

SoftTools59654 opened this issue Feb 22, 2024 · 1 comment

Comments

@SoftTools59654
Copy link

Add the fuse option to the data entered linearly

A short review of the size and speed differences between zng and csv files

I imported a 10GB csv file into zui as a line because it had problems with the file structure

The result was interesting for me, reducing the size of the csv file up to 30% of the original file size, in some cases even 25% of the original file size (of course, even if the standard csv and json file is also imported, the size will be less than this, but in some cases that the files do not have a proper structure at all, the line option is the best solution)

The search speed is also acceptable due to the 75% reduction in volume

It is a suitable option for files with a very large volume. Because csv files take up a lot of space

I have a request

Is it possible to activate the fuse option in the query display section for csv json standard data? Enable this option for the line as well, if there are separator characters like in the image below or if you have a problem with the json file that is imported as a line and has the minimum standards, the search result, the fuse option should be enabled for it as well.

It will be a suitable option for data that has minimum standards such as separators, even for problematic jsons. Because I didn't see any limitation in importing data linearly, while in csv and json files that don't have some standards, there are many problems to import this data.

image

image

@philrz
Copy link
Contributor

philrz commented Feb 23, 2024

@SoftTools59654: Sorry, but I'm having difficulty understanding your question. I wonder if maybe you have some misunderstanding about what the fuse operator does, and/or based on your mention of "some problems with the file structure" maybe you're asking in a different way for the "Fault tolerant data input " enhancement tracked in brimdata/zed#4546?

In any case, I'll try to read back my best guesses of what you might be getting at. It's easiest to have these discussions if we're both working off the same sample data, so I'll point to some public ones I can find. If I don't manage to reproduce the effect you're seeing with your data it would be ideal if you could attach a sample here or point to a URL of another that's publicly available that you use to show the problem.

I did some web searches for "movies" CSV data since that seemed to be what you were showing in your screenshots. I didn't find any 10 GB in size, but I did find a smaller one at https://gist.github.com/jheer/4dee9b65d5f4cab64235e28d0e4010dc. In my demo videos below I've downloaded that to my desktop and am using it as my starting point. I'm using Zui Insiders 1.6.1-13 to reproduce these.

For well-formed CSV like this, you can drag it in and the auto-detect will read it into the pool. The multiple "shapes" highlighted in this case is a side effect of frequent "null" values which results in many different Zed record types. If this poses a problem you can apply fuse by clicking the link as shown first in the video or manually adding it in the editor, or if you want the data fused immediately as it lands in the pool, you can apply fuse as a shaper during data load as I show near the end of the video.

Demo1.mp4

Question: Is there something you're looking to accomplish with well-formed CSV and fuse different than what I show here?

Since you spoke of "problems to import", I suspect where you're looking for help is something more like this next example. Here I've created separate small test data gist_bad.csv that has just the first handful of lines from the original CSV with a "bad" line inserted:

$ cat gist_bad.csv 
Title,US Gross,Worldwide Gross,US DVD Sales,Production Budget,Release Date,MPAA Rating,Running Time (min),Distributor,Source,Major Genre,Creative Type,Director,Rotten Tomatoes Rating,IMDB Rating,IMDB Votes
The Land Girls,146083,146083,,8000000,12-Jun-98,R,,Gramercy,,,,,,6.1,1071
"First Love, Last Rites",10876,10876,,300000,7-Aug-98,R,,Strand,,Drama,,,,6.9,207
bad line
I Married a Strange Person,203134,203134,,250000,28-Aug-98,,,Lionsgate,,Comedy,,,,6.8,865

I try to import this data in the next video. As it shows, the auto-detect finds the "bad line" and therefore recognizes that it can't import it as structured CSV, so I revert to the Line input mode, which it sounds like it might be similar to what you're doing.

Demo2.mp4

This is where I wonder if maybe you're misunderstanding fuse. Once that data is imported in Line mode and it's just a sequence of strings, you can certainly do searches on it as strings, and/or maybe use some of the functions like split() to chop it up or maybe grok() to parse it into record fields. But the way you mention "separator characters" makes me wonder if you think that the mere presence of commas in these strings is going to somehow allow fuse to easily turn it into columns like we saw in the first video, and that's not the case. If that's what you were seeking, I think you'd effectively be asking for what's described in brimdata/zed#4546, e.g., some way for the loading step to recognize the bad lines and skip over them and load the rest as well-formed data that can be fused into columns, or perhaps some new feature that could parse loaded strings into columns after they're already in the pool (e.g., something like what's described in brimdata/zed#3758 where you could specify what names/types to give the columns and then it would read each line as comma-separated values using that schema).

Assuming I'm on the right track with that, next let's look at the example of JSON data that has problems importing, because you've got more options here. Here I've created small test data gist_bad.json that's similar to the CSV example:

$ cat gist_bad.json 
{"Title":"The Land Girls","US Gross":146083,"Worldwide Gross":146083,"US DVD Sales":null,"Production Budget":8000000,"Release Date":"12-Jun-98","MPAA Rating":"R","Running Time (min)":null,"Distributor":"Gramercy","Source":null,"Major Genre":null,"Creative Type":null,"Director":null,"Rotten Tomatoes Rating":null,"IMDB Rating":6.1,"IMDB Votes":1071}
{"Title":"First Love, Last Rites","US Gross":10876,"Worldwide Gross":10876,"US DVD Sales":null,"Production Budget":300000,"Release Date":"7-Aug-98","MPAA Rating":"R","Running Time (min)":null,"Distributor":"Strand","Source":null,"Major Genre":"Drama","Creative Type":null,"Director":null,"Rotten Tomatoes Rating":null,"IMDB Rating":6.9,"IMDB Votes":207}
bad line
{"Title":"I Married a Strange Person","US Gross":203134,"Worldwide Gross":203134,"US DVD Sales":null,"Production Budget":250000,"Release Date":"28-Aug-98","MPAA Rating":null,"Running Time (min)":null,"Distributor":"Lionsgate","Source":null,"Major Genre":"Comedy","Creative Type":null,"Director":null,"Rotten Tomatoes Rating":null,"IMDB Rating":6.8,"IMDB Votes":865}

In the final video, I take similar steps to read it as Line because the auto-detect once again notices the bad line. But here we have some more options. Specifically, the parse_zson() function is available that will read any string and if it parses as a ZSON or JSON value it will return that value. So here it creates valid records for each valid input line we had, and a Zed error value is created for the one bad line we had. I filter that error out and then am able to apply fuse to see the "good data" in columns.

Demo3.mp4

That Zed I applied:

parse_zson(this)
| !has_error(this)
| fuse

And like we did in the first video, you could perhaps choose to use that as your "shaper" at load time if you wanted to avoid having the string representation first loaded into your pool.

Let me know if any of those help you with what you're trying to achieve and if there's still something missing in there that's needed for your specific use case. Being able to show it with your own sample data or the ones I've been using here would help. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants