Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame distinct with multiple columns in the selector is not working if the name of the columns is variable #155

Open
marcodisa opened this issue Jan 13, 2023 · 4 comments

Comments

@marcodisa
Copy link

Hi, I'm not sure if this is the proper section to post this question,
First of all I’d like to congratulate you on the amazing work you’ve done with Data Forge npm and Data Forge Notebook. It’s a very extraordinary package and the Notebook is an extremely useful tool.
I’m seriously considering the upgrade to the pro version for my further projects.

Anyway, I’m stuck on a DataFrame distinct with two parameters in the selector.

If I run the code below it perfectly does what I want, “filiale” and “CAP” are two columns of df.

const safeJoinChar = '$';
var distinct_join = df.distinct(value => [value.filiale, value.CAP].join(safeJoinChar));
console.log(distinct_join.toString());

Actually I need to put the execution of the select distinct into a loop. I need the select distinct of every column with all the other columns, two columns at a time.
Long story short, I have a loop that in every iteration assigns the value to two variables serie1_name and serie2_name, in every iteration I correctly get my two variables like the code below.
Now I need the dataframe distinct to work with variable names not with the explicit names of the columns.
How the code below should be to work fine with variable column names?

var serie1_name = "filiale";
var serie2_name = "CAP";
const safeJoinChar = '$';

var distinct_join = df.distinct(value => [value.serie1_name, value.serie2_name].join(safeJoinChar));

console.log(distinct_join.toString());

I’ve tried many ways to make it work but I wasn’t able to handle it myself.
Is there something I’m missing out?
Can I get any hints or suggestions?

Thank you very much!

@marcodisa
Copy link
Author

marcodisa commented Jan 13, 2023

After many attempts I solved the issue using the javascript function constructor to create a function from a string

var serie1_name = "filiale";
var serie2_name = "CAP";
const safeJoinChar = '$';

var distinctFormula = new Function(serie1_name, serie2_name, safeJoinChar, "return value => [value."+ serie1_name + ", value." + serie2_name + "].join("+safeJoinChar+")");

var distinct_join = df.distinct(distinctFormula());

console.log(distinct_join.toString());

Now it works perfectly!

Anyway, is there a more clever solution?

@ashleydavis
Copy link
Member

That's strange. You shouldn't have to use a string.

Are you able to publish a runnable code example in GitHub of the broken version? Then I can take a look at it.

@marcodisa
Copy link
Author

marcodisa commented Jan 17, 2023

Hi, thanks for the reply.
Well, I need to run an automatic exploratory analysis on datasets that I don't know, no human will look at the dataset but it all has to be handled by an algorithm.
For this reason I loop through all the columns and due to the fact I don't know their names in advance I need them to be variable.
Moreover I need to compare the distinct count of every columns with the join distinct count of each columns with all the others, two columns at a time.

Here is an example code and how I solved the "issue"

const dataForge = require('data-forge');
require('data-forge-fs');

csvString = `col0,col1,col2,col3
0,Aa,Ba,Aa
1,Ab,Bb,Bb
2,Ac,Bc,Cc
3,Ad,Bd,Dd
4,Ae,Be,Ee`

const df = dataForge.fromCSV(csvString);
var columns = df.getColumns();
var series_distincts = [];
const safeJoinChar = '$';

for (const column of columns) {            //loop through all the columns
    var name = column.name;
    const count_distinct = column.series.distinct().filter(value => value != "").count();     //single column distinct count
    
    let element = {"name": name, "count_distinct": count_distinct};
    series_distincts.push(element);
}
//console.log(series_distincts);

for (var i = 0; i < series_distincts.length; i++) {         //loop to compare the single distinct count and the join distinct count

    var serie1_name = series_distincts[i].name;
    var distinct1 = series_distincts[i].count_distinct;
    console.log("Column to compare with all the others in this iteration: " + serie1_name);

    for (var j = 0; j < series_distincts.length; j++) {
        if (i == j) {
            continue;           //don't need to compare a column with itself
        } else {
            var serie2_name = series_distincts[j].name;
            var distinct2 = series_distincts[j].count_distinct;
            console.log("Column compared: " + serie2_name);

            var distinctFormula = new Function(serie1_name, serie2_name, safeJoinChar, "return value => [value." + serie1_name + ", value." + serie2_name + "].join(" + safeJoinChar + ")");         //solution I found after many attempts

            var distinct_join = df.distinct(distinctFormula()).count();

            console.log(serie1_name + " count distinct: " + distinct1);
            console.log(serie2_name + " count distinct: " + distinct2);
            console.log(serie1_name + " + " + serie2_name + " join distinct count: " + distinct_join);
        }
    }

}

Obviously this dataset doesn't make sense, it's just an example of what the algorithm should do. In fact, the algorithm then reads the join distinct count and goes on with its tasks.

Another "issue" that I experienced is that before executing a join distinct I have to remove all the spaces from the column names. (This step is not present in the example above but I do it in my algorithm).
Is there a way to bypass dataframe renameSeries()?

Hope this example helps to clarify what I meant in the original question, anyway now I solved it using the javascript function constructor but I'd like to know if there is a more clever solution.

@ashleydavis
Copy link
Member

Hey @marcodisa any chance you could fork the repo then get your code working as a failing test? Then I can work towards fixing the code for you and making your test pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants