Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParseSetup and ParserInfo to utilize supplied domains. #16165

Open
monil1334 opened this issue Apr 19, 2024 · 5 comments
Open

ParseSetup and ParserInfo to utilize supplied domains. #16165

monil1334 opened this issue Apr 19, 2024 · 5 comments
Labels

Comments

@monil1334
Copy link

Currently, when parsing a file, the domains for categorical are populated using the file data. But I have a suggestion or an issue, with the way parser currently sets domains for the categorical. I have predefined values of categorical domains based on some enums and would want to preserve the order in which the domain is generated. The suggested solution is to use the supplied order of values for categorical domains.

In the code, there is a todo which says use _setup.domains instead of categoricals. I have tried to populate the categoricals with the supplied domains, even so with the categoricals with defined domains, the ordering still gets messed up with the sorting in the GatherCategoricalDomainsTask and I have tried to modify the CSVParser to see if I can update the isDomainProvided flag to see if I can get around it. But I haven't been able to do so.

@wendycwong
Copy link
Contributor

Hi @monil1334:

You can use relevel to set the base level of a categorical column.

Say you have a categorical column 2 with levels 'a', 'b', 'c'. However, you want the ordering to be 'c','b','a'. This is what you can do:

data[2] = data[2].relevel('b') # new domain is 'b', 'a', 'c'
data[2] = data[2].relevel('c') # now you have the correct level 'c', 'b', 'a'.

If you like, you can also check out the relevel_by_frequency which will set the domain by how frequent the categorical level appears.

@monil1334
Copy link
Author

monil1334 commented Apr 22, 2024

@wendycwong This is in R and Python reference If I am understanding it correctly. My code base is in Java, So here is a sample setup of what I am trying to do.

public static final int ACTION = 6;
public static final String[] DOMAIN = new String[]{"A", "B", "C", "D", "E", "F"};

PFile file = new PFile("/Users/jaeger/Downloads/part-00000-ca8e6d00-0b91-4f42-8987-f12ac891077e.c0001.csv.gz");
    Key[]     fvKeys    = new Key[1];
    FileVec[] fvs       = new FileVec[1];
    File[]    tempFiles = new File[1];
    try (FileVecManager fvm = new FileVecManager(file, false)) {
      fvs      [0] = fvm._fv;
      fvKeys   [0] = fvm._fv._key;
      tempFiles[0] = fvm._tempFile;
    }

    ParseSetup ps            = ParseSetup.guessSetup(fvKeys, false, 1);
    String[][] updateDomains = Bootstrap.setDomain(ps, file.getName()); // which basically sets the domain values for the columns in the ParseSetup
    Bootstrap.setWFETypes(ps.getColumnTypes(), ps.getColumnNames());
    Key<Frame> frameKey = Key.make(file.getName());
    Frame frame = ParseDataset.parse(frameKey, fvKeys, true, ps);

so when I do 
String[] domain = frame.vec(ACTION).domain();
I get the domain as because the CSV does not have all the possible values from the domain. This would vary file to file
 domain = new String []{ "B", "D", "E", "A"};

But what I want is to have the same DOMAIN as declared above so it is consistent.

@wendycwong
Copy link
Contributor

@monil1334:

You want to be able to set the domain of the categorical columns that you want to parse, right? If the dataset you injest does not have the corresponding categorical value, the current domain will miss those values in its domains.

The team is very busy right now, I am trying to get a workaround for you before the real implementation and it seesm like I have failed to do so.

I will try my best to check it out.

@wendycwong
Copy link
Contributor

@monil1334:

I thought about this and decided that this is difficult to do. The main reason that we have datasets are to use them to build machine learning models. However, if you add extra domains that are not in the dataset and then you try to build a machine learning model using GLM, there will be problem. For GLM, we have a coefficient for each categorical levels. For the extra domain levels that are not found in the dataset, there is no way to determine the coefficient level. Hence, the Gram matrix will not be invertible and hence the model building process will fail.

I think this is the main reason that we did not allow more domain levels than the ones found in the dataset in the first place.

Wendy

@monil1334
Copy link
Author

@wendycwong thank you for trying. I have a temporary workaround which is a manual process of updating the vec's once the frame is built for the columns for which I am trying to keep the vecs as is. But I had more of a curiosity since there was a todo in the code since 2016 or 2017 to use the domains from the ParseSetup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants