Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Population variables lose case distinctions #546

Open
riastradh-probcomp opened this issue May 2, 2017 · 9 comments
Open

Population variables lose case distinctions #546

riastradh-probcomp opened this issue May 2, 2017 · 9 comments

Comments

@riastradh-probcomp
Copy link
Contributor

This isn't itself quite a bug, but downstream things sometimes expect the case to match from queries, where sqlite3 preserves the original case of the variable in the underlying table, and from the results of bayesdb_variable_names.

We should fix everything downstream to do case-insensitive comparisons (e.g.: probcomp/iventure#38), but until then, we can work around those mistakes by using the original case in the bayesdb_variable table. Won't help with existing bdb files, but will suppress downstream mistakes for any new bdb files.

@fsaad
Copy link
Collaborator

fsaad commented Jun 2, 2017

Just hit this issue as a bug proper.

        bdb.execute('''
            CREATE POPULATION satellites_p for satellites_t WITH SCHEMA (
                GUESS STATTYPES FOR (*)
            )
        ''')
        bdb.execute('''
            CREATE METAMODEL satellites_m FOR satellites_p
                    WITH BASELINE crosscat (
                OVERRIDE GENERATIVE MODEL FOR
                    anticipated_lifetime
                GIVEN
                    class_of_orbit,
                    perigee_km,
                    apogee_km,
                    period_minutes,
                    date_of_launch,
                    launch_mass_kg,
                    dry_mass_kg
                USING ordinary_least_squares
            )
        ''')
bayeslite.exception.BQLError: Unmodellable variables:
set(['date_of_launch', 'apogee_km', 'class_of_orbit',
    'launch_mass_kg', 'period_minutes', 'dry_mass_kg', 'perigee_km'
])

And the failure happens here:
https://github.com/probcomp/bayeslite/blob/master/src/metamodels/cgpm_metamodel.py#L1378-L1382
where printing needed and modeled gives:

modeled = set([u'Power_watts', u'longitude_radians_of_geo', u'Contractor',
    u'Period_minutes', u'Country_of_Operator', u'Launch_Vehicle', u'Users',
    u'Apogee_km', u'Operator_Owner', 'anticipated_lifetime',
    u'Anticipated_Lifetime', u'Purpose', u'Dry_Mass_kg', u'Launch_Site',
    u'Perigee_km', u'Date_of_Launch', u'Class_of_Orbit', u'Eccentricity',
    u'Inclination_radians', u'Type_of_Orbit', u'Source_Used_for_Orbital_Data',
    u'Launch_Mass_kg', u'Country_of_Contractor'
])

needed = set(['date_of_launch', 'apogee_km', 'class_of_orbit',
    'launch_mass_kg', 'period_minutes', 'dry_mass_kg', 'perigee_km'
])

Also notice duplicate of anticipated_lifetime and Anticipated_lifetime (the regression variable). Not sure why this bug has not manifested before.

@fsaad
Copy link
Collaborator

fsaad commented Jun 2, 2017

Actually, what broke this behavior was @riastradh-probcomp's fix d7c5b3f to this issue, which appears to have caused the regression.

Update: git revert d7c5b3f indeed fixed the issue with Latent variables reported above. Surprised why nothing in the test suite failed after the commit in question (probably because all cases were lower uniformly).

@fsaad
Copy link
Collaborator

fsaad commented Jun 5, 2017

@riastradh-probcomp I'm going to revert this commit --- it has completely broken the create schema logic in https://github.com/probcomp/bayeslite/blob/master/src/metamodels/cgpm_metamodel.py#L1043, and refactoring that complex state machine to is likely to be more effort and introduce more bugs than updating downstream clients to either use the variable name from the base table and/or do case insensitive comparisons.

@riastradh-probcomp
Copy link
Contributor Author

This state machine already appears to be problematic from a cursory glance. But it doesn't look like it needs refactoring -- it looks like it just needs a few casefolds inserted judiciously: replace clause.var(s) by casefold(clause.var)(s), and casefold the output of bayesdb_variable_names.

@riastradh-probcomp
Copy link
Contributor Author

Here's an example I expect to fail with the state machine as is:

create table t(x, y);
create population p for t;
create metamodel m for p with baseline crosscat(
    latent Z categorical;
    override model for y, z given x using piecewise
)

@jar600
Copy link
Contributor

jar600 commented Aug 29, 2017

I'm running against this problem too. I want to do a JOIN of a dependence probability table and a table read from a CSV file, and the cases aren't matching. If there's some reason to keep this (mis)feature, maybe it could be a CREATE POPULATION option, so you could decide at that point whether you want folding or not?

@riastradh-probcomp
Copy link
Contributor Author

riastradh-probcomp commented Aug 29, 2017 via email

@jar600
Copy link
Contributor

jar600 commented Aug 29, 2017

I think I made an assumption, because it just didn't occur to me that sqlite3 would do case insensitive comparison. I was seeing a very slow JOIN and thought this was a likely cause. But if what you say it true, then it has a different cause.

Is = case insensitive as well?

@riastradh-probcomp
Copy link
Contributor Author

riastradh-probcomp commented Aug 29, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants