Vectorize operations for propensity score matching #1179

rahulbshrestha · 2024-05-12T21:49:02Z

This PR addresses this issue by introducing vectorized operations instead of the existing for-loops. This should speed up operations for large datasets.

This PR is a work in progress, and the remaining tasks include:

Add test cases to verify matching still works properly
Further vectorize one of the arrays
Cleanup

amit-sharma · 2024-05-13T16:46:28Z

Thanks for starting this, @rahulbshrestha . Let us know once the PR is ready for review.

added todo comment Signed-off-by: Rahul Shrestha <rahulshrestha0101@gmail.com> formatting fix Signed-off-by: Rahul Shrestha <rahulshrestha0101@gmail.com> bug fix with string name Signed-off-by: rahulbshrestha <rahulshrestha0101@gmail.com>

Signed-off-by: rahulbshrestha <rahulshrestha0101@gmail.com>

rahulbshrestha · 2024-05-14T12:14:26Z

I ran some tests to check if the values of att and atc are the same before and after changes made in this PR:


### PREVIOUS IMPLEMENTATION
        att = 0
        numtreatedunits = treated.shape[0]
        treated_outcomes_old = []
        control_outcomes_old = []

        for i in range(numtreatedunits):

            treated_outcome = treated.iloc[i][self._target_estimand.outcome_variable[0]].item()
            control_outcome = control.iloc[indices[i]][self._target_estimand.outcome_variable[0]].item()
            treated_outcomes_old.append(treated_outcome)
            control_outcomes_old.append(control_outcome)
            att += treated_outcome - control_outcome

        att /= numtreatedunits


        print('Checking values of ATT: ')
        print('ATT (before): ', att)

        outcome_variable = self._target_estimand.outcome_variable[0]
        treated_outcomes = treated[outcome_variable]
        control_outcomes = list(control.iloc[indices.flatten()][outcome_variable])

        att = (treated_outcomes - control_outcomes).mean()

        print('ATT (after): ', att)
        print('Treated outcomes ', treated_outcomes_old == treated_outcomes)
        print('Control outcomes', control_outcomes_old == control_outcomes)

and the results when running on some test data:

Checking values of ATT: 
ATT (before):  10.923190922091228
ATT (after):  10.923190922091242
Treated outcomes  True
Control outcomes True
Checking values of ATC: 
ATC (before):  10.506587873468016
ATC (after):  10.506587873468012
Treated outcomes  True
Control outcomes True

Both lists, treated outcomes and control outcomes are the same before and after the changes I made. The ATT and ATC seems to be off by a couple digits after averaging (check last 3 digits in the example above), which is probably a rounding error. Is this a problem @amit-sharma?

rahulbshrestha marked this pull request as draft May 12, 2024 21:49

add vector operations

2a0a358

added todo comment Signed-off-by: Rahul Shrestha <rahulshrestha0101@gmail.com> formatting fix Signed-off-by: Rahul Shrestha <rahulshrestha0101@gmail.com> bug fix with string name Signed-off-by: rahulbshrestha <rahulshrestha0101@gmail.com>

rahulbshrestha force-pushed the vectorize branch from 4c635cd to 2a0a358 Compare May 14, 2024 08:56

vectorize remaining list

635c2f7

Signed-off-by: rahulbshrestha <rahulshrestha0101@gmail.com>

rahulbshrestha marked this pull request as ready for review May 14, 2024 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize operations for propensity score matching #1179

Vectorize operations for propensity score matching #1179

rahulbshrestha commented May 12, 2024 •

edited

amit-sharma commented May 13, 2024

rahulbshrestha commented May 14, 2024 •

edited

Vectorize operations for propensity score matching #1179

Are you sure you want to change the base?

Vectorize operations for propensity score matching #1179

Conversation

rahulbshrestha commented May 12, 2024 • edited

amit-sharma commented May 13, 2024

rahulbshrestha commented May 14, 2024 • edited

rahulbshrestha commented May 12, 2024 •

edited

rahulbshrestha commented May 14, 2024 •

edited