Correlation speedup #123

Hilly12 · 2021-09-09T13:21:58Z

Vectorise fairlens.unified.correlation_matrix
Add stress tests for correlation matrix generation
Concatenate columns before dropping nulls in the correlation matrix helper
fairlens.plot.heatmap.two_column_heatmap is now fairlens.plot.correlation.heatmap

Hilly12 · 2021-09-09T15:56:36Z

src/fairlens/metrics/unified.py

+        if df[col].dtype.kind == "O":
+            df[col] = pd.factorize(df[col])[0]
+
+    df = df.append(pd.DataFrame({col: [i] for i, col in enumerate(df.columns)}))



The idea here is - it's impossible to know which column corresponds to which distribution type in the helper, so we append the column's index in the data frame to it (as the final row). Then in the helper, we use that row to index the precomputed distribution types and drop that row. There might be a better way of doing this.

I think another way to do this would be to revert back to using utils.infer_distribution_type but also adding a functools.lru_cache to the utils.infer_distribution_type function so that we avoid repetitive calculations.

… nulls

Hilly12 · 2021-09-14T09:08:23Z

src/fairlens/metrics/unified.py

-        columns_x (Optional[List[str]]):
-            The column names that determine the rows of the matrix.
-        columns_y (Optional[List[str]]):
-            The column names that determine the columns of the matrix.

    Returns:
        pd.DataFrame:
            The correlation matrix to be used in heatmap generation.
    """



The correlation matrix is generated using df.corr(). Since df.corr() only works on numerical data, we need to encode all the columns. The issue with this is that we use the infer_distr_type() function to decide which metric would be suitable, which works differently on the encoded numerical data. The only way to resolve this issue is to infer types beforehand (which is probably more efficient). The problem then becomes about making a binary function (a, b) -> float that knows the types of a and b beforehand.

Just to add, using df.corr() provides a major performance improvement.

Hilly12 · 2021-09-14T09:13:45Z

src/fairlens/metrics/unified.py

+    sr_a = pd.Series(a[:-1])
+    sr_b = pd.Series(b[:-1])
+
+    df = pd.DataFrame({"a": sr_a, "b": sr_b}).dropna().reset_index()



Columns need to be joined so any rows with nulls are dropped before the correlation metric is applied.

Hilly12 · 2021-09-14T09:15:33Z

tests/test_correlation.py

@@ -133,3 +144,38 @@ def test_cn_unequal_series_corr():
    sr_b = pd.Series([100, 200, 99, 101, 201, 199, 299, 300, 301, 500, 501, 505, 10, 12, 1001, 1050])

    assert distance_cn_correlation(sr_a, sr_b) > 0.7
+
+


Stress test to make sure correlation matrix generates properly without the encoding step.

src/fairlens/plot/correlation.py

simonhkswan · 2021-09-14T05:13:32Z

src/fairlens/metrics/unified.py

+        if df[col].dtype.kind == "O":
+            df[col] = pd.factorize(df[col])[0]
+
+    df = df.append(pd.DataFrame({col: [i] for i, col in enumerate(df.columns)}))



I think another way to do this would be to revert back to using utils.infer_distribution_type but also adding a functools.lru_cache to the utils.infer_distribution_type function so that we avoid repetitive calculations.

…n proxies

…io/fairlens into correlation-speedup

codecov · 2021-09-16T17:01:46Z

Codecov Report

Merging #123 (12769bc) into main (432b120) will increase coverage by 4.99%.
The diff coverage is 89.39%.

@@            Coverage Diff             @@
##             main     #123      +/-   ##
==========================================
+ Coverage   74.16%   79.16%   +4.99%     
==========================================
  Files          15       15              
  Lines         871      864       -7     
  Branches      186      184       -2     
==========================================
+ Hits          646      684      +38     
+ Misses        181      135      -46     
- Partials       44       45       +1

Flag	Coverage Δ
unittests	`79.16% <89.39%> (+4.99%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/fairlens/metrics/__init__.py	`100.00% <ø> (ø)`
src/fairlens/sensitive/correlation.py	`82.22% <44.44%> (-2.23%)`	⬇️
src/fairlens/metrics/correlation.py	`60.86% <90.47%> (+12.48%)`	⬆️
src/fairlens/metrics/unified.py	`83.33% <100.00%> (+36.39%)`	⬆️
src/fairlens/plot/__init__.py	`100.00% <100.00%> (ø)`
src/fairlens/plot/correlation.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 432b120...12769bc. Read the comment docs.

sonarcloud · 2021-09-17T13:59:56Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

92.2% Coverage
0.0% Duplication

Hilly12 added 7 commits September 6, 2021 17:52

speed up correlation matrix generation

6232b51

add kendall tau, spearman

4ae6e9f

update cramers v

9ff13c4

revert cramers v

a187231

Merge branch 'main' into correlation-speedup

0292e00

remove kendall tau, spearman rank

b6f4edc

remove references to kendall tau, spearman

45c7caf

Hilly12 commented Sep 9, 2021

View reviewed changes

add stress tests for correlation matrix, join columns before dropping…

b8fb0a3

… nulls

Hilly12 marked this pull request as ready for review September 13, 2021 13:39

Hilly12 requested review from simonhkswan, jamied157 and tonbadal September 13, 2021 13:44

Hilly12 commented Sep 14, 2021

View reviewed changes

simonhkswan reviewed Sep 15, 2021

View reviewed changes

simonhkswan assigned Hilly12 Sep 15, 2021

simonhkswan added the type:enhancement New feature or request label Sep 15, 2021

bogdansurdu and others added 6 commits September 16, 2021 11:16

update first proxy detection example

a00f237

remove kwargs from heatmap

2eb380d

Merge branch 'main' into correlation-speedup

4a9c094

use infer_distr_type instead of old check, fix order of type checks i…

23dbbcb

…n proxies

extend proxy tests, add correct results

4f48f5e

Merge branch 'correlation-speedup' of https://github.com/synthesized-…

06528fd

…io/fairlens into correlation-speedup

Merge branch 'main' into correlation-speedup

12769bc

simonhkswan removed request for jamied157 and tonbadal April 4, 2024 11:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correlation speedup #123

Correlation speedup #123

Hilly12 commented Sep 9, 2021 •

edited

Hilly12 Sep 9, 2021 •

edited

simonhkswan Sep 14, 2021

Hilly12 Sep 14, 2021

Hilly12 Sep 14, 2021

Hilly12 Sep 14, 2021

Hilly12 Sep 14, 2021

simonhkswan Sep 14, 2021

codecov bot commented Sep 16, 2021 •

edited

sonarcloud bot commented Sep 17, 2021

		@@ -133,3 +144,38 @@ def test_cn_unequal_series_corr():
		sr_b = pd.Series([100, 200, 99, 101, 201, 199, 299, 300, 301, 500, 501, 505, 10, 12, 1001, 1050])

		assert distance_cn_correlation(sr_a, sr_b) > 0.7

Correlation speedup #123

Are you sure you want to change the base?

Correlation speedup #123

Conversation

Hilly12 commented Sep 9, 2021 • edited

Hilly12 Sep 9, 2021 • edited

Choose a reason for hiding this comment

simonhkswan Sep 14, 2021

Choose a reason for hiding this comment

Hilly12 Sep 14, 2021

Choose a reason for hiding this comment

Hilly12 Sep 14, 2021

Choose a reason for hiding this comment

Hilly12 Sep 14, 2021

Choose a reason for hiding this comment

Hilly12 Sep 14, 2021

Choose a reason for hiding this comment

simonhkswan Sep 14, 2021

Choose a reason for hiding this comment

codecov bot commented Sep 16, 2021 • edited

Codecov Report

sonarcloud bot commented Sep 17, 2021

Hilly12 commented Sep 9, 2021 •

edited

Hilly12 Sep 9, 2021 •

edited

codecov bot commented Sep 16, 2021 •

edited