Add LSA Component #1022

eccabay · 2020-08-05T14:34:56Z

Should fix #940 and close #980 by moving the implementation of LSA into evalml, instead of making the changes within nlp_primitives. The new LSA component can function as an independent component but is also called within the TextFeaturizer component to maintain its previous behavior.

codecov · 2020-08-05T14:35:47Z

Codecov Report

Merging #1022 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff            @@
##             main    #1022    +/-   ##
========================================
  Coverage   99.90%   99.91%            
========================================
  Files         181      183     +2     
  Lines        9998    10143   +145     
========================================
+ Hits         9989    10134   +145     
  Misses          9        9

Impacted Files	Coverage Δ
evalml/pipelines/components/__init__.py	`100.00% <ø> (ø)`
...alml/pipelines/components/transformers/__init__.py	`100.00% <100.00%> (ø)`
.../components/transformers/preprocessing/__init__.py	`100.00% <100.00%> (ø)`
...lines/components/transformers/preprocessing/lsa.py	`100.00% <100.00%> (ø)`
...ents/transformers/preprocessing/text_featurizer.py	`100.00% <100.00%> (ø)`
evalml/tests/component_tests/test_lsa.py	`100.00% <100.00%> (ø)`
...alml/tests/component_tests/test_text_featurizer.py	`100.00% <100.00%> (ø)`
evalml/tests/component_tests/test_utils.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 49556bb...14c50ac. Read the comment docs.

eccabay · 2020-08-06T13:57:53Z

evalml/pipelines/components/transformers/preprocessing/text_featurizer.py

-        if len(text_columns) == 0:
-            warnings.warn("No text columns were given to TextFeaturizer, component will have no effect", RuntimeWarning)


Moved this warning from __init__ to fit to temporarily resolve #1017

evalml/pipelines/components/transformers/preprocessing/lsa.py

angela97lin

Looking good! Left a few non-blocking nit-picky comments :)

evalml/pipelines/components/transformers/preprocessing/lsa.py

angela97lin · 2020-08-06T15:38:02Z

evalml/tests/component_tests/test_lsa.py

+                              'LSA(col_2)[1]'])
+    X_t = lsa.transform(X)
+    assert set(X_t.columns) == expected_col_names
+    assert len(X_t.columns) == 4


Nit-pick: I feel like this line is covered by set(X_t.columns) == expected_col_names so maybe not necessary? (same with other tests!)

I thought so as well at first, but this line actually helped me catch a bug yesterday! Since we take the set of X_t.columns, any columns with duplicate names will not cause that line to fail -- checking the number of columns explicitly prevents that from slipping through the cracks.

Ooo huh, I didn't even know duplicate names were allowed but makes sense! 😊

Yeah, @angela97lin , you can do fancy stuff in pandas.

df = pd.DataFrame(data=np.array([[1, 1], [2, 2], [3, 3]]), columns=['a', 'a'])

produces a df with two columns which happen to have the same name, although they occupy different positions in the column index.

For these tests, let's do a direct comparison of the column names:

expected_col_names = np.array(...) # expected str values np.testing.assert_equal(X_t.columns, expected_col_names)

This has the added benefit of covering the column name order.

Unfortunately, the column order as outputted by featuretools changes, and as far as I can tell there's no option to fix it. @dsherry would you rather I enforce a column order by sorting, say, alphabetically, or leave this test as is?

Oh, that's good to know. Your call.

I'm going to leave this as is, since enforcing an order makes the test bulkier.

evalml/pipelines/components/transformers/preprocessing/lsa.py

dsherry

@eccabay looks pretty close! I left a bunch of questions and some suggestions.

A few of my questions and comments have to do with the conversion of the feature names to / from str, and the way we're indexing into the input DFs. Unless there's a detail I'm missing so far, I don't think we need any conversion. Its good that we're validating that the provided feature names exist in the input DFs, but I think we can assume whatever format the DF column index is in, the feature names will be provided in that format, str, int or whatever else. LMK if you want to talk this through rather than responding via text.

I also left a comment about the warnings which we should resolve, ideally before merging.

More unit test to add:

Input DF has two features with the same name
Input DF has non-str column names, i.e. df = pd.DataFrame(data=np.zeros((1, 4)), columns=[0, 1, 42, -1000])

docs/source/release_notes.rst

evalml/pipelines/components/transformers/preprocessing/lsa.py

evalml/tests/component_tests/test_text_featurizer.py

dsherry · 2020-08-07T16:17:40Z

evalml/pipelines/components/transformers/preprocessing/lsa.py

+                X_t = X_t.drop(labels=int(col), axis=1)
+
+            X_t['LSA({})[0]'.format(col)] = pd.Series(transformed[:, 0])
+            X_t['LSA({})[1]'.format(col)] = pd.Series(transformed[:, 1])


Not blocking: what do you think of doing this for the naming: LSA(my_feature, 0) and LSA(my_feature, 1) ?

I like it! I only kept this formatting to mirror what the primitives' generated column names look like, but I can change this if you'd prefer.

dsherry · 2020-08-07T16:29:57Z

evalml/tests/component_tests/test_lsa.py

+                              'LSA(col_2)[1]'])
+    X_t = lsa.transform(X)
+    assert set(X_t.columns) == expected_col_names
+    assert len(X_t.columns) == 4


Yeah, @angela97lin , you can do fancy stuff in pandas.

df = pd.DataFrame(data=np.array([[1, 1], [2, 2], [3, 3]]), columns=['a', 'a'])

produces a df with two columns which happen to have the same name, although they occupy different positions in the column index.

dsherry · 2020-08-07T16:30:04Z

evalml/tests/component_tests/test_lsa.py

+                              'LSA(col_2)[1]'])
+    X_t = lsa.transform(X)
+    assert set(X_t.columns) == expected_col_names
+    assert len(X_t.columns) == 4


For these tests, let's do a direct comparison of the column names:

expected_col_names = np.array(...) # expected str values np.testing.assert_equal(X_t.columns, expected_col_names)

This has the added benefit of covering the column name order.

evalml/tests/component_tests/test_lsa.py

dsherry

Looks good! I left one comment about addressing the str-to-int column name conversion and the transform try/except in a separate PR. I also didn't see coverage for the two cases I mentioned previously:

Input DF has two features with the same name
Input DF has non-str column names, i.e. df = pd.DataFrame(data=np.zeros((1, 4)), columns=[0, 1, 42, -1000])

eccabay added 3 commits August 4, 2020 16:04

Add LSA component

0ba5627

Integrate LSA component into TextFeaturizer

da83205

Standardize LSA transform output

19c355e

eccabay and others added 3 commits August 5, 2020 10:40

Update release notes

6d1b5c8

Merge branch 'main' into 980_940_lsa_component

ed44a33

Fix outdated docstring

c9fe512

eccabay marked this pull request as ready for review August 5, 2020 15:35

eccabay requested review from dsherry, angela97lin, freddyaboulton and jeremyliweishih and removed request for freddyaboulton August 5, 2020 15:35

auto-assign bot assigned eccabay Aug 5, 2020

eccabay requested a review from freddyaboulton August 5, 2020 16:57

Remove runtime warnings from init functions

9bf7427

eccabay commented Aug 6, 2020

View reviewed changes

jeremyliweishih reviewed Aug 6, 2020

View reviewed changes

evalml/pipelines/components/transformers/preprocessing/lsa.py Show resolved Hide resolved

angela97lin reviewed Aug 6, 2020

View reviewed changes

evalml/pipelines/components/transformers/preprocessing/lsa.py Outdated Show resolved Hide resolved

angela97lin reviewed Aug 6, 2020

View reviewed changes

evalml/pipelines/components/transformers/preprocessing/lsa.py Show resolved Hide resolved

angela97lin reviewed Aug 6, 2020

View reviewed changes

evalml/pipelines/components/transformers/preprocessing/lsa.py Outdated Show resolved Hide resolved

eccabay and others added 2 commits August 6, 2020 15:34

Clean up unnecessary code

cb5617a

Merge branch 'main' into 980_940_lsa_component

e360a1c

dsherry suggested changes Aug 7, 2020

View reviewed changes

eccabay added 2 commits August 7, 2020 13:45

Address PR comments

a1d9a98

Raise warnings using logger instead of warnings package

e383731

eccabay requested a review from dsherry August 10, 2020 16:23

dsherry approved these changes Aug 10, 2020

View reviewed changes

PR comments

0ad6b0a

Merge branch 'main' into 980_940_lsa_component

14c50ac

eccabay merged commit dd784a2 into main Aug 11, 2020

jeremyliweishih mentioned this pull request Aug 12, 2020

Text featurizer warning on evalml import #1017

Closed

dsherry mentioned this pull request Aug 25, 2020

Release v0.13.1 #1101

Merged

eccabay deleted the 980_940_lsa_component branch November 2, 2020 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LSA Component #1022

Add LSA Component #1022

eccabay commented Aug 5, 2020

codecov bot commented Aug 5, 2020 •

edited

eccabay Aug 6, 2020

angela97lin left a comment

angela97lin Aug 6, 2020

eccabay Aug 6, 2020

angela97lin Aug 6, 2020

dsherry Aug 7, 2020

dsherry Aug 7, 2020

eccabay Aug 10, 2020

dsherry Aug 10, 2020

eccabay Aug 10, 2020 •

edited

dsherry left a comment

dsherry Aug 7, 2020

eccabay Aug 7, 2020

dsherry Aug 7, 2020

dsherry Aug 7, 2020

dsherry left a comment

		if len(text_columns) == 0:
		warnings.warn("No text columns were given to TextFeaturizer, component will have no effect", RuntimeWarning)

Add LSA Component #1022

Add LSA Component #1022

Conversation

eccabay commented Aug 5, 2020

codecov bot commented Aug 5, 2020 • edited

Codecov Report

Choose a reason for hiding this comment

angela97lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eccabay Aug 10, 2020 • edited

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 5, 2020 •

edited

eccabay Aug 10, 2020 •

edited