feat: Add LumpFeatures transfomer #1941

dylanw-oss · 2023-04-26T02:19:00Z

Related Issues/PRs

What changes are proposed in this pull request?

A transformer can be used to handle data with high cardinality skewed categorical before doing other featurization processing.

How is this patch tested?

unit test

Does this PR add a new feature? If so, have you added samples on website?

will add document in next commit (I'd like to ensure it makes sense before doing the next step)

Find the corresponding markdown file for your new feature in website/docs/documentation folder.
Make sure you choose the correct class estimators/transformers and namespace.
Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
Make sure the DocTable points to correct API link.
Navigate to website folder, and run yarn run start to make sure the website renders correctly.
Don't forget to add  before each python code blocks to enable auto-tests for python samples.
Make sure the WebsiteSamplesTests job pass in the pipeline.

github-actions · 2023-04-26T02:19:14Z

Hey @dylanw-oss 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

fix: Fix LightGBM crashes with empty partitions
feat: Make HTTP on Spark back-offs configurable
docs: Update Spark Serving usage
build: Add codecov support
perf: improve LightGBM memory usage
refactor: make python code generation rely on classes
style: Remove nulls from CNTKModel
test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

dylanw-oss · 2023-04-28T21:29:37Z

if there is no objective for this feature, I'll add document. @sarahshy @memoryz

github-actions

Summary by GPT-4

The LumpFeatures transformer is a custom transformer that takes a DataFrame and a list of lumping rules as input and returns a DataFrame comprised of the original columns, but the columns defined in lumping rules will be indexed and lumped to top k. This transformer can be used to handle high cardinality skewed categorical features before doing encoding.

In the given code, the LumpFeatures class extends Transformer and implements the following methods:

transform: This method takes an input dataset and applies the lumping rules to it. It first creates a pipeline with StringIndexer transformers for each column specified in the lumping rules. Then, it fits and transforms the input dataset using this pipeline. Finally, it keeps only the top k levels for each categorical column according to the lumping rules.
transformSchema: This method returns the schema of the output DataFrame after applying the transformation.
copy: This method creates a copy of this instance with extra parameters.

The test suite LumpFeaturesSuite tests this transformer's basic functionality by creating an input DataFrame with categorical columns, applying lumping rules using an instance of LumpFeatures, and comparing the output DataFrame with an expected result.

In summary, this custom transformer helps in handling high cardinality skewed categorical features by indexing and lumping them according to specified rules before encoding them.

Suggestions

The changes in this PR look good and no suggestions are needed.

Add LumpFeatures transfomer

fb86806

dylanw-oss requested a review from mhamilton723 as a code owner April 26, 2023 02:19

dylanw-oss changed the title ~~Feat, Add LumpFeatures transfomer~~ feat, Add LumpFeatures transfomer Apr 26, 2023

dciborow changed the title ~~feat, Add LumpFeatures transfomer~~ feat: Add LumpFeatures transfomer Apr 27, 2023

dylanw-oss requested a review from memoryz April 28, 2023 19:51

Merge branch 'master' into lump

fa1b9a5

github-actions bot reviewed May 18, 2023

View reviewed changes

Merge branch 'master' into lump

76da69c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add LumpFeatures transfomer #1941

feat: Add LumpFeatures transfomer #1941

dylanw-oss commented Apr 26, 2023

github-actions bot commented Apr 26, 2023

dylanw-oss commented Apr 28, 2023

github-actions bot left a comment •

edited

feat: Add LumpFeatures transfomer #1941

Are you sure you want to change the base?

feat: Add LumpFeatures transfomer #1941

Conversation

dylanw-oss commented Apr 26, 2023

Related Issues/PRs

What changes are proposed in this pull request?

How is this patch tested?

Does this PR add a new feature? If so, have you added samples on website?

github-actions bot commented Apr 26, 2023

dylanw-oss commented Apr 28, 2023

github-actions bot left a comment • edited

Choose a reason for hiding this comment

Summary by GPT-4

Suggestions

github-actions bot left a comment •

edited