Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add LumpFeatures transfomer #1941

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

dylanw-oss
Copy link
Contributor

Related Issues/PRs

#1891

What changes are proposed in this pull request?

A transformer can be used to handle data with high cardinality skewed categorical before doing other featurization processing.

How is this patch tested?

unit test

Does this PR add a new feature? If so, have you added samples on website?

will add document in next commit (I'd like to ensure it makes sense before doing the next step)

  1. Find the corresponding markdown file for your new feature in website/docs/documentation folder.
    Make sure you choose the correct class estimators/transformers and namespace.
  2. Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
  3. Make sure the DocTable points to correct API link.
  4. Navigate to website folder, and run yarn run start to make sure the website renders correctly.
  5. Don't forget to add <!--pytest-codeblocks:cont--> before each python code blocks to enable auto-tests for python samples.
  6. Make sure the WebsiteSamplesTests job pass in the pipeline.

@github-actions
Copy link

Hey @dylanw-oss 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

  • fix: Fix LightGBM crashes with empty partitions
  • feat: Make HTTP on Spark back-offs configurable
  • docs: Update Spark Serving usage
  • build: Add codecov support
  • perf: improve LightGBM memory usage
  • refactor: make python code generation rely on classes
  • style: Remove nulls from CNTKModel
  • test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

@dylanw-oss dylanw-oss changed the title Feat, Add LumpFeatures transfomer feat, Add LumpFeatures transfomer Apr 26, 2023
@dciborow dciborow changed the title feat, Add LumpFeatures transfomer feat: Add LumpFeatures transfomer Apr 27, 2023
@dylanw-oss dylanw-oss requested a review from memoryz April 28, 2023 19:51
@dylanw-oss
Copy link
Contributor Author

if there is no objective for this feature, I'll add document. @sarahshy @memoryz

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary by GPT-4

The LumpFeatures transformer is a custom transformer that takes a DataFrame and a list of lumping rules as input and returns a DataFrame comprised of the original columns, but the columns defined in lumping rules will be indexed and lumped to top k. This transformer can be used to handle high cardinality skewed categorical features before doing encoding.

In the given code, the LumpFeatures class extends Transformer and implements the following methods:

  1. transform: This method takes an input dataset and applies the lumping rules to it. It first creates a pipeline with StringIndexer transformers for each column specified in the lumping rules. Then, it fits and transforms the input dataset using this pipeline. Finally, it keeps only the top k levels for each categorical column according to the lumping rules.

  2. transformSchema: This method returns the schema of the output DataFrame after applying the transformation.

  3. copy: This method creates a copy of this instance with extra parameters.

The test suite LumpFeaturesSuite tests this transformer's basic functionality by creating an input DataFrame with categorical columns, applying lumping rules using an instance of LumpFeatures, and comparing the output DataFrame with an expected result.

In summary, this custom transformer helps in handling high cardinality skewed categorical features by indexing and lumping them according to specified rules before encoding them.

Suggestions

The changes in this PR look good and no suggestions are needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants