New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add LumpFeatures transfomer #1941
base: master
Are you sure you want to change the base?
Conversation
Hey @dylanw-oss 👋! We use semantic commit messages to streamline the release process. Examples of commit messages with semantic prefixes:
To test your commit locally, please follow our guild on building from source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary by GPT-4
The LumpFeatures
transformer is a custom transformer that takes a DataFrame and a list of lumping rules as input and returns a DataFrame comprised of the original columns, but the columns defined in lumping rules will be indexed and lumped to top k. This transformer can be used to handle high cardinality skewed categorical features before doing encoding.
In the given code, the LumpFeatures
class extends Transformer
and implements the following methods:
-
transform
: This method takes an input dataset and applies the lumping rules to it. It first creates a pipeline with StringIndexer transformers for each column specified in the lumping rules. Then, it fits and transforms the input dataset using this pipeline. Finally, it keeps only the top k levels for each categorical column according to the lumping rules. -
transformSchema
: This method returns the schema of the output DataFrame after applying the transformation. -
copy
: This method creates a copy of this instance with extra parameters.
The test suite LumpFeaturesSuite
tests this transformer's basic functionality by creating an input DataFrame with categorical columns, applying lumping rules using an instance of LumpFeatures
, and comparing the output DataFrame with an expected result.
In summary, this custom transformer helps in handling high cardinality skewed categorical features by indexing and lumping them according to specified rules before encoding them.
Suggestions
The changes in this PR look good and no suggestions are needed.
Related Issues/PRs
#1891
What changes are proposed in this pull request?
A transformer can be used to handle data with high cardinality skewed categorical before doing other featurization processing.
How is this patch tested?
unit test
Does this PR add a new feature? If so, have you added samples on website?
will add document in next commit (I'd like to ensure it makes sense before doing the next step)
website/docs/documentation
folder.Make sure you choose the correct class
estimators/transformers
and namespace.DocTable
points to correct API link.yarn run start
to make sure the website renders correctly.<!--pytest-codeblocks:cont-->
before each python code blocks to enable auto-tests for python samples.WebsiteSamplesTests
job pass in the pipeline.