Skip to content

Commit

Permalink
Use SimpleImputer for categorical values with strategy=most_frequent
Browse files Browse the repository at this point in the history
  • Loading branch information
victorvrv committed May 4, 2020
1 parent 0322dce commit a6329e8
Showing 1 changed file with 3 additions and 19 deletions.
22 changes: 3 additions & 19 deletions 03_classification.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2957,23 +2957,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We will also need an imputer for the string categorical columns (the regular `SimpleImputer` does not work on those):"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [],
"source": [
"# Inspired from stackoverflow.com/questions/25239958\n",
"class MostFrequentImputer(BaseEstimator, TransformerMixin):\n",
" def fit(self, X, y=None):\n",
" self.most_frequent_ = pd.Series([X[c].value_counts().index[0] for c in X],\n",
" index=X.columns)\n",
" return self\n",
" def transform(self, X, y=None):\n",
" return X.fillna(self.most_frequent_)"
"We will also need an imputer for the string categorical columns - we can again use the `SimpleImputer`, but with `strategy=“most_frequent”`:"
]
},
{
Expand Down Expand Up @@ -3011,7 +2995,7 @@
"source": [
"cat_pipeline = Pipeline([\n",
" (\"select_cat\", DataFrameSelector([\"Pclass\", \"Sex\", \"Embarked\"])),\n",
" (\"imputer\", MostFrequentImputer()),\n",
" (\"imputer\", SimpleImputer(strategy=\"most_frequent\")),\n",
" (\"cat_encoder\", OneHotEncoder(sparse=False)),\n",
" ])"
]
Expand Down Expand Up @@ -4477,4 +4461,4 @@
},
"nbformat": 4,
"nbformat_minor": 1
}
}

0 comments on commit a6329e8

Please sign in to comment.