Add label encoder #5067

LiuYuHui · 2020-06-15T03:44:27Z

Related to #5054
Add LabelEncoder/MulticlassLabelsEncoder/BinaryLabelEncoder

src/shogun/labels/BinaryLabelEncoder.h

gf712 · 2020-06-15T06:32:40Z

src/shogun/labels/BinaryLabelEncoder.h

+			    transformed_vec.begin(), transformed_vec.end(),
+			    transformed_vec.begin(), [](float64_t e) {
+				    if (std::abs(e - 0.0) <=
+				        std::numeric_limits<float64_t>::epsilon())


Math::fequals does this

plz dont change it to Math:: ;) we wanna kill that thing sooner or later

hmm I guess we need to replace this with a utility function somewhere then? Because this is quite a common operation

could be an issue, and then while we are at it can also replace all the CMath::fequals calls

I think we should either introduce this utility function now, or just use CMath::fequals and include this in a larger refactor (it is copy paste)

src/shogun/labels/BinaryLabelEncoder.h

gf712 · 2020-06-15T06:36:09Z

src/shogun/labels/BinaryLabelEncoder.h

+			    "BinaryLabel should contain only two elements");
+
+			return SGVector<float64_t>(
+			    unique_labels.begin(), unique_labels.end());


isn't this the result of fit_impl?

gf712 · 2020-06-15T06:42:38Z

tests/unit/labels/LabelsEncoder_unittest.cc

+TEST(BinaryLabelEncoder, fit_transform)
+{
+	auto label_encoder = std::make_shared<BinaryLabelEncoder>();
+	SGVector<int32_t> vec{-1, -1, 1, -1, 1};


what about something with values that are not {-1,1}. Could you also test for something with more than two labels to make sure the exception is thrown?

if values are not {-1, 1}, they will be transformed to {-1, 1}, i will add more unite test.

I guess you want to test

binary labels are passed (no-op)

labels with two unique values are passed (and transformed into binary)

labels with fewer or more than two unique values are passed (error)

gf712 · 2020-06-15T06:43:34Z

tests/unit/labels/LabelsEncoder_unittest.cc

+TEST(MulticlassLabelsEncoder, fit_transform)
+{
+	auto label_encoder = std::make_shared<MulticlassLabelsEncoder>();
+	SGVector<float64_t> vec{1, 2, 2, 6};


I am assuming something with negative values and 0 would also work?

gf712 · 2020-06-15T06:44:41Z

tests/unit/labels/LabelsEncoder_unittest.cc

+
+	auto inv_result = label_encoder->inverse_transform(result_labels)
+	                      ->as<MulticlassLabels>()
+	                      ->get_labels();


could you also test for something that isn't the result? For example something like {2,0,1,0}?

Also what happens when you transform with something that wasn't fitted? For example if now there was a labels 3, but your labels space is {1,2,6}?

gf712

Good start! I have some questions. But I think I just misunderstood the design. It would be good to add more tests and think of all possible corner cases

gf712 · 2020-06-15T07:33:43Z

src/shogun/labels/LabelEncoder.h

+			std::transform(
+			    result_vector.begin(), result_vector.end(),
+			    original_vector.begin(),
+			    [& normalized_to_origin = normalized_to_origin](const auto& e) {


btw if you don't rename a variable you can just write it as [&normalized_to_origin]

normalized_to_origin is a class member, so if I want to use normalized_to_origin in lambda, I guess I have to write [& normalized_to_origin = normalized_to_origin]?

ah yes, you're right!

karlnapf · 2020-06-15T09:32:38Z

src/shogun/labels/BinaryLabelEncoder.h

+			require(
+			    std::set<float64_t>(result_vector.begin(), result_vector.end())
+			            .size() == 2,
+			    "BinaryLabel should contain only two elements");


could you pls re-use the code we have for these transformations? I put some links in the issue.
Or at least then delete the old code

karlnapf · 2020-06-15T09:33:49Z

src/shogun/labels/BinaryLabelEncoder.h

+			require(
+			    std::set<float64_t>(result_vector.begin(), result_vector.end())
+			            .size() == 2,
+			    "BinaryLabel should contain only two elements");


"BinaryLabel" -> "Binary labels".
And also please print what the values are so that the user can see what is wrong. See also the existing code for that

karlnapf · 2020-06-15T09:34:03Z

src/shogun/labels/BinaryLabelEncoder.h

+			    std::set<float64_t>(
+			        normalized_vector.begin(), normalized_vector.end())
+			            .size() == 2,
+			    "BinaryLabel should contain only two elements");


See comment above

karlnapf · 2020-06-15T15:11:12Z

src/shogun/labels/BinaryLabelEncoder.h

+			auto result_labels = fit_impl(result_vector);
+			require(
+			    unique_labels.size() == 2,
+			    "Binary Labels should contain only two elements");


nitpick: "Binary labels" (the idea is to not replicate class/variable/type names in user facing error messages).
Still would be good to print the unique labels in this error message

but the amount of the unique labels may be exceed more than two, such as if {0,1,2,3,4,5,6,7,8} is passed to the method, should we print all values?

yeah the idea is that the user gets a feedback what is the the problem with his input. i.e. you print out that we've detected {0,1,2} as unique labels in his data set... this should help one to figure out/fix the input...

++ what viktor says here.
Also keep in mind that the unique number of labels will be relatively small (unless someone passes regression labels .... so maybe you could cap them at length 10 or so and make a "...", although I don't really think that is necessary)

karlnapf · 2020-06-15T15:13:02Z

src/shogun/labels/BinaryLabelEncoder.h

+			    std::unordered_set<float64_t>(
+			        normalized_vector.begin(), normalized_vector.end())
+			            .size() == 2,
+			    "Binary Labels should contain only two elements");


could we maybe put this string somewhere so it is not replicated so many times.... as well as the check code itself?

karlnapf · 2020-06-15T15:15:51Z

tests/unit/labels/LabelsEncoder_unittest.cc

+	auto inv_test = label_encoder->inverse_transform(test_labels)
+	                    ->as<BinaryLabels>()
+	                    ->get_labels();
+	EXPECT_EQ(-100, inv_test[0]);


could you use a loop (or macro) and the original vector so make this less verbose?

karlnapf

BTW I think there should be a warning printed if a user passes labels that are not integers. E.g. {2.13434, 3.42342343} shouldnt be mapped to {-1,+1} without warning

LiuYuHui · 2020-06-16T13:44:03Z

BTW I think there should be a warning printed if a user passes labels that are not integers. E.g. {2.13434, 3.42342343} shouldnt be mapped to {-1,+1} without warning

but it seems like the Labels class use SGVector<float64_t> to store the values, even the values are SGVector<int32_t>, so i think there is no way to figure out what the values are, because those values have been converted to float64_t.

gf712 · 2020-06-16T13:48:10Z

BTW I think there should be a warning printed if a user passes labels that are not integers. E.g. {2.13434, 3.42342343} shouldnt be mapped to {-1,+1} without warning

but it seems like the Labels class use SGVector<float64_t> to store the values, even the values are SGVector<int32_t>, so i think there is no way to figure out what the values are, because those values have been converted to float64_t.

But you can cast the elements to int and then compare the difference between element and integer cast right? If that difference is larger than epsilon then there is an issue, i.e. it does not represent an integer. You could do this when you populate the set. It will slow down the encoding fitting massively, but this shouldn't be the performance critical part anyway

gf712 · 2020-06-17T11:22:22Z

src/shogun/labels/BinaryLabelEncoder.h

+			    unique_set.size() == 2,
+			    "Binary labels should contain only two elements, ({}) have "
+			    "been detected.",
+			    fmt::join(unique_set, ", "));


neat! Didn't know this was a thing :D

nit: "Cannot interpret {} as binary labels"

typo can not -> cannot

gf712 · 2020-06-17T11:30:18Z

src/shogun/labels/BinaryLabelEncoder.h

+			    [](auto&& e1, auto&& e2) {
+				    return std::abs(e1 - e2) >
+				           std::numeric_limits<float64_t>::epsilon();
+			    });


I think you need to rename this function, it is a bit confusing what you get back. Imo it should be can_convert_float_to_int and then you just have to invert the logic to std::abs(e1 - e2) < std::numeric_limits<float64_t>::epsilon();, and then fix the logic in each call

gf712 · 2020-06-17T11:34:50Z

tests/unit/labels/LabelsEncoder_unittest.cc

+{
+	auto label_encoder = std::make_shared<BinaryLabelEncoder>();
+	SGVector<int32_t> vec{-100, 200, -100, 200, -100, 42};
+	auto origin_labels = std::make_shared<BinaryLabels>(vec);


@karlnapf doesn't BinaryLabels throw in this situation? And if not shouldn't it?
Also shouldn't this test be done with MulticlassLabels?

I think it doesnt atm, but I am not sure.

src/shogun/labels/LabelEncoder.h

gf712 · 2020-06-22T13:20:35Z

tests/unit/labels/LabelsEncoder_unittest.cc

+	auto label_encoder = std::make_shared<MulticlassLabelsEncoder>();
+	SGVector<float64_t> vec{-100.1, 200.4, -2.868, 6.98, -2.868};
+	auto origin_labels = std::make_shared<MulticlassLabels>(vec);
+	auto unique_vec = label_encoder->fit(origin_labels);


so will this give the user a warning? Because they are using non-integer representations?

I think it should (though maybe it should be possible to de-activate it? not sure if this can be spamy)

I assume this would only happen once when you fit thought no? And the user can also change the io level.

in grid search it would happen quite a few times. But yes with the loglevel that is best taken care of

gf712 · 2020-06-23T05:58:30Z

src/shogun/labels/LabelEncoder.h

+			return "LabelEncoder";
+		}
+
+		void set_print_warning(bool print_warning){


Imo this shouldn’t be needed. The user should set this with log level, eg switch from warn to error

karlnapf · 2020-06-24T09:09:38Z

src/shogun/labels/BinaryLabelEncoder.h

+			std::transform(
+			    normalized_vector.begin(), normalized_vector.end(),
+			    normalized_vector.begin(), [](float64_t e) {
+				    if (std::abs(e + 1.0) <=


Math::fequals or utility func

karlnapf · 2020-06-24T09:10:12Z

src/shogun/labels/BinaryLabelEncoder.h

+			    [](auto&& e) { return static_cast<int32_t>(e); });
+			return std::make_shared<BinaryLabels>(result_vev);
+		}
+		/** Fit label encoder and return encoded labels.


karlnapf · 2020-06-24T09:12:18Z

src/shogun/labels/BinaryLabelEncoder.h

+			    std::unordered_set<float64_t>(vec.begin(), vec.end());
+			require(
+			    unique_set.size() == 2,
+			    "Binary labels should contain only two elements, can not interpret ({}) as binary labels",


Suggested change

"Binary labels should contain only two elements, can not interpret ({}) as binary labels",

"Cannot interpret ({}) as binary labels, need exactly two classes.",

src/shogun/labels/BinaryLabelEncoder.h

karlnapf · 2020-06-24T09:14:06Z

src/shogun/labels/BinaryLabelEncoder.h

+			const auto unique_set =
+			    std::unordered_set<float64_t>(vec.begin(), vec.end());
+			require(
+			    unique_set.size() == 2,


@gf712 I think it might not be good to assert this as sometimes labels might only contain one class. But I guess this will pop up if a problem and we can change it then :)

hmm, in what situation would there only be one label?

When predicting, although I am not sure this is ever called in that case.

seems like this is only called in fit and transform, so should be fine

yep + 1 and if it becomes a problem we just change it

karlnapf · 2020-06-24T09:15:55Z

src/shogun/labels/MulticlassLabelsEncoder.h

+		 * @return original encoding labels
+		 */
+		std::shared_ptr<Labels>
+		inverse_transform(const std::shared_ptr<Labels>& labs) override


don't we also need a check valid here? Something that ensures that the labels are contiguous? [0,1,2,3,4, ... ] no gaps.
I wrote some code for this in the links I posted. Either re-use or remove my code :)

I am thinking whether we need a check valid here, as inverse_transform is to map from internal encoding to origin encoding. for example, {100, 100, 200, 300} -> {0, 0, 1, 2}, {0, 0, 1, 2} are transformed by internal encoding, but it is not continuous

ah you are right of course :)

karlnapf · 2020-06-24T09:16:07Z

src/shogun/labels/MulticlassLabelsEncoder.h

+namespace shogun
+{
+
+	class MulticlassLabelsEncoder : public LabelEncoder


same comments as for binary labels class

karlnapf · 2020-06-24T09:17:40Z

src/shogun/labels/LabelEncoder.h

+			return std::equal(
+			    vec.begin(), vec.end(), converted.begin(),
+			    [](auto&& e1, auto&& e2) {
+				    return std::abs(e1 - e2) <


CMath::fequals or utility pls :)

@LiuYuHui could you do this change so that when CMath::fequals is replaced we can just find and replace these things?

karlnapf · 2020-06-24T09:18:07Z

src/shogun/labels/LabelEncoder.h

+		}
+
+		std::set<float64_t> unique_labels;
+		std::unordered_map<float64_t, float64_t> normalized_to_origin;


inverse_mapping

karlnapf

Made some more comments. Nice work :)

karlnapf · 2020-06-27T12:37:47Z

src/shogun/labels/BinaryLabelEncoder.h

@@ -16,37 +16,33 @@
 #include <unordered_set>
 namespace shogun
 {
-
+	/** @brief 	Implements a reversible mapping from


whitespaces are weird here. Could you clean up?

karlnapf · 2020-06-27T12:39:40Z

src/shogun/labels/LabelEncoder.h

+	/** @brief Implements a reversible mapping from any
+	 * form of labels to one of Shogun's target label spaces
+	 * (binary, multi-class, etc).
+	 *


nit: remove this line

karlnapf

Cool, looks good. What is missing?

LiuYuHui · 2020-06-27T12:45:56Z

Cool, looks good. What is missing?

i think all things have been done.

karlnapf · 2020-06-27T17:48:13Z

src/shogun/labels/LabelEncoder.h

+			return original_vector;
+		}
+
+		bool can_convert_float_to_int(const SGVector<float64_t>& vec) const


I wonder whether this should be templated (for both float and int type) and live somewhere where other conversion tools (safe_convert) live...this might be useful elesewhere
@gf712 thoughts?

Yes, ideally this would be templated

karlnapf · 2020-06-27T17:48:31Z

src/shogun/labels/MulticlassLabelsEncoder.h

+
+namespace shogun
+{
+	/** @brief 	Implements a reversible mapping from


some whitespace glitches

karlnapf

Just wondering: would it make sense to avoid doing any conversion if the labels are already in the correct format? (i.e. -1,1 or 0,1,2,3,4 ....,num_classes-1)? Or is that a pointless optimization as not making a difference anyways?

LiuYuHui · 2020-06-28T07:36:30Z

Just wondering: would it make sense to avoid doing any conversion if the labels are already in the correct format? (i.e. -1,1 or 0,1,2,3,4 ....,num_classes-1)? Or is that a pointless optimization as not making a difference anyways?

I think we should not add this optimization, when {-1, 1} or {0, 1, 2, num_classes-1} are passed in fit/transform, we still need to maintain the mapping, as we don't know what values will be passed in inverse_transform.

karlnapf · 2020-06-28T10:02:33Z

Just wondering: would it make sense to avoid doing any conversion if the labels are already in the correct format? (i.e. -1,1 or 0,1,2,3,4 ....,num_classes-1)? Or is that a pointless optimization as not making a difference anyways?

I think we should not add this optimization, when {-1, 1} or {0, 1, 2, num_classes-1} are passed in fit/transform, we still need to maintain the mapping, as we don't know what values will be passed in inverse_transform.

Well that could be dealt with ... say if no inverse mapping was computed, the labels are not mapped backwards, but simply used as returned by the machine. @gf712 @vigsterkr what are your thoughts?

karlnapf · 2020-06-28T10:03:45Z

I think otherwise, we can merge this. Any objections (apart from the question above)?

gf712 · 2020-06-28T14:56:40Z

Just wondering: would it make sense to avoid doing any conversion if the labels are already in the correct format? (i.e. -1,1 or 0,1,2,3,4 ....,num_classes-1)? Or is that a pointless optimization as not making a difference anyways?

I think we should not add this optimization, when {-1, 1} or {0, 1, 2, num_classes-1} are passed in fit/transform, we still need to maintain the mapping, as we don't know what values will be passed in inverse_transform.

Well that could be dealt with ... say if no inverse mapping was computed, the labels are not mapped backwards, but simply used as returned by the machine. @gf712 @vigsterkr what are your thoughts?

Yes, makes sense to me to have this, but then you need to check if the encoder has been fitted (boolean class member) and then you can see if the map is empty -> if empty noop

karlnapf · 2020-06-30T08:08:37Z

BTW once the above discussion is resolved, we could merge this.
But there is a follow up thing to do (could go into a single PR)

embed this into Machine
remove the old conversion methods
all machine implementations can then perform dynamic casts and assume the correct label type is given to them

And then another PR for using all this within xvalidation

gf712 · 2020-07-01T08:26:35Z

src/shogun/labels/LabelEncoder.h

+
+		std::set<float64_t> unique_labels;
+		std::unordered_map<float64_t, float64_t> inverse_mapping;
+		const float64_t eps = std::numeric_limits<float64_t>::epsilon();


Suggested change

const float64_t eps = std::numeric_limits<float64_t>::epsilon();

constexpr float64_t eps = std::numeric_limits<float64_t>::epsilon();

when I changed const to constexpr, I got an error: non-static data member ‘eps’ declared ‘constexpr’

ah yes, it should be static constexpr float64_t eps = std::numeric_limits<float64_t>::epsilon();

gf712 · 2020-07-01T13:10:41Z

I think this can merged now :)

karlnapf · 2020-07-01T15:27:12Z

Great! :)

As a next step, can I suggest a PR that

uses this class inside Machine::train
removes all the conversion code in BinaryLabels and MulticlassLabels
replaces all usages of binary_labels and multiclass_labels conversion methods in train_machine methods with ->as<BinaryLabels>() etc? I.e. a simple dynamic cast
Fixes all the examples/tests that will break through that :)

add label encoder

ea60ba6

gf712 reviewed Jun 15, 2020

View reviewed changes

src/shogun/labels/BinaryLabelEncoder.h Outdated Show resolved Hide resolved

gf712 reviewed Jun 15, 2020

View reviewed changes

src/shogun/labels/BinaryLabelEncoder.h Outdated Show resolved Hide resolved

gf712 reviewed Jun 15, 2020

View reviewed changes

karlnapf reviewed Jun 15, 2020

View reviewed changes

add more unit test

53e7e3f

LiuYuHui force-pushed the LabelEncoder branch from a67b644 to 53e7e3f Compare June 15, 2020 12:09

karlnapf reviewed Jun 15, 2020

View reviewed changes

refine erro message

266cfc4

gf712 reviewed Jun 17, 2020

View reviewed changes

src/shogun/labels/LabelEncoder.h Show resolved Hide resolved

refine function name

aea04e7

gf712 reviewed Jun 22, 2020

View reviewed changes

add covert warning for multiclasslabel encoder

145f093

gf712 reviewed Jun 23, 2020

View reviewed changes

karlnapf reviewed Jun 24, 2020

View reviewed changes

src/shogun/labels/BinaryLabelEncoder.h Show resolved Hide resolved

karlnapf reviewed Jun 24, 2020

View reviewed changes

karlnapf reviewed Jun 27, 2020

View reviewed changes

add special treat for contiguous labels

7c2f7d4

LiuYuHui force-pushed the LabelEncoder branch from 09d0111 to 7c2f7d4 Compare June 30, 2020 03:30

gf712 reviewed Jul 1, 2020

View reviewed changes

create mapping in fit

b2eb4d5

karlnapf merged commit 1f1f7d8 into shogun-toolbox:develop Jul 1, 2020

gf712 mentioned this pull request Jul 1, 2020

Label assertion and mapping in Machine #5054

Open

5 tasks

	"Binary labels should contain only two elements, can not interpret ({}) as binary labels",
	"Cannot interpret ({}) as binary labels, need exactly two classes.",

	const float64_t eps = std::numeric_limits<float64_t>::epsilon();
	constexpr float64_t eps = std::numeric_limits<float64_t>::epsilon();

Add label encoder #5067

Add label encoder #5067

Conversation

LiuYuHui commented Jun 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gf712 left a comment

Choose a reason for hiding this comment

gf712 Jun 15, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karlnapf left a comment

Choose a reason for hiding this comment

LiuYuHui commented Jun 16, 2020

gf712 commented Jun 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karlnapf Jun 27, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karlnapf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karlnapf left a comment

Choose a reason for hiding this comment

LiuYuHui commented Jun 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karlnapf left a comment

Choose a reason for hiding this comment

LiuYuHui commented Jun 28, 2020

karlnapf commented Jun 28, 2020

karlnapf commented Jun 28, 2020

gf712 commented Jun 28, 2020

karlnapf commented Jun 30, 2020

gf712 Jun 15, 2020 •

edited

karlnapf Jun 27, 2020 •

edited