Help with min-max scaling and the correct orientation! #19339

amjass12 · 2021-02-03T20:11:49Z

amjass12
Feb 3, 2021

Hi! I have posted this on stack-overflow as well, however thought I would also ask in the dedicated scikit-learn discussions forum!

I am building a neural network with keras and need clarification for the pre-processing step.

I have a dataframe that is 1-n rows (features for the machine learning algorithm to learn from) and 1-n columns each column being a sample

My data is currently correctly log-transformed and I simply need to squish to between 0 and 1. I am using the minmax_scale in sklearn and the processing of my data is as follows:

##transpose counts so that rows as samples and cols as features (to correct format for NN)
counts = normCounts.transpose()

##scale counts
scaled = preprocessing.minmax_scale(counts, feature_range=(0,1))
I need clarification on which way around the dataframe needs to be. Reading the minmax documentation on sklearn says that the data are scaled along axis=0

does this mean:

       featurecolumn1, featurecolumn2...
sample1 ->
sample2 ->

??

Basically, what I need to ensure is that low counts on the original dataframe are closer to the 0 and the upper end on the dataframe are closer to one... however, I am unsure now as to whether this should actually be the other way around... so not transposing the dataframe from the beginning, I feel like it should look like below, so scaling for each feature across all samples, and not samples across features as above:

         sample1, sample2...
feature1 ->
feature2 ->

Please note the features are genomic features where there is a well defined range globally, where for any feature, a small value is a lowly made gene and a high value is highly made gene... this global relationship needs to be preserved..

Would really appreciate clarification here!

Thank you.

ogrisel · 2021-02-04T09:06:36Z

ogrisel
Feb 4, 2021
Maintainer

I have a dataframe that is 1-n rows (features for the machine learning algorithm to learn from) and 1-n columns each column being a sample

This is the opposite of the usual machine conventions (both for scikit-learn and neural network libraries): usually the samples are stored as the rows of a dataframe. It seems to be the source of the confusion for your problem.

minmax_scale(data, axis=0) will treat each column (feature) individually and make it such that for a give column, the maximum values across the rows (samples) will be transformed to 1.0 and the minimum values (across the rows) will be set to 0.

In scikit-learn parlance, "a sample" is synonym for "an example" or "an observation" (in your train or test set). We you do cross-validation of train-test split, each sample is either allocated to train side or the test/validation side of the split and the goal of machine learning is to learn from the statistical properties of the samples from the train side of the data to make good predictions for each sample of the test set.

A feature is a kind descriptor for which you can measure the value for all the samples in your dataset. For instance it could be a temperature measurement in Celsius degrees, a price in yen, weight in grams, a count of inhabitants, the brightness value between 0 and 255 of a pixel at a give location of an image...

If this is not clear, feel free to edit your question to explain what is the "physical" meaning of the features in your case. You said they are integer counts. But what do they count?

1 reply

amjass12 Feb 4, 2021
Author

Hi @ogrisel ,

Thank you for your detailed reply! so the reason this confusion came about was the following: I have actually been scaling my data correctly (I believe), because the first thing I do is transpose the dataframe such that the rows are now samples, and the features are now columns. I then run:

scaled=preprocessing.minmax_scale(counts, feature_range=(0,1))

I should say at this point that the data are genomic features... so 1-n genes each with a value (already log transformed) so for example, min value of 3 is low, max value of 20 is high (please note these values are the min and max for the entire dataframe) so regardless of the samples, a gene with a value of 3 is low and 20 is very high)...

So training is fine and actually works very well, however, I also subset the data to contain some samples that are completely unseen (not the test dataset, these will be used purely for the purposes of what the NN thinks these samples are)... This is when I started to panic a little and realised maybe at axis=0 is not appropriate here: There are features that globally, are very lowly made (values of 3) but because the scaling is happening across rows as per your comment below

'minmax_scale(data, axis=0) will treat each column (feature) individually and make it such that for a give column, the maximum values across the rows (samples) will be transformed to 1.0 and the minimum values (across the rows) will be set to 0.'

a feature that is lowly made in all samples - i.e between 3 and 3.5- are now scaled accordingly such that they are in the range of 0-1!! (with most values being between 0.6-1) in terms of global gene levels this is incorrect as they are not highly made, and am unsure about how to correct this, which is why I was wondering if it might be more appropriate to min_max scale at axis=1?

It is important that this global relationship in values is preserved - so samples have 'low' values which range from 1 in some samples to 3 in others (these are equally low, but is the way the variance stabilising transform handles the samples) - these should always be scaled to be closer to 0, and higher levels (in the range of 15-25 closer to 1 and then everything in between accordingly...

I should mention that by doing axis=1 (when rows are samples and columns are features) the global values are preserved well, so low counts remain low (closer to 0) and vice-versa... It seems also as if samples that have a lowest value of 1 and samples that have a lowest value of 3, dont have a huge impact on the model as a result of the scaling (as these will both become 0 etc)...

but any advice and input is highly appreciated as i know this type of pre-processing on genomic data is not as common as i thought it would be!

Thank you for your time! (original post updated to reflect what the features are)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help with min-max scaling and the correct orientation! #19339

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Help with min-max scaling and the correct orientation! #19339

amjass12 Feb 3, 2021

Replies: 1 comment · 1 reply

ogrisel Feb 4, 2021 Maintainer

amjass12 Feb 4, 2021 Author

amjass12
Feb 3, 2021

Replies: 1 comment 1 reply

ogrisel
Feb 4, 2021
Maintainer

amjass12 Feb 4, 2021
Author