Replies: 1 comment 1 reply
-
This is the opposite of the usual machine conventions (both for scikit-learn and neural network libraries): usually the samples are stored as the rows of a dataframe. It seems to be the source of the confusion for your problem.
In scikit-learn parlance, "a sample" is synonym for "an example" or "an observation" (in your train or test set). We you do cross-validation of train-test split, each sample is either allocated to train side or the test/validation side of the split and the goal of machine learning is to learn from the statistical properties of the samples from the train side of the data to make good predictions for each sample of the test set. A feature is a kind descriptor for which you can measure the value for all the samples in your dataset. For instance it could be a temperature measurement in Celsius degrees, a price in yen, weight in grams, a count of inhabitants, the brightness value between 0 and 255 of a pixel at a give location of an image... If this is not clear, feel free to edit your question to explain what is the "physical" meaning of the features in your case. You said they are integer counts. But what do they count? |
Beta Was this translation helpful? Give feedback.
-
Hi! I have posted this on stack-overflow as well, however thought I would also ask in the dedicated scikit-learn discussions forum!
I am building a neural network with keras and need clarification for the pre-processing step.
I have a dataframe that is 1-n rows (features for the machine learning algorithm to learn from) and 1-n columns each column being a sample
My data is currently correctly log-transformed and I simply need to squish to between 0 and 1. I am using the
minmax_scale
in sklearn and the processing of my data is as follows:##transpose counts so that rows as samples and cols as features (to correct format for NN)
counts = normCounts.transpose()
##scale counts
scaled = preprocessing.minmax_scale(counts, feature_range=(0,1))
I need clarification on which way around the dataframe needs to be. Reading the minmax documentation on sklearn says that the data are scaled along axis=0
does this mean:
??
Basically, what I need to ensure is that low counts on the original dataframe are closer to the 0 and the upper end on the dataframe are closer to one... however, I am unsure now as to whether this should actually be the other way around... so not transposing the dataframe from the beginning, I feel like it should look like below, so scaling for each feature across all samples, and not samples across features as above:
Please note the features are genomic features where there is a well defined range globally, where for any feature, a small value is a lowly made gene and a high value is highly made gene... this global relationship needs to be preserved..
Would really appreciate clarification here!
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions