Reshape if sinlge value when havinf several values #29

AlexAdolfoKohan · 2024-01-08T15:55:24Z

Hi sometimes when trying to predict some values my it gaves me the next error:

Expected 2D array, got 1D array instead:
array=[].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

It's says my array is 1D when I read a .csv with pandas that contains more than 5000 values.

I don't know what can be wrong, my data are names of authors, and there's nothing strange, I have fit with more than 60000 and then try to predict and works with some values and don't with other, may be some kind of bug?

The text was updated successfully, but these errors were encountered:

fritshermans · 2024-01-08T16:13:23Z

Could you check if the column names are the same in the data frame that you want to predict on as the one used for training? Could you check the shape of your data frame ('df.shape')?

fritshermans · 2024-01-08T17:33:27Z

Btw, if there was something wrong with your data frame, please also let me know. Then I raise a warning with some explanation. That might be useful for future users.

AlexAdolfoKohan · 2024-01-09T06:58:13Z

the shape is (10934, 2) so I don´t understand the problem, I'm trying to see what could be the problem, but is a really large data frame

AlexAdolfoKohan · 2024-01-09T09:46:03Z

So I discover a work around, but I think better I tell you my case.

I have a csv with two columns "id name", this file has 500000 rows and wheights 15MB.

My problem was that when trying to predict the entire file the memory usage increase up to 70GB and kill the process instantly. So to try to improve this I made a script that divide the dataframe in N pieces and then start join the first piece with the second and predict, and then the first with the third and then start again with the second and on and on(ex: dividing in 3 will be like 1-2 1-3 2-3)...

That started workin but suddenly in the third "principle" iteration the error of my first message appeared.

Now the work around is to join the pieces, save it in an csv and open it, but now instead of costing 3 seconds it cost like 20 for each iteration.

Is there any way of predicting the entire 15MB file without dividing it and not consuming mor than 60GB of memory?

Becasue in that case I'm doing this, and I don't know if there's something that I can improve.

import pandas as pd
import pickle
import sys

with open("./scripts/model.pkl", 'rb') as f:
    myDedupliPy = pickle.load(f)

df = pd.read_csv("./scripts/file.csv", sep='\t')
res = myDedupliPy.predict(df, score_threshold=0.1)
print(res.sort_values('deduplication_id').head(60))

fritshermans · 2024-01-09T18:50:17Z

I think that the size shouldn't be a problem. Most likely the selected blocking rules results in too large blocks of comparisons that are made. Could you give some more details on what you're comparing? Probably I get guide you in selecting useful blocking rules that won't blow up memory usage.

AlexAdolfoKohan · 2024-01-10T08:18:46Z

Yes, this are some examples of the dataset
id name
3004 Arturo Pérez Reverte
49441 Reverte, Arturo Perez
49441 A. P. Reverte
15309 Mozart, Wolfgang Amadeus
50640 Wolfgang A. Mozart
352997 W. A. Mozart

Mostly are name of authors sometimes "surname, name" or "name surname" and some abbreviation of the name.
I'm only looking the name, and is there a way to show the original id column when predicting? I didn't find a way in the documentation
thank you a lot for the help and fast reply :)

fritshermans · 2024-01-11T21:38:42Z

You could try to use blocking rules like the ones below:

def sorted_three_letter_abbreviation(name):
    words = name.lower().split()
    first_letters = [x[0] for x in words]
    return "".join(sorted(first_letters)[:3])

or

def probable_surname(name):
    if ',' in name:
        return name.split(',')[0]
    else:
        return name.split()[-1]

The advanced example notebook shows how to use custom blocking rules. Morever, I would create your own string similarity metric for the id column as you don't want to do fuzzy matching on this field. Two IDs match or not, there's no approximate match. The similarity metrics for this field should look like this:

def exact_match(x, y):
    return int(x==y)

The advanced example notebook also shows how to use custom similarity metrics for specific fields. Good luck!

AlexAdolfoKohan · 2024-01-12T05:47:30Z

Oh perfect thank you so much :D

AlexAdolfoKohan closed this as completed Jan 9, 2024

AlexAdolfoKohan reopened this Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reshape if sinlge value when havinf several values #29

Reshape if sinlge value when havinf several values #29

AlexAdolfoKohan commented Jan 8, 2024

fritshermans commented Jan 8, 2024

fritshermans commented Jan 8, 2024

AlexAdolfoKohan commented Jan 9, 2024

AlexAdolfoKohan commented Jan 9, 2024 •

edited

fritshermans commented Jan 9, 2024

AlexAdolfoKohan commented Jan 10, 2024 •

edited

fritshermans commented Jan 11, 2024

AlexAdolfoKohan commented Jan 12, 2024

Reshape if sinlge value when havinf several values #29

Reshape if sinlge value when havinf several values #29

Comments

AlexAdolfoKohan commented Jan 8, 2024

fritshermans commented Jan 8, 2024

fritshermans commented Jan 8, 2024

AlexAdolfoKohan commented Jan 9, 2024

AlexAdolfoKohan commented Jan 9, 2024 • edited

fritshermans commented Jan 9, 2024

AlexAdolfoKohan commented Jan 10, 2024 • edited

fritshermans commented Jan 11, 2024

AlexAdolfoKohan commented Jan 12, 2024

AlexAdolfoKohan commented Jan 9, 2024 •

edited

AlexAdolfoKohan commented Jan 10, 2024 •

edited