Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reshape if sinlge value when havinf several values #29

Open
AlexAdolfoKohan opened this issue Jan 8, 2024 · 8 comments
Open

Reshape if sinlge value when havinf several values #29

AlexAdolfoKohan opened this issue Jan 8, 2024 · 8 comments

Comments

@AlexAdolfoKohan
Copy link

Hi sometimes when trying to predict some values my it gaves me the next error:

Expected 2D array, got 1D array instead:
array=[].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

It's says my array is 1D when I read a .csv with pandas that contains more than 5000 values.

I don't know what can be wrong, my data are names of authors, and there's nothing strange, I have fit with more than 60000 and then try to predict and works with some values and don't with other, may be some kind of bug?

@fritshermans
Copy link
Owner

Could you check if the column names are the same in the data frame that you want to predict on as the one used for training? Could you check the shape of your data frame ('df.shape')?

@fritshermans
Copy link
Owner

Btw, if there was something wrong with your data frame, please also let me know. Then I raise a warning with some explanation. That might be useful for future users.

@AlexAdolfoKohan
Copy link
Author

the shape is (10934, 2) so I don´t understand the problem, I'm trying to see what could be the problem, but is a really large data frame

@AlexAdolfoKohan
Copy link
Author

AlexAdolfoKohan commented Jan 9, 2024

So I discover a work around, but I think better I tell you my case.

I have a csv with two columns "id name", this file has 500000 rows and wheights 15MB.

My problem was that when trying to predict the entire file the memory usage increase up to 70GB and kill the process instantly. So to try to improve this I made a script that divide the dataframe in N pieces and then start join the first piece with the second and predict, and then the first with the third and then start again with the second and on and on(ex: dividing in 3 will be like 1-2 1-3 2-3)...

That started workin but suddenly in the third "principle" iteration the error of my first message appeared.

Now the work around is to join the pieces, save it in an csv and open it, but now instead of costing 3 seconds it cost like 20 for each iteration.

Is there any way of predicting the entire 15MB file without dividing it and not consuming mor than 60GB of memory?

Becasue in that case I'm doing this, and I don't know if there's something that I can improve.

import pandas as pd
import pickle
import sys

with open("./scripts/model.pkl", 'rb') as f:
    myDedupliPy = pickle.load(f)

df = pd.read_csv("./scripts/file.csv", sep='\t')
res = myDedupliPy.predict(df, score_threshold=0.1)
print(res.sort_values('deduplication_id').head(60))

@fritshermans
Copy link
Owner

I think that the size shouldn't be a problem. Most likely the selected blocking rules results in too large blocks of comparisons that are made. Could you give some more details on what you're comparing? Probably I get guide you in selecting useful blocking rules that won't blow up memory usage.

@AlexAdolfoKohan
Copy link
Author

AlexAdolfoKohan commented Jan 10, 2024

Yes, this are some examples of the dataset
id name
3004 Arturo Pérez Reverte
49441 Reverte, Arturo Perez
49441 A. P. Reverte
15309 Mozart, Wolfgang Amadeus
50640 Wolfgang A. Mozart
352997 W. A. Mozart

Mostly are name of authors sometimes "surname, name" or "name surname" and some abbreviation of the name.
I'm only looking the name, and is there a way to show the original id column when predicting? I didn't find a way in the documentation
thank you a lot for the help and fast reply :)

@fritshermans
Copy link
Owner

You could try to use blocking rules like the ones below:

def sorted_three_letter_abbreviation(name):
    words = name.lower().split()
    first_letters = [x[0] for x in words]
    return "".join(sorted(first_letters)[:3])

or

def probable_surname(name):
    if ',' in name:
        return name.split(',')[0]
    else:
        return name.split()[-1]

The advanced example notebook shows how to use custom blocking rules. Morever, I would create your own string similarity metric for the id column as you don't want to do fuzzy matching on this field. Two IDs match or not, there's no approximate match. The similarity metrics for this field should look like this:

def exact_match(x, y):
    return int(x==y)

The advanced example notebook also shows how to use custom similarity metrics for specific fields. Good luck!

@AlexAdolfoKohan
Copy link
Author

Oh perfect thank you so much :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants