Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.loc[...] = value returns SettingWithCopyWarning #17476

Closed
NadiaRom opened this issue Sep 8, 2017 · 10 comments
Closed

.loc[...] = value returns SettingWithCopyWarning #17476

NadiaRom opened this issue Sep 8, 2017 · 10 comments
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Usage Question

Comments

@NadiaRom
Copy link

NadiaRom commented Sep 8, 2017

Code Sample

# My code
df.loc[0, 'column_name'] = 'foo bar'

Problem description

This code in Pandas 20.3 throws SettingWithCopyWarning and suggests to

"Try using .loc[row_indexer,col_indexer] = value instead".

I am already doing so, looks like there is a little bug. I use Jupyter.
Thank you! :)

Output of pd.show_versions()


commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 8.1
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 8, 2017

@NadiaRom Can you provide a full example? It's hard to say for sure, but I suspect that df came from an operation that may be a view or copy. For example:

In [8]: df = pd.DataFrame({"A": [1, 2], "B": [3, 4], "C": [4, 5]})

In [9]: df1 = df[['A', 'B']]

In [10]: df1.loc[0, 'A'] = 5
/Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexing.py:180: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
/Users/taugspurger/Envs/pandas-dev/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  #!/Users/taugspurger/Envs/pandas-dev/bin/python3.6

So we're updating df1 correctly. The ambiguity is whether or not df will be updated as well. I think a similar thing is happening to you, but without a reproducible example it's hard to say for sure.

@NadiaRom
Copy link
Author

NadiaRom commented Sep 8, 2017

@TomAugspurger Here is the code, in general, I never assign values to pandas without .loc

df = pd.read_csv('df_unicities.tsv', sep='\t')
df.replace({'|': '--'}, inplace=True)

df_c = df.loc[df.encountry == country, : ]

df_c['sort'] = (df_c.encities_ua == 'all').astype(int) # new column
df_c['sort'] += (df_c.encities_foreign == 'all').astype(int)
df_c.sort_values(by='sort', inplace=True)

# ---end of chunk, everything is fine ---

if df_c.encities_foreign.str.contains('all').sum() < len(df_c):
    df_c.loc[df_c.encities_foreign.str.contains('all'), 'encities_foreign'] = 'other'
    df_c.loc[df_c.cities_foreign.str.contains('всі'), 'cities_foreign'] = 'інші'
else:
    df_c.loc[df_c.encities_foreign.str.contains('all'), 'encities_foreign'] = country
    df_c.loc[df_c.cities_foreign.str.contains('всі'), 'cities_foreign'] = df_c.country.iloc[0]
    
if df_c.encities_ua.str.contains('all').sum() < len(df_c):
    df_c.loc[df_c.encities_ua.str.contains('all'), 'encities_ua'] = 'other'
    df_c.loc[df_c.cities_ua.str.contains('всі'), 'cities_ua'] = 'інші'
else:
    df_c.loc[df_c.encities_ua.str.contains('all'), 'encities_ua'] = 'Ukraine'
    df_c.loc[df_c.cities_ua.str.contains('всі'), 'cities_ua'] = 'Україна'
	
# Warning after it

Thank you for rapid answer!

@CRiddler
Copy link

CRiddler commented Sep 8, 2017

The issue here is that you're slicing you dataframe first with .loc in line 4. The attempting to assign values to that slice.

df_c = df.loc[df.encountry == country, :]

Pandas isn't 100% sure if you want to assign values to just your df_c slice, or have it propagate all the way back up to the original df. To avoid this when you first assign df_c make sure you tell pandas that it is its own data frame (and not a slice) by using

df_c = df.loc[df.encountry == country, :].copy()

Doing this will fix your error. I'll tack on a brief example to help explain the above since I've noticed a lot of users get confused by pandas in this aspect.

Example with made up data

>>> import pandas as pd
>>> df = pd.DataFrame({'A':[1,2,3,4,5], 'B':list('QQQCC')})
>>> df
   A  B
0  1  Q
1  2  Q
2  3  Q
3  4  C
4  5  C
>>> df.loc[df['B'] == 'Q', 'new_col'] = 'hello'
>>> df
   A  B new_col
0  1  Q   hello
1  2  Q   hello
2  3  Q   hello
3  4  C     NaN
4  5  C     NaN

So the above works as we expect! Now lets try an example that mirrors what you attempted to do with your data.

>>> df = pd.DataFrame({'A':[1,2,3,4,5], 'B':list('QQQCC')})
>>> df_q = df.loc[df['B'] == 'Q']
>>> df_q
   A  B
0  1  Q
1  2  Q
2  3  Q
>>> df_q.loc[df['A'] < 3, 'new_col'] = 'hello'
/Users/riddellcd/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py:337: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)

>>> df_q
   A  B new_col
0  1  Q   hello
1  2  Q   hello
2  3  Q     NaN

Looks like we hit the same error! But it changed df_q as we expected! This is because df_q is a slice of df so, even though we're using .loc[] df_q pandas is warning us that it won't propagate the changes up to df. To avoid this, we need to be more explicit and say that df_q is its own dataframe, separate from df by explicitly declaring it so.

Lets start back from df_q but use .copy() this time.

>>> df_q = df.loc[df['B'] == 'Q'].copy()
>>> df_q
   A  B
0  1  Q
1  2  Q
2  3  Q

Lets try to reassign our value now!
>>> df_q.loc[df['A'] < 3, 'new_col'] = 'hello'
>>> df_q
   A  B new_col
0  1  Q   hello
1  2  Q   hello
2  3  Q     NaN

This works without an error because we've told pandas that df_q is separate from df

If you in fact do want these changes to df_c to propagate up to df thats another point entirely and will answer if you want.

@NadiaRom
Copy link
Author

NadiaRom commented Sep 9, 2017

@CRiddler Great, thank you!
As you mentioned, chained .loc has never returned unexpected results. As I understand, .copy() ensures Pandas that we treat selected df_sliced_once as separate object and do not intend to change initial full df. Please correct if I mixed up smth.

@gfyoung gfyoung added the Indexing Related to indexing on series/frames, not to indexes themselves label Sep 9, 2017
@jreback
Copy link
Contributor

jreback commented Sep 9, 2017

documentation is here http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy and @CRiddler has a nice expl. you should in general NOT use inplace at all.

@jreback jreback closed this as completed Sep 9, 2017
@jreback jreback added this to the No action milestone Sep 9, 2017
@persep
Copy link

persep commented Sep 12, 2020

If you in fact do want these changes to df_c to propagate up to df thats another point entirely and will answer if you want.

@CRiddler Thanks your answer is better than the ones in Stack Overflow could you add when you want to propagate to the initial dataframe or give an indication of how it is done?

@CRiddler
Copy link

CRiddler commented Sep 14, 2020

@persep In general I don't like turning issues into stackoverflow threads for help, but it seems that this issue has gotten a fair bit of attention since last posting so I'll go ahead and post my method of tackling this type of problem in pandas. I typically do this by not subsetting the dataframe into separate variables, but I instead turn masks into variables- then combine masks as needed and set values based on those masks to ensure the changes happen in the original dataframe, and not to some copy floating around.

Original data:

>>>import pandas as pd
>>> df = pd.DataFrame({'A':[1,2,3,4,5], 'B':list('QQQCC')})
>>> df
   A  B
0  1  Q
1  2  Q
2  3  Q
3  4  C
4  5  C

Remember that creating a temporary dataframe will NOT propagate changes
As shown in the previous example, this makes changes to only to df_q and raises a pandas warning (not copied/pasted here). AND does NOT propagate any changes to df

>>> df_q = df.loc[df["B"] == "Q"]
>>> df_q.loc[df["A"] < 3, "new_column"] = "hello"

# df remains unchanged because we only made changes to `df_q`
>>> df
   A  B
0  1  Q
1  2  Q
2  3  Q
3  4  C
4  5  C

To my knowledge, there is no way to use the same code as above and force changes to propagate back to the original dataframe.

However, if we change our thinking a bit and work with masks instead of full-on subsets we can achieve the desired result. While this isn't necessarily "propagating" changes to the original dataframe from a subset, we are ensuring that any changes we do make happen in the original dataframe df. To do this, we create masks first, then apply them when we want to make a change to that subset of df

>>> q_mask = df["B"] == "Q"
>>> a_mask = df["A"] < 3

# Combine masks (in this case we used "&") to achieve what a nested subset would look like
#  In the same step we add in our item assignment. Instructing pandas to create a new column in `df` and assign
#  the value "hello" to the rows in `df` where `q_mask` & `a_mask` overlap.
>>> df.loc[q_mask & a_mask, "new_col"] = "hello"

# Successful "propagation" of new values to the original dataframe
>>> df
   A  B new_col
0  1  Q   hello
1  2  Q   hello
2  3  Q     NaN
3  4  C     NaN
4  5  C     NaN

Lastly, if we ever wanted to see what df_q would look like we can always subset it from the original dataframe using our q_mask

>>> df.loc[q_mask, :]
   A  B new_col
0  1  Q   hello
1  2  Q   hello
2  3  Q     NaN

While this isn't necessarily "propagating" changes from df_q to df we achieve the same result. Actual propagation would need to be explicitly done and would be less efficient than just working with masks.

@persep
Copy link

persep commented Sep 15, 2020

@CRiddler Thanks, you've been very helpful

@linehammer
Copy link

The first thing you should understand is that SettingWithCopyWarning is a warning, and not an error. You can safely disable this warning with the following assignment.

pd.options.mode.chained_assignment = None

The real problem behind the warning is that it is generally difficult to predict whether a view or a copy is returned. When filtering Pandas DataFrames , it is possible slice/index a frame to return either a view or a copy. A "View" is a view of the original data, so modifying the view may modify the original data. While, a "Copy" is a replication of data from the original, any changes made to the copy will not affect original data, and any changes made to the original data will not affect the copy.

@ntjess
Copy link

ntjess commented Jul 16, 2021

@CRiddler thanks for the detailed explanation. What happens if the original dataframe is out of scope? I.e.

def update_values(filtered):
  # Filtered is the result of a 'loc' call
  new_value = result_from_function_body()
  set_indexes = some_computation()
  filtered.loc[set_indexes, 'new_col'] = new_value

Does this mean there is no way for update_values to work? In this setup, a mask can't be used since we don't have access to the reference dataframe, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Usage Question
Projects
None yet
Development

No branches or pull requests

8 participants