Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ticker.history return messy tz info #208

Open
ericwbzhang opened this issue Jul 19, 2023 · 10 comments
Open

Ticker.history return messy tz info #208

ericwbzhang opened this issue Jul 19, 2023 · 10 comments

Comments

@ericwbzhang
Copy link

ericwbzhang commented Jul 19, 2023

Hi I notice the price history returns messy tz info when working with more than one ticker in different tz

Example:
tmp_ticker= yq.Ticker( ['CLK24.NYM', '6618.HK'])
a= tmp_ticker.history(interval='5m', start='2023-07-10').reset_index()

part of returns:

6618.HK,2023-07-13 15:45:00+08:00,55.50000,55.80000,55.50000,55.70000,147000.00000
6618.HK,2023-07-13 15:50:00+08:00,55.80000,55.90000,55.70000,55.75000,157050.00000
6618.HK,2023-07-13 15:55:00+08:00,55.80000,55.80000,55.55000,55.60000,86600.00000
6618.HK,2023-07-13 04:05:00-04:00,55.60000,55.70000,55.60000,55.70000,359100.00000
6618.HK,2023-07-14 09:30:00+08:00,55.70000,56.15000,55.35000,55.90000,0.00000
6618.HK,2023-07-14 09:35:00+08:00,55.70000,55.75000,55.10000,55.35000,80950.00000

pd.version =1.5.3
yq.version= 2.3.2

@maread99
Copy link
Contributor

maread99 commented Jul 19, 2023

Hi @ericwbzhang, thanks for raising this. The bug's here, in how the dataframes for each symbol are concatenated:

try:
df = pd.concat(d, names=["symbol", "date"], sort=False)

Rather than retain the timezone info of the labels being concatenated, it'll use the first label it comes across that represents the same timestamp, regardless of the timezone. In your example you'll see that the labels that seem out for the HK symbol are correct as timestamps, they are just represented in a different timezone to what would be expected. This happens wherever the label represents a timestamp that's already present in the data for the NYM symbol.

I'm going to offer a PR with a fix that will retain the expected timezones (EDIT - no I'm not, see next comment).

It's probably worth noting that when you request data for symbols trading in different timezones AND have the adj_timezone as True (the default) the date level of the index will not be a pd.DatetimeIndex, but rather of dtype 'object'. This is because pd.DatetimeIndex requires all indices to have the same timezone. You might need to do quite a bit of post-processing. Depending on what you're doing with the data, you may find any of the following useful:

  • pass the adj_timezone parameter as False to return all indices in terms of UTC. The date level will then be a pd.DatetimeIndex which can be easily converted to another timezone.
  • have a look at the market_prices package - by default it uses yahooquery to get the price data although adds a host of options to query and post-process the data. For symbols trading on different exchanges it provides for combining all symbols in a dataframe indexed with a pd.DatetimeIndex (not a pd.MultiIndex) with unique labels in terms of your elected timezone.
  • (EDIT) separate out your ticker instances if it works for you to have the data in different dataframe, each with its own timezone, i.e. tick1 = yq.Ticker('CLK24.NYM'), tick2 = yq.Ticker('6618.HK').

@maread99
Copy link
Contributor

Right, this is annoying, I'm not able to create the required pd.MultiIndex. Tried various ways to no avail.

It might be bug in pandas (I've raised it with them here) but I think it's just as likely that it can't be done given how pd.MultiIndex work.

Assuming it's not a bug in pandas, options that occur to me are:

  • leave it as it is. All the labels are accurate, they just aren't all in the timezone that might be expected. (I'm not comfortable with this option given the potential for confusion.)
  • require that adj_timezone is explicitly passed as False whenever symbols correspond to more than one timezone (if adj_timezone is True in this case then raise an exception explaining why adj_timezone needs to be passed as False). This would force all indices to UTC and the user would know to expect this and why.

@dpguthrie, what do you reckon? I'm happy to offer up a PR for the second option if you agree that's the way to go.

As a related aside, I think the default for adj_timezone would be better as False, not least because it seems it's not possible to reliably represent more than one timezone in the returned dataframe. Although such a change would probably need to wait for any v3.

@ericwbzhang
Copy link
Author

@maread99 thanks for looking into this issue. I am thinking maybe returning an extra column that states the tzinfo for each symbol if adj_timezone= True. The date col will always be utc datetime
How do you like this?

@maread99
Copy link
Contributor

Hi @ericwbzhang.

IMHO, and the idea I get from having followed the package's evolution, yahooquery is about providing an API to get data from Yahoo Finance and it should not be concerned with doing much more than a bare minimum of post-processing. Rather post-processing should be left to the client according to their needs (although I might be off-the-mark with this - @dpguthrie ?).

The extra column you suggest could be easily added by any user who would find it useful (a symbol's timezone is available through the API). I'm not sure if you're suggesting that this column should have dtype 'str' and show the timezone (e.g. "America/New_York") or have dtype 'object' and hold pd.Timestamp showing the row datetime in the timezone of the corresponding symbol. (The latter is certainly possible and more useful - you CAN have a column, dtype 'object', with pd.Timestamp that represent the same underlying timestamp in different timezones. Just appears you can't do this within a pd.MultiIndex.)

@dpguthrie
Copy link
Owner

  1. Definitely agree that this package should be a wrapper around the API and return, for the most part, what the underlying data is and not do much more.
  2. @maread99 Maybe we don't need a pd.MultiIndex - pretty sure people end up resetting the index anyway after getting the results.

@ericwbzhang
Copy link
Author

ericwbzhang commented Jul 19, 2023 via email

@maread99
Copy link
Contributor

Just to clarify, are we suggesting having the dataframe with index as the default integer index (first row 0) and adding a 'symbol' and a 'date' column, i.e. as currently retuned by df.reset_index()?

I think this would certainly resolve the issue. I'm happy to offer the PR and revise the doc and affected tests - let me know.

Only thing, I suspect the change would break pretty much every dependant that uses the history method. Maybe one for 2.4, even 3.0?

@dpguthrie
Copy link
Owner

@maread99 As always, appreciate the work and timely responses!

I've definitely gotten feedback in the past that the MultiIndex is somewhat undesirable and that most people do in fact reset_index.

It is definitely a breaking change. In my opinion, it's more of a minor breaking change (so upgrading to 2.4) instead of a major one (upgrading to 3.0). Would love to hear your opinion, and anyone else's, on what the appropriate code path and semver path should be.

@ValueRaider
Copy link

Definitely agree that this package should be a wrapper around the API and return, for the most part, what the underlying data is and not do much more.

Then why combine the tables at all? Just return a dict of individual tables.

It is definitely a breaking change.

My suggestion: put this change in a new major version like 3.0, but continue supporting 2.* for a good while. Let users migrate in their own time.

@maread99
Copy link
Contributor

maread99 commented Jul 24, 2023

Then why combine the tables at all? Just return a dict of individual tables.

I think this is a good point. At market-prices I just end up splitting the DataFrame back into individual tables. I suspect at least some users will be doing something similar. Seems to make sense to offer the rawer option and leave it to the user to decide what they want to do with it. This would also nicely solve the tz matter that started off this issue - there would be no room for conflict between 'same timestamp different timezone', rather if adj_timezone is True then each DataFrame could simply be indexed in terms of the symbol's local timezone. This would also allow the columns to cleanly represent OHLCV, i.e. as now and as I think users would most-likely expect (personally I'm not keen on polluting the columns with the likes of 'date' and 'symbol').

So there's a couple of options to resolve the tz conflicts - either as if the index were reset or offer the return as a dict[str, DataFrame] with each DataFrame representing the prices for a single symbol (as identified by the key). I'd elect the latter for reasons of being cleaner and rawer (and so potentially more performant). @dpguthrie, all that said, I'd be happy to resolve this by offering a PR for the route you'd prefer to go with.

Versioning

Whichever way this is resolved it's going to break the code of almost everyone who uses's the .history method. However, the fixes that users will be required to implement should be pretty minimal. Couple of thoughts:

  • Whilst the nature of the change might warrant v3, going to v3 for this alone seems a bit excessive - 'history' is after all just one method offered offered by yahooquery. Users who don't use it would see absolutely no changes despite a new major release.
  • Whilst going to v3 and temporarily maintaining a v2 branch might be the ideal approach to give users time to migrate, I think simply going to v2.4 could be more than justified given time restraints on owners / maintainers of open-source. Users could pin to the prior version until they get a chance to implement their fix. On this basis, might be worth considering resolving the cookies issue first, releasing as 2.4, then fix this and release as 2.5 so that users have the option to pin to a stable 2.4.
    • Bundling the whole lot (cookies and this) as v3 might appear to be a clean option, although it would leave users without a stable prior version to pin to.
  • I think what's more important than 3.0 or 2.* is ensuring good documentation accompanies the changes / release so that users are aware there's been a breaking change and it's clear what that change is.

If it were me I'd probably go 2.4 then 2.5 - stable prior version to pin to and no responsibility to maintain a separate branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants