Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameters start_date and end_date not working as expected. #859

Open
TheAnalystx opened this issue Jan 29, 2023 · 14 comments
Open

Parameters start_date and end_date not working as expected. #859

TheAnalystx opened this issue Jan 29, 2023 · 14 comments

Comments

@TheAnalystx
Copy link

TheAnalystx commented Jan 29, 2023

Describe the bug
Parameters start_date and end_date not working as expected. They don't return stations data even tough there should be data available.

To Reproduce

import pytz
from datetime import datetime, timezone, timedelta
from wetterdienst.provider.dwd.observation import DwdObservationRequest, DwdObservationDataset, \
    DwdObservationResolution, DwdObservationPeriod
now = datetime.now(tz=pytz.timezone('Europe/Berlin')).astimezone(tz=timezone.utc)
r = DwdObservationRequest(
            parameter=[
                DwdObservationDataset.TEMPERATURE_AIR,
                DwdObservationDataset.WIND,
                DwdObservationDataset.PRECIPITATION
            ],
            resolution=DwdObservationResolution.MINUTE_10,
            period=DwdObservationPeriod.NOW,
            start_date=now - timedelta(minutes=360),  # <- comment out and it works
            end_date=now  # <- comment out and it works
        )
# search weather for coordinates:
wernigerode = {'latitude': 51.8395648304923, 'longitude': 10.780834955884814}
station = r.filter_by_rank(rank=1, latlon=(wernigerode['latitude'], wernigerode['longitude']))
assert not station.values.all().df.empty, 'empty dataframe found'  # <- error

You can see that data is available for the specific interval if you comment out start_date and end_date.

Expected behavior
Should return data for only the time frame specified.

Desktop (please complete the following information):

  • OS: [Windows]
  • Python-Version [3.9]
@gutzbenj
Copy link
Member

gutzbenj commented Jan 29, 2023

Dear @TheAnalystx ,

apparently when I run the code I get actual data:

station_id          dataset  ...    value quality
0        05490  temperature_air  ...  99880.0     2.0
1        05490  temperature_air  ...  99850.0     2.0
2        05490  temperature_air  ...  99820.0     2.0
3        05490  temperature_air  ...  99810.0     2.0
4        05490  temperature_air  ...  99800.0     2.0
..         ...              ...  ...      ...     ...
355      05490    precipitation  ...      0.0     2.0
356      05490    precipitation  ...      0.0     2.0
357      05490    precipitation  ...      0.0     2.0
358      05490    precipitation  ...      0.0     2.0
359      05490    precipitation  ...      NaN     NaN

Could you give more details on your environment and the request? What would now look like when you get empty data? And at what time did you run the code?

@TheAnalystx
Copy link
Author

Interesting, it works now for me too. It was late in the night, I will try to reproduce it. Maybe I can find a pattern. Thanks for your fast feedback!

@TheAnalystx
Copy link
Author

TheAnalystx commented Jan 29, 2023

Issue re-appeared, time of execution was 23:11 Berlin Time.
I added the timestamps to the example

import pytz
from datetime import datetime, timezone, timedelta
from wetterdienst.provider.dwd.observation import DwdObservationRequest, DwdObservationDataset, \
    DwdObservationResolution, DwdObservationPeriod

now = datetime.now(tz=pytz.timezone('Europe/Berlin')).astimezone(tz=timezone.utc)
# datetime.datetime(2023, 1, 29, 22, 11, 39, 501404, tzinfo=datetime.timezone.utc)

r = DwdObservationRequest(
            parameter=[
                DwdObservationDataset.TEMPERATURE_AIR,
                DwdObservationDataset.WIND,
                DwdObservationDataset.PRECIPITATION
            ],
            resolution=DwdObservationResolution.MINUTE_10,
            period=DwdObservationPeriod.NOW,
            start_date=now - timedelta(minutes=360),  # <- comment out and it works
            # start_date = datetime.datetime(2023, 1, 29, 16, 11, 39, 501404, tzinfo=datetime.timezone.utc)
            end_date=now  # <- comment out and it works
            # end_date = datetime.datetime(2023, 1, 29, 22, 11, 39, 501404, tzinfo=datetime.timezone.utc)
        )

# search weather for coordinates:
wernigerode = {'latitude': 51.8395648304923, 'longitude': 10.780834955884814}
station = r.filter_by_rank(rank=1, latlon=(wernigerode['latitude'], wernigerode['longitude']))

assert not station.values.all().df.empty, 'empty dataframe found'  # <- error
df = station.values.all().df
df['date'].min()  # Timestamp('2023-01-29 00:00:00+0000', tz='UTC')
df['date'].max()  # Timestamp('2023-01-29 21:50:00+0000', tz='UTC')

@gutzbenj
Copy link
Member

Thanks for the report! I also did a request just now and still got values:

station_id          dataset  ...    value quality
0        05490  temperature_air  ...  99330.0     2.0
1        05490  temperature_air  ...  99310.0     2.0
2        05490  temperature_air  ...  99310.0     2.0
3        05490  temperature_air  ...  99290.0     2.0
4        05490  temperature_air  ...  99280.0     2.0
..         ...              ...  ...      ...     ...
355      05490    precipitation  ...      0.0     2.0
356      05490    precipitation  ...      NaN     NaN
357      05490    precipitation  ...      NaN     NaN
358      05490    precipitation  ...      NaN     NaN
359      05490    precipitation  ...      NaN     NaN

Did you switch off the cache for once and try the same request?

@TheAnalystx
Copy link
Author

Yes, the issues persisted when I turned it of by using Settings.cache = False I will look if I can identify the issue at some point.

@gutzbenj
Copy link
Member

Thanks for the feedback! I just ran it again and again got values:

 station_id          dataset  ...    value quality
0        05490  temperature_air  ...  98630.0     2.0
1        05490  temperature_air  ...  98620.0     2.0
2        05490  temperature_air  ...  98620.0     2.0
3        05490  temperature_air  ...  98620.0     2.0
4        05490  temperature_air  ...  98590.0     2.0
..         ...              ...  ...      ...     ...
355      05490    precipitation  ...      0.0     2.0
356      05490    precipitation  ...      0.0     2.0
357      05490    precipitation  ...      0.0     2.0
358      05490    precipitation  ...      NaN     NaN
359      05490    precipitation  ...      NaN     NaN

@TheAnalystx
Copy link
Author

@gutzbenj

Hi a short question, does this code work for you?

from wetterdienst.provider.dwd.observation import DwdObservationRequest, DwdObservationDataset, \
    DwdObservationResolution, DwdObservationPeriod
r = DwdObservationRequest(
            parameter=[
                DwdObservationDataset.TEMPERATURE_AIR,
                DwdObservationDataset.WIND,
                DwdObservationDataset.PRECIPITATION
            ],
            resolution=DwdObservationResolution.MINUTE_10,
            period=DwdObservationPeriod.HISTORICAL,
        )
# search weather for coordinates:
hannover = {'latitude': 52.39197954397832, 'longitude': 9.80360833506706}
station = r.filter_by_rank(rank=1, latlon=(hannover['latitude'], hannover['longitude']))
df = station.values.all().df
assert not df.empty, 'empty dataframe found'  # <- error
print(df)

Because I receive:

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2022.3.2\plugins\python-ce\helpers\pydev\pydevconsole.py", line 364, in runcode
    coro = func()
  File "<input>", line 1, in <module>
  File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\wetterdienst\core\scalar\values.py", line 755, in all
    for result in tqdm(self.query(), total=len(self.sr.station_id), file=tqdm_out):
  File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\tqdm\std.py", line 1195, in __iter__
    for obj in iterable:
  File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\wetterdienst\core\scalar\values.py", line 449, in query
    parameter_df = self._collect_station_parameter(station_id, parameter, dataset)
  File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\wetterdienst\provider\dwd\observation\api.py", line 132, in _collect_station_parameter
    date_ranges = self._get_historical_date_ranges(station_id, dataset)
  File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\wetterdienst\provider\dwd\observation\api.py", line 340, in _get_historical_date_ranges
    interval = pd.Interval(from_date_min, to_date_max, closed="both")
  File "pandas\_libs\interval.pyx", line 325, in pandas._libs.interval.Interval.__init__
  File "pandas\_libs\interval.pyx", line 345, in pandas._libs.interval.Interval._validate_endpoint
ValueError: Only numeric, Timestamp and Timedelta endpoints are allowed when constructing an Inte

I tried to create a completly new conda env because of the other error.

As for the other error:
The error is still perstisting (i even manuall deleted the cache files and set up a new environment with python 10 instead of 11). Could you maybe also try a different location?
hannover = {'latitude': 52.39197954397832, 'longitude': 9.80360833506706} I am not sure if its now a different buggy location or if I just made a mistake when copy pasting the buggy location.

@gutzbenj
Copy link
Member

The given code also throws an error for me. It is caused because for the specific method we throw together different metadata of all requested datasets and then some stations are not existing in all datasets. I'll try to figure a solution as soon as possible. Until then try to request the datasets separately!

@TheAnalystx
Copy link
Author

@gutzbenj Thank you for your answer, should I create a different ticket for the issue?

@gutzbenj
Copy link
Member

@TheAnalystx I think it falls in the same category as the skip_empty issue, so guess no separate ticket required

@gutzbenj
Copy link
Member

Dear @TheAnalystx ,

I tried working on some improvements regarding the problem of having empty data at #889 . The main idea there is that you set skip_empty in settings we want to get all data for all stations meaning we iterate over all stations and along the collection we increase a counter if a station has enough data and if not we just drop that station and continue with the next one.

There are currently some problems:
1.) If you request multiple datasets we will calculate the availability of data per parameter for all collected parameters of all datasets (e.g. temperature_air, precipitation and other datasets). If we have a station that has not all datasets available we try to always get a complete empty dataset available in that case however we don't do that if start_date and end_date is not given in the request.

This will result in the parameters not being considered empty in the calculation step of data availability simply because they are not present in the resulting dataframe. Thus tendentially the rate of available data is higher that it should be because empty datasets are not taken into account. This is not the case for parameters in the available datasets because there we have the given date range of that dataset that is being taken even for completely empty parameters.

The whole topic is quite complex to explain, should we have another chat?

@TheAnalystx
Copy link
Author

Dear @gutzbenj yes lets have another chat, is there such a functionality on github? sounds complicated, but would be a huge boost in accessablity in my oppinion.

@gutzbenj
Copy link
Member

I found some solution, but if you like we can still have a chat and I may introduce you to the whole library. There's no such functionality but we could just do a skype/whatever call.

@gutzbenj
Copy link
Member

I just pushed some changes to main, so if you install now the live from from Github the code should be working for you.
Just be careful: If you request to many parameters it'll probably run endlessly getting all station data because probably few stations will have the data.

You can select the criteria for missing values like skip_threshold=0.8 and skip_criteria="min" (or "mean" or "max"),
where "min" would be the lowest availability of all parameters, "mean" would be the average of availabilities of all parameters and "max" the highest availability.

E.g. if you request "precipitation_height" and "temperature_air_mean_200" and have the following availabilities

parameter perc
precipitation_height 0.7
temperature_air_mean_200 0.9

the station would fail for the above setting with skip_criteria="min" (because of precipitation_height being below the threshold) and would continue finding a station but would work with skip_criteria="mean" and skip_criteria="max".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants