Parameters start_date and end_date not working as expected. #859

TheAnalystx · 2023-01-29T02:10:41Z

Describe the bug
Parameters start_date and end_date not working as expected. They don't return stations data even tough there should be data available.

To Reproduce

import pytz
from datetime import datetime, timezone, timedelta
from wetterdienst.provider.dwd.observation import DwdObservationRequest, DwdObservationDataset, \
    DwdObservationResolution, DwdObservationPeriod
now = datetime.now(tz=pytz.timezone('Europe/Berlin')).astimezone(tz=timezone.utc)
r = DwdObservationRequest(
            parameter=[
                DwdObservationDataset.TEMPERATURE_AIR,
                DwdObservationDataset.WIND,
                DwdObservationDataset.PRECIPITATION
            ],
            resolution=DwdObservationResolution.MINUTE_10,
            period=DwdObservationPeriod.NOW,
            start_date=now - timedelta(minutes=360),  # <- comment out and it works
            end_date=now  # <- comment out and it works
        )
# search weather for coordinates:
wernigerode = {'latitude': 51.8395648304923, 'longitude': 10.780834955884814}
station = r.filter_by_rank(rank=1, latlon=(wernigerode['latitude'], wernigerode['longitude']))
assert not station.values.all().df.empty, 'empty dataframe found'  # <- error

You can see that data is available for the specific interval if you comment out start_date and end_date.

Expected behavior
Should return data for only the time frame specified.

Desktop (please complete the following information):

OS: [Windows]
Python-Version [3.9]

The text was updated successfully, but these errors were encountered:

gutzbenj · 2023-01-29T10:17:34Z

Dear @TheAnalystx ,

apparently when I run the code I get actual data:

station_id          dataset  ...    value quality
0        05490  temperature_air  ...  99880.0     2.0
1        05490  temperature_air  ...  99850.0     2.0
2        05490  temperature_air  ...  99820.0     2.0
3        05490  temperature_air  ...  99810.0     2.0
4        05490  temperature_air  ...  99800.0     2.0
..         ...              ...  ...      ...     ...
355      05490    precipitation  ...      0.0     2.0
356      05490    precipitation  ...      0.0     2.0
357      05490    precipitation  ...      0.0     2.0
358      05490    precipitation  ...      0.0     2.0
359      05490    precipitation  ...      NaN     NaN

Could you give more details on your environment and the request? What would now look like when you get empty data? And at what time did you run the code?

TheAnalystx · 2023-01-29T14:47:28Z

Interesting, it works now for me too. It was late in the night, I will try to reproduce it. Maybe I can find a pattern. Thanks for your fast feedback!

TheAnalystx · 2023-01-29T22:17:54Z

Issue re-appeared, time of execution was 23:11 Berlin Time.
I added the timestamps to the example

import pytz
from datetime import datetime, timezone, timedelta
from wetterdienst.provider.dwd.observation import DwdObservationRequest, DwdObservationDataset, \
    DwdObservationResolution, DwdObservationPeriod

now = datetime.now(tz=pytz.timezone('Europe/Berlin')).astimezone(tz=timezone.utc)
# datetime.datetime(2023, 1, 29, 22, 11, 39, 501404, tzinfo=datetime.timezone.utc)

r = DwdObservationRequest(
            parameter=[
                DwdObservationDataset.TEMPERATURE_AIR,
                DwdObservationDataset.WIND,
                DwdObservationDataset.PRECIPITATION
            ],
            resolution=DwdObservationResolution.MINUTE_10,
            period=DwdObservationPeriod.NOW,
            start_date=now - timedelta(minutes=360),  # <- comment out and it works
            # start_date = datetime.datetime(2023, 1, 29, 16, 11, 39, 501404, tzinfo=datetime.timezone.utc)
            end_date=now  # <- comment out and it works
            # end_date = datetime.datetime(2023, 1, 29, 22, 11, 39, 501404, tzinfo=datetime.timezone.utc)
        )

# search weather for coordinates:
wernigerode = {'latitude': 51.8395648304923, 'longitude': 10.780834955884814}
station = r.filter_by_rank(rank=1, latlon=(wernigerode['latitude'], wernigerode['longitude']))

assert not station.values.all().df.empty, 'empty dataframe found'  # <- error
df = station.values.all().df
df['date'].min()  # Timestamp('2023-01-29 00:00:00+0000', tz='UTC')
df['date'].max()  # Timestamp('2023-01-29 21:50:00+0000', tz='UTC')

gutzbenj · 2023-01-29T22:45:02Z

Thanks for the report! I also did a request just now and still got values:

station_id          dataset  ...    value quality
0        05490  temperature_air  ...  99330.0     2.0
1        05490  temperature_air  ...  99310.0     2.0
2        05490  temperature_air  ...  99310.0     2.0
3        05490  temperature_air  ...  99290.0     2.0
4        05490  temperature_air  ...  99280.0     2.0
..         ...              ...  ...      ...     ...
355      05490    precipitation  ...      0.0     2.0
356      05490    precipitation  ...      NaN     NaN
357      05490    precipitation  ...      NaN     NaN
358      05490    precipitation  ...      NaN     NaN
359      05490    precipitation  ...      NaN     NaN

Did you switch off the cache for once and try the same request?

TheAnalystx · 2023-01-31T12:59:45Z

Yes, the issues persisted when I turned it of by using Settings.cache = False I will look if I can identify the issue at some point.

gutzbenj · 2023-01-31T22:17:42Z

Thanks for the feedback! I just ran it again and again got values:

 station_id          dataset  ...    value quality
0        05490  temperature_air  ...  98630.0     2.0
1        05490  temperature_air  ...  98620.0     2.0
2        05490  temperature_air  ...  98620.0     2.0
3        05490  temperature_air  ...  98620.0     2.0
4        05490  temperature_air  ...  98590.0     2.0
..         ...              ...  ...      ...     ...
355      05490    precipitation  ...      0.0     2.0
356      05490    precipitation  ...      0.0     2.0
357      05490    precipitation  ...      0.0     2.0
358      05490    precipitation  ...      NaN     NaN
359      05490    precipitation  ...      NaN     NaN

TheAnalystx · 2023-02-06T13:56:46Z

@gutzbenj

Hi a short question, does this code work for you?

from wetterdienst.provider.dwd.observation import DwdObservationRequest, DwdObservationDataset, \
    DwdObservationResolution, DwdObservationPeriod
r = DwdObservationRequest(
            parameter=[
                DwdObservationDataset.TEMPERATURE_AIR,
                DwdObservationDataset.WIND,
                DwdObservationDataset.PRECIPITATION
            ],
            resolution=DwdObservationResolution.MINUTE_10,
            period=DwdObservationPeriod.HISTORICAL,
        )
# search weather for coordinates:
hannover = {'latitude': 52.39197954397832, 'longitude': 9.80360833506706}
station = r.filter_by_rank(rank=1, latlon=(hannover['latitude'], hannover['longitude']))
df = station.values.all().df
assert not df.empty, 'empty dataframe found'  # <- error
print(df)

Because I receive:

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2022.3.2\plugins\python-ce\helpers\pydev\pydevconsole.py", line 364, in runcode
    coro = func()
  File "<input>", line 1, in <module>
  File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\wetterdienst\core\scalar\values.py", line 755, in all
    for result in tqdm(self.query(), total=len(self.sr.station_id), file=tqdm_out):
  File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\tqdm\std.py", line 1195, in __iter__
    for obj in iterable:
  File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\wetterdienst\core\scalar\values.py", line 449, in query
    parameter_df = self._collect_station_parameter(station_id, parameter, dataset)
  File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\wetterdienst\provider\dwd\observation\api.py", line 132, in _collect_station_parameter
    date_ranges = self._get_historical_date_ranges(station_id, dataset)
  File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\wetterdienst\provider\dwd\observation\api.py", line 340, in _get_historical_date_ranges
    interval = pd.Interval(from_date_min, to_date_max, closed="both")
  File "pandas\_libs\interval.pyx", line 325, in pandas._libs.interval.Interval.__init__
  File "pandas\_libs\interval.pyx", line 345, in pandas._libs.interval.Interval._validate_endpoint
ValueError: Only numeric, Timestamp and Timedelta endpoints are allowed when constructing an Inte

I tried to create a completly new conda env because of the other error.

As for the other error:
The error is still perstisting (i even manuall deleted the cache files and set up a new environment with python 10 instead of 11). Could you maybe also try a different location?
hannover = {'latitude': 52.39197954397832, 'longitude': 9.80360833506706} I am not sure if its now a different buggy location or if I just made a mistake when copy pasting the buggy location.

gutzbenj · 2023-02-12T21:46:45Z

The given code also throws an error for me. It is caused because for the specific method we throw together different metadata of all requested datasets and then some stations are not existing in all datasets. I'll try to figure a solution as soon as possible. Until then try to request the datasets separately!

TheAnalystx · 2023-02-14T13:09:49Z

@gutzbenj Thank you for your answer, should I create a different ticket for the issue?

gutzbenj · 2023-02-19T14:51:36Z

@TheAnalystx I think it falls in the same category as the skip_empty issue, so guess no separate ticket required

gutzbenj · 2023-02-19T21:50:59Z

Dear @TheAnalystx ,

I tried working on some improvements regarding the problem of having empty data at #889 . The main idea there is that you set skip_empty in settings we want to get all data for all stations meaning we iterate over all stations and along the collection we increase a counter if a station has enough data and if not we just drop that station and continue with the next one.

There are currently some problems:
1.) If you request multiple datasets we will calculate the availability of data per parameter for all collected parameters of all datasets (e.g. temperature_air, precipitation and other datasets). If we have a station that has not all datasets available we try to always get a complete empty dataset available in that case however we don't do that if start_date and end_date is not given in the request.

This will result in the parameters not being considered empty in the calculation step of data availability simply because they are not present in the resulting dataframe. Thus tendentially the rate of available data is higher that it should be because empty datasets are not taken into account. This is not the case for parameters in the available datasets because there we have the given date range of that dataset that is being taken even for completely empty parameters.

The whole topic is quite complex to explain, should we have another chat?

TheAnalystx · 2023-02-24T01:45:00Z

Dear @gutzbenj yes lets have another chat, is there such a functionality on github? sounds complicated, but would be a huge boost in accessablity in my oppinion.

gutzbenj · 2023-02-24T22:24:53Z

I found some solution, but if you like we can still have a chat and I may introduce you to the whole library. There's no such functionality but we could just do a skype/whatever call.

gutzbenj · 2023-02-26T20:47:25Z

I just pushed some changes to main, so if you install now the live from from Github the code should be working for you.
Just be careful: If you request to many parameters it'll probably run endlessly getting all station data because probably few stations will have the data.

You can select the criteria for missing values like skip_threshold=0.8 and skip_criteria="min" (or "mean" or "max"),
where "min" would be the lowest availability of all parameters, "mean" would be the average of availabilities of all parameters and "max" the highest availability.

E.g. if you request "precipitation_height" and "temperature_air_mean_200" and have the following availabilities

parameter	perc
precipitation_height	0.7
temperature_air_mean_200	0.9

the station would fail for the above setting with skip_criteria="min" (because of precipitation_height being below the threshold) and would continue finding a station but would work with skip_criteria="mean" and skip_criteria="max".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameters start_date and end_date not working as expected. #859

Parameters start_date and end_date not working as expected. #859

TheAnalystx commented Jan 29, 2023 •

edited

gutzbenj commented Jan 29, 2023 •

edited

TheAnalystx commented Jan 29, 2023

TheAnalystx commented Jan 29, 2023 •

edited

gutzbenj commented Jan 29, 2023

TheAnalystx commented Jan 31, 2023

gutzbenj commented Jan 31, 2023

TheAnalystx commented Feb 6, 2023

gutzbenj commented Feb 12, 2023

TheAnalystx commented Feb 14, 2023

gutzbenj commented Feb 19, 2023

gutzbenj commented Feb 19, 2023

TheAnalystx commented Feb 24, 2023

gutzbenj commented Feb 24, 2023

gutzbenj commented Feb 26, 2023

Parameters start_date and end_date not working as expected. #859

Parameters start_date and end_date not working as expected. #859

Comments

TheAnalystx commented Jan 29, 2023 • edited

gutzbenj commented Jan 29, 2023 • edited

TheAnalystx commented Jan 29, 2023

TheAnalystx commented Jan 29, 2023 • edited

gutzbenj commented Jan 29, 2023

TheAnalystx commented Jan 31, 2023

gutzbenj commented Jan 31, 2023

TheAnalystx commented Feb 6, 2023

gutzbenj commented Feb 12, 2023

TheAnalystx commented Feb 14, 2023

gutzbenj commented Feb 19, 2023

gutzbenj commented Feb 19, 2023

TheAnalystx commented Feb 24, 2023

gutzbenj commented Feb 24, 2023

gutzbenj commented Feb 26, 2023

TheAnalystx commented Jan 29, 2023 •

edited

gutzbenj commented Jan 29, 2023 •

edited

TheAnalystx commented Jan 29, 2023 •

edited