Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle faulty DST logic with repeated indicies #104

Open
wholmgren opened this issue Mar 30, 2021 · 4 comments
Open

handle faulty DST logic with repeated indicies #104

wholmgren opened this issue Mar 30, 2021 · 4 comments

Comments

@wholmgren
Copy link
Member

wholmgren commented Mar 30, 2021

I have a file in which the data appears to not use DST, but there's a complication. The transition to DST does occur at 2 am. Then an hour later the data falls back to 3 am. So the effect is that the 3 am hour is repeated in the data.

I speculate that this happens with a non-trivial number of data loggers, so it would be nice if pvanalytics could flag it and fix it. index.is_monotonic and index.duplicated() seem like good places to start, but I don't have a complete suggestion for how to implement.

See example in the details tag below, and find the file in the zip archive below that.

df.index[96598:96598+60*1+4]

DatetimeIndex(['2020-03-08 01:59:00', '2020-03-08 03:00:00',
               '2020-03-08 03:01:00', '2020-03-08 03:02:00',
               '2020-03-08 03:03:00', '2020-03-08 03:04:00',
               '2020-03-08 03:05:00', '2020-03-08 03:06:00',
               '2020-03-08 03:07:00', '2020-03-08 03:08:00',
               '2020-03-08 03:09:00', '2020-03-08 03:10:00',
               '2020-03-08 03:11:00', '2020-03-08 03:12:00',
               '2020-03-08 03:13:00', '2020-03-08 03:14:00',
               '2020-03-08 03:15:00', '2020-03-08 03:16:00',
               '2020-03-08 03:17:00', '2020-03-08 03:18:00',
               '2020-03-08 03:19:00', '2020-03-08 03:20:00',
               '2020-03-08 03:21:00', '2020-03-08 03:22:00',
               '2020-03-08 03:23:00', '2020-03-08 03:24:00',
               '2020-03-08 03:25:00', '2020-03-08 03:26:00',
               '2020-03-08 03:27:00', '2020-03-08 03:28:00',
               '2020-03-08 03:29:00', '2020-03-08 03:30:00',
               '2020-03-08 03:31:00', '2020-03-08 03:32:00',
               '2020-03-08 03:33:00', '2020-03-08 03:34:00',
               '2020-03-08 03:35:00', '2020-03-08 03:36:00',
               '2020-03-08 03:37:00', '2020-03-08 03:38:00',
               '2020-03-08 03:39:00', '2020-03-08 03:40:00',
               '2020-03-08 03:41:00', '2020-03-08 03:42:00',
               '2020-03-08 03:43:00', '2020-03-08 03:44:00',
               '2020-03-08 03:45:00', '2020-03-08 03:46:00',
               '2020-03-08 03:47:00', '2020-03-08 03:48:00',
               '2020-03-08 03:49:00', '2020-03-08 03:50:00',
               '2020-03-08 03:51:00', '2020-03-08 03:52:00',
               '2020-03-08 03:53:00', '2020-03-08 03:54:00',
               '2020-03-08 03:55:00', '2020-03-08 03:56:00',
               '2020-03-08 03:57:00', '2020-03-08 03:58:00',
               '2020-03-08 03:59:00', '2020-03-08 03:00:00',
               '2020-03-08 03:01:00', '2020-03-08 03:02:00'],
              dtype='datetime64[ns]', name='Time', freq=None)

FSEC_RTC_Weather_2020.csv.zip

@wholmgren
Copy link
Member Author

Here's my half baked solution

def fix_st_to_dst(df):
    # assume duplicated only occurs at ST to DST transition
    duplicated = df.index.duplicated()
    num_duplicates = duplicated.sum()
    first_duplicate_index = np.argmax(duplicated)
    first_shift_point = first_duplicate_index - num_duplicates
    new_hour = \
        df.index[first_shift_point:first_duplicate_index] - pd.Timedelta('1h')
    new_hour_df = df.iloc[first_shift_point:first_duplicate_index].copy()
    new_hour_df.index = new_hour
    df_fixed = pd.concat([
        df.iloc[:first_shift_point],
        new_hour_df,
        df.iloc[first_duplicate_index:]
    ])
    return df_fixed

@cwhanse
Copy link
Member

cwhanse commented Mar 31, 2021

Interesting, and weird that the "spring forward" duplicated 3-4am, as if the clock was reset to ST after one hour. I wonder if these data have already been modified in an attempt to fix time stamp issues, since a similar flaw doesn't appear on 1-Nov-2020.

I could talk people at the data source and find out why this all happened. That may help us understand how to structure a useful function for pvanalytics.

The code above is different than the pattern in quality.time. The existing functions return a Series of bool. This function would return a corrected Datetimeindex. I'm not opposed to the code you propose, only wanted to make this difference explicit for the discussion here.

@wholmgren
Copy link
Member Author

I wonder if these data have already been modified in an attempt to fix time stamp issues, since a similar flaw doesn't appear on 1-Nov-2020.

I assumed it was happening within the datalogger, but yes, you may be right.

The code above is different than the pattern in quality.time.

I don't have a preference, this is just what I hacked together for this script for arbiter insert. I'd like to say quickly hacked together, but alas, it was not so.

A pvanalytics implementation following my approach might want to restrict the duplicates search to specific days.

@DanRiley
Copy link

I can confirm that these errors are not happening within the datalogger, at least, not in this case. Here, the logger remains in local standard time (UTC-5) and does not change. So I believe the errors in the referenced file are due to an error in the system that exported the data from the database.

However, I have seen plenty of times where a logger or data acquisition system switches from standard time to daylight time (or the other way), so I think it's probably a good idea to have a routine adept at correcting the issue. It should be trivial if the UTC offset is included in the time, but perhaps more difficult (and perilous!) if the UTC offset is not available with each time stamp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants