Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event Based Time Series #229

Open
nsteins opened this issue May 1, 2020 · 6 comments
Open

Event Based Time Series #229

nsteins opened this issue May 1, 2020 · 6 comments
Assignees

Comments

@nsteins
Copy link
Contributor

nsteins commented May 1, 2020

Proposing a new class for Traces EventSeries for handling data that is a series of timestamps denoting the occurrence of discrete events. For example this collection of 311 requests in Chicago, where each record is a request that has a timestamp for when it was opened and when it was closed. This is a fit for Traces because it is another example of unevenly-spaced time series and can use traces.TimeSeries for certain calculations

An example of how the API might look

df = pd.read_csv('311_Service_Requests.csv',nrows=10000)
creation = EventSeries(df['CREATED_DATE'].dropna())
completion = EventSeries(df['CLOSED_DATE'].dropna())

Event series could tell you the amount of events that occured between two arbitrary timestamps

>>> creation.events_between(pd.Timestamp('2018-01-01'),pd.Timestamp('2019-02-01'))
6681

EventSeries would also have a cumulative sum function which returns a TimeSeries of the cumulative number of events that have occured since the first record

>>>ts = creation.cumsum()
>>>ts.plot()

image

For events that have a "open" and "close" time stamp, EventSeries can calculate the number of active open cases

>>>diff = EventSeries.count_active(creation, completion)
>>>diff.plot()

image

Finally, EventSeries can calculate the inter-event arrival times and create visualizations for analysis

>>>after = creation.time_lag(how='after')
>>>creation.plot_time_lag(how='after')

image

I am already working on implementing this, but I would appreciate feedback and suggestions on API or features. Particularly interested if this can be extended to support the use case outlined in this issue #227

@johnhaire89
Copy link

This looks very useful, although I wonder if EventSeries could just a special case of TimeSeries.
Using your example, each service request might be represented as a TimeSeries with two points.

service_call_event = traces.TimeSeries(default=0)
service_call_event[pd.Timestamp('2019-07-17 11:56:40')] = 1
service_call_event[pd.Timestamp('2019-07-30 13:14:54')] = 0

Suppose if you have the list of all service calls in a list named service_call_list where each event is a TimeSeries with 2 points, then your cumsum function might be the same as a merge operation:

active_events = traces.TimeSeries.merge(service_call_list, operation=sum)

All that said, I guess that this way of processing the data would be far less efficient than your method.

I have a device that flashes according to a timetable. It reports a "commencement" event when it starts flashing and a "cessation" event where it stops. I'm looking into a method to represent the state on a timeline by creating a TimeSeries for that state and adding a value of 1 for each commencement and a value of 0 for each cessation.
I'm also trying to represent the device's timetable as a time series for the desired state, with a value for 1 for when it should start flashing and 0 for when it should stop flashing. With this method I can use a xor operation to generate a plottable time series of all the times that the desired state didn't equal the actual state.

I like your time_lag function because I want to work out the total amount of time that my actual flashing state didn't match with the desired state. However, now that I have a TimeSeries where y=1 for any time that the actual state didn't match the desired state, maybe that function can be performed by existing operation as well. @devs, Histogram.total() calculate the area under the curve?

@nsteins
Copy link
Contributor Author

nsteins commented Sep 2, 2020

You are correct that you could represent this as a TimeSeries, and in fact, that was my first approach to modeling this kind of data. It's just slow because traces.TimeSeries.merge iterates through the entire SortedDict on every insertion.

@johnhaire89
Copy link

johnhaire89 commented Sep 3, 2020

Ah. Understood.

I feel like event_series is just a list of events, rather than something that fits into the library.

A faster way to build a timeseries could be

ts = traces.TimeSeries(default:0)
for row in df:
    ts[df['CREATED_DATE'].dropna()] = 1
    ts[df['CLOSED_DATE'].dropna()] = -1

A cumulative sum function could be an awesome addition to the api

cumsum_trace = traces.TimeSeries(default:0)
cumsum = 0
for k, v in ts.items():
    cum_sum += v
    cumsum_trace[k] = cumsum

As for feature requests, it could be cool if there was a function get_events(self, start_signal, end_signal) that returned a list of "events". Given (key, value) pairs in a time series, each event will have a start (key when value == start_signal) and an end (key when value == end_signal).

@nsteins
Copy link
Contributor Author

nsteins commented Sep 4, 2020

I think that EventSeries fits in with Traces because it tries to follow a similar design and API to TimeSeries. There are obviously many ways to accomplish this, but I often found myself frustrated trying to accomplish this with pure pandas, and unable to do a lot of the things I wanted to with TimeSeries.

The main difference is that TimeSeries are designed around a model of an irregularly sampled continuous signal. I'm not sure what physical quantity a cumulative sum function would correspond to for a general TimeSeries.

Could you explain the get_events(self, start_signal, end_signal) request a bit more?

@johnhaire89
Copy link

I think it could be nice to have a function that transforms a timeseries into a list of periods (each with a start and end time or a start time and duration) based on the values.
You can then answer questions like "provide a list of periods where a light was switched on" or, using the shopping cart example from the docs, "provide a list of periods where the user had apples in their cart".
start_signal and end_signal could be functions so that it works on non-numeric traces.

@ThomDietrich
Copy link

Hey @nsteins, coming here from #227. Are you working on this? The feedback was short but I think this would be a great addition to the library, as an EventSeries equally falls into the task traces tries to solve: Handling time series. The fact that there are these two main classes makes EventSeries quite logical.
@stringertheory came to the same conclusion in #227

Any timeline for this or questions you still want to discuss? I guess that would be easiest managed in a preliminary PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants