Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploratory Data Analysis (Time Series) #197

Open
alanhdu opened this issue Feb 26, 2016 · 4 comments
Open

Exploratory Data Analysis (Time Series) #197

alanhdu opened this issue Feb 26, 2016 · 4 comments
Assignees

Comments

@alanhdu
Copy link
Contributor

alanhdu commented Feb 26, 2016

Use .csv dump and get sense of data.

Questions:

  • Are the time series stationary?
  • What kind of seasonal effects do we see (daily? Monthly? weekly? yearly?)
@alexander-yu alexander-yu self-assigned this Feb 26, 2016
@alanhdu alanhdu changed the title Exploratory Data Analysis Exploratory Data Analysis (Time Series) Feb 26, 2016
@alexander-yu
Copy link
Collaborator

From eyeballing plots of various floors/buildings (IPython notebook link with plots below), there's a couple observations that I've made:

  1. There are some pretty notable seasonal effects to the time series, which implies that our time series aren't stationary.

  2. Specifically, we can see seasonality at the daily, weekly, and annual level (I haven't seen any notable patterns on a monthly basis).

    On the daily and weekly level, the capacities are pretty predictable in terms of relative capacities. That is, there are much more people during the day at around 3 p.m. than in the morning or after 9 p.m., and numbers tend to die down near closing hours for the buildings that do close, and there are usually more people in study spaces in the middle of the week than there are on weekends. For dining halls, it's again pretty predictable; much more people during normal eating periods like early afternoon or around 6 p.m. than in the early morning or near closing hours.

    On the annual level, the seasonal effects tend to correspond more closely with the academic calendar; capacities really die down during holidays and breaks, and there are certainly peaks (especially in libraries) as the year approaches midterms/finals.

  3. It also turns out that people tend not to stay in buildings too much after closing time; for libraries this tends not to be the case anyways, since those libraries with actual closing hours tend to get cleared out by staff (personal experience).

However, we can still see some people in buildings (Lerner for example) past closing time. For some of these plots, though, it's a bit difficult to tell whether it's a handful of people loitering past hours or it's other devices in the building (like printers/desktops); for example, as the last plot in the IPython notebook shows, on 11/01/14, Avery 2 constantly had some number of devices counted, which I'm guessing are printers or something.

Link to IPython Notebook: https://github.com/afy2103/Density-Data-Analysis/blob/master/density.ipynb

@alanhdu
Copy link
Contributor Author

alanhdu commented Mar 23, 2016

@afy2103 Nice work. I've made a pull request (alexander-yu/Density-Data-Analysis#1) with some technical comments about the analysis. A couple of high-level comments:

  • Would you mind doing either some autocorrelation plots or running a fast fourier transform so we can pinpoint the seasonality. The peaks of the autocorrelation plot (or the peak of the FFT magnitude) should indicate what kinds of seasonality we're seeing.
  • Could you highlight the missing data problem? I remember there being almost a month of garbage data, but it doesn't seem to be mentioned in your analysis

@alexander-yu
Copy link
Collaborator

Got it; thanks for the edits/comments. Apologies about the code being hacked together as it was (and I should probably get familiar with the pandas documentation more). Should I incorporate the edits in your version of the IPython notebook into mine without merging the commits?

I'll get some autocorrelation plots together, and also address that hole in the data in a later post.

@alexander-yu
Copy link
Collaborator

Update:

  1. From the autocorrelation plots (notebook included in the already linked repo), there's definitely the sort of behavior on the hourly/daily level that we can see from eyeballing the plots, though in some cases it's not as statistically significant as we'd like (in particular on the daily level of seasonality). For example, with Butler 3, the autocorrelation plot shows local min/max points at the half-week/full-week points, which is what we would expect, but many of the points are below the threshold for statistical significance (Avery had better results there). The hourly level is better, since we can see significance at the 12-hour and 24-hour intervals, which is exactly what we would expect.
  2. We can see some of the seasonality we'd like to have from the weekly autocorrelation plots, but it's even less significant; a lot of this is probably because we simply just don't have enough data for that (the hole in the data doesn't help with that, since pandas isn't able to compute the autocorrelations past that point).
  3. As for the hole in the data, it seems to be about a month where for some reason there's nothing recorded: also something interesting to note is that suddenly after that hole ends, there's a new group of routers called Butler 301 that starts recording, separately from Butler 3 (the updated IPython notebook, density.ipynb, shows this). Was this a new group of routers put in by Columbia? It looks like it's a separate count from the rest of Butler 3, as the capacity recorded in Butler 3 is now significantly lower than what it's been historically -- does this also affect Density's current capacity estimations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants