Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use plan notes #18

Open
MichaelCurrin opened this issue Jun 2, 2020 · 2 comments
Open

Use plan notes #18

MichaelCurrin opened this issue Jun 2, 2020 · 2 comments

Comments

@MichaelCurrin
Copy link
Owner

From plans.txt

This was before git so maybe not relevant

@MichaelCurrin
Copy link
Owner Author

Steps:
    Export to Google Drive or sync with Dropbox
    Pick up updated file.
    Clean and prepare CSV. The activities column has multiple values, better handled in Pandas than SQL.
    Import into db. Load into staging table then update or insert based on timestamp. Or just delete all since the CSV is always full date range.
    Produce reports.
    Email report weekly.
    When condition is met, send notifications using web request to Maker and IFTTT and push notification to mobile phone or email. Or simply Gmail. But only on new data entered or for warnings.

The default values are "rad", "good", "med", "fugly", "awful" and following

Reporting:
    On demand local view
    Email report
    Warnings
    SQLite queries
    Command-line results
    Native python graphs
    Browser graphs

Data structures:
    Pandas is nice for cleaning, normalisation and encoding for ML.
    SQL is nice for aggregating.

SQL strategy:
    Reading a CSV into SQLite will set all columns to TEXT affinity and numerical data is stored as text. See https://www.sqlite.org/datatype3.html
    A CSV can be read into a created table with correct formats. However the table column names will be hardcoded unlike above.
    The data is simple enough to get into SQL and not need an ORM. If an ORM is used, it may be possible to pickup column names dynamically from the CSV and then create attributes on the table, though inconveniently.
    Ultimately, the view queries will have to reference the hardcoded column names and each column will have particularly meaning.


Time frames:
    Last day entered as max of entered dates.
    Actual current day.
    History current month or week. Project to end of period at current rate or that target will be exceeded, or has been exceeded.
    Or for past 4 weeks or 7 days. Compare vs previous period.
    Warn if target exceeded or will be exceeded at current rate.
    Set weekly and monthly targets. Also use the same for rolling 7 day or 4 week periods. See Wallet app.

ML:
OLS or regression are ways of setting weights to predict mood based on states and actions.
Note that sometimes mood will drive an action.
Also the previous day's ave mood or last known mood should be considered and will probably have a strong contribution.


This repo's [dayliopy]() application is for users of the Daylio diary app who want to go beyond the app's own built-in reporting to seek a better understanding of one's experiences and emotions. The aim of this project is to use analysis and reporting to improve self-awareness and provide to insights which can drive actions.

The features are implemented with the following processes:
- [ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load)
- On-demand reporting
    1. Handle user request.
    2. Fetch data.
    3. Process and report on data.
- Automated reporting and notifications
    1. Run weekly or daily job on schedule.
    2. Check condition is met (otherwise always run).
    3. Send alert
      - e.g. Send request with message to _Maker_, so that _IFTTT_ can trigger a mobile alert
      - e.g. Send text or HTML in e-mail using configured Gmail account.


### Components

There are several reporting components, with different uses and levels of complexity.

- Terminal reporting. _TO BE COMPLETED_
- Visual reporting. _TBC_
- A dashboard view over various time frames. _TBC_
- Measurement of totals or rates against goals. _TBC_
- Automated alerts based on custom conditions. _TBC_
- Forecasting. _TBC_



OLS
    # TODO: check accuracy.
    # TODO: split test and train data.
    # Consider using historical data to predict recent data.


# Try variations of single models.
explore = []
for col in df.columns:
    model = sm.OLS(
        mood_normalized,
        df[col]
    ).fit()
    stats = [col, model.rsquared]
    explore.append(stats)
    predictions = model.predict(df[col])
    #print(model.summary())

e = pd.DataFrame(explore, columns=('field', 'rsquared'))
e.set_index('field', inplace=True)
e.loc[:,'rsquared'] = round(e['rsquared'] * 100).astype(int)
print(e.sort_values('rsquared', ascending=False))

===

    # Count null and set values.
    print(df['note'].isnull().value_counts())
    # Unique moods.
    print(df['mood_label'].value_counts())
    exit()


    timestamp_standardized, timestamp_scaler = scale(df[['timestamp']])
    df['timestamp_standardized'] = timestamp_standardized

    mood_standardized, mood_scaler = scale(df[['mood_score']])
    mood_standardized = df[['mood_score']]


===

@MichaelCurrin
Copy link
Owner Author

Scale notes



SCALE NOTES

# Required for MinMaxScaler.
scipy==1.1.0

from sklearn.preprocessing import MinMaxScaler


    def scale(values):
        """Standardise numeric input data to a fit between 0.0 and 1.0.

        @param values: list of input values, as a column of CSV values.

        @return: list of transformed values.
        @return scaler: MinMaxScaler instance, fitted to the input values.
            This can be used to reverse scaled values later, using
            `scaler.inverse_transform(normalized)`.
        """
        scaler = MinMaxScaler()
        scaler = scaler.fit(values)

        return (scaler.transform(values), scaler)


Large numeric values such as year could be scaled in order to avoid
a warning when printing a model summary.
    [2] The condition number is large, .... This might indicate that there are
    strong multicollinearity or other numerical problems.
A comment in this post helped.
https://stats.stackexchange.com/questions/243000/cause-of-a-high-condition-number-in-a-python-statsmodels-regression
    My guess is that this is not related to collinearity but to the scaling of
    the explanatory variables, i.e. an eigenvalue is large because the variable
    is large, and some variables, like constant, are small. (statsmodels
    reports the condition number of the design matrix and not of a standardized
    design matrix.)
Though, the model's coefficient for year can only interpreted after reversing
the scaling using a fitted scaler object.
Therefore the simplest solution to use one-hot encoding for year, assuming
there will only be a handful of unique year values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant