Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a function to get data pertaining to more than 2 parameters #807

Open
mirandasaari1 opened this issue Oct 30, 2018 · 8 comments
Open

Comments

@mirandasaari1
Copy link

Extend the function implemented https://github.com/stoqs/stoqs/blob/master/stoqs/contrib/analysis/init.py, _getMeasuredPPData, which gets the measured data when given two parameters. Extending this function to get all parameters or a given list of parameters for a given platform will allow for more data and features when exploring and modeling for the output data. This can be vital to improving the performance of a machine learning algorithm.
The goal is to get this data into a pandas dataframe, or similar, to have an easier base to work with when implementing further machine learning algorithms.
Myself, @MBARIMike, @bretstine and @markmocek will be exploring this issue further for part of Fall Capstone 2018

@MBARIMike
Copy link
Contributor

MBARIMike commented Nov 4, 2018

This will be an important addition to the STOQS code base!

I think what we'd like to enhance is the createLabels() function of the classify.py program, which calls a slightly different method in __init__.py: _getPPData(). When this method is called only the MeasuredParameter IDs are used from the return. If I interpret the desire of this Issue correctly here is a list of functional requirements:

  • Produce a table of any number of MeasuredParameter data values from a Platform
  • Return MeasuredParameter IDs along with the data so that new labels can be added to the DB
  • Be able to constrain selection based on time or depth range
  • Be able to constrain select based on the value range of a MeasuredParameter

If I recall correctly, _getPPData() is a generalized improvement over _getMeasuredPPData() in that it reuses methods already developed for the UI and allows passage of a pvDict dictionary that holds any number of MeasuredParameter value constraints to the selection.

The already developed code for the UI constructs raw SQL statements that execute self-join statements in order to retrieve multiple Parameters for plotting in the Parameter-Parameter section of the UI. This code would be difficult to extend. Perhaps we can take a fresh approach to get the data in a suitable format for exploration and modeling using Machine Learning techniques.

@MBARIMike
Copy link
Contributor

Here's a start on a fresh approach, a Django query that gets the first 20 data values from dorado:

(venv-stoqs) [vagrant@localhost stoqsgit]$ stoqs/manage.py shell_plus
...

In [1]: mps = MeasuredParameter.objects.using('stoqs_september2013_o').filter(
   ...:                 measurement__instantpoint__activity__platform__name='dorado')
   ...:

In [2]: for i, mp in enumerate(mps[:20]):
   ...:     if i == 0:
   ...:         print("time, depth, latitude, longitude, parameter__name, measuredparameter__datavalue")
   ...:     print(f"{mp.measurement.instantpoint.timevalue}, {mp.measurement.depth:.2f},"
   ...:           f" {mp.measurement.geom.y:.6f}, {mp.measurement.geom.x:.6f}"
   ...:           f" {mp.parameter.name}, {mp.datavalue}")
   ...:
time, depth, latitude, longitude, parameter__name, measuredparameter__datavalue
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 sigmat, 25.1383576072121
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 spice, 0.830712889765499
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 altitude, 1395.68956636994
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 temperature, 13.9910522171992
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 salinity, 33.6403972259011
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 oxygen, 5.670288605996
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 nitrate, 0.21
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 bbp420, 0.00231458255927606
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 bbp700, 0.00228426640768986
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 fl700_uncorr, 0.000823624706576738
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 biolume, 194666664.695293
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 roll, -4.08951048388392
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 pitch, -0.105888989907026
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 yaw, 175.513420572358
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 sepCountList, None
2013-09-17 18:42:20, -0.03, 36.734970, -122.128144 mepCountList, None
2013-09-17 18:42:18, -0.04, 36.734989, -122.128162 sigmat, 25.1403727711047
2013-09-17 18:42:18, -0.04, 36.734989, -122.128162 spice, 0.829269194464183
2013-09-17 18:42:18, -0.04, 36.734989, -122.128162 altitude, 1395.49904668803
2013-09-17 18:42:18, -0.04, 36.734989, -122.128162 temperature, 13.9828055034561

Maybe there's a way to pivot an output like this to get the data in a format amenable to analysis in Pandas?

@mirandasaari1
Copy link
Author

@MBARIMike that definitely looks like the direction we were trying to go in. Maybe "extension" of an existing function was the wrong way to word things given we would be starting fresh. Thank you for making that clarification.

@MBARIMike
Copy link
Contributor

Also, Pandas has a DataFrame.from_records() method that will import Django data into a data frame, e.g:

In [1]: import pandas as pd

In [2]: mps = MeasuredParameter.objects.using('stoqs_september2013_o').filter(
   ...:                 measurement__instantpoint__activity__platform__name='dorado')
   ...:

In [3]: df = pd.DataFrame.from_records(mps.values(
   ...:     'measurement__instantpoint__timevalue', 'measurement__depth',
   ...:     'measurement__geom', 'parameter__name', 'datavalue', 'id'
   ...:     ))
   ...:

In [4]: df.head(20)
Out[4]:
       datavalue       id  measurement__depth                         measurement__geom measurement__instantpoint__timevalue parameter__name
0   2.476802e+01  5664562           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49          sigmat
1   1.262683e+00  5673227           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49           spice
2   2.546787e+01  5690556           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49        altitude
3   1.582349e+01  5577911           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49     temperature
4   3.367453e+01  5629901           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49        salinity
5   6.593205e+00  5586576           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49          oxygen
6   5.360300e+02  5595241           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49         nitrate
7   9.528316e-03  5603906           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49          bbp420
8   6.610731e-03  5612571           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49          bbp700
9   4.761394e-04  5621236           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49    fl700_uncorr
10  9.728126e+09  5638566           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49         biolume
11 -1.292509e+01  5647231           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49            roll
12 -6.497791e+00  5655896           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49           pitch
13  5.802254e+01  5664561           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49             yaw
14           NaN  5690705           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49    sepCountList
15           NaN  5691417           -0.055507    [-121.934897431052, 36.90470983771924]                  2013-09-16 20:55:49    mepCountList
16  2.476093e+01  5664563           -0.082238  [-121.93492018129153, 36.90469289678784]                  2013-09-16 20:55:47          sigmat
17  1.270436e+00  5673228           -0.082238  [-121.93492018129153, 36.90469289678784]                  2013-09-16 20:55:47           spice
18  2.544076e+01  5690555           -0.082238  [-121.93492018129153, 36.90469289678784]                  2013-09-16 20:55:47        altitude
19  1.585611e+01  5577910           -0.082238  [-121.93492018129153, 36.90469289678784]                  2013-09-16 20:55:47     temperature

@mirandasaari1
Copy link
Author

So instead of manipulating x and y such as loadLabeledData does, we could write a new function with this code and return the pandas data frame. Would you suggest adding to classify.py to do this or creating a new file?

@MBARIMike
Copy link
Contributor

MBARIMike commented Nov 5, 2018

I suggest creating a new file for now. Perhaps it could be a Jupyter Notebook that demonstrates an analysis.

@mirandasaari1
Copy link
Author

So looking at classify.py, would we need to construct a process_command_line() function for this new file?

@MBARIMike
Copy link
Contributor

We'd need to understand the functional requirements better; perhaps a new option (or implementation of an aspirational option already in classify.py) is an approach. I'd like to see a Jupyter Notebook demonstration - that will help us decide.

stoqs pushed a commit that referenced this issue Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants