Skip to content

Queryset Basics

jimdale edited this page May 3, 2024 · 2 revisions

Queryset Basics

The viewser package has a small, concise API that aims to provide viewsers with a nice "language" for defining data which is both easy to write and easy to read.

The package exposes two top-level imports that are all you need to define, publish and retrieve data:

from viewser import Queryset, Column

Querysets are Python objects, which can be instantiated in scripts or notebook cells.

Defining a new queryset from scratch

To define a queryset, one first imports the Queryset and Column classes

from viewser import Queryset, Column

A new queryset is then created by

new_queryset = (Queryset("queryset_name", "level of analysis"))

Every queryset must have

  • a (meaningful) name.
  • a level of analysis. This defines the level of analysis that all columns in the queryset will be aggregated/disaggregated to, and therefore the level of analysis at which the final dataset will be presented.

The level of analysis must be a string of the form spatial-unit_time-unit, where the spatial-unit must be one of

  • country
  • priogrid

and the time unit must be one of

  • month
  • year

Column objects representing data are added to the queryset using the with_column method:

new_queryset = (Queryset("simple_conflict", "country_month")

           .with_column(Column("ged_sb", from_loa="country_month", from_column="ged_sb_best_sum_nokgi")
                        )
            )

Each call to with_column takes a single Column instance as its argument.

The Column instance in turn defines what the column is to be called and what data will go into it. The first argument to the Column instance is the column name. This again should be meaningful.

Note that, unlike older versions of viewser (<6.0.0), all column names in a queryset must be unique. If two or more columns are given the same name, the queryset will be rejected by the server and an error massage detailing which columns have repeated names will be returned.

The second argument to the Column instance from_loa specifies which level of analysis the requested raw data is defined at. If the user does not know this, they need to examine the database using the viewser CLI.

If the wrong loa is specified, the queryset will be rejected by the server and an error message detailing which columns have been requested from the wrong loa will be returned.

The final argument to the Column instance is the name of the raw column to be fetched from the database. If a non-existant column is requested, the queryset will be rejected by the server and an error message detailing which columns are unavailable will be returned.

Aggregation/disaggregation

The definition of a queryset must include the target level of analysis, at which the resulting data will be presented to the user.

The definition of each column in a queryset must specify the source level of analysis, at which the raw data used to define that column is stored in the database.

If these loas differ for a given column, the necessary aggregation or disaggregation is performed automatically.

If an aggregation is required, the user may choose from the following aggregation functions:

  • sum (default): sums over necessary spatial and time units
  • avg: averages over necessary spatial and time units
  • max: takes the maximum value over necessary spatial and time units
  • min: takes the minimum value over necessary spatial and time units
  • count: counts non-zero values over necessary spatial and time units

If no aggregation function is specified but aggregation is required, the default choice (sum) will be automatically (and silently) selected.

It is up to users to ensure that they select the correct aggregation functions. If something other than the default is required, an aggregation function can be specified by

new_queryset = (Queryset("simple_conflict", "country_month")

           .with_column(Column("ged_sb", from_loa="country_month", from_column="ged_sb_best_sum_nokgi")
                        .aggregate('avg')
                       )
               )

If a non-existent aggregation function is specified, the queryset will be rejected by the server and an error message detailing which columns have incorrect aggregation functions will be returned.

queryset = (Queryset("example_queryset","country_month")
      .with_column(Column("ged_raw","priogrid_month","ged_best_ns")
         .aggregate("sum")
         )
      )

Transforms

Any queryset column may specify an arbitrary number of transforms to be done to the raw data after any necessary aggregation/disaggregation has been done.

Transforms are added to columns using the transform method

new_queryset = (Queryset("simple_conflict", "country_month")

           .with_column(Column("ged_sb", from_loa="country_month", from_column="ged_sb_best_sum_nokgi")
                        .aggregate('avg')
                        .transform.ops.ln()
                        .transform.missing.replace_na()
                       )          
               )

Each transform method has a general transform type, e.g. missing followed by a specific function with brackets for arguments.

A list of available transforms can be obtained using the viewser CLI. A notebook giving examples of what each transform does can be found at https://github.com/prio-data/views3/tree/master/examples.

Note that not all transforms are available at all levels of analysis. If a transform is requested at an innapproprite loa, the queryset will be rejected by the server and an error message detailing which columns have requested incompatible transforms and loas will be returned.

Making a new queryset by merging two or more existing querysets

It is sometimes desirable to make a larger queryset by merging several existing querysets. This can be done with the from_merger method. The method requires at mininum a list of querysets to be merged and a name for the merged queryset. Optionally, a theme and description can also be passed. There is also a boolean verbose flag, described below. For example

querysets_to_merge = ['queryset1','querysets2','queryset3']
merged_queryset = Queryset.from_merger(querysets_to_merge,'my_merged_queryset',theme='my_theme',description='description')

Before merging, some checks are performed. The querysets to be merged must all have the same target LOA. If the querysets to be merged contain two or more columns with the same name, the method checks that all the definitions of that column are exactly the same (same raw data, same transforms with same parameters). If this is the case, one copy of this column is included in the merged queryset (if the verbose flag is True, the method reports that this has been done). If there are multiple definitions of the columns with the same column name, the attempt at merging is aborted.

Recreating a queryset from storage

If a queryset has already been published to the queryset store (see below), the queryset object can be regenerated by doing

queryset = Queryset.from_storage(queryset_name)