Skip to content

Latest commit

 

History

History
243 lines (202 loc) · 9.11 KB

Frames-notes.org

File metadata and controls

243 lines (202 loc) · 9.11 KB

Tasks

[#C] Add a summarize class

Other data analysis toolkits have something like a summary function that computes arguably sensible properties of columns. It shows things like min, max, and mean for numeric columns, or the number of unique values for textual columns.

We may be able to implement a similar utility with a type class whose instance for each column type is a Fold.

For example, suppose we have the following Frame:

user idagegenderoccupationzip code
124Mtechnician85711
253Fother94043
323Mwriter32067
424Mtechician43537

The summary for a single column like occupation (a categorical variable) might look like this:

categorycount
technician2
other1
writer1

However, the summary for a column like age (an interval variable) could look like this:

parametervalue
min24
max53
mean2.5
stddev1.29
median2.5

The summary of a whole frame might just go through summaries for each column:

entityattributevalue
user_idclassNum
user_idclassKey
user_idmin1
user_idmax4
user_idmean2.5
user_idstddev1.29
user_idmedian2.5
user_idunique4
ageclassInterval
ageclassRatio
agemin24
agemax53
agemean2.5
agestddev1.29
agemedian2.5
genderclassCategorical
gendernumCategories2
occupationclassCategorical
occupationnumCategories3
zip codeclassCategorical
zip codenumCategories4

[#B] Support overriding specific columns for type inference

We currently support overriding all or none, but something more granular will be helpful. In particular with things like data types for categorical variables where the inference is more likely to be wrong.

[#A] Support types like Either Categorical Double

The idea is that a numeric column could have several “failure modes” that we might like to enumerate.

[#A] Demonstrate grouping

Show how to perform a fold that performs a computation based on some grouping criteria.

[#A] Revisit Joins

Get a benchmark sorted out.

Debugging inference in the REPLp

Some handy examples,

foldCoRec parsedTypeRep (bestRep "r" :: CoRec ColInfo CommonColumns)

> =Definitely Text

foldCoRec parsedTypeRep (bestRep "23" :: CoRec ColInfo CommonColumns)

> =Definitely Int

Stage Restrictions

One area where GHC’s stage restrictions on Template Haskell bite us is in stating how to parse a particular file. The problem is that we parse the file once at compile time to generate type declarations, and again at runtime to read the values. As a reminder, GHC’s stage restriction means that values we pass to a splice must be literals or imported from another module. To be clear, we can’t do this,

x :: Foo
x = foo 23

mySplice x "skidoo"

myData :: RuntimeFoo
myData = readStuff x "skidoo"

This means that a value representing parser options must be imported so that it can be used during both phases. At the moment, the only parser options are defining how columns are separated, and whether or not there is a header row (the absence of a header is indicated by explicitly providing column names). We can capture most of the needed functionality by passing a separator character and a list of strings to the TH splice. This is a slight wart as any further parser options would extend the type of every parsing function. Using a record for options would mean that we could add options without having to change every type signature.

A Benefit to Duplication?

Another drawback of passing parsing options as literals is that it exacerbates another problem: repeating the name of the file to be parsed. Specifically, we need to provide the name for the template haskell splice that produces all the relevant declarations, and again for the runtime code that reads the data file. A minor advantage of this duplication is that we can provide a model file for the type declarations, and a lower quality data file that we want to analyze. This offers a way to infer tighter types than the noisy data would allow so that malformed records can more easily be discarded when they fail to parse at the specific type.

Options

To be concrete, if we do not use a record for parser options, we could always pass the unpacked parser options wherever they are needed.

tableTypesOpt '|' ["name", "age", "occupation"] "Users" "data/users.dat"

userData :: Producer Users IO ()
userData = readTableOpt '|' ["name", "age", "occupation"] "data/users.dat"

The duplication of the column names is atrocious. We could declare all Users-related types and values, and the definition of userData at once to avoid repeating ourselves, but this seems like it might become an unwieldy splice.

The best choice is for the splice to declare a value usersParser that readTableOpt could then use. This works out quite nicely.

Prettying TH Splice Dumps

At the GHCi repl

> :set -XQuasiQuotes -XTemplateHaskell
> import Language.Haskell.TH
> putStrLn $(tableTypes "base" "base.csv" >>= stringE . show . ppr_list)
> set -XOverloadedStrings -XQuasiQuotes TempalteHaskell
> import Data.Char
> import Data.List
> import Frames
> import Frames.CSV
> let stripModule = until (\w -> length w == 1 || not ("." `isInfixOf` w)) (tail . dropWhile isAlpha)
> let onWords f xs = takeWhile isSpace xs ++ unwords (map f (words xs))
> putStrLn . unlines . map (onWords stripModule) $ lines $(tableTypes' (rowGen "data/ml-100k/u.user") {rowTypeName = "User", columnNames = ["user id", "age", "gender", "occupation", "zip code"], separator = "|"} >>= stringE . show . ppr_list)

Using ghc-mod and some elisp helpers

Dumping the definitions created by the TH splices results in a pretty unreadable mess. Here’s how to use these functions to clean things up:

  1. Evaluate the three elisp definitions here
  2. Hit C-c C-e to get ghc-mod to evaluate all splices
  3. Copy the contents of the *GHC Info* buffer to somewhere like your *scratch* buffer (because *GHC Info* is read-only)
  4. Run M-x pretty-splices in that buffer
(defun replace-stringf (from to)
  (beginning-of-buffer)
  (while (search-forward from nil t)
    (replace-match to nil t)))

(defun replace-regexpf (from to)
  (beginning-of-buffer)
  (while (re-search-forward from nil t)
    (replace-match to nil nil)))

(defun pretty-splices ()
  (interactive)
  ;; Fix newlines
  (replace-stringf (rx (char ?\0)) "
")
  ;; Unqualify names
  (replace-stringf "GHC.Types.:" "':")
  (replace-stringf "Data.Text.Internal." "")
  (replace-stringf "Data.Text." "T.")
  (replace-stringf "GHC.Types.Int" "Int")
  (replace-stringf "GHC.Base." "")
  (replace-stringf "Frames.Col." "")
  (replace-stringf "Data.Proxy." "")
  (replace-stringf "Data.Vinyl.TypeLevel." "")
  (replace-stringf "Data.Vinyl.Core." "")
  (replace-stringf "Frames.Rec." "")
  (replace-stringf "Data.Vinyl.Lens." "")
  (replace-stringf "Frames.CSV.ParserOptions" "ParserOptions")

  ;; Erase inferrable type
  (replace-regexpf "(Frames.TypeLevel.RIndex .*?)" "")

  ;; Make `:->' infix
  (replace-regexpf (rx (sequence "(:->) \""
                                 (group (0+ (not (in "\""))))
                                 "\" "
                                 (group (0+ (not (in " "))))))
                   "\"\\1\" :-> \\2")

  ;; Make `:' infix
  (replace-regexpf (rx (sequence "((':) (" (group (0+ (not (in ")")))) ") '[])"))
                   "[\\1]")
  (let ((x 10))
    (while (plusp x)
      (replace-regexpf (rx (sequence "((':) (" (group (0+ (not (in ")")))) ") ["
                                     (group (0+ (not (in "]")))) "])"))
                       "[\\1, \\2]")
      (decf x)))

  ;; Newline before top-level type signature
  (replace-regexpf "^    [^ ]+ ::" "
\\&")
  ;; Newline before single-line type synonym definitions
  (replace-regexpf "^    type [^ ]+ = [^ ]+.*$" "
\\&"))

Removing INLINE pragmas

These may be hurting compile times while not helping runtime performance. I’ll be looking at the benchdemo executable.

CodeCompile (s)Run (s)
vinyl-0.13.3 with INLINE8.80.37
vinyl-0.13.3 no INLINE8.00.36
vinyl-0.14.0 no INLINE10.80.38