Skip to content

TM031 Automated Grading Support

Alex St Laurent edited this page Apr 16, 2017 · 3 revisions

Automated Grading Support

Goal: define and implement a (JavaScript) interface that can run a set of test suites of an assignment against a set of implementations, exporting data as needed.

(Scroll down to #Motivation for the original beginning of the document.)

TODO List as of 16 April 2017

This is a list of all the things that should go into pyret-lang. Everything else can, and therefore will, exist outside of Pyret.

  • Finish check results API.
    • Add tests to the checker-api branch (see open pull request: #997).
    • I was initially waiting on Joe or Ben to comment on the proposed interface before doing so.
  • Get shared-gdrive imports working from the command line.
    • I believe most of the work is already done on the httplib branch.
  • Add command-line option to specify a local directory to serve as the source of my-gdrive imports.
    • Haven't done any work for this, but it should be a relatively straightforward addition.

After all that is done, I envision the usage to look like this:

To evaluate a student implementation, run something like

$ make foo-tests-ta.jarr
$ node foo-tests-ta.jarr --my-gdrive student_alpha@brown.edu/final/ --run-full-report > student_alpha_impl.json

To evaluate a student test, run

$ make student_alpha@brown.edu/sweep/foo-tests.jarr
$ node student_alpha@brown.edu/sweep/foo-tests.jarr --my-gdrive foo-ta-resources/ --run-full-report > student_alpha_test.json

From there, the JSON data can be processed outside of Pyret, and contains all the data one would want in order to assign grades.

Motivation

The pedagogy that Brown's CS019 and CS173 have adopted involves having students hand in two files: foo-code.arr and foo-tests.arr. The former would be an implementation of some specified functions, and may contain implementation-dependent tests, while the latter would contain implementation-independent tests. Evaluation of the submission involves both checking foo-code.arr for correctness, by running the staff's test suite against it, as well as checking foo-tests.arr for its ability to classify incorrect implementations, by running it against one known-correct implementation ("gold") and some number of known-buggy implementation ("coals").

As a result, for each assignment, there's a lot of (a) iterations of Pyret-running-something that need to happen, and (b) data to be collected.

Spec

Suppose student submissions are from Captain Teach, and exporting gives you the following directory structure:

submissions/
├── student_alpha@brown.edu/
│   ├── sweep/
│   │   └── foo-tests.arr
│   └── final/
│       └── foo-code.arr
|       └── foo-tests.arr
│
├── student_beta@brown.edu/
│   ├── sweep/
│   │   └── foo-tests.arr
│   └── final/
│       └── foo-code.arr
|       └── foo-tests.arr
├── ...
.
.
.

Then, the input could/would be:

  • submissions directory: submissions/ :: DirectoryIdentifier
  • the sub-directory: "final" :: String
  • implementation name: "foo-code.arr" :: String
  • test name: "foo-tests.arr" :: String
  • the staff test suite: foo-tests-ta.arr :: FileIdentifier
  • the staff gold: foo-gold.arr :: FileIdentifier
  • the staff coals: [foo-coal-1.arr, foo-coal-2.arr] :: List<FileIdentifier>, or coals/ :: DirectoryIdentifier
  • timeout: x-minutes :: Time

From there, it should:

  • For each $student_email, run foo-tests-ta.arr where its import my-gdrive("foo-code.arr") resolves to submissions/$student_email/final/foo-code.arr
  • For each $student_email, for each $staff_impl = [gold.arr, foo-coal-1.arr, foo-coal-2.arr], run submissions/$student_email/final/foo-tests.arr, where its import my-gdrive("foo-code.arr") resolves to $staff_impl
  • Any time running Pyret takes longer than x-minutes, halt, report timeout as an error, and move on.
  • Output organized data
  • Not require all these arguments. E.g. when grading sweeps, we can skip the first step.

Optionally, it could:

  • Output summarized grade data for each student, based on some specified grading heuristic.
  • Enforce internal consistency: create a "submission" with foo-gold.arr and foo-tests-ta.arr, make sure that "submission" gets a 100% score.
  • Collect data about external consistency: for each $student_email, run submissions/$student_email/final/foo-tests.arr where its import my-gdrive("foo-code.arr") resolves to submissions/$student_email/final/foo-code.arr.

Requirements

  • Check Result API
  • Ability to have import my-gdrive("foo-code.arr") resolve to a specific, chosen replacement for foo-code.arr.
  • Ability to have shared-gdrive imports resolve correctly from the command-line.

Desirable

  • Awareness of and/or integration with Captain Teach, including awareness of and robustness against common hand-in issues.
  • Web interface. There's some work on the grade branch of code.pyret.org, which was able to get the job done this semester. It wasn't great, but it worked.