GCB 535 Challenge

We (Casey Greene , Ben Voight) teach GCB 535 at Penn. The class as a whole is computational biology for biologists. This portion of the class aims to give students an introduction to machine learning, as well as hands on practice with machine learning methods.

In this game, we try to build and accurately assess a predictor. This repository hosts the challenge for individuals outside of our class. Feel free to play along with us.

Structure

We'll provide two different datasets. Within each dataset (D1 and D2), we have 5000 examples. We've randomly partitioned these into sets of 2000, 1000, 1000, and 1000. These are respectively numbered S1, S2, S3, and S4 for each dataset. Thus D1_S1.csv is a comma separated set of 2000 samples for the first dataset. The data have 200 features. The final column is the class label that we expect you to predict.

The initial repository contains the first sets of 2000 (S1) and 1000 (S2) examples for each dataset. Each sample (S1, S2, S3, S4) within a dataset (e.g. D1) should be comparable. We'll provide an S3 that contains an additional 1000 samples for each dataset on Wednesday, April 6th. We'll also provide an S4 at that time in a predict subfolder. This one will have the labels stripped. You may use these samples however you wish (e.g. combine and cross validate, etc). The final metrics that we're interested in are prediction accuracy on the final subset (S4) as well as your ability to predict your accuracy on the held out data.

We now provide an example (example.py) in the format of a move in the game that we expect the students to provide. Hopefully this provides a starting point for those of you attempting the challenge!

gcb535challenge
│   README.md
│   example.py
│
└───data
    │   D1_S1.csv
    │   D1_S2.csv
    │   D1_S3.csv
    │   D2_S1.csv
    │   D2_S2.csv
    │   D2_S3.csv
└───predict
    │   D1_S4.csv
    │   D2_S4.csv

We'll release the third set of samples (D1_S3.csv and D2_S3.csv) at the time of our class on Wednesday, April 6. At this time, we'll also release the final prediction sets with labels stripped (D1_S4 and D2_S4). If you participate, we'd love to hear what you expect your accuracy to be (for binary class labels) once we release the final labels. We'll make these available just before or just after class on April 8th at 10AM EST.

If you want to make predictions, fork this repository. Make sure your predictions are committed and pushed by April 8th at 10AM EST. Alongside your predictions, provide an estimate for the performance that you expect to see on the independent validation data.

Please ask clarifying questions, and we'll try to update this README to address the questions.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
predict		predict
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.py		example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

predict

predict

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

example.py

example.py

Repository files navigation

GCB 535 Challenge

Structure

About

Releases

Packages

Languages

License

greenelab/gcb535challenge

Folders and files

Latest commit

History

Repository files navigation

GCB 535 Challenge

Structure

About

Topics

Resources

License

Stars

Watchers

Forks

Languages