Jobs Analysis using Machine Information Extraction (JAMIE) is a tool that aims to monitor and analyse the number of academic jobs, mainly in the UK, that require software skills.
Documentation • Contribution Guidelines • Machine Learning
There is a research software jobs tracker which is an instance of jamie that tracks software jobs in UK universities.
-
OS. Any UNIX based OS can be used to run jamie. Development was done on Debian 11 (testing, bullseye), Ubuntu 20.04 should work as well.
-
Python. Development uses Python 3.8, though later versions should work as well.
-
Database. Jamie uses MongoDB as the backing store for jobs data. Either install MongoDB locally or connect to a MongoDB database by setting a valid MongoDB connection URI (with username and password, if required) in the
JAMIE_MONGO_URI
environment variable.The database uses the name
jobsDB
. If such a database already exists in the MongoDB, then either rename it or set the database name usingjamie config db.name <newname>
. -
Setup. Run
jamie setup
. This (i) checks the database connection, (ii) downloads necessary NLTK datasets which are needed for text cleaning, and (iii) checks that a training set exists.
To install using pip:
git clone https://github.com/softwaresaved/jamie.git
cd jamie
python3 -m venv .venv
source .venv/bin/activate
pip install .
pip install .[dev,docs] # For development work
The CLI tool jamie
is a wrapper around the Jamie API (see the documentation).
Working with Jamie is similar to running standard machine learning pipeline: we
first train a model and use that to predict whether jobs are software jobs or
not. The final step is the creation of the report.
You can take a look at the detailed workflow along with the help for the command line interface, or look at how we built the model.
Concurrency. All the steps indicated above with snapshots
support
multiple snapshots, and independent snapshots can be worked on concurrently.
Scraping writes to the filesystem and can be run independently of other steps
as well. Prediction requires read access to the database, so running it
concurrently with the load step (which writes to the database) might not work
or result in unpredictable behaviour. This can be fixed by making prediction
work from a database snapshot (not currently supported).
Reproducibility. Training the model should be reproducible and the random number seed is set automatically where needed. Scraping is inherently non-reproducible, but loading and cleaning the data should be (not tested yet). Prediction is non-reproducible as it relies on a mutable database, but generation of reports from predictions is reproducible.
Detailed usage can be found in the workflow document.
-
Configuration: Show the configuration using
jamie config
, or set configuration usingjamie config <configname> <value>
-
Download jobs:
jamie scrape
-
Load jobs into MongoDB:
jamie load
. Pass option--dry-run
to test. -
Training snapshots: A training snapshot is needed to run the machine learning pipeline. First check the snapshots folder location (
jamie config common.snapshots
), exists and then copy an existing training set CSV file into the training snapshot location. It should be calledtraining_set.csv
:cd `jamie config common.snapshots` mkdir -p training/<date> # date of snapshot cp /path/to/training_set.csv training/<date>
-
Train the model:
jamie train [<snapshot>]
. If snapshot is not specified, uses the latest snapshot. -
Predict classification: The previous command will create model snapshots in
<snapshots>/models
of the snapshots location. You can now use these snapshots to make predictions:jamie predict [<snapshot>]
. This saves the prediction snapshot in<snapshots>/predictions
. -
Generate report corresponding to the prediction snapshot:
jamie report
. The report is created in<snapshots>/reports
with the same name as the corresponding prediction snapshot. To view the report, run# If snapshot not specified, see latest report jamie view-report [<snapshot>]
This will start a local webserver for viewing the report. The report snapshot folder is self contained and can be served using standard webservers as well.