Design of a transformer-based architecture for object detection conditioned by metadata:
- DEtection TRanformer (DETR)
- Vision Transformer (ViT) ???
We develop the following strategies to incorporate metadata information into image processing:
- Baseline (no metadata)
- Early concatenation
- Early summation
To install the project, simply clone the repository and get the necessary dependencies:
git clone https://github.com/MarcoParola/conditioning-transformer.git
cd conditioning-transformer
mkdir models data
Create and activate virtual environment, then install dependencies.
python -m venv env
. env/bin/activate
python -m pip install -r requirements.txt
Next, create a new project on Weights & Biases. Log in and paste your API key when prompted.
wandb login
To perform a training run by setting model
parameter:
python train.py model=detr
model
can assume the following value detr
, early-sum-detr
, early-concat-detr
, early-shift-detr
.
To run inference on test set to compute some metrics, specify the weight model path by setting weight
parameter (I ususally download it from wandb and I copy it in checkpoint
folder).
python test.py model=detr weight=checkpoint/best.pt
Special thanks to @clive819 for making an implementation of DETR public here. Special thanks to @hustvl for YOLOS original implementation