JDI-QASP-ml v.2.0

Model

MUI

Our model is neural network based on pytorch framework organized as follows:

-> Input linear layer
-> Dropout layer
-> LeakyReLu activation layer
-> Batch normalization layer
-> Hidden linear layer
-> Dropout layer
-> LeakyReLu activation layer
-> Output linear layer

As input for NN we use following calculated groups of features:

Attributes features (OneHot-encoded info about having some attributes for object, his parent and up and down siblinhs)
Class features (TF-IDF encoded info about class attribute for object, his parent and up and down siblinhs)
Type features (OneHot-encoded info about type attribute for object, his parent and up and down siblinhs)
Role features (OneHot-encoded info about role attribute for object, his parent and up and down siblinhs)
TAG features (OneHot-encoded info about tag of object, his parent and up and down siblinhs)
Followers TAG features (TF-IDF encoded info about all tags of childs or generally followers)
Numerical general features (General features about object like numger of followers, children, max max_depth etc.)
Binary general features (General features with binary values like is the object or his parent hidden or displayed or leaf etc.)

Angular

Same model as for MUI.

HTML5

Out model is desicion tree, because of simplicity of classic html5 element structure

Picture of tree can be found in HTML5_model/model/tree.jpeg

Install environment to train / test model

Clone the repository.
Download and install Anaconda from https://www.anaconda.com/products/individual.
Alter your PATH environment variable to be able run python as well as conda utility.
Create conda virtual environment using this command (see create-env.bat if you use Windows):

    conda env create -f environment.yml --name jdi-qasp-ml

Run cmd.exe for windows or terminal for mac, and from command prompt:

    conda activate jdi-qasp-ml

Generating the dataset for training model

MUI

Generator for MUI element library sites placed in generators/MUIgenerator/
To generate sites go in the directory of MUIgenerator and run:

    sh generate_data.sh

After thet in catalog /data/mui_dataset/build you will find directories named like "site-N"

Next go the directory MUI_model and run:

    python build_datasets_for_mui_sites.py

After that in directory /data/mui_dataset you will find following structure:

/annotations (not used, maybe need to be removed later)
/cache-labels (not used, maybe need to be removed later)
/df - directory with pickles of site-datasets
/html - directory with html files of sites (only for info)
/images - directiry with images of sites (only for info)
classes.txt - file with all possible labels to detect. Do not change it!!!
EXTRACT_ATTRIBUTES_LIST.json - file with all attributes to take into account in the model (need in feature building). Do not change it!!!

Angular

Generator for Angular element library sites placed in generators/NgMaterialGenerator/
To generate sites go in the directory of NgMaterialGenerator and run:

    sh generate_data.sh

After thet in catalog /data/angular_dataset/build you will find directories named like "site-N"

Next go the directory Angular_model and run:

    python build_datasets_for_angular_sites.py

After that in directory /data/angular_dataset you will find following structure:

/annotations (not used, maybe need to be removed later)
/cache-labels (not used, maybe need to be removed later)
/df - directory with pickles of site-datasets
/html - directory with html files of sites (only for info)
/images - directiry with images of sites (only for info)
classes.txt - file with all possible labels to detect. Do not change it!!!
EXTRACT_ATTRIBUTES_LIST.json - file with all attributes to take into account in the model (need in feature building). Do not change it!!!

HTML5

Generator for HTML5 element library sites placed in generators/HTMLgenerator/
To generate sites go in the directory of HTMLgenerator and run:

    python generate-html.py

After thet in catalog /data/html5_dataset/build/ you will find directory named as "html5"

Next go the directory HTML5_model and run:

    python build_datasets_for_html5_sites.py

After that in directory /data/html5_dataset you will find following structure:

/annotations (empty, maybe need to delete)
/cache-labels (empty, maybe need to delete)
/df - directory with pickles of site-datasets
/html - directory with html files of sites (only for info)
/images - directiry with images of sites (only for info)
classes.txt - file with all possible labels to detect. Do not change it!!!
EXTRACT_ATTRIBUTES_LIST.json - file with all attributes to take into account in the model (need in feature building). Do not change it!!!

Train model

MUI

To train the model you need to go to the directory /MUI_model and run:

    python train.py

If you need to set up training parameters, change following variables for train.py (placed in vars/mui_train_vars.py):

BATCH_SIZE (2048 by default)
TRAIN_LEN and TEST_LEN
NUM_EPOCHS (2 by default)
EARLY_STOPPING_THRESHOLD (2 by default)

At the end of the process the table with training results saves in MUI_model/tmp/train_metrics.csv

Angular

To train the model you need to go to the directory /Angular_model and run:

    python train.py

HTML5

To train the model you need to go to the directory /HTML5_model and run:

    python train.py

If you need to set up training parameters, change following variables for train.py (placed in vars/html5_train_vars.py):

TRAIN_LEN and TEST_LEN
parameters of DT

At the end of the process the table with training results saves in MUI_model/tmp/train_metrics.csv

Predicting

To get predictions we need to run API main.py (better to do it wia docker - will be disscussed below) when API is running we can send input json data to following url:

http://localhost:5050/mui-predict for mui model
http://localhost:5050/angular-predict for angular model
http://localhost:5050/html5-predict for html5 model
http://localhost:5050/predict for old version of model

Validate model

MUI

To validate models quality we use test web-pages, placed in directory notebooks/MUI/Test-backend

You can change only notebooks with the "new"-end in the name like "Test-backend_mui-Buttons_new.ipynb"(others are legacy for comparing)

In that notebooks we load specific web-page, creating dataset and predict labels for this dataset. It may be needed to correct some paths in notebooks (especially ports in them)

To use this notebooks main.py need to be run or docker needs to be up.

HTML5

To validate models quality we use test web-pages, placed in directory notebooks/HTML5/Test-backend

Docker

Take docker image from github:

RC version

macOS/Linux

docker rm --force jdi-qasp-ml-api && docker image rm ghcr.io/jdi-testing/jdi-qasp-ml:rc --force && curl --output docker-compose.yaml --url https://raw.githubusercontent.com/jdi-testing/jdi-qasp-ml/rc/docker-compose-rc.yaml && docker compose up

Windows

docker rm --force jdi-qasp-ml-api && docker image rm ghcr.io/jdi-testing/jdi-qasp-ml:rc --force && curl.exe --output docker-compose.yaml --url https://raw.githubusercontent.com/jdi-testing/jdi-qasp-ml/rc/docker-compose-rc.yaml && docker compose up

Development version

macOS/Linux

docker rm --force jdi-qasp-ml-api && docker image rm ghcr.io/jdi-testing/jdi-qasp-ml:latest --force && curl --output docker-compose.yaml --url https://raw.githubusercontent.com/jdi-testing/jdi-qasp-ml/develop/docker-compose.yaml && docker compose up

Windows

docker rm --force jdi-qasp-ml-api && docker image rm ghcr.io/jdi-testing/jdi-qasp-ml:latest --force && curl.exe --output docker-compose.yaml --url https://raw.githubusercontent.com/jdi-testing/jdi-qasp-ml/develop/docker-compose.yaml && docker compose up

Installing version from any other repository branch:

Example with branch "branch_name":

Installing for the first time:

Clone repository to your machine:

git clone https://github.com/jdi-testing/jdi-qasp-ml.git

After process finished go to the project folder:

cd jdi-qasp-ml

Checkout to a branch needed:

git checkout branch_name

Copy .env.dist file to .env:

cp .env.dist .env

Adjust variables in .env file to your needs (refer to the Settings section).
Build and start containers:

docker-compose -f docker-compose.dev.yaml up --build

Next time if you want to run/rerun containers, use following commands:

Stop running containers:

docker-compose -f docker-compose.dev.yaml down -v

Update repository with new commits:

git pull

Restart containers:

docker-compose -f docker-compose.dev.yaml up

Settings

Variable name	Description	Default value
SELENOID_PARALLEL_SESSIONS_COUNT	Total number of parallel Selenoid sessions. Is also used to determine number of processes used to calculate visibility of page elements. Set it to the number of parallel running threads supported by your processor. -2 optionally if you'd like to reduce CPU load.	4

Docker - get debugging info:

http://localhost:5050/build - get the docker image build's datetime
http://localhost:5050/files - get data sent by browser to model
To clean all docker images and containers:

    docker system prune --all --force

Development:

API service dependencies

New dependencies can be added with pipenv command:

pipenv install <package>==<version>

If there are conflicts on creating a new pipenv env on your local machine, please, add the dependencies inside the container:

docker compose -f docker-compose.dev.yaml run --rm api pipenv install <package>==<version>

API

Available API methods you can see in Swagger at http://localhost:5050/docs

Websocket commands

Those commands could be sent to websocket and be processed by back-end:

1. Schedule Xpath Generation for an element in some document:

Request sent:

{
    "action": "schedule_xpath_generation",
    "payload": {
        "document": '"<head jdn-hash=\\"0352637447734573274412895785\\">....',
        "id": "1122334455667788990011223344",
        "config": {
            "maximum_generation_time": 10,
            "allow_indexes_at_the_beginning": false,
            "allow_indexes_in_the_middle": false,
            "allow_indexes_at_the_end": false,
        },
    },
}

Response from websocket:

{
    "action": "tasks_scheduled",
    "payload": {"1122334455667788990011223344": "1122334455667788990011223344"},
}

2. Get task status:

Request sent:

{
    "action": "get_task_status",
    "payload": {"id": "1122334455667788990011223344"},
}

3. Get task statuses:

Request sent:

{
    "action": "get_task_status",
    "payload": {
        "id": [
            "1122334455667788990011223344",
            "1122334455667788990011223345",
            "1122334455667788990011223346",
        ]
    },
}

4. Revoke tasks:

Request sent:

{
    "action": "revoke_tasks",
    "payload": {
        "id": [
            "1122334455667788990011223344",
            "1122334455667788990011223345",
            "1122334455667788990011223346",
        ]
    },
}

Response from websocket:

{
    "action": "tasks_revoked",
    "payload": {
        "id": [
            "1122334455667788990011223344",
            "1122334455667788990011223345",
            "1122334455667788990011223346",
        ]
    },
}

5. Get task result:

Request sent:

{
    "action": "get_task_result",
    "payload": {"id": "1122334455667788990011223344"},
}

6. Get task results:

Request sent:

{
    "action": "get_task_results",
    "payload": {
        "id": [
            "1122334455667788990011223344",
            "1122334455667788990011223345",
            "1122334455667788990011223346",
        ]
    },
}

Name		Name	Last commit message	Last commit date
Latest commit History 825 Commits
.github/workflows		.github/workflows
Angular_model		Angular_model
BS_model		BS_model
HTML5_model		HTML5_model
MUI_model		MUI_model
app		app
data		data
ds_methods		ds_methods
generators		generators
js		js
kombu-redis-priority @ 341740f		kombu-redis-priority @ 341740f
model/version		model/version
notebooks		notebooks
templates		templates
tests		tests
utils		utils
vars		vars
.dockerignore		.dockerignore
.env.dist		.env.dist
.gitignore		.gitignore
.gitmodules		.gitmodules
.isort.cfg		.isort.cfg
Dockerfile		Dockerfile
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
browsers.json		browsers.json
build_docker.sh		build_docker.sh
chromedriver.exe		chromedriver.exe
create-env.bat		create-env.bat
docker-compose-rc.yaml		docker-compose-rc.yaml
docker-compose-stable.yaml		docker-compose-stable.yaml
docker-compose.dev.yaml		docker-compose.dev.yaml
docker-compose.remote_server.yaml		docker-compose.remote_server.yaml
docker-compose.remote_server_dev.yaml		docker-compose.remote_server_dev.yaml
docker-compose.yaml		docker-compose.yaml
environment.yml		environment.yml
remove-env.bat		remove-env.bat
run_docker.sh		run_docker.sh
setup.cfg		setup.cfg
start_celery.sh		start_celery.sh

jdi-testing/jdi-qasp-ml

Folders and files

Latest commit

History

Repository files navigation

JDI-QASP-ml v.2.0

Model

MUI

Angular

HTML5

Install environment to train / test model

Generating the dataset for training model

MUI

Angular

HTML5

Train model

MUI

Angular

HTML5

Predicting

Validate model

MUI

HTML5

Docker

Take docker image from github:

RC version

macOS/Linux

Windows

Development version

macOS/Linux

Windows

Installing version from any other repository branch:

Settings

Docker - get debugging info:

Development:

API service dependencies

API

Websocket commands

1. Schedule Xpath Generation for an element in some document:

2. Get task status:

3. Get task statuses:

4. Revoke tasks:

5. Get task result:

6. Get task results:

About

Resources

Stars

Watchers

Forks

Languages