RL framework for IoT protocols

BASED ON PAPER: RL-IoT: Reinforcement Learning to Interact with IoT Devices

Automatically learn the semantics of a protocol of a generic IoT device in the shortest possible time, using Reinforcement Learning (RL) techniques.

This RL framework implements 4 RL algorithms:

SARSA
Q-learning
SARSA(λ)
Q(λ) (Watkin's version)

RL is used to automatize the interaction with the IoT devices present in the local network. For these algorithms we assume there exists a dataset with valid protocol messages of different IoT devices. However, we have no further knowledge on the semantics of such command messages, nor on whether particular devices would accept the commands. This dataset will be stored into a dictionary inside our framework.

Here there is a first component based on the Yeelight protocol.

Introduction to the project

Motivation

In IT systems, the presence of IoT devices is exponentially growing and most of them are custom devices: they rely on proprietary protocols, often closed or poorly documented. Here we want to interact with such devices, by learning their protocols in an autonomous manner.

Definitions

state of an IoT device: represented by some properties specific to that device.
state-machine of a protocol: multiple series of states linked by one or more sequences of commands. These commands can be exchanged through that protocol to complete a predefined task.
task: identified as a path inside the state-machine. The sequence of commands could change the state of the IoT device following a certain path - i.e., completing a task - inside the state-machine.
RL algorithms iterate over 2 nested loops: the outer loop iterating over episodes and the inner loop iterating over time steps t.

Current work

This work mimics the behaviour of an attacker, which tries to explore the state-machine of the IoT device it is trying to communicate with.

We start developing our framework:

Targeting an actual IoT protocol: Yeelight protocol.
Implementing 4 RL algorithms: SARSA, Q-learning, SARSA(λ) and Q(λ).

Note: the Yeelight protocol defines a maximum rate on the commands to be sent to Yeelight devices, hence our framework can take about 50 minutes to complete the 1 learning process of 200 episodes for a single RL algorithm.

Features

Main features include:

Support to 4 RL algorithms, that can be selected inside the config.py file.
Collect all necessary data to generate plots for comparing performance among different configurations.
Block the learning process and restart it from the Q matrix computed before, giving as id the date of the previous execution.
Possibility to configure all parameters inside a single file - config.py - about parameters for algorithms, for the framework and about debug options.

How to use?

This project has been developed with Python 3.7. To use it, you need to first install all necessary Python packages with command:

pip install .

After installing all needed dependencies, the project can be executed directly running the __main__.py script. If some modules are missing, try to install them with pip install <module> command.

Note: you need to have nmap installed in your computer to make the python-nmap package working.

Structure

General structure of directories:

learning directory contains a learning module, with RL algorithms, and a run script to follow the best policy found by algorithms.
discovery contains scripts for finding IoT devices in LAN.
dictionary contains dictionaries for IoT protocols.
request_builder accesses dictionaries and builds requests to be sent to IoT devices.
device_communication contains api for directly communicating with a specific IoT device.
state_machine contains methods defining state machines for protocols and methods for the computation of the reward.
plotter contains scripts for plotting results.
sample contains some toy scripts to communicate to individual devices (Yeelight and Hue devices).
images contains images for readme purposes.

The project can be run from the __main__.py, starting with a discovery phase for IoT devices in the local network.

Output

Throughout the entire learning process, the Learning module collects data into external files, inside the output directory.

All files for 1 execution of the learning process are identified by the current date and the id of the current thread, in the format %Y_%m_%d_%H_%M_%S_<thread_id>

The structure of the output directory is the following:

output
|
|__ log
|   |__ log_<date1>_<thread_id>.log
|   |__ log_<date2>_<thread_id>.log
|
|__ output_csv
|   |__ output_<algorithm1>_<date1>_<thread_id>.csv
|   |__ output_<algorithm2>_<date2>_<thread_id>.csv
|   |__ partial_output_<algorithm1>_<date1>_<thread_id>.csv
|   |__ partial_output_<algorithm2>_<date2>_<thread_id>.csv
|
|__ output_Q_parameters
|   |__ output_parameters_<date1>_<thread_id>.csv
|   |__ output_parameters_<date2>_<thread_id>.csv
|   |__ output_Q_<date1>_<thread_id>.csv
|   |__ output_E_<date1>_<thread_id>.csv
|
|__ log_date.log

More in details, inside output directory:

output_Q_parameters: contains data collected before and after the learning process. Before the process starts, all values for the configurable parameters are saved into file output_parameters_<date>_<thread_id>.csv: information about the path to learn, the optimal policy, the chosen algorithm, the number of episodes, the values of α, γ, λ and ε. If one wants to reproduce an execution of the learning process, all the parameters saved inside this file allow for repeating the learning process using the exact same configuration. Then, at the end of each episode, the Q matrix is written and updated inside a file output_Q_<date>_<thread_id>.csv. The E matrix, if required by the chosen RL algorithm, is written into output_E_<date>_<thread_id>.csv.
output_csv: contains output_<algorithm>_<date>_<thread_id>.csv and partial_output_<algorithm>_<date>_<thread_id>.csv files. The first file contains, for each episode, the obtained reward, the number of time steps and the cumulative reward. The latter contains the same values obtained while stopping the learning process at a certain episode and following the best policy found until that episode. partial_output_<algorithm>_<date>_<thread_id>.csv files are present only if a proper flag is activated inside the learning_yeelight.py script, specifying the number of episodes at which the learning process should be stopped.
log: contains log data for each execution. After the learning process has started, for each step t performed by the RL agent, log_<date>_<thread_id>.log is updated with information about the current state s_t, the performed action a_t, the new state s_t+1 and the reward r_t+1.
log_dates.log: file saving the id of each execution. It can be used to collect all ids for all executions and use them inside the Plotter module.

Workflow

The complete workflow is modelled in the following way:

Here there is an in-depth description of the previous figure:

The framework starts through the __main__.py script, that first activates the Discoverer. Before starting, the config.py file provides all necessary information to configure the framework: some general information about the root directory in which saving output files, the state-machine, the goal that the RL agent should learn, all values of parameters for the chosen RL algorithm. Possible paths arbitrarily defined for the Yeelight protocol are shown inside the images directory.
The Discoverer analyzes the local network and returns to the main script all Discovery Reports describing found IoT devices. Here you can choose if you want to use the nmap version of the Discoverer, or only the the Yeelight-specific discoverer. The nmap version of the Discoverer supports 2 protocols: Yeelight and Shelly. Support for multiple protocols needs to be done.
The main script receives these reports and generates multiple threads running the Learning module, passing to each of them the Discovery Report for 1 distinct Yeelight device found inside the LAN.
The Learning module is the RL agent, iterating over episodes.
1. It receives multiple parameters as input from the config.py file: the chosen RL algorithm, values of ε, α, γ and λ if needed, total number of episodes, etc. Also some flags are present to decide whether after the learning process the user wants to directly plot some results, or wants to run the RL agent following the best policy found, using respectively the Plotter module or the Run Policy Found script.
2. During each episode, the agent asks for commands to the Request Builder, which accesses data of the Yeelight Dictionary and returns a JSON string with the built command requested by the agent. This string can be sent to the Yeelight device.
3. The JSON string is passed to the API Yeelight script inside the Device Communication module, that sends commands to the Yeelight bulb and handles its responses.
4. Moreover, at each time step t the Learning module retrieves the reward r_t and the current state s_t from the State Machine module. In order to retrieve information about the state of the Yeelight device, this module asks to the Dictionary module the command to retrieve all necessary information from the bulb and sends this command to the API script, which actually sends the command to the bulb and returns the response to the State Machine Yeelight module.
5. At the end of the learning process, the Learning module generates some output files, described in Output section.
The main thread waits until the thread running the Learning module ends.

Generated output files can then be used by the Run Policy Found script, which retrieves data from these files and follows the best policy found, through the Q-matrix. While following the policy, the script retrieves complete commands from the Dictionary module and sends them to the Yeelight device passing through the API Yeelight script. Also, output files can be used by the Plotter module to present graphically the results obtained in the learning process.

Plots (screenshots)

Since a lot of different plots can be generated, here there is a quick explanation on what graphs can be generated by scripts of the Plotter module.

get_training_time_traffic.py and plot_training_time_traffic.py retrieve values of time of execution and traffic generated by each execution of the algorithm and generates these bar graphs:
plot_moving_avg.py and plot_moving_avg_for_params.py show the following results respectively for different algorithms and for different values of parameters:
plot_cdf_reward.pyplots the CDF (Cumulative Distribution Function) of the reward:
plot_reward_per_request.py shows the cumulative reward over the number of commands sent:
plot_output_data.py shows reward and time step results for 1 single execution (It can be used for check the correct working):
plot_heatmap.py generates the heatmap reflecting the Q matrix of one run of the algorithm.
plot_animation.py generates an animated plot in real time while the algorithm is working. Once the algorithm has started, the current date can be retrieved from the log_date.log file and copied into the plot_animation.py script. Once this script has started, it will generate a real time plot as the one showed in Demo section.
support_plotter.py contains methods for supporting the operation of the other scripts inside the Plotter module.
run_all_plots.py generates all plots and saves them into a Plot directory created outside the Plotter module.

Note:

all scripts use arrays of dates in format %Y_%m_%d_%H_%M_%S_<thread_id> to identify executions of RL algorithms.
most of the scripts save plots inside subdirectories of the Plot directory. The target directory can be manually chosen inside each script.

Demo

A short demo of the working of the Learning process, showed through the console, and an animated plot can be seen in demo.

Recall that this demo was done using the previously described plot_animation.py script, in order to create an animated plot.

Tests

No tests present for now.

Contribute

Pull Requests are always welcome.

Ensure the PR description clearly describes the problem and solution. It should include:

Name of the module modified
Reasons for modification

Authors

Giulia Milan - Initial work - giuliapuntoit

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Previous implementation of RL algorithms in TCP toycase scenario: RL-for-TCP
SARSA implementation example: SARSA-example
How to evaluate RL algorithms: RL-examples

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
device_communication		device_communication
dictionary		dictionary
discovery		discovery
images		images
learning		learning
plotter		plotter
request_builder		request_builder
sample		sample
state_machine		state_machine
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__main__.py		__main__.py
config.py		config.py
formatter_for_output.py		formatter_for_output.py
setup.py		setup.py

License

SmartData-Polito/RL-IoT

Folders and files

Latest commit

History

Repository files navigation

RL framework for IoT protocols

Table of Contents

Introduction to the project

Motivation

Definitions

Current work

Features

How to use?

Structure

Output

Workflow

Plots (screenshots)

Demo

Tests

Contribute

Authors

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Languages