Skip to content

PadishahIII/RFGuess

Repository files navigation

RFGuess(Random Forest Password Guessing model)

Overview

This repository contains the reproduction for the paper Password Guessing Using Random Forest. The author proposes a set of new methods to translate PII(Personal Identifiable Information) data into structures that perform quite well in classical machine learning models. I have implemented the main concept of the paper and programmed an easy-to-use tool for training models, generating patterns, conducting guesses and evaluating accuracy. This repo contributes:

  • A GUI program exclusively for the PII-based targeted password guessing scenario
  • A pre-trained model

If you are looking for more knowledge about the underlying logic and training process of this project, this article provides more details about the algorithm and the corresponding transcript is available here.

Table of contents

Features

  • PII-based targeted password guessing
  • A pre-trained model(get here) ready to use which is trained on a dataset with 11w data
  • Generate password patterns based on PII dataset
  • Conduct password guesses for given personal information
  • Support for training specified model for self-defined datasets
  • Support for evaluating the accuracy of generated guesses

Prerequisites

Install & Launch

Clone this repo to your local, install dependencies:

pip install -r requirements.txt

This project use Mysql to store analysis data, you can launch a prepared database in docker which is recommended:

docker-compose up -d

And connect to mysql://root:root@127.0.0.1:3307/rfguess.

Or if you expect to use a custom database, you should import user.sql into your database manually, which will create all the data tables.

Launch the user interface:

python main.py

Usage

Main window

Run the executable file and you will see the panel as below: 1

There are three main modules in the user interface: Guess-Generator, Pattern-Generator and Model-Trainner.

Generate pattern(Pattern-Generator)

First you should get a trained model(whether you train by yourself in model-trainner or use the pre-trained model from rfguess.clf). Then set a limit on the number of patterns to be generated and start generating.

  1. Load model(.clf) 2

  2. Assign output path and limit 3

Generate password dictionary(Guess-Generator)

This module requires a pattern file(see Appendix for more detail) and PII data of the target user. You can load the pattern file generated by Pattern-Generator or use the default pattern file.

  1. Load pattern file 4

  2. Fill in PII data Input the personal data of the target user or load data from json file(format) 5

6

  1. Generate password dictionary 7

Train your own model

The model training process of Machine-Learning is pretty more laborious than that of Deep-Learning. The algorithms in this program need to use mysql database to store intermediate data structures while processing the original dataset. Fortunately, you just need to have a normal running mysql server and just provide a database url to connect to. All the data structures are configured automatically.

  1. Connect to your database and import database structure

Connect to database URL: 8

Import sql file(get here). Note that, this script would drop and recreate data tables that are in business(you can checkout and modify the table names in Parse/Config.py). If you start mysql server by docker-compose(which is the recommended way), the user.sql has already imported at the start time, so you can just skip this step. 9

  1. Load your PII dataset(.txt)

The PII dataset should in csv format and comply with the principles below:

  • the first line presents field names
    • field name should fall into ['account', 'name', 'phone', 'idcard', 'email', 'password'], case-insensitive
    • you can include any combination of the allowed fields but name and password are mandatory
  • each line contains one PII data
  • each line should have several fields and separated by comma
  • blank characters will be ignored

A legal dataset is presented like:

name, email, password, phone
张三, 350777@aa.com , zhangsan, 111122222
John, 3333@bb.com, 3333, 44444
Jason Harris, aaaa@aa.com, 5555, 5555

You can specify the character set of the target dataset by Charset edit box. image

Push Load PII Data button and wait. Your dataset will be consumed and stored in database after some procession. 10

  1. Analyze and process dataset

This step will analyze the PII dataset to some intermediate data. 11

  1. Train model

You will train a classifier model and dump into a .clf file. 12

  1. Evaluate accuracy

To evaluate the accuracy of a model, this step uses 50% of your dataset as train-set and other 50% as test-set, generates a password dictionary for each PII data and checks whether the correct password falls into the dictionary. 13

  1. Restore the status of last run

Use "Update Status" button to load the progress of the last run and check the status of each phase. 14

Advanced Configuration

See more detailed configuration at Config.py.

Algorithm configuration

Markov n-gram model is used in the main algorithm, you can set n by pii_order parameter:

pii_order = 6

You can control the limit of guesses by the two following thresholds, which are calculated according to the possibility of the growing pattern. A pattern is adopted only if its possibility is greater than the threshold. So the larger is the threshold, the lesser is the number of guesses, vice verse. It is notable that you should not set the threshold excessively small(lesser than 1e-11) to avoid overwhelming by useless patterns.

general_generator_threshold = 1.2e-8

Database configuration

You can config the table names of database as you like:

class TableNames:
    PII = "PII"
    pwrepresentation = "pwrepresentation"
    representation_frequency = "representation_frequency"
    pwrepresentation_frequency = "pwrepresentation_frequency"
    pwrepresentation_unique = "pwrepresentation_unique"
    pwrepresentation_general = f"{pwrepresentation}_general"
    representation_frequency_base_general = f"representation_frequency_base_general"
    representation_frequency_general = f"{representation_frequency}_general"
    pwrepresentation_frequency_general = f"{pwrepresentation_frequency}_general"
    pwrepresentation_unique_general = f"{pwrepresentation_unique}_general"

Classifier configuration

Tune the parameters of random forest by the following config:

class RFParams:
    n_estimators = 30
    criterion = 'gini'
    min_samples_leaf = 10
    max_features = 0.8

Build from source

This project is written by Python3.11. You can install dependencies by using pip:

pip install -r requirements.txt

And run the following command to launch the main window:

python main.py

License

This code is released under an MIT License. You are free to use, modify, distribute, or sell it under those terms.

Contact

Project Link: https://github.com/PadishahIII/RFGuess

Acknowledgements

Appendix

Pattern format

Tag Description
N1 FullName
N2 Abbreviate of name
N3 Family name
N4 Given name
N5 First character of given name append family name
N6 First character of family name append given name
N7 Family name capitalized
N8 First character of family name
N9 Abbr of given name
B1 Birthday in YYYYMMDD
B2 MMDDYYYY
B3 DDMMYYYY
B4 MMDD
B5 YYYY
B6 YYYYMM
B7 MMYYYY
B8 YYMMDD
B9 MMDDYY
B10 DDMMYY
A1 Account
A2 Letter segment of account
A3 Digit segment of account
E1 Email prefix
E2 Letter segment of email
E3 Digit segment of email
E4 Email site like qq, 163
P1 Phone number
P2 First three digits of phone number
P3 Last four digits of phone number
I1 Id card number
I2 First three digits of idCard
I3 First six digits of idCard