GitHub - porteron/http-archive-parser: Various parsers used for data privacy detection and shared data between hosts. HAR file parsing tools and accompanying API.

Archive Parser

Description

The HTTP Archive Parser stands up a server with endpoints to parse HTTP Archive files in various ways. It's initial purpose was to detect data privacy violations in a user session. Part of that system has been broken out into a more general purpose parser, which will expose Shared Strings in a user's session. It helps identify dataflow between different host domains.

It looks to match strings such as "cookies", "headers", and "query parameters".

You can specify various reports to run on the HAR file.

Shared String
Shared String Entity List
Shared String Differential

Currently Reports will be read and stored in S3. You will have to fill out the .env file with the proper credentials. There will be future support for using local filesystem.

Installation

$ npm install

Running the app

# development
$ npm run start

# watch mode
$ npm run start:dev

# production mode
$ npm run start:prod

Test

# unit tests
$ npm run test

# e2e tests
$ npm run test:e2e

# test coverage
$ npm run test:cov

Parser

There are many properties you can customize for the parser. The config is located in the parser/har/parser.config.js file.

Improper modification of these values can lead to unneccessary parsing conditions which leads to long parsing times.

FIRST_CHAR_MIN_LEN and FIRST_CHAR_MAX_LEN values are most sensitive. The smaller the FIRST_CHAR_MIN_LEN the more strings the parser will consider in the file. You should probably always have this value greater than 6 or 7. Most unique identifiers are greater than 7 so go ahead and set it higher if that is what you are looking for.

Below are the supported config values

{
    LEVELS: [
        'request',
        'response'
    ],
    ENTRY_TYPES: [
        'headers',
        'cookies',
        'queryString'
    ],
    FIRST_CHAR_MIN_LEN: 7,
    FIRST_CHAR_MAX_LEN: 200,
    REPORT_KEY_NAME_MAX_LENGTH: 60,
    REPORT_URL_MAX_LENGTH: 120,
    INCLUDE_INITIATOR: true,
    INCLUDE_SERVER_IP: false,
    MATCH_COUNT_MIN: 2,
    IGNORE_LIST: [],
    INCLUDE_LIST: [],
    REPORT_PARAMS: [],
    IGNORE_SAME_REQUESTS: true,
    FILTER_SAME_HOST_URL: true,
    FILTER_TIMESTAMPS: true,
    FILTER_URL_VALUES: false,
}

How it works

POST to {SERVER_HOST}/collection-event/parse

There are two supported ways to pass your file to the parser
1. Send the entire raw HAR contents in the request body
2. Send the name of the HAR file stored in S3

Example Requests for Various Parsing

Header Request Format

Headers
Content-Type	application/json
mx-token	TEST-KEY-PARSER

Supported Request Body

{
  "format": "json", // OPTIONAL - also accepts "csv" - default is json
  "save": bool, // OPTIONAL - true or false to save to bucket - default is true
  "update": bool, // OPTIONAL - true or false to overwrite existing file - default is false
  "report_type": "sharedStrings", // or "differential" or "entityList" 
  "files": ["<S3 HAR FILE NAME>"] // if differential pass two files
  // OR
  "raw": [{HAR1}], // if differential pass two raw HAR files as json objects
}

Shared Strings Parse

Request Body
{
	"report_type": "sharedStrings",
	"format": "json",
	"files": ["<S3 HAR FILE NAME>"]
}

Entity List Parse

Request Body
{
	"report_type": "entityList",
	"format": "json", 
	"files": ["<S3 HAR FILE NAME>"]
}

HAR Differential

Request Body
{
	"report_type": "differential",
	"format": "json",
	"files": ["<S3 HAR FILE NAME>", "<S3 HAR FILE TO DIFF AGAINST>"]
}

Development

Creating a new module

Before starting development it is important to understand the NEST framework. There are a couple basic concepts that will help to understand the purpose of each file. The basic stucture is that each Module has a Component, Service/Repository, Data Transfer Model, & Interface. Some of these constucts are just fancy words for very simple purpose.

You can run the CreateFullModule.sh script to have the following files autogenerated for you:

(Note that there are occasionally syntax issues when creating a module with a module name that is more than 1 word long)

controller
module
dto
repository
spec test file

Take a look at the CreateFullModule.sh code to understand what is happening.

After you run the script there are still a few things that need to be manually added.

You will need to manually enter in the values into the dto file. Look at the src/interfaces/entities/<your module> for all the fields that you will need to add in. Use existing files for reference.

The alternative is to use the nest cli and do it all manually.

Resources

Swagger Docs:
- https://docs.nestjs.com/recipes/swagger
Great docs on creating components that hook into TypeORM
- https://blog.theodo.com/2019/05/an-overview-of-nestjs-typeorm-release-your-first-application-in-less-than-30-minutes/
Swagger with NestJS
- https://docs.nestjs.com/recipes/swagger

HTTP Archive Parser is built with TypeScript Nest framework

Maintainers

Nicholas Porter - https://github.com/porteron

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
src		src
test		test
.env		.env
.gitignore		.gitignore
CreateFullModule.sh		CreateFullModule.sh
Dockerfile		Dockerfile
Dockerrun.aws.json		Dockerrun.aws.json
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json
tslint.json		tslint.json

License

porteron/http-archive-parser

Folders and files

Latest commit

History

Repository files navigation

Description

Installation

Running the app

Test

Parser

Below are the supported config values

How it works

Example Requests for Various Parsing

Shared Strings Parse

Entity List Parse

HAR Differential

Development

Creating a new module

Resources

Maintainers

About

Resources

License

Stars

Watchers

Forks

Languages