Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON Dict Comparison Improvement #22

Open
TheDr1ver opened this issue Sep 7, 2021 · 2 comments
Open

JSON Dict Comparison Improvement #22

TheDr1ver opened this issue Sep 7, 2021 · 2 comments

Comments

@TheDr1ver
Copy link
Owner

TheDr1ver commented Sep 7, 2021

Consider revisiting stripping dates from data (shodan http data 80_data or 443_data appears to be the biggest offender at the moment). This also affects the 443_hash value.

Other targets for removal:

  • Anything censys with __encoding in the key - value = DISPLAY_UTF8 or value = DISPLAY_HEX
  • Shodan - 443_opts_heartbleed - contains date which will change every time
  • Shodan - _location_latitude, _location_longitude, _location_city - might change too frequently

NOTE - This scrubbing should only happen after the diff comes back with a positive result. That way we're not looking at every single character in every JSON blob that comes our way, plus it'll be easier to find "true scrubs" rather than accidentally deleting pieces of data that some plugin determines to look "date-like".

Subset for diffing inside bodies should be implemented

If you get a diff between HTML-specific fields like *_http_response_body then that HTML should be parsed and diffed separately if at all possible... But that may quickly get so complicated as to turn into a project of its own.

@TheDr1ver
Copy link
Owner Author

TheDr1ver commented Sep 23, 2021

Censys

Delete:

*__encoding_*

^^ Addressed in #42

Scrub:

*_banner
    cookies:
        Set-Cookie.*?=(.*?);
        (e.g. sessionid=<base64>; csrftoken=<base64>; expires=<date>)

*_http_response_body
    <input.*?(?=token).*?value(.*?)>
    # Or could be double-rex process. 
        # One rex to find <input> with 'token' inside it:
            <input[^>]*?(?=token).*?>
        # Then another to scrub the value inside of the result
            s/value=\".*?\"/value=\"\"/g
    ^^^ note that this overly simplified. We should have a better way of scrubbing HTML in general.

Scrub-reliant deletes:
(if any of the related fields get scrubbed in the previous function, delete these fields entirely from the result)

*_banner:
    *banner_hex
    *http_response_headers_Set_Cookie_*

*_http_response_body:
    *_http_response_body_hash
    *_http_response_body_size

@TheDr1ver
Copy link
Owner Author

TheDr1ver commented Sep 23, 2021

Shodan

Delete:

*_asn
*_isp
*_location_*
*_opts_*

^^ Addressed in #42

Scrub:

*_data
    \nDate:(.*?)\n

Scrub-reliant deletes:
(if any of the related fields get scrubbed in the previous function, delete these fields entirely from the result)

*_data:
    *_hash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant