Improved Bambenek parser to better handle description #1451

amojamo · 2019-09-13T09:41:25Z

Changed value to values as it makes more sense.
Use of regex to read the description for better robustness. As it stands now, there is a conflict when the Bambenek parser reads the IP list. This is because there is a slight change in the Bambenek IP list, where they have a longer description with more commas than usual.

ghost

What was the actual problem you were facing? E.g. a line which could not be parsed?

ghost · 2019-09-16T10:48:36Z

intelmq/bots/parsers/bambenek/parser.py

@@ -32,32 +33,33 @@ def parse_line(self, line, report):
            self.tempdata.append(line)

        else:
-            value = line.split(',')
+            m = re.match(r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}),(?P<description>.*), \


Does not seem to work with IPv6

Could we change .* (greedy) to .*? (non-greedy) here?

Ah, I'm sorry, I didn't consider IPv6. Regex for an IPv6 pattern is out of my scope.
The greedy to non-greedy works for the description group, but not for the URL group.

ghost · 2019-09-16T10:48:57Z

intelmq/bots/parsers/bambenek/parser.py

-            value = line.split(',')
+            m = re.match(r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}),(?P<description>.*), \
+                (?P<timestamp>\d{4}-\d{2}-\d{2}[ ]\d{2}[:]\d{2}),(?P<url>.*)", line)
+            values = m.groups()


That line raises an exception in the tests

It could be my line split. I'm not sure how to break the regex line into two lines in order not to trigger the "Line too long" warning when it comes to code style checking.

Just split the line like this:

m = re.match(r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}),(?P<description>.*), " r"(?P<timestamp>\d{4}-\d{2}-\d{2}[ ]\d{2}[:]\d{2}),(?P<url>.*)", line)

ghost · 2019-09-16T10:52:54Z

intelmq/bots/parsers/bambenek/parser.py

@@ -32,32 +33,33 @@ def parse_line(self, line, report):
            self.tempdata.append(line)

        else:
-            value = line.split(',')
+            m = re.match(r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}),(?P<description>.*), \
+                (?P<timestamp>\d{4}-\d{2}-\d{2}[ ]\d{2}[:]\d{2}),(?P<url>.*)", line)


Could we change .* (greedy) to .*? (non-greedy) here?

Can we $ at the end to indicate the regular expression should match the full line?

amojamo · 2019-09-17T07:13:17Z

What was the actual problem you were facing? E.g. a line which could not be parsed?

They (Bambenek) have since last time edited the IP list, so it parses the line without error as of today. The problem before their last edit was on the following line(s):

64.183.187.20,IP resolved by necurs C&C, uses encoded IP, this is not the C2 IP, 2019-09-17 06:06,http://osint.bambenekconsulting.com/manual/necurs.txt

This has since been changed to the following:

64.183.187.20,IP resolved by necurs C&C uses encoded IP - this is not the C2 IP, 2019-09-17 06:06,http://osint.bambenekconsulting.com/manual/necurs.txt

The problem was that the parsing failed because there were more commas than anticipated, so event.add('event_description.url', value[3]) contained the test "this is not the C2 IP" instead of the anticipated URL.

This kinda makes the change obsolete in a way, but without a regex expression the parser is more fragile than it needs to be.

see #1451

ghost · 2019-09-18T14:20:19Z

What was the actual problem you were facing? E.g. a line which could not be parsed?

They (Bambenek) have since last time edited the IP list, so it parses the line without error as of today. The problem before their last edit was on the following line(s):

64.183.187.20,IP resolved by necurs C&C, uses encoded IP, this is not the C2 IP, 2019-09-17 06:06,http://osint.bambenekconsulting.com/manual/necurs.txt

This has since been changed to the following:

64.183.187.20,IP resolved by necurs C&C uses encoded IP - this is not the C2 IP, 2019-09-17 06:06,http://osint.bambenekconsulting.com/manual/necurs.txt

The problem was that the parsing failed because there were more commas than anticipated, so event.add('event_description.url', value[3]) contained the test "this is not the C2 IP" instead of the anticipated URL.

Well, that's obvious.

This kinda makes the change obsolete in a way, but without a regex expression the parser is more fragile than it needs to be.

If the regular expression itself is stable - yes. I opened https://github.com/amojamo/intelmq/pull/1 for some tests which you could use for development.

In case the format is proper CSV (using commas is ok then if properly escaped), we can use the csv parser of python itself like with any csv-based feed. That's IMHO the best option.

TST: add bambenek tests for commas in descriptions

…elop

ghost · 2020-05-20T16:25:37Z

Are you still working on this? The tests are still failing on values = m.groups()

Improved parser to better handle description

debd147

amojamo changed the title ~~Improved parser to better handle description~~ Improved Bambenek parser to better handle description Sep 13, 2019

Removed whitespace and broke down regex expression

e5b4372

ghost suggested changes Sep 16, 2019

View reviewed changes

TST: add bambenek tests for commas in descriptions

56021be

see #1451

ghost added component: bots feature request Indicates new feature requests labels Sep 18, 2019

ghost added this to the 2.1.0 milestone Sep 18, 2019

Adrian Moen and others added 3 commits September 19, 2019 10:04

Corrected line-break in regex expression

f8dfa9b

Merge pull request #1 from wagner-certat/pr-1451

87a11ed

TST: add bambenek tests for commas in descriptions

Merge branch 'develop' of https://github.com/amojamo/intelmq into dev…

4065494

…elop

ghost modified the milestones: 2.1.0, 2.2.0 Oct 25, 2019

Merge branch 'develop' into pr-1451

ef4ec5c

ghost marked this pull request as draft June 16, 2020 13:47

ghost removed this from the 2.2.0 milestone Jun 17, 2020

ghost added the needs: feedback label Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved Bambenek parser to better handle description #1451

Improved Bambenek parser to better handle description #1451

amojamo commented Sep 13, 2019

ghost left a comment •

edited by ghost

ghost Sep 16, 2019

ghost Sep 16, 2019

amojamo Sep 17, 2019

ghost Sep 16, 2019

amojamo Sep 17, 2019

ghost Sep 18, 2019

ghost Sep 16, 2019

amojamo commented Sep 17, 2019

ghost commented Sep 18, 2019

ghost commented May 20, 2020

Improved Bambenek parser to better handle description #1451

Are you sure you want to change the base?

Improved Bambenek parser to better handle description #1451

Conversation

amojamo commented Sep 13, 2019

ghost left a comment • edited by ghost

Choose a reason for hiding this comment

ghost Sep 16, 2019

Choose a reason for hiding this comment

ghost Sep 16, 2019

Choose a reason for hiding this comment

amojamo Sep 17, 2019

Choose a reason for hiding this comment

ghost Sep 16, 2019

Choose a reason for hiding this comment

amojamo Sep 17, 2019

Choose a reason for hiding this comment

ghost Sep 18, 2019

Choose a reason for hiding this comment

ghost Sep 16, 2019

Choose a reason for hiding this comment

amojamo commented Sep 17, 2019

ghost commented Sep 18, 2019

ghost commented May 20, 2020

ghost left a comment •

edited by ghost