Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an example of a custom classifier #4

Open
vatjujar opened this issue Sep 27, 2017 · 12 comments
Open

Add an example of a custom classifier #4

vatjujar opened this issue Sep 27, 2017 · 12 comments

Comments

@vatjujar
Copy link

I'd like to see an example of custom classifier that is proven to work with custom data. The reason for the request is my headache when trying to write my own and my efforts simply do not work. My code (and patterns) work perfectly in online Grok debuggers, but they do not work in AWS. I do not get any errors in the logs either. My data simply does not get classified and table schemas are not created.

So, the classifier example should include a custom file to classify, maybe a log file of some sort. The file itself should include various types of information so that the example would demonstrate various pattern matches. Then the example should present the classifier rule, maybe even include a custom keyword to demonstrate the usage of that one too. Also, a deliberate mistake should also be demoed (both in input data and patterns) and how to debug this situation in AWS.

Thanks in advance!

@wintersky
Copy link

wintersky commented Oct 11, 2017

Here is one of mine. For log lines like this:
some-log-type: source-host-name 2017-07-01 00:00:01 - {"foo":1,"bar":2}

I set up a Glue custom classifier with:

Grok pattern: %{OURLOGWITHJSON}

Custom patterns:
OURTIMESTAMP (%{TIMESTAMP_ISO8601}|%{YEAR}/%{MONTHNUM}/%{MONTHDAY} %{TIME})
OURWORDWITHDASHES \b[\w-]+\b
OURLOGSTART %{OURWORDWITHDASHES:ourevent}:? %{SYSLOGHOST:logsource}( %{POSINT:pid})? %{OURTIMESTAMP:ourtimestamp}
GREEDYJSON (\{.*\})
OURLOGWITHJSON ^%{OURLOGSTART}( - )?[^{]+%{GREEDYJSON:json}$

(Note Logstash works with GREEDYJSON ({.*}) but Glue's Grok parser rejects that)

and I get rows with four fields:
ourevent: some-log-type
logsource: source-host-name
ourtimestamp: 2017-07-01 00:00:01
json: {"foo":1,"bar":2}

The Grok patterns are a bit more complicated than the minimum to match that,
in particular the colon after "some-log-type" is optional, the ' - ' may
or may not be present, and the timestamp might be in ISO8601 format.

@wintersky
Copy link

wintersky commented Oct 11, 2017

I updated the text above, so the backslashes are now correctly shown in the GREEDYJSON pattern... (The text above elided the backslashes in front of the braces in the GREEDYJSON pattern -- I needed to add those in order for Glue's Grok parser to accept the pattern.)

@billmetangmo
Copy link

I got the same issue than @vatjujar vatjujar.

@loudmouth
Copy link

We are also experiencing the same issue while trying to parse apache styled log lines—everything works perfect in online grok debuggers, but manually running a crawler shows nothing...a more detailed example would be greatly appreciated!

@ramzanfarooq
Copy link

I have given many tries but not working , all my grok patterns work well with grok debugger but not in AWS Glue

@bmardimani
Copy link

I tried writing a pattern for single quoted semi json data file and it works on the debugger. However, not in Glue. Any help is much appreciated!

@wintersky
Copy link

As shown above, I had to include backslashes before the brace characters (see "GREEDYJSON") to get it to match the JSON part of my log lines (to a string field named json, which I later unbox in a Glue script like this:
...
unbox5 = Unbox.apply(frame = priorframe4, path = "json", format = "json", transformation_ctx = "unbox5")
...)

The backslashes weren't necessary in the online Grok debugger or in Logstash, but were necessary in Glue's Grok patterns. Dunno if that's your issue or not, but you might try throwing around some backslashes to see if it helps!

@vkubushyn
Copy link

I just ran into this same issue. The problem was that in order to test an updated classifier, you need to create a whole new crawler. Simply updating the classifier and rerunning the crawler will NOT result in the updated classifier being used. This is not intuitive at all and lacks documentation in relevant places.

The only place this is explicitly mentioned (that I found) is in https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html - "To reclassify data to correct an incorrect classifier, create a new crawler with the updated classifier"

This nugget of information needs to be added to every other place custom classifiers are documented in bold capital letters.

@naginenisridhar
Copy link

I have lot text files in our S3 under different folders in columnar data sections. Automatic crawler does not recognize the schema in those files. How do we setup custom crawler for text files with column data.

@naginenisridhar
Copy link

How do we have crawler setup on S3 buckets with "ini" file formats?

@danilocgomes
Copy link

I just ran into this same issue. The problem was that in order to test an updated classifier, you need to create a whole new crawler. Simply updating the classifier and rerunning the crawler will NOT result in the updated classifier being used. This is not intuitive at all and lacks documentation in relevant places.

The only place this is explicitly mentioned (that I found) is in https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html - "To reclassify data to correct an incorrect classifier, create a new crawler with the updated classifier"

This nugget of information needs to be added to every other place custom classifiers are documented in bold capital letters.

Thank you @vkubushyn , you saved me some time. I faced the same here.

@BwL1289
Copy link

BwL1289 commented Sep 30, 2019

in addition:

  1. Glue grok classifiers and grok debugger patterns are not exactly the same
  2. don't crawl specific files; instead, crawl the directories
  3. multiline and newline not supported -> need to transform the file contents via a script

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests