Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word count not counting the whole data #11

Open
estebarb opened this issue Nov 28, 2021 · 3 comments
Open

Word count not counting the whole data #11

estebarb opened this issue Nov 28, 2021 · 3 comments

Comments

@estebarb
Copy link

I'm running the word count program over a 86GB dataset. The data is utf8, already sanitized with newlines and spaces. I already know that the total words is around 29000M words. But the resulting output of the word count program sums just 86M words. Also, the logs are full of too many requests errors.

How can I debug why the program is not reading the whole input? It is caused by those too many requests errors? Any workaround? Thanks

@bcongdon
Copy link
Owner

This project is not actively being maintained. That being said, what types of request errors are you seeing in the logs? I'm not sure that worker retry was ever implemented, so that may be the cause of the undercounting.

@estebarb
Copy link
Author

Thanks! The errors are similar to:

time="2021-11-27T23:07:16-06:00" level=warning msg="Function invocation failed. (Attempt 2 of 3)"
...
time="2021-11-27T23:07:18-06:00" level=error msg="unexpected end of JSON input"
time="2021-11-27T23:07:18-06:00" level=error msg="Error when running mapper 99: TooManyRequestsException: Rate Exceeded.\n{\n  RespMetadata: {\n    StatusCode: 429,\n    RequestID: \"3bffa6f5-1a44-4749-8408-165c4d8881da\"\n  },\n  Message_: \"Rate Exceeded.\",\n  Reason: \"ConcurrentInvocationLimitExceeded\",\n  Type: \"User\"\n}"
time="2021-11-27T23:07:19-06:00" level=warning msg="Function invocation failed. (Attempt 3 of 3)"

@bcongdon
Copy link
Owner

It may be worth trying to configure "maxConcurrency" (ref) to a lower value. It looks like the default is 500, which may be too high, in retrospect.

As for unexpected end of JSON input: It's hard to know for sure, but if that's being emitted from the framework, I'm guessing it'd be from here, where the reducers decode the intermediate map output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants