Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text.analyze_file([MY_PATH]).strings returns array of characters #11

Open
sciprog opened this issue Jun 18, 2013 · 0 comments
Open

Text.analyze_file([MY_PATH]).strings returns array of characters #11

sciprog opened this issue Jun 18, 2013 · 0 comments

Comments

@sciprog
Copy link

sciprog commented Jun 18, 2013

I'm testing the contents of a PDF generated by PDFKit. When I run Text.analyze_file([MY_PATH]).strings on the file I get an array which holds each character of the PDF content in it's own index. Spaces are stored as '' (empty string). I've been able to move forward by replacing all empty strings with a space character. However, I'm now up against content which contains new line characters. The new lines are not stored in the array, so the separation between the words is lost around the new line character. Ever see this sort of behaviour? I realize that there are a number of factors which could be screwing things up, including my own ignorance, and I'd love to find the root of the problem, but I have no time. Right now, I'd be happy with a hack to get my tests working.
Cheers!

EDIT: So I came up with a hack that'll get me through. I remove all the white space characters from the array (they weren't actually empty strings, as I had believed). Then join the characters with exactly one space, and downcase the whole thing.

def char_array_to_normalized_string(arr)
arr.delete_if{|s| s =~ /\s/ }.join(' ').downcase
end

After I put my test strings through the same process, by calling char_array_to_normalized_string("Test String".scan(/./)), I'm able to match them against the ouput of PDF inspector. It's not pretty, but it gets me where I need to go.
Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant