Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lc_data.clean_text function not fully abstracted #2

Open
schaunwheeler opened this issue Oct 1, 2013 · 0 comments
Open

lc_data.clean_text function not fully abstracted #2

schaunwheeler opened this issue Oct 1, 2013 · 0 comments

Comments

@schaunwheeler
Copy link
Collaborator

The regex parameter in the clean_text function is just a workaround. The desc field in the lc data contains various updates following the pattern:

r'(\d+|Borrower)\s+added\s+on\s+(\d{2}/\d{2}/\d{2})\s+>'

The point of the regex and the parts of the function that reference the regex is to identify these points, extract all dates other than the first update date (since that first "update" was actually the original posting), and delete anything that was entered in after those date markers. I can think of other situations in which it would be useful to identify a body of text by regex, and then delete everything after the nth occurrence of that regex.

It would also be useful to feed a dictionary into the regex parameter, where the names of each node indicate which field of the data the regex should be applied to. If a field does not have a corresponding name in the dictionary, then the entire regex portion of the clean_text function can be skipped.

If the above enhancements are made, clean_text can be used as a general-purpose tool for cleaning text fields. As the function stands right now, it can only be used for the one very specific task for which it was originally created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant