lc_data.clean_text function not fully abstracted #2

schaunwheeler · 2013-10-01T13:13:04Z

The regex parameter in the clean_text function is just a workaround. The desc field in the lc data contains various updates following the pattern:

r'(\d+|Borrower)\s+added\s+on\s+(\d{2}/\d{2}/\d{2})\s+>'

The point of the regex and the parts of the function that reference the regex is to identify these points, extract all dates other than the first update date (since that first "update" was actually the original posting), and delete anything that was entered in after those date markers. I can think of other situations in which it would be useful to identify a body of text by regex, and then delete everything after the nth occurrence of that regex.

It would also be useful to feed a dictionary into the regex parameter, where the names of each node indicate which field of the data the regex should be applied to. If a field does not have a corresponding name in the dictionary, then the entire regex portion of the clean_text function can be skipped.

If the above enhancements are made, clean_text can be used as a general-purpose tool for cleaning text fields. As the function stands right now, it can only be used for the one very specific task for which it was originally created.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lc_data.clean_text function not fully abstracted #2

lc_data.clean_text function not fully abstracted #2

schaunwheeler commented Oct 1, 2013

lc_data.clean_text function not fully abstracted #2

lc_data.clean_text function not fully abstracted #2

Comments

schaunwheeler commented Oct 1, 2013