Skip to content

Screens legal text and extracts sentences containing user input party name-predicate phrases

License

Notifications You must be signed in to change notification settings

jblake1965/eluciDoc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

eluciDoc_header

CodeQL GitHub Discussions PYPI Version

What this is:

This CLI Python project, written for the Windows™ environment, filters sentences and clauses containing specific user input terms from a single document. This project was originally created as a tool to aid in the review of legal contracts, but can be used with any text. Documents can be in docx, .pdf or .txt file formats. The general principle behind its function is subject-predicate sentence analysis. Searches are based on user-selected parties in the document, followed by a user-selected phrase. It is used in conjunction with Microsoft™ Office 365™ Word and Excel™ apps.

How it works:

A .docx, .pdf or .txt file and path is entered (drag and drop work in the Windows terminal):

file_input

The file is then processed as utf8 text, with MS Word Smart Quotes being converted to straight quotes and non-ASCII and non-breaking spaces removed. The term for the party being searched in the document is entered next:

enter_party_name

and then passed with the processed text to textacy's Keyword in Context (KWIC) function. The result is saved as an Excel file with the same name in the same location as the searched document, with "..._[name of the party]_search_result.xlsx" appended. The Excel file automatically opens with a subprocess call, and the results can be converted to a table for further sorting:

textacy_rendering

Note: the subprocess call below uses the default Office install location:

 subprocess.Popen([r'C:\Program Files\Microsoft Office\root\Office16\EXCEL.EXE', result_file])

If the user has Office installed in a different location, then the code must be changed to reflect that directory.

The document is chunked into sentences (or clauses, depending on the formatting) with the spaCy module. The user is prompted to enter predicate search phrases culled from the Excel search file which are stored in a list.
Once finished entering the predicate search phrases, the script iterates through the list of search phrases looking for a match in each sentence. Sentences and clauses containing a match sentences not already in the master list are added to the master list. The master list is then saved as a Word file that is opened automatically at the end of the run (as with Excel, note the location of the Word executable and adjust the path if it is not in the standard install location).

External Dependencies and Licenses

Name: Version: License:
docx2python 2.10.1 MIT
openpyxl 3.1.2 MIT
pandas 2.2.2 BSD
pdfminer.six 20231228 MIT/X
python-docx 1.1.2 MIT
rich 13.7.1 MIT
spacy 3.7.4 MIT
textacy 0.13.0 Apache 2.0

Installation

It is strongly recommended that this package be installed in a virtual environment. The package is available at https://pypi.org/project/elucidoc/ and can be installed with pip install elucidoc .

THE SPACY PIPELINE en_core_web_lg MUST ALSO BE INSTALLED INTO THE VIRTUAL ENVIRONMENT FOR THE SCRIPT TO WORK.

The pipeline can be installed as follows:

python -m spacy download en_core_web_lg

You must also be sure to verify the directory for the Office install is the same as noted above. If not, the code must be changed to the directory where the Excel and Word apps are located.

Running the Script

The project is run as a script. It can be run with a .bat file calling the virtual environment and the executable file per the below example:

@"C:\Users\..\venv\Scripts\python.exe" "C:\Users\..\venv\lib\elucidoc\eluciDoc.py"

@pause

Additionally, the location of the elucidoc.py executable can be included in the Windows PATH environment variable.

Case Sensitive Searches

General convention in legal texts is to capitalize defined terms. For that reason, the user may want to make the search case-sensitive to target the appropriate instances of the term. For searches where the specific use of the subject term is not important but broader capture is, the case-sensitive feature can be turned off. Once a selection is made, it applies for all subsequent searches until the script is restarted.

Possessive Case and Other Punctuation

Textacy divides the party search term from both following words and punctuation including the possessive case, as shown below:

textacy_rendering

To capture an instance of a possessive case of the party being searched, a 's or ' (for the plural possessive) must be the first character in the predicate search phrase, as illustrated by the prompt below:

enter_predicate_phrase

The same principal applies to the comma, colon, semicolon and closed parentheses immediately following the party name.

Smart Quotes

Microsoft Word's default settings utilize smart quotes, which are the curly type fonts. Those are problematic when searching documents converted to text (rendered as slanted quotes in Utf8), and are replaced with straight quotes via the following code:

text = re.sub(r'”', '\"', text)  # replace double smartquote open quote
text = re.sub(r'“', '\"', text)  # replace double smartquote close quote
text = re.sub(r'’', '\'', text)  # replace single smartquote close quote
text = re.sub(r'‘', '\'', text)  # replace single smartquote open quote

PDFs

Due to the nature of .pdf files and the sometimes-inconsistent results that occur when converting pdf documents to text format, additional processing is done. Some characters and extra spaces between word boundaries are removed as part of the text processing:

text = re.sub(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\U00010000-\U0010FFFF]+', '', text)
text = re.sub(r'(\b)(\s{2,4})(\b)', r'\g<1> ', text)

The above solution is not a comprehensive fix for pdf issues. The accuracy of the results with searches of .pdf files may be negatively impacted by the quality or formatting of the underlying document, particularly with scanned documents.

Open Files

If a consecutive search is run for the same party and the Excel file with the prior search results is still open, the script will notify the user of such and not overwrite the existing Excel file. With the Word files, the user will be prompted to save the existing file with another name and close it before proceeding with a second search for the same party.