Skip to content

lorarjohns/SCOTUS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Topic Modeling Supreme Court Opinions

In this project, I assembled a corpus of the U.S. Supreme Court's entire body of opinions and performed unsupervised topic modeling of its 14th Amendment jurisprudence from the 1700s to present using latent semantic analysis, k-means clustering, and other natural language processing techniques.

This analysis serves as the foundation for further exploration of the 64,000+ document corpus to gain insight into judicial methods, to add to the body of knowledge of law and linguistics, to perform historical and predictive analytics, and to improve upon legal research algorithms.

Assembling the Corpus

I used Scrapy, BeautifulSoup, bash, SQL, and Django to webscrape 64,000+ opinions from the web along with citation information, which I merged into a master database.

Linguistic Preprocessing

I wrote a custom pipeline using regular expressions, NLTK, scikit-learn, and LexNLP to remove legal-domain stopwords. Because law has such a specialized vocabulary and idiosyncratic writing style, I needed to custom-write a pipeline to account for things (e.g., citations) that most ordinary NLP machine-learning kits can't handle off-the-shelf.

After tokenizing, lemmatizing, and removing stop words, I began processing the opinions.

Fragment of a custom component for the NLP pipeline:

def stop_citation_noise(text):
    '''
    Takes in a string, returns a string.
    Use to remove unwanted citations to legal reports,
    document headers, appendices, and artifacts of
    digital processing.
    '''
    stops = []
    for cite in get_citations(text, return_source=True, as_dict=True):
        # if len(cite['citation_str']) < 70:
            # text = text.replace(cite["citation_str"], " ")
            #    cite['citation_str'] = cite['citation_str'].replace(str(cite['volume']) + cite['reporter'] + str(cite['page']), " ")
            # stops.append(cite["citation_str"])        
        buildcite = str(cite['volume']) + " " + cite['reporter'] + " " + str(cite['page'])
        if cite['page2']:
            buildcite += ', '+str(cite['page2'])
        stops.append(buildcite)
        
    return stops
        
def replace_citations(text):
    stops = stop_citation_noise(text)
    for stop in stops:
        text = text.replace(stop, " ")
    return text

Latent Semantic Analysis

I began by analyzing the 14th Amendment, extracting those cases with regex (n=4579). First, after preprocessing, I vectorized the vocabulary using TF-IDF to obtain a frequency measure for each word in the corpus and prepare the documents for singular value decomposition and latent semantic analysis.

Using LSA, I performed topic clustering and extracted the top terms per topic for each cluster. I performed this first on the corpus as a whole, then on the data in segmented time using a rolling window function.

A sampling:

words7

words10

words6

Modifying the source code for Python's WordCloud package, I wrote a custom function to generate word clouds that adjusted a word's size based on frequency and color based on topic. I then wrote a script to generate word clouds iteratively for each rolling window.

A sampling:

1980-2000

1970-1990

1950-1970

1940-1960

K-means Clustering

To obtain better topic modeling, I used a k-means algorithm to cluster terms, first in the whole corpus.

I used silhouette score testing to judge the number of clusters.

testing

Applying the k-means function to the data using a rolling window enabled me to model the 14th Amendment's evolving meaning over time.

Whole corpus k-means terms:

Built With

  • spaCy, NLTK, LexNLP
  • Django, Scrapy, bash, RESTful API
  • scikit-learn, numpy, pandas
  • wordcloud, PIL, OpenCV, Plotly, Plotly Express

Authors

Acknowledgments

  • I am indebted to CourtListener for bulk data downloads of case law and to Oyez for citation and other information.
  • Thanks to Michael J. Carlisle for inspiration, collaboration, and good company. Medium, GitHub.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published