Skip to content

Legal case retrieval challenge. Solution based on similarity search and learning-to-rank methods

License

Notifications You must be signed in to change notification settings

Scheggetta/coliee24

Repository files navigation

Coliee24 challenge

Report: link

Introduction

This repository contains the code for the Coliee24 challenge. We focused on Task 1. The task is to predict the citations of a given case law. In other words, given a query and an evidence, the task is to predict whether the evidence is relevant to the query or not.

Data

The dataset is provided by the organizers of the challenge. It contains the training and test corpus. The dataset is not included in this repository, but can be requested to the organizers. For more information on the challenge, visit the official website.

Approach

Preprocessing

To process the corpus, we first apply regular text preprocessing techniques. We remove special characters, recurrent tags, multiple spaces and sentences that bias the prediction.

Then, since the texts are fragmented, we perform sentence segmentation with spaCy. Every part of the document is originally separated by a tag (a number in square brackets) to indicate the beginning of a new paragraph. These tags are maintained in the sentence segmentation process.

Finally, on each sentence, French language detection and translation to English are performed with lingua-language-detector and argostranslate packages respectively.

Obviously, we concatenate the sentences to form the texts again. Now every document has the following structure:

[1]
... text in English ...
[2]
... text in English ...
[...]
[N]
... text in English ...

N is the number of paragraphs in the document.

Baselines

  • Random: Randomly predicts whether the evidence is relevant or not.
  • All-ones: Predicts that all the evidences are relevant.
  • TF-IDF: Every document is represented by a TF-IDF vector. The cosine similarity between the query and the evidence is calculated. The top n evidences with the highest similarity are selected.
  • Okapi BM25: The ranking function is used to score the relevance of the evidence to the query. The top n evidences with the highest score are selected.
  • GPT text-embedding-3-small: The GPT model is used to generate embeddings for the query and the evidence. The cosine similarity between the embeddings is calculated. The top n evidences with the highest similarity are selected.
  • Embedding Head: The GPT model is used to generate embeddings for the query and the evidence. The embeddings are passed through a feed-forward neural network that reduces the dimensionality of the embeddings and is fine-tuned on the training data with a contrastive loss function. The cosine similarity between the embeddings is calculated. The top n evidences with the highest similarity are selected.

Method

Notation:

  • e: embedding of a document;
  • ē: mean of the embeddings of a document;
  • e*: projection of ē to the latent space of the feed-forward neural network;
  • D: set of all documents in the corpus (queries and evidences);
  • qi: query i that belongs to D;
  • Di: D without qi;
  • dij: evidence j that belongs to Di;
  • sij: similarity score between qi and dij;
  • Dki: top k evidences that have the highest similarity with qi;
  • dkij: evidence j that belongs to Dki;
  • Ei: selected evidences for query i that belong to Dki.

Embedding Head baseline

     ┌──────────┐                     │                                           │
     │ Document │                     │                                ┌──────┐   │
     └────┬─────┘                     │                               ┌┴─────┐│   │
          │                           │       ┌─────┐                ┌┴─────┐├┘   │
┌─────────┼─────────┐                 │       │ q_i │                │ d_ij ├┘    │
│         │         │                 │       └──┬──┘                └──┬───┘     │
│         ▼         │                 │          │                      │         │
│      ┌─────┐      │                 │          ▼                      ▼         │
│      │ GPT │      │                 │   Pre-processing          Pre-processing  │
│      └──┬──┘      │                 │          │                      │         │
│         │         │                 │          │                      │         │
│         ▼         │                 │          ▼                      ▼         │
│         ┌───┐     │                 │    ┌───────────┐          ┌───────────┐   │
│        ┌┴──┐│     │  ┌───────────┐  │    │ Embedding │          │ Embedding │   │
│       ┌┴──┐├┘     │  │ Embedding │  │    │   Head    │          │   Head    │   │
│       │ e ├┘      ├──┤   Head    │  │    │ (recall)  │          │ (recall)  │   │
│       └─┬─┘       │  │   (m)     │  │    └─────┬─────┘          └─────┬─────┘   │
│         │         │  └───────────┘  │          │                      │         │
│         │mean     │                 │          └──────┐         ┌─────┘         │
│         ▼         │                 │                 ▼         ▼               │
│       ┌───┐       │                 │            ┌───────────────────┐          │
│       │ ē │       │                 │            │ Cosine similarity │          │
│       └─┬─┘       │                 │            └─────────┬─────────┘          │
│         │         │                 │                      │                    │
│         ▼         │                 │                      ▼                    │
│ ┌───────────────┐ │                 │                     s_ij                  │
│ │ Feed-forward  │ │                 │                      │                    │
│ │ NN trained on │ │                 │                      │                    │
│ │  metric `m`   │ │                 │                      ▼                    │
│ └───────┬───────┘ │                 │                  ┌───────┐                │
│         │         │                 │                  │ Top k │                │
└─────────┼─────────┘                 │                  └───┬───┘                │
          │                           │                      │                    │
          ▼                           │                      ▼                    │
        ┌───┐                         │                  ┌───────┐                │
        │ e*│                         │                  │ D^k_i │                │
        └───┘                         │                  └───────┘                │

Our method

                                      ┌────────┐                              
                                     ┌┴───────┐│                              
         ┌─────┐                    ┌┴───────┐├┘                              
         │ q_i │                    │ d^k_ij ├┘                               
         └──┬──┘                    └───┬────┘                                
            │                           │                                     
            ▼                           ▼                                     
     Pre-processing                Pre-processing                             
            │                           │                                     
            │                           │                                     
            ├───────────────────────────┇──────────────────┬───────────┐      
            │                           │                  │           │      
      ┌─────┴───────┐             ┌─────┴───────┬──────────┇────┬──────┇───┐  
      │             │             │             │          │    │      │   │  
      ▼             ▼             │             │          │    │      │   │  
┌───────────┐    ┌──────┐   ┌─────┴─────┐    ┌──┴───┐      │    │      │   │  
│ Embedding │    │ GPT  │   │ Embedding │    │ GPT  │      │    │      │   │  
│   Head    │    │ with │   │   Head    │    │ with │      │    │      │   │  
│   (F1)    │    │ mean │   │   (F1)    │    │ mean │      │    │      │   │  
└─────┬─────┘    └──┬───┘   └─────┬─────┘    └───┬──┘      │    │      │   │  
    ┌─┴───────────┐ └──────────┬──┇──────────┐   │         │    │      │   │  
    │   ┌─────────┇───┬────────┇──┘          │   │         │    │      │   │  
    │   │         │   │        │   ┌─────────┇───┤         │    │      │   │  
    │   │         │   │        │   │         │   │         │    │      │   │  
    ▼   ▼         ▼   ▼        ▼   ▼         ▼   ▼         │    │      │   │  
┌────────────┐ ┌─────────┐ ┌────────────┐ ┌─────────┐      ▼    ▼      ▼   ▼  
│   Cosine   │ │   Dot   │ │   Cosine   │ │   Dot   │    ┌────────┐   ┌──────┐
│ Similarity │ │ Product │ │ Similarity │ │ Product │    │ TF-IDF │   │ BM25 │
└─────┬──────┘ └────┬────┘ └────────┬───┘ └────┬────┘    └───┬────┘   └──┬───┘
      │             │               │          │             │           │    
      │             └────────────┐  │  ┌───────┘             │           │    
      │                          │  │  │  ┌──────────────────┘           │    
      └───────────────────────┐  │  │  │  │  ┌───────────────────────────┘    
                              ▼  ▼  ▼  ▼  ▼  ▼                                
                           ┌────────────────────┐                             
                           │   CatBoostRanker   │                             
                           └─────────┬──────────┘                             
                                     │                                        
                                     ▼                                        
                             ┌────────────────┐                               
                             │ Date Filtering │                               
                             └───────┬────────┘                               
                                     │                                        
                                     ▼                                        
                           ┌───────────────────┐                              
                           │ Dynamic Threshold │                              
                           └─────────┬─────────┘                              
                                     │                                        
                                     ▼                                        
                                  ┌─────┐                                     
                                  │ E_i │                                     
                                  └─────┘                                     

Test set results

The following table reports the results of all the models optimized on the F1 score for the fine-grained predictions. Subsequently, in the last row, the table displays the results obtained at the end of the pipeline.

Method Recall Precision F1 score
Random 0.0134 0.0021 0.0036
All Ones 0.0314 0.0049 0.0085
TF-IDF 0.3681 0.1437 0.2068
BM25 0.2887 0.2255 0.2532
GPT only 0.2350 0.1835 0.2061
Embedding Head 0.1933 0.2131 0.2028
CatBoost 0.2708 0.2424 0.2558

Since the employed ensemble model is explainable, CatBoost provides the feature importances depicted in the following table. This information is useful to understand the behaviour of the model when changing the predictors values.

Feature Value
Embedding Head (F1 model & Cosine similarity) 71.0026
Embedding Head (F1 model & Dot Product) 22.0739
GPT Only (Cosine similarity) 3.3902
GPT Only (Dot Product) 2.8006
TF-IDF 0.5462
BM25 0.1865

GPT text-embedding-3-small embedding space

GPT text-embedding-3-small embedding space The points represent the embeddings of the documents projected in UMAP learned space.

For what concerns the coloring of the points,

the essence of the approach is that we can use PCA, which preserves global structure, to reduce the data to three
dimensions. If we scale the results to fit in a 3D cube we can convert the 3D PCA coordinates of each point into an RGB
description of a color. By then coloring the points in the UMAP embedding with the colors induced by the PCA it is
possible to get a sense of how some of the more large scale global structure has been represented in the embedding.

(The quote is taken from the UMAP documentation)

Embedding Head trained on recall embedding space

Embedding Head trained on recall embedding space

About

Legal case retrieval challenge. Solution based on similarity search and learning-to-rank methods

Topics

Resources

License

Stars

Watchers

Forks

Languages