Skip to content

bartczernicki/MachineIntelligence-TextAnalytics-TPLDataFlows

Repository files navigation

MachineIntelligence-TextAnalytics-TPLDataFlows

Machine Intelligence Text Analytics Enrichment implemented using Task Parallel Library Data Flow Pipelines:

  • Document Enrichment Pipeline - Builds the entire Vector Database using OpenAI embeddings in SQL using 50 selected books
  • Q&A Over Vector Database Pipeline - Searches the SQL Vector Database with provided question phrase using Semantic Kernel
  • Total Text (OpenAI) Tokens Processed:...............8,267,408
  • Total Text (Characters) Length Processed:..........33,702,085
  • Total cost for processing and building Vector Database using OpenAI Embeddings (Feb 2024 prices):
    • text-embedding-ada-002 with 1536 dimensions: ~$0.84 (~84 cents; this depends on how the chunking of text is configured)
    • text-embedding-3-small with 512 dimensions: ~$0.17 (~17 cents; this depends on how the chunking of text is configured)

TPL Pipeline

Features:

  • The console app uses 50 selected books from the Project Gutenberg site from various authors: Oscar Wilde, Bram Stoker, Edgar Allen Poe, Alexandre Dumas and performs enrichment using multiple AI enrichment steps
  • Downloads book text, processes text analytics & embeddings, creates a vector database in SQL, demonstrates vector search and answers a sample question using semantic meaning from OpenAI embeddings
  • Stores all enrichment output for each book in a seperate JSON file
  • Rather than processing text analytics enrichment in single synchronous steps, it uses an data flow model to create efficient pipelines that can saturate multiple logical CPU cores
  • Illustrates that SQL Server or Azure SQL can be used as a valid Vector Store, can perform vector search and provide Q&A over the database
  • Demonstrates how to create a Machine Intelligence & Text Analytics Pipeline can be combbined using TPL DataFlows
  • The console application is cross-platform .NET 8.x. It will run on macOS, Linux, Windows 10/11 x64, Windows 11 ARM

Requirements:

  • Visual Studio 2022, .NET 8.x
  • SQL Server Connection to either a local SQL Server 2022 (free Devolpment SKU or higher) or Azure SQL Database
  • ******Note: SQL Server 2022 / Azure SQL Database features are used for JSON processing and ordered Columnstore Indexes
  • OpenAI for both embeddings and completions

Training Job

Getting Started - Step 1) Configuration of SQL Connection and OpenAI API Keys (example of secrets.json shown below)

  • Ensure to add .NET Secrets or JSON configuration (you will need to add the JSON code if using a file)
  • Right-click on the C# Project and select "Manage User Secrets"
  • Add the SQL Connection (SQLConnection) and OpenAI (APIKey) (if using Azure OpenAPI, use AzureOpenAPI section)
{
  "SQL": {
    "SqlConnection": "Server=[NAME OF SERVER],1433;Initial Catalog=MachineIntelligenceDb;Persist Security Info=False;User ID=[USERID];Password=[PASSWORD];MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=5000;"
  },
  "OpenAI": {
    "APIKey": "[YOUR OPENAPI KEY]"
  },
  "AzureOpenAI": {
    "APIKey": "[YOUR AZURE OPENAPI KEY]"
  }
}

Getting Started - Step 2) Processing (after adding proper SQL and OpenAI/Azure OpenAI connections):

  • Select option 1 to process the entire Data Enrichment Pipeline (build the embeddings Vector Database in SQL)
  • Select option 2 to only process the Q&A pipeline using Semantic Kernel over the Vector Database (Note: Option #1 must have been run beforehand)
  • Select option 3 to only process the Q&A pipeline with reasoning using Semantic Kernel over the Vector Database (Note: Option #1 must have been run beforehand). This option is similar to option #2 except it provides details on how the AI agent achieved the results.

Getting Started - Console App

Learn more about the concepts used in this repository: