Skip to content

VoicenterTeam/hebrew-itn

Repository files navigation

Hebrew Inverse Text Normalization (ITN)

This Node.js module performs Inverse Text Normalization on Hebrew text, converting spelled-out numbers to their digit representation.

Hebrew-ITN: Hebrew Inverse Text Normalization

License Version

Hebrew-ITN is a high-performance library for Hebrew Inverse Text Normalization that converts spelled-out Hebrew numbers to their digit form. It features both single-threaded and multi-threaded processing capabilities for optimal performance across various workloads.

🌟 Features

  • Text Normalization: Convert spelled-out Hebrew numbers to digits (e.g., "חמש מאות" → "500")
  • High Performance: Process thousands of transcript sentences with a persistent worker pool
  • Multi-core Processing: Utilize all available CPU cores for parallel processing
  • Memory Efficient: Reuse worker threads to minimize resource consumption
  • Battle-tested: Includes comprehensive test cases and benchmarks
  • Resilient: Automatic recovery from worker failures
  • Flexible API: Simple interfaces for both single and batch processing

📋 Table of Contents

📦 Installation

npm install hebrew-itn

🚀 Quick Start

Basic Usage

const { normalizeText } = require('hebrew-itn');

// Normalize a single Hebrew text string
const original = "יש חמש מאות אלף הורים וחמשת אלפיים שלוש מאות ושניים ילדים";
const normalized = normalizeText(original);

console.log(normalized);
// Output: "יש 500,000 הורים ו5,302 ילדים"

Parallel Processing for Large Datasets

const { normalizeTranscriptParallel, shutdownWorkerPool } = require('hebrew-itn');

async function processTranscript() {
  const transcriptData = [
    { speaker: "Speaker1", text: "יש חמש מאות אלף שקלים בקופה", sentence_id: 1 },
    { speaker: "Speaker2", text: "צריך להגיע לרציף שלוש", sentence_id: 2 },
    // ... more sentences
  ];

  try {
    // Process all sentences in parallel using multiple CPU cores
    const normalizedTranscript = await normalizeTranscriptParallel(transcriptData);
    console.log(normalizedTranscript);
    
    // Always shut down the worker pool when done
    await shutdownWorkerPool();
  } catch (error) {
    console.error('Error:', error);
    await shutdownWorkerPool();
  }
}

processTranscript(); 📚 API Reference Main Functions normalizeText(text) Normalizes a single Hebrew text string, converting spelled-out numbers to digits.

const { normalizeText } = require('hebrew-itn');
const result = normalizeText("חמש מאות שקל"); // "500 שקל"
normalizeNumber(numberText)
Normalizes only the number part without surrounding context.

const { normalizeNumber } = require('hebrew-itn');
const result = normalizeNumber("חמש מאות"); // "500"
normalizeTranscriptParallel(transcriptSentences, options)
Processes an array of transcript sentences in parallel using a worker pool.

const { normalizeTranscriptParallel } = require('hebrew-itn');

const result = await normalizeTranscriptParallel([
  { speaker: "Speaker1", text: "יש חמש מאות שקל", sentence_id: 1 },
  { speaker: "Speaker2", text: "ברציף שמונה", sentence_id: 2 }
]);

Options:

numWorkers: Number of worker threads (default: CPU count) batchSize: Number of sentences per batch (default: auto-calculated) shutdownWorkerPool() Gracefully terminates the worker pool when done with parallel processing.

const { shutdownWorkerPool } = require('hebrew-itn');
await shutdownWorkerPool();

⚡ Performance Optimization

Hebrew-ITN uses several optimization techniques for handling large volumes of text:

Persistent Worker Pool: Creates workers once and reuses them across requests Adaptive Batching: Automatically determines optimal batch size based on input volume Load Balancing: Distributes work evenly across available CPU cores Small Batch Optimization: Processes small batches synchronously to avoid thread overhead Automatic Recovery: Replaces failed workers to maintain processing capacity 📝 Examples The project includes several examples to help you get started:

Basic Example // examples/basic-example.js

const { normalizeText } = require('../src/index');

const text = "יש חמש מאות אלף שקלים בקופה";
console.log(normalizeText(text)); // "יש 500,000 שקלים בקופה"
Parallel Processing Example
// examples/parallel-example.js
const { normalizeTranscriptParallel, shutdownWorkerPool } = require('../src/index');

async function example() {
  const data = [/* transcript data */];
  const result = await normalizeTranscriptParallel(data);
  console.log(result);
  await shutdownWorkerPool();
}

📊 Benchmarks The project includes benchmark scripts to measure performance:

node benchmark/pool-benchmark.js Typical performance gains with the worker pool on a quad-core CPU:

Dataset Size Sequential (ms) Parallel (ms) Speedup 100 52.34 24.56 2.13x 1000 524.67 132.78 3.95x 5000 2623.45 642.31 4.08x

📂 Project Structure

hebrew-itn/
├── src/
│   ├── index.js            # Main entry point and API
│   ├── hebrewITN.js        # Core normalization logic
│   ├── numberPatterns.js   # Hebrew number patterns
│   ├── testCases.js        # Known test cases
│   ├── workerPool.js       # Worker pool implementation
│   └── normalization-worker.js # Worker thread code
├── data/
│   └── sample-data.json    # Sample data for testing
├── examples/
│   ├── basic-example.js    # Simple usage examples
│   └── parallel-example.js # Parallel processing examples
├── benchmark/
│   └── pool-benchmark.js   # Performance benchmarking
├── test/
│   ├── mockData.js         # Test mocks
│   └── index.test.js       # Unit tests
└── README.md               # This file

🛠️ Development Setup

git clone https://github.com/yourusername/hebrew-itn.git
cd hebrew-itn
npm install

Running Tests npm test Building npm run build 🤝 Contributing Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository Create your feature branch (git checkout -b feature/amazing-feature) Commit your changes (git commit -m 'Add some amazing feature') Push to the branch (git push origin feature/amazing-feature) Open a Pull Request 📄 License This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgements Special thanks to contributors and maintainers Inspired by similar projects in other languages Let's also create a simple basic example for the README:

About

Hebrew Inverse Text Normalization - Converts spelled-out Hebrew numbers to digits

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages