This Node.js module performs Inverse Text Normalization on Hebrew text, converting spelled-out numbers to their digit representation.
Hebrew-ITN is a high-performance library for Hebrew Inverse Text Normalization that converts spelled-out Hebrew numbers to their digit form. It features both single-threaded and multi-threaded processing capabilities for optimal performance across various workloads.
- Text Normalization: Convert spelled-out Hebrew numbers to digits (e.g., "חמש מאות" → "500")
- High Performance: Process thousands of transcript sentences with a persistent worker pool
- Multi-core Processing: Utilize all available CPU cores for parallel processing
- Memory Efficient: Reuse worker threads to minimize resource consumption
- Battle-tested: Includes comprehensive test cases and benchmarks
- Resilient: Automatic recovery from worker failures
- Flexible API: Simple interfaces for both single and batch processing
- Installation
- Quick Start
- API Reference
- Performance Optimization
- Examples
- Benchmarks
- Project Structure
- Development
- Contributing
- License
npm install hebrew-itn
🚀 Quick Start
Basic Usage
const { normalizeText } = require('hebrew-itn');
// Normalize a single Hebrew text string
const original = "יש חמש מאות אלף הורים וחמשת אלפיים שלוש מאות ושניים ילדים";
const normalized = normalizeText(original);
console.log(normalized);
// Output: "יש 500,000 הורים ו5,302 ילדים"
Parallel Processing for Large Datasets
const { normalizeTranscriptParallel, shutdownWorkerPool } = require('hebrew-itn');
async function processTranscript() {
const transcriptData = [
{ speaker: "Speaker1", text: "יש חמש מאות אלף שקלים בקופה", sentence_id: 1 },
{ speaker: "Speaker2", text: "צריך להגיע לרציף שלוש", sentence_id: 2 },
// ... more sentences
];
try {
// Process all sentences in parallel using multiple CPU cores
const normalizedTranscript = await normalizeTranscriptParallel(transcriptData);
console.log(normalizedTranscript);
// Always shut down the worker pool when done
await shutdownWorkerPool();
} catch (error) {
console.error('Error:', error);
await shutdownWorkerPool();
}
}
processTranscript(); 📚 API Reference Main Functions normalizeText(text) Normalizes a single Hebrew text string, converting spelled-out numbers to digits.
const { normalizeText } = require('hebrew-itn');
const result = normalizeText("חמש מאות שקל"); // "500 שקל"
normalizeNumber(numberText)
Normalizes only the number part without surrounding context.
const { normalizeNumber } = require('hebrew-itn');
const result = normalizeNumber("חמש מאות"); // "500"
normalizeTranscriptParallel(transcriptSentences, options)
Processes an array of transcript sentences in parallel using a worker pool.
const { normalizeTranscriptParallel } = require('hebrew-itn');
const result = await normalizeTranscriptParallel([
{ speaker: "Speaker1", text: "יש חמש מאות שקל", sentence_id: 1 },
{ speaker: "Speaker2", text: "ברציף שמונה", sentence_id: 2 }
]);
Options:
numWorkers: Number of worker threads (default: CPU count) batchSize: Number of sentences per batch (default: auto-calculated) shutdownWorkerPool() Gracefully terminates the worker pool when done with parallel processing.
const { shutdownWorkerPool } = require('hebrew-itn');
await shutdownWorkerPool();
⚡ Performance Optimization
Hebrew-ITN uses several optimization techniques for handling large volumes of text:
Persistent Worker Pool: Creates workers once and reuses them across requests Adaptive Batching: Automatically determines optimal batch size based on input volume Load Balancing: Distributes work evenly across available CPU cores Small Batch Optimization: Processes small batches synchronously to avoid thread overhead Automatic Recovery: Replaces failed workers to maintain processing capacity 📝 Examples The project includes several examples to help you get started:
Basic Example // examples/basic-example.js
const { normalizeText } = require('../src/index');
const text = "יש חמש מאות אלף שקלים בקופה";
console.log(normalizeText(text)); // "יש 500,000 שקלים בקופה"
Parallel Processing Example
// examples/parallel-example.js
const { normalizeTranscriptParallel, shutdownWorkerPool } = require('../src/index');
async function example() {
const data = [/* transcript data */];
const result = await normalizeTranscriptParallel(data);
console.log(result);
await shutdownWorkerPool();
}
📊 Benchmarks The project includes benchmark scripts to measure performance:
node benchmark/pool-benchmark.js Typical performance gains with the worker pool on a quad-core CPU:
Dataset Size Sequential (ms) Parallel (ms) Speedup 100 52.34 24.56 2.13x 1000 524.67 132.78 3.95x 5000 2623.45 642.31 4.08x
📂 Project Structure
hebrew-itn/
├── src/
│ ├── index.js # Main entry point and API
│ ├── hebrewITN.js # Core normalization logic
│ ├── numberPatterns.js # Hebrew number patterns
│ ├── testCases.js # Known test cases
│ ├── workerPool.js # Worker pool implementation
│ └── normalization-worker.js # Worker thread code
├── data/
│ └── sample-data.json # Sample data for testing
├── examples/
│ ├── basic-example.js # Simple usage examples
│ └── parallel-example.js # Parallel processing examples
├── benchmark/
│ └── pool-benchmark.js # Performance benchmarking
├── test/
│ ├── mockData.js # Test mocks
│ └── index.test.js # Unit tests
└── README.md # This file
🛠️ Development Setup
git clone https://github.com/yourusername/hebrew-itn.git
cd hebrew-itn
npm install
Running Tests npm test Building npm run build 🤝 Contributing Contributions are welcome! Please feel free to submit a Pull Request.
Fork the repository Create your feature branch (git checkout -b feature/amazing-feature) Commit your changes (git commit -m 'Add some amazing feature') Push to the branch (git push origin feature/amazing-feature) Open a Pull Request 📄 License This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgements Special thanks to contributors and maintainers Inspired by similar projects in other languages Let's also create a simple basic example for the README: