Skip to content

aymanrb/php-unstructured-text-parser

Repository files navigation

Unstructured Text Parser [PHP]

Tests Coverage Status Latest Stable Version Total Downloads License

About Unstructured Text Parser

This is a small PHP library to help extract text out of documents that are not structured in a processing friendly format. When you want to parse text out of form generated emails for example you can create a template matching the expected incoming mail format while specifying the variable text elements and leave the rest for the class to extract your pre-formatted variables out of the incoming mails' body text.

Useful when you want to parse data out of:

  • Emails generated from web forms
  • Documents with definable templates / expressions

Installation

PHP Unstructured Text Parser is available on Packagist (using semantic versioning), and installation via Composer is recommended. Add the following line to your composer.json file:

"aymanrb/php-unstructured-text-parser": "~2.0"

or run

composer require aymanrb/php-unstructured-text-parser
<?php
include_once __DIR__ . '/../vendor/autoload.php';

$parser = new aymanrb\UnstructuredTextParser\TextParser('/path/to/templatesDirectory');

$textToParse = 'Text to be parsed fetched from a file, mail, web service, or even added directly to the a string variable like this';

//performs brute force parsing against all available templates, returns first match successful parsing
$parseResults = $parser->parseText($textToParse);
print_r($parseResults->getParsedRawData());

//slower, performs a similarity check on available templates to select the most matching template before parsing
print_r(
    $parser
        ->parseText($textToParse, true)
        ->getParsedRawData()
);

Parsing Procedure

1- Grab a single copy of the text you want to parse.

2- Replace every single varying text within it to a named variable in the form of {%VariableName%} if you want to match everything in this part of text or {%VariableName:Pattern%} if you want to match a specific set of characters or use a more precise pattern.

3- Add the templates file into the templates directory (defined in parsing code) with a txt extension fileName.txt

4- Pass the text you wish to parse to the parse method of the class and let it do the magic for you.

Template Example

If the text documents you want to parse looks like this:

Hello,
If you wish to parse message coming from a website that states info like:
ID & Source: 12234432 Website Form  
Name: Pet Cat
E-Mail: email@example.com
Comment: Some text goes here

Thank You,
Best Regards
Admin

Your Template file (example_template.txt) could be something like:

Hello,
If you wish to parse message coming from a website that states info like:
ID & Source: {%id:[0-9]+%} {%source%}
Name: {%senderName%}
E-Mail: {%senderEmail%}
Comment: {%comment%}

Thank You,
Best Regards
Admin

The output of a successful parsing job would be:

Array(
    'id' => '12234432',
    'source' => 'Website Form',
    'senderName' => 'Pet Cat',
    'senderEmail' => 'email@example.com',
    'comment' => 'Some text goes here'
)

Upgrading from v1.x to v2.x

Version 2.0 is more or less a refactored copy of version 1.x of the library and provides the exact same functionality. There is just one slight difference in the results returned. It's now a parsed data object instead of an array. To get the results as an array like it used to be in v1.x simply call "getParsedRawData()" on the returned object.

<?php
//ParseText used to return array in 1.x
$extractedArray = $parser->parseText($textToParse);

//In 2.x you need to do the following if you want an array
$extractedArray = $parser->parseText($textToParse)->getParsedRawData();

About

A PHP library to help extract text out of text documents that are not structured in a processing friendly manner

Resources

License

Stars

Watchers

Forks

Packages

No packages published