Skip to content

It enables you to parse web sites or any other XML-based content with a predefined template.

License

Notifications You must be signed in to change notification settings

zanybaka/Html.Template.Finder

Repository files navigation

Html.Template.Finder library (C#, .NET Standard)

It enables you to parse web sites or any other XML-based content with a predefined template.

Click here to expand...

Basics

The finder HtmlXPathTemplateFinder is based on XPath selectors and uses HtmlAgilityPack library under the hood.

All you need is to provide three things

  • an html content (string)
  • a template reader (IHtmlTemplateReader<HtmlXPathTemplate>)
  • an entity type (any class/struct with a few string properties)

Template format

The default reader HtmlXPathTemplateReader supports custom XPath-based templates like

//*[@class='row']
    .//*[@row-type='photo']
        .//img[@src=$img]
    .//*[@row-type='title']
        .//a[@href=$url]/$title
    .//*[@row-type='price']
        .//span/$price
    .//*[@row-type='date']/$date

The XPath format can't be changed while you are using HtmlXPathTemplateFinder.

But you can change all other stuff by implementing your own template reader based on IHtmlTemplateReader<out TTemplate>

For example, it could be JSON format like

{
  "RootNodeXPath": "//*[@class='row']",
  "Patterns": [
    {
      "XPathSelector": ".//*[@row-type='photo']"
      "Children": [ { "XPathSelector": ".//img[@src=$img]" } ]
    },
    ...
  ]
}

There are two variable types in the template

  • attribute variable
  • innerText variable

You can set multiple attribute variables in a single XPath selector

.//a[@href=$url and @title=$title]

innerText variable grabs all the text inside the specified tag and can be combined with attribute variables in a single XPath selector

    ...
        .//a[@href=$url]/$title
    .//*[@row-type='price']
        .//span/$price
    .//*[@row-type='date']/$date
    ...

Just keep in mind

  • all of them are removed once template is read
  • the format is parsed by regex
Click here to expand...


Code examples

AvitoHtmlXPathTemplateFinderFixture.cs

Nuget

References

Disclaimer

About

It enables you to parse web sites or any other XML-based content with a predefined template.

Topics

Resources

License

Stars

Watchers

Forks