GitHub - nicolkill/ex_crawlzy: Crawling library inspired from Tesla

ExCrawlzy

Another crawling library but with more than just crawl

You can crawl sites and transform content to json/map using CSS selectors with no other libraries, utilities included and more than a simple integration, you can transform a simple site to json with fields like lists or another sub-json structures

Installation

If available in Hex, the package can be installed by adding ex_crawlzy to your list of dependencies in mix.exs:

def deps do
  [
    {:ex_crawlzy, "~> 0.1.1"}
  ]
end

Usage

Just use the function ExCrawlzy.crawl/1 to crawl and ExCrawlzy.parse/2 to parse to json

Basic usage

site = "https://example.site"

fields = %{
  # shortcut for use a function from ExCrawlzy.Utils
  body: {"div#the_body", :text}
#  module/function way
#  body: {"div#the_body", {ExCrawlzy.Utils, :text}}
#  body: {"div#the_body", fn content -> 
#   ExCrawlzy.Utils.text(content)
#  end}
}

{:ok, content} = ExCrawlzy.crawl(site)
{:ok, %{body: body}} = ExCrawlzy.parse(fields, content)

Using Client

You can create a module pre-configured with key, selector and processing functions and just call using the function crawl/1 inside the same module

defmodule ExampleCrawler do
  use ExCrawlzy.Client.Json
  
  add_field(:title, "head title", :text)
  add_field(:body, "div#the_body", :text)
  add_field(:inner_field, "div#the_body div#inner_field", :text)
  add_field(:inner_second_field, "div#inner_second_field", :text_alt)
  add_field(:number, "div#the_number", :text)
  add_field(:exist, "div#the_body div#exist", :exist)
  add_field(:not_exist, "div#the_body div#not_exist", :exist)
  add_field(:link, "a.link_class", :link)
  add_field(:img, "img.img_class", :img)

  def text_alt(sub_doc) do
    ExCrawlzy.Utils.text(sub_doc)
  end
end

site = "https://example.site"

{:ok, data} = ExampleCrawler.crawl(site)

List of elements

You can create a client that parses multiple elements from html using css selectors

Using list_selector/1 you can define the selector that all elements matches, the next is define as the client ExCrawlzy.Client.Json and this are the inner elements

defmodule ExampleCrawlerList do
  use ExCrawlzy.Client.JsonList

  list_size(2)
  list_selector("div.possible_value")
  add_field(:field_1, "div.field_1", :text)
  add_field(:field_2, "div.field_2", :text)
end

site = "https://example_list.site"

{:ok, data} = ExampleCrawlerList.crawl(site)

A good example

defmodule GithubProfilePinnedRepos do
  use ExCrawlzy.Client.JsonList

  list_selector("div.pinned-item-list-item")
  add_field(:name, "a.mr-1 span.repo", :text)
  add_field(:link, "a.mr-1", :link)
  add_field(:access, "span.Label", :text)
  add_field(:description, "p.pinned-item-desc", :text)
  add_field(:language, "span.d-inline-block span[itemprop=\"programmingLanguage\"]", :text)

  def link(doc) do
    path = ExCrawlzy.Utils.props("href", doc)
    "https://github.com#{path}"
  end
end

site = "https://github.com/nicolkill"

{
  :ok, 
  [
    %{
      access: "Public",
      description: "An API Prototype Platform",
      link: "https://github.com/nicolkill/dbb",
      name: "dbb",
      language: "Elixir"
    },
    %{
      access: "Public",
      description: "JSON Schema verifier in Elixir",
      link: "https://github.com/nicolkill/map_schema_validator",
      name: "map_schema_validator",
      language: "Elixir"
    },
    %{
      access: "Public",
      description: "",
      link: "https://github.com/nicolkill/ex_crawlzy",
      name: "ex_crawlzy",
      language: "Elixir"
    }
  ]
} == ExampleCrawlerList.crawl(site)

Add clients

You can define you own browser clients on the requests, just use the function add_browser_client/1 and your headers on this shape [{"header_name", "header value"}]

Add your own browser clients will replace the predefined ones

site = "https://example.site"

fields = %{
  body: {"div#the_body", :text}
}

clients = [
  [
    {"referer", "https://your_site.com"},
    {"user-agent", "Custom User Agent"}
  ]
]

{:ok, content} = ExCrawlzy.crawl(site, clients)
{:ok, %{body: body}} = ExCrawlzy.parse(fields, content)

defmodule ExampleCrawlerList do
  use ExCrawlzy.Client.JsonList

  add_browser_client([
    {"referer", "https://your_site.com"},
    {"user-agent", "Custom User Agent"}
  ])
  list_size(2)
  list_selector("div.possible_value")
  add_field(:field_1, "div.field_1", :text)
  add_field(:field_2, "div.field_2", :text)
end

site = "https://example_list.site"

{:ok, data} = ExampleCrawlerList.crawl(site)

Testing your crawler

Because uses the Tesla core, you can test identically like a Tesla Client, see the guide to do it but here also another example testing the past ExampleCrawlerList module

The response must be the html of the site for this go to the site, right-click and select the option View Source, or use the shortcut Ctrl + U or CMD + U on Mac, then save the source in the priv folder of your project (not mandatory) and then you can use this fragment

# test.exs
config :tesla, ExampleCrawlerList, adapter: Tesla.Mock

defmodule ExampleCrawlerListTest do
  use ExUnit.Case

  import Tesla.Mock

  setup do
    {:ok, content} =
      :your_app
      |> :code.priv_dir()
      |> then(&"#{&1}/test.html")
      |> File.read()

    mock(fn
      %{method: :get, url: "https://example_list.site"} ->
        %Tesla.Env{status: 200, body: content}
    end)

    :ok
  end

  test "list things" do
    site = "https://example_list.site"
    assert {:ok, data} = ExampleCrawlerList.crawl(site)
  end
end

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
config		config
lib		lib
priv		priv
test		test
.formatter.exs		.formatter.exs
.gitignore		.gitignore
README.md		README.md
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

lib

lib

priv

priv

test

test

.formatter.exs

.formatter.exs

.gitignore

.gitignore

README.md

README.md

mix.exs

mix.exs

mix.lock

mix.lock

Repository files navigation

ExCrawlzy

Installation

Usage

Basic usage

Using Client

List of elements

A good example

Add clients

Testing your crawler

About

Releases

Packages

Languages

nicolkill/ex_crawlzy

Folders and files

Latest commit

History

Repository files navigation

ExCrawlzy

Installation

Usage

Basic usage

Using Client

List of elements

A good example

Add clients

Testing your crawler

About

Resources

Stars

Watchers

Forks

Languages