Skip to content

spekulatius/phpscraper-keyword-scraping-example

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Keyword Scraping Example using PHPScraper

PHPScraper is a scraping library aimed at making web-scraping easier. It simplifies the coding effort involved by reducing verbosity.

This is an example of the library scraping keywords from the Wikipedia article "Online Advertising". The expected output can be found below.

Within PHPScraper, the library RAKE PHP Plus is used. RAKE stands for "Rapid Automatic Keyword Extraction" algorithm.

There is another example showing how to analyze the keyword length distribution of a web-page and the performance test of PHPScraper and BeautifulSoup.

You might need to merge your keywords after scraping.

Installation

This example has been built on PHP 7.2.24 run on an Ubuntu-based Linux distro.

To run this example you will need to clone the repository and install the dependencies:

git clone git@github.com:spekulatius/phpscraper-keyword-scraping-example.git
composer install

If you would like to make changes you will need to fork the repository.

Execution

$ php keyword-extractor.php

Result

This page contains around 1989 keywords/phrases. Below are some selected keyword extractions.

Selected keywords with years:

  • truste announces 2011 behavioral advertising survey results (65.0)
  • july 2014 facebook reported advertising revenue (56.1)
  • cisco 2013 annual security report (18.3)
  • january 1994 mark eberra started (13.0)
  • august 2017 wikipedia articles (8.8)
  • august 2017 category (5.8)
  • october 2013 category (5.0)
  • august 2014 yahoo' (4.1)
  • june 2014 quarter (3.5)

Selected keywords with "content":

  • call 'content marketing' (77.2)
  • content management system (53.6)
  • automated ad content optimisation (49.6)
  • 10 content marketing 2 (44.0)
  • content marketing (44.0)
  • publisher content server sends (35.9)
  • web page content (33.3)
  • online content (29.5)
  • ad content delivered (27.1)
  • /wp-content/uploads/2015/11/iab_display_mobile_creative_guidelines_html5_2015 (25.8)
  • ad content (25.4)
  • website content' (22.9)
  • access requested content (19.3)
  • content page [ (16.5)
  • editorial content (15.8)
  • dividing content (15.8)
  • content filters (15.8)
  • sexual content (15.8)
  • primary content (15.8)
  • publishing content (15.3)
  • presenting content (15.3)
  • content (13.8)

Long Tail Keywords:

  • spanish euskara online publizitate (41,795.3)
  • platform customer relationship management (7,010.4)
  • flower delivery flower delivery (2,521.1)
  • search engine optimization search (1,645.3)
  • blocking search engine marketing (1,319.9)
  • adblock adblock advertising advertising (1,051.5)
  • factor annoyance factor horizontal (887.6)
  • web banners web banner (873.3)
  • engine optimisation search engine (809.6)
  • market segmentation strategy marketing (799.8)
  • search analytics search analytics (597.8)
  • management logistics management facebook (568.1)
  • enlarge display advertising display (567.3)
  • firms oracle oracle corporation (556.7)
  • digital distribution digital distribution (373.1)
  • underwriting spot underwriting spot (371.4)
  • interactive advertising bureau interactive (358.8)
  • mix promotional mix promotional (326.5)
  • marketing market research market (276.8)
  • marketing marketing marketing marketing (271.1)
  • product demonstration product demonstration (265.8)
  • placement product placement propaganda (198.8)
  • marketing activation brand licensing (186.5)
  • advertising mobile advertising mobile (169.8)
  • red bull red bull (169.1)
  • honor system honor system (167.9)
  • sears global network navigator (145.6)
  • arpanet arpanet nsfnet nsfnet (138.7)
  • banner blindness banner blindness (127.9)
  • marketing effectiveness ethics marketing (108.0)
  • revenue sharing revenue sharing (100.7)
  • modern search engines rank (100.6)
  • bull media house streaming (92.5)
  • pricing retail retail service (91.4)
  • live support software online (91.0)
  • malvertising malvertising cisco cisco (90.7)
  • advertising bureau predicts continued (88.7)
  • rich media rich media (86.3)
  • banner advertising display advertising (85.9)
  • advertising methods digital marketing (83.1)
  • federal trade commission federal (76.5)
  • explorer continues growth past (67.5)
  • online service prodigy displayed (65.3)
  • announces 2011 behavioral advertising (65.0)
  • corporate identity corporate identity (62.3)
  • search engines originally sold (60.0)
  • advertising age advertising age (58.7)
  • 2014 facebook reported advertising (56.1)
  • web bugs web bugs (50.3)
  • crime complaint center received (45.2)
  • states advertising industry organizations (43.7)
  • unit guidelines proposes standardized (43.2)
  • personal selling personal selling (41.0)
  • trade commission frequently supports (38.7)
  • ndl national diet library (33.8)
  • wikipedia current events find (32.3)
  • display advertising process overview (32.0)
  • revenue sharing revenue sharing (31.1)
  • file printable version printable (28.3)
  • owners sought additional revenue (28.3)
  • fixed cost compensation means (25.8)
  • news feed ads generate (24.9)
  • upload file upload files (16.7)

Please note: These results might have changed by now.