Skip to content

youhusky/Search_Ads_Web_Service

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Search Ads Web Service

Online search advertisement platform & Realtime Campaign Monitoring

Project Description

  • Designed and developed web crawler which crawled 500000 product data from Amazon (Java, JSoup, Proxy)
  • Developed Search Ads workflow support: Query understanding, Ads selection from inverted index (with MemCached), Ads ranking, Ads filter, Ads pricing, Ads allocation
  • Designed and implemented feature engineering pipeline which generate features for query understanding and click prediction with Spark MapReduce

Crawler

Used Jsoup to crawler information on Amazon.

  • Finished
    • extract price, product detail url, product image url, category from web page
    • convert each product to Ads
    • store Ads to file, each ads in JSON format.
    • support paging
    • log all exception

Avoid Bot Detection

  • Proxy IP and rotating Brower
  • Distribute Crawler

Online Search Ads Platform

Search advertising is placing online advertisments on front end pages that show results to users from their search engine queries. This search ads server takes thousands of product data as ads candidates and selects, filters, ranks, allocates and prices the ads when search query comes in. The selection and ranking of search ads is based on the quality of ads and the bid price offered by advertisers.

alt text

Query Understanding

  • clean the text by Lucean
  • train word2vector model using ads keywords corpus and use synonyms to rewrite query

Query Relevancy Matching

Ads candiate will first be evaluated and filtered by relevance score. Relevance score is to measure how relevant query is to key words in ads. Here the relevance score = number of word match query / total number of words in key words. For quick retreival of ads infomation, the inverted index of ads keywords were built and store in cache.

The data layer for supporting online system:

  • Forward index for Ad detail information (MySQL)
  • Inverted index for Ad keywords (Memcached)

P-Click Prediction

The probability of user click (p-click) plays an important role in ads ranking.

Use spark ML process simulated user click log data and generate prediction model.

  • Click log

log: Device IP, Device id,Session id,Query,AdId,CampaignId,Ad_category_Query_category(0/1),clicked(0/1)

  • Feature space

pClick Features extracted from search log and stored in key-value store alt text

  • Model

Logistic Regression

Gradient Boosting Tree

Online Ads Ranking and Pricing

Quality Score = 0.25 * Relevance Score + 0.75 * pClick

Rank Score = Quality Score * Bid

Price(Cost Per Click) = next rank score / current quality score + 0.01

System

When receiving search query, the system matchs rewrote query with keywords of ads using inverted index to get relevance score, and predict the probability of click by the regression model generated from 50GB historical click data. The quality of ads will be determined by both relevance score and the probability of click. The ads engine calculates the quality score and combines it with ads bid price for final ranking and pricing.

alt text

Real Time Campaign Monitor

The real time campaign monitor system is built for collecting the ads relevant events generated by online ads server and visulizing the trending of campaigns.

Join Events Streams

he real time campaign monitoring system is a streaming pipeline which collects and processes the ads events generated by online search ads engine. The chance events, impression events and click events of ads are published to message queue and processed to store in database in streaming way. The front end dashboard visualizes the budget status and dynamic impression, click and pricing trending of campaigns.

Streaming Pipeline

alt text

Dashboard Visualization

alt text

Releases

No releases published

Packages

No packages published