Skip to content

ECE 252 Lab 4: A simplified multi-threaded web crawler that searches a seed URL and find up to 50 valid PNGs

Notifications You must be signed in to change notification settings

Max00358/Multi_Threaded_Image_Web_Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi_Threaded_Image_Web_Crawler

Objective

The findpng2 is a tiny simplified web crawler. It searches the web starting from a seed URL. It then finds the number of valid PNG URLs that user specifies in the input or up to 50 PNG URLs and log them into a log.txt file.

The crawler visits the given URL page and finds two pieces of information. The first piece is the URLs that link to valid PNG images3. The crawler adds PNG URLs found to a search result table. We want this table to contain unique URLs, hence if the found URL is already in the table, you should not add it to the table.

The second piece is a set of new URLs to further crawl. The crawler adds this set of new URLs to a URLs pool known as the URLs frontier . Since visiting web pages has costs, we do not want the crawler to visit the same page twice. Hence the crawler needs a mechanism to remember URLs that have been visited already. As the crawler visits the URLs in the URLs frontier, the process of finding the target PNG URLs and new URLs to further explore repeats until it finds no more new PNG URL or it reaches the maximum number of PNG URLs specified by the user input.

Sample Test Run

-t: Create t number of threads simultaneously crawling the web
-m: Find up to m number of unique PNG URLs on the web
-v: Name of log file, log all the visited URLs by the crawler, one URL per line in logfile

Log file may look like the following:
http://ece252-1.uwaterloo.ca/lab4
http://ece252-1.uwaterloo.ca/~yqhuang/lab4/#top
http://ece252-1.uwaterloo.ca/~yqhuang/lab4/imgs/cpp.jpg
http://ece252-1.uwaterloo.ca:2540/image?q=gAAAAABdHkoqPlVGPQPMpFG8Vjvrodu_pZqXMC4jGRiLcYwY6MkhhFG1m5a3x5ZYDOiLGFmz8FTbq3sAva7QKiXY5YNIxrBdxg==
http://ece252-1.uwaterloo.ca/~yqhuang/lab4/imgs/para.jpg
http://ece252-1.uwaterloo.ca/~yqhuang/lab4/imgs/crawler_arch.jpg
http://ece252-1.uwaterloo.ca/~yqhuang/lab4/imgs/Disguise.png
http://ece252-1.uwaterloo.ca:2531/image?img=1&part=0
http://ece252-3.uwaterloo.ca:2540/image?q=gAAAAABdHkoqOKR-cFRCkiCUBEMAAAWfDvBFlRisL9ysLWHYHbcQbn1b28PV_uHBZ0gJf5bvzrnf1HNXxB6KRlAVETwTIqBH2Q==
http://ece252-1.uwaterloo.ca/~yqhuang/lab4/sub1/index.html
http://ece252-2.uwaterloo.ca:2540/image?q=gAAAAABdHkoqvYBSFk4VKGcRmB2TKYcl7hgNpTehsR0y7cd1ysTzvlgowzkgpWgwp42BumobA22jAEl0LJj_TcwR-X_F79vyuA==
http://ece252-1.uwaterloo.ca:2540/image?q=gAAAAABdHkoq1GsX73XwnpU1qaO1laOtmzl_e3npHIX83pNGFgfOLJG2vwICDD5Gnb7iwdzV_ZKnckzZQfhsmEEeWbvO3-4Kug==
http://ece252-1.uwaterloo.ca:2540/image?q=gAAAAABdHkoqXCN_I78zg0zDjOvOhA19H7tB5L2pSkuG0NmXg4E1LCe1Ew8utkFxulC2ssJUGOYgguCCXcDlOraPK-tYzsMl-w==
http://ece252-3.uwaterloo.ca:2540/image?q=gAAAAABdHkoq5CAThsWSs_snhLDy2vDhV97ea1remux8ERu-H5FIIqoG5by17Zt7ba3hdadbOLXoHKJEYVHWdlvkpZg6VxeyaQ==
http://ece252-1.uwaterloo.ca:2540/image?q=gAAAAABdHkoqKjtpjpiM4WKtVvo33HojmbxU3Q_VNRl2bS5uYIta1fUv23182WN7nHtTa-qg_M33g_5CvFk-5i95lVgkaDOCYA==
http://ece252-2.uwaterloo.ca:2540/image?q=gAAAAABdHkoqymQU_88e8PRLFAHlee_iqRtsEkeVQsUV1YzN2c7mQXV1Sg7hOuNp212fkLFH71-EtoQMS_7bivH4V_Q5rx2Cfg==
http://ece252-3.uwaterloo.ca:2540/image?q=gAAAAABdHkoqvsHZzgOGhVZBuBxA8IkeZS6GevG48xd-O9--4La4GPLs4qMW8QWW63iNSc4bypjbMJK1tZZGvHELGeVTmaoyZw==
http://ece252-2.uwaterloo.ca:2540/image?q=gAAAAABdHlHZgxhjnyyylvmstitqthsafcipehodfzmjbugcviblrqxfzjgucrcllazmyhznzdkcmymdknquttkinbgbtshojf==
http://ece252-3.uwaterloo.ca:2540/image?q=gAAAAABdHkoqdRxJhdCk49ofJ1TCxsbV6DF4DIHKtdWsF-5fJM0BHF6b6oM5Y99T7T78SQQcii0YvDMqzGN4L8a4eFsQotIHLA==
http://ece252-3.uwaterloo.ca:2540/image?q=gAAAAABdHkoqu2HgO66gfwYQjCHJninMtqjg83iaDpT6U54p8F9XGbboHeGjhalG9fE7CMNx7vkgZZMzN-9c6EqVE5qXlhdqnw==
http://ece252-2.uwaterloo.ca:2540/image?q=gAAAAABdHkoqMW24vZRpDznRyQZuRsR5vjPThROSj7k52y1weIHxIKOvt9xOxAbpZNudfTkojMyk2ii5Ap5o7e4cRdq_5C0aKw==
http://ece252-2.uwaterloo.ca:2540/image?q=gAAAAABdHkoqLkzOa89N8w0I4kdmfUtjcN1D6AZYEOJCzr6qpnWae4LfzfIhCRH_MOeQj8x8_ugiRcF79z5D723Ssnm7U-84og==
http://ece252-1.uwaterloo.ca:2540/image?q=gAAAAABdHkoqBsj81dnixZFLtjKlW6lHD5o0y0IWPRDcP0L_uEzcHVKXT6L-32Cl4u-th38LksMevU6T3NRufm6yVtFON0tXHg==
http://ece252-1.uwaterloo.ca:2540/image?q=gAAAAABdHkoqYBTxXgfsZAuOV3Nkm0obQ58hGNAnrTrEDwPseWO4OKq2JXTnwAhg1lufmc4OhP7Vtf4hMEgzNdxNdeAgjsscVg==
http://ece252-1.uwaterloo.ca:2540/image?q=gAAAAABdHkoq-JvNAjwE7fa5r26FVJgLMEqvkrj6AlO2BKI5zQtKRThKL79uHBoP-uRzt9WzYyPeec6zE6bmdkzy5UXt5Br2Iw==
http://ece252-2.uwaterloo.ca:2540/image?q=gAAAAABdHkoq292-4eAqc4xZ-nwNibr1F5ZZCJhojZGV7hQMtNzryLwR8grsztNSurbqnucszcyMbtSE8BaFg049sI6WZ4aDfg==
http://ece252-1.uwaterloo.ca:2540/image?q=gAAAAABdHkoqtuljtcVzSazkwT4SVEw_rlEYwknlCq6mKLqndncMei2Om8ccO2ieak5QQCVNJKuwG5FYmX8sgbhKht2xt4ZNlw==

About

ECE 252 Lab 4: A simplified multi-threaded web crawler that searches a seed URL and find up to 50 valid PNGs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published