How to web crawl on Google

java proxy web-crawler jsoup search-engine

153 观看

1回复

1526 作者的声誉

My requirement is to make a report on a given keyword by searching that keyword online.

My plan is that my webcrawler will

  1. Search the keyword on google or bing or yahoo
  2. Open pages/links of the website returned by google, bing or yahoo
  3. Make the report using those pages.

As I want to make a rule obeying webcrawler. So when I see the robots.txt of these websites I come to know that search engines have blocked the webcrawler to search keywords like

google.com/robots.txt

User-agent: *
Disallow: /search

I know that if I try to search keyword on the search engines my ip might be blocked.

My new plan that my webcrawler will

  1. Search the keyword on google or bing or yahoo ( max 2 - 3 times in different span of time a day)
  2. Open pages/links of the website return by google, bing or yahoo (giving 2 - 3 mins of delay in opening each page/link returned by search engine)
  3. Make the report using those pages.

Questions

  1. Let me know that even after so much care will google block my ip ? Is it safe to webcrawl like that ?
  2. Also let me know good techniques for using proxies to hide/change actual ip address.

PS: I am using Java and Jsoup for webcrawling

作者: Junaid 的来源 发布者: 2017 年 9 月 15 日

回应 1


0

0 作者的声誉

Try selenium, to do your job.Its for automation so i don't think your ip will block by any of the service provider.

作者: user8432378 发布者: 2017 年 9 月 15 日
32x32