[SEO crawler] use scrapy to capture Baidu search results

Intro:   As much knowledge as you can grasp, you can seize as many opportunities as you can. "knowledge" comes from the extraction of "information", and "information" comes from the analysis of "data". From "d

As much knowledge as you can grasp, you can seize as many opportunities as you can. “knowledge” comes from the extraction of “information”, and “information” comes from the analysis of “data”. From “data”, “information”, “knowledge” and “wisdom” is transformed step by step.

What do you want to learn from the Internet, unlike schools, teachers can directly give students ready-made knowledge. The rapid development of the Internet, unpredictable features, resulting in most of the information is still unknown, not many people will tell you what information, what knowledge. A lot of information is still lying in a piece of data waiting to be analyzed.

Therefore, in the Internet industry, data acquisition and analysis is the same general skills as communication and time management, which can be transferred without the control of functions.

Data can be obtained through exchange, purchase, API and so on, but if no one else has it, then they can only find the data themselves, and then analyze the information and extract the knowledge.

For example, in order to analyze the common characteristics of the web pages on Baidu’s home page, they have specified several factors that may affect the ranking: the size of the page, the speed of download, the number of links to the page, the number of words in the text, the directory level of url, the number of times query appears in the text, the number of words after query segmentation, the number of times query appears in the title, and so on. With 5000 long-tailed words run Baidu search results, the first five pages appear all the pages down, run out of the previous specified more than a dozen indicators corresponding to the data, and then analyze the different pagination pages (50, 000 samples per page), what is the obvious law in the index.

The above is to obtain the data, and after analyzing the data, it is found that:

The results on the first page show that the average number of words in the text is 500, and the results on the second page to the fifth page decrease in turn.

The results on the first page show that the average number of links contained in the web page is 130, and the results on the second page to the fifth page increase in turn.

Other indicators do not fluctuate significantly in all pagination.

The above is the information, abstracts the information, forms the knowledge:

The number of words in the body of a web page and the links contained in a web page affect the ranking of long tail words

Cover the page of the long tail word, ensure that the number of words in the text is controlled above 500 words, and the links contained in the page are controlled below 130, which will improve the probability that the web page will appear on the home page of Baidu.

Of course, the real web page sorting factor is much more complex than this, in addition to the above two points must meet multiple conditions at the same time to appear on the home page.

In addition, we should also pay attention to the reliability and fairness of the obtained data, reliability is whether the data can deduce the correct conclusion; fairness is whether the data is fair.

Or the above example, if changed to 5000 hot words, the calculated results will be unreliable and unfair. Because Baidu is a commercial search engine, and in the long tail word, Baidu will be relatively less commercial.

To do traffic, a lot of data needs to be caught by themselves, and crawlers are used to capture it. It took a few days to experience python’s Scrapy, which feels good, and is a high-performance, easy-to-use, robust, stable, highly customizable, distributed crawler framework.

As a mature crawler framework, it must be much faster than writing a crawler by hand, and compared with the locomotive, it can achieve functions that the locomotive can not achieve, such as the example mentioned above.

Here are a few notes on the use of Scrapy (grab Baidu search results, which should be viewed on a horizontal screen, or click on the lower left corner to “read the original text” to browse to the blog.)

Project composition:

Scrapy.cfg: project configuration file items.py: stores the pipelines.py: of crawling data settings.py: crawler configuration file, there are ready-made API, to add a variety of anti-ban policies middlewares.py: middleware dmoz_spider.py: crawler program

Referring to official documents and Google wrote a program to capture Baidu rankings, after the launch, in the configuration speed faster than the locomotive.

Considering that Baidu closed reptiles are relatively strict, some shielding strategies are needed, which are realized by the following methods:

1. Rotating export IP

Using the agent provided by scrapinghub, because it is a foreign IP, it is slower to visit Baidu than at home, but the agent provided is very stable, easy to configure, and free, and there seems to be no limit on the number of times you use it.


‘crawlera account, password ‘CRAWLERA_ENABLED = True CRAWLERA_USER =’ account ‘CRAWLERA_PASS =’ password ‘download middleware settings’ DOWNLOADER_MIDDLEWARES = {‘scrapy_crawlera.CrawleraMiddleware’: 600} 2, rotate UA


USER_AGENTS = [“Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser;. Net CLR 1.1.4322;. Net CLR 2.0.50727)”, “Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52”. ] ‘download middleware settings’ DOWNLOADER_MIDDLEWARES = {‘tutorial.middlewares.RandomUserAgent’: 1,} add: class RandomUserAgent (object): “Randomly rotate user agents based on a list of predefined ones” def _ _ init__ (self, agents): self.agents = agents @ classmethod def from_crawler (cls, crawler): return cls (crawler.settings.getlist (‘ USER_AGENTS’)) def process_request (self, request,) to middlewares.py Spider): # print “*” + random.choice (self.agents) request.headers.setdefault (‘User-Agent’, random.choice (self.agents)) 3, rotate Cookie, And completely simulate the browser request header.


Def getCookie (): cookie_list = [‘cookie1′, # getting cookie from a different browser is being added to the’ cookie2′,’. ] Cookie = random.choice (cookie_list) return cookie ‘setting default request headers’ DEFAULT_REQUEST_HEADERS = {‘ Accept’:’text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8′, ‘Accept-Encoding’:’gzip, deflate, sdch’,’ Accept-Language’:’zh-CN,zh;q=0.8,en; Q ≤ 0.6 /, ‘Cache-Control’:’max-age=0′,’ Connection’:’keep-alive’, ‘Host’:’www.baidu.com’,’ RA-Sid’:’7739A016-20140918-030243-3adabf-48f828′, ‘RA-Ver’:’3.0.7′,’ Upgrade-Insecure-Requests’:’1′, ‘Cookie’:’%s’% getCookie ()} sittings.py adds additional configuration items:

‘download delay, that is, the waiting time for downloading two pages’ DOWNLOAD_DELAY = 0.5′ concurrency maximum ‘CONCURRENT_REQUESTS = 100′ enables the AutoThrottle extension for a single Web site concurrency maximum’ CONCURRENT_REQUESTS_PER_DOMAIN = 100′, and the default is False’ AUTOTHROTTLE_ENABLED = False’ setting download timeout ‘DOWNLOAD_TIMEOUT = 10′ to reduce the log level. Uncomment outputs crawling details’ LOG_LEVEL = ‘INFO’ spider program: dmoz_spider.py


Grasp about 300 pages per minute, there is still a lot of room to improve the speed, at least with the domestic agent speed will be much faster. The above anti-ban function is suitable for all websites, and then grasp another website as long as the corresponding changes can be made.

Grab data written to mysql:

Wechat official account: traffic traffickers

Knowledge Planet (there will be welfare in the future, such as a piece of Python code that can write yellow paragraphs)

Related Passages:

Leave a Reply

Your email address will not be published. Required fields are marked *