Intro: This paper is about 2200 words. It is suggested that the reading time should be 4 minutes and the comprehension should be given priority to.Before starting to describe the technical architecture of se
This paper is about 2200 words. It is suggested that the reading time should be 4 minutes and the comprehension should be given priority to.
Before starting to describe the technical architecture of search engines, a general description of the basic knowledge of search engines is helpful to understand the technical issues.
Search engine refers to the system that collects information from the Internet according to certain strategies and uses specific computer programs. After organizing and processing the information, it provides retrieval services for users and displays the relevant information to users.
From the above explanations, it is easy to extract three aspects:
What is a search engine? -it’s a software system.
Where does the search engine search for information? -from the Internet.
What do search engines do from the Internet? -gathering information.
What are you doing searching for information? -provision of services to users.
In short, search engines are software systems that collect information from the Internet and provide services to users.
Search engines are also divided into many categories, from the different objects of search information, can be divided into:
1) full-text search engine, Full Text Search Engine
As the name implies, that is, the text, pictures, videos, links and so on of the web page to search for all the content, such as Baidu, Google.
2) Vertical search engine, Vertical Search Engine
That is to say, the specific vertical collar of the web page is collected and processed, such as the domestic Ctrip only for ticket, travel information collection and processing, such as foreign Pinterest mainly for the collection and processing of pictures.
3) Meta search engine, Meta Search Engine
“Meta” can be understood as the data of the data, such as the number of words in this article, ah, the size of the information and so on.
Abstractly speaking, meta-search engine is the search engine that collects and processes search engine.
Specifically, meta-search engines integrate the data of many kinds of search engines and provide them to users at the same time.
Such as meta crawler and so on (can not be visited in China).
As one of the most technologically available applications on the Internet, search engines serve billions of users every day. In addition to knowing to enter an “Apple” in the Baidu search box and clicking on the page returned by Baidu, users may know very little about search engines. But before the user sees the result, the search engine relies on the complex architecture and the algorithm, collects and processes the massive data, at the same time also provides the user with the most accurate search information.
Next, we focus on the very macro level of the search engine architecture.
Search engine is composed of many technical modules, responsible for different stages of technical processing of different data, each other constitutes a complete technical architecture, basically this architecture diagram can cover the general work of search engines.
For ease of understanding, I divide these technical modules into two phases:
The first stage is the two columns on the left, which occurs before the user enters the query words, that is, what the search engine is doing in silence.
The second phase is on both sides of the right, which occurs within a few milliseconds of the user entering the query words and returning the search results.
1) in the first stage, the search engine has been doing things in silence.
First, search engines use crawlers to get and download pages on the Internet to the local computer, which can be understood as downloading individual pages to the local computer in the form of word documents.
This step is similar to the fact that supermarket purchasers buy many and many goods back.
Secondly, there are related programs to reprocess the downloaded data. Because many, many documents are downloaded, and there are many documents that are completely consistent, you need to redo these documents to ensure that each document contains the same content.
This step is similar to the fact that supermarket tellers label each item with a unique price tag.
Thirdly, there are related programs to analyze the documents after they are heavy, that is, to extract the contents and links of the documents. According to some complex algorithms, the text is operated to form an inverted index table. At the same time, according to some complex algorithms, the link is operated to a certain extent, which constitutes a link relationship.
This step is similar to the fact that supermarket tellers remember the main contents of the commodity price label, such as whether the item is food or clothing.
Finally, the focus of all the operations that have been carried out is to form a good inverted index table and link relationship, and to deal with anti-cheating, such as eliminating illegal and criminal content, deleting bad web pages, and so on, similar to the inspection of supermarket goods before they went on sale.
This step is similar to the fact that supermarket tellers remember the recommended relationship between goods, such as whether apples are put with pears or lipstick.
So far, the first phase is over, of course, the real search and processing process is much more complex, and then, in the second phase, the user enters the query words in the search box for a few milliseconds before the search results are displayed.
2) in the second stage, what happens after the user enters the query word.
First, the user enters “Apple” in the search box.
This step is similar to the fact that a user entering a supermarket is the same as a salesman says he wants to buy an “apple” (provided that the user does not have to go into the supermarket to find what he or she wants to buy).
Second, search engines on the Cache system (that is, the cache system) quickly check to see if there are any apples. The cache system can be understood to mean that the user searches a lot and places it in a separate, easy-to-access place.
This step is similar to the fact that supermarket salespeople look for “apples” on the nearest “best-selling shelf”. If there is an “apple”, give it directly to the user. If not, hurry into the supermarket to look for it, then move on to the next step.
Thirdly, if the search engine no longer caches the words that the user wants to look up in the system, it will analyze the contents and links processed by the query words in the first stage, and find the information that the user may want.
This step is similar to the fact that supermarket salespeople quickly go through the price labels of each item and find all the items related to Apple.
Finally, the search engine finds hundreds of millions of pages that can be related in a few milliseconds, according to a certain correlation algorithm, the most likely pages that users want are displayed at the top, and then sorted according to the correlation, and then displayed in turn.
This step is similar to the fact that supermarket salespeople hold hundreds of millions of labels, and the most important thing to give users is the “apple” they can eat. Is it an Apple phone? Or an apple-shaped pillow? Of course, salespeople will recommend the most relevant items to users on the basis of existing experience.
Above, it shows the general technical architecture of search engine, supplemented by the example of supermarket. I hope you can understand what search engine has done behind the “invisible” of users, so that we users can understand the greatness of Internet technology, and Internet people also admire the search giant Google.
In addition, the product manager should not complain that the engineer can not do Baidu this kind of search function, he wants to be able to do it, he will not be in this company.
After that, I will update the search engine series articles one after another, and start telling you what the search engine has done, why and how to do it at different stages and different steps.