TF-IDF algorithm of SEO algorithm

Intro:   TF-IDF algorithm of SEO algorithm1. The concept of TF-IDF algorithm:TF-IDF (term frequency-inverse document)Frequency) is a common weighted technique for information retrieval and information explorat

TF-IDF algorithm of SEO algorithm

1. The concept of TF-IDF algorithm:

TF-IDF (term frequency-inverse document)

Frequency) is a common weighted technique for information retrieval and information exploration. TF-IDF is a statistical method used to assess the importance of a word to a set of files or to one of the documents in a corpus. The importance of a word increases in proportion to the number of times it appears in the file, but at the same time it decreases inversely with the frequency of its appearance in the corpus. Various forms of TF-IDF weighting are often used by search engines as a measure or rating of the correlation between files and user queries. In addition to TF-IDF, search engines on the Internet also use link-based rating methods to determine the order in which files appear in the search results.

TF-IDF (word Frequency-inverse document Frequency) algorithm is a statistical method to evaluate the importance of a word to a file set or one of the files in a corpus. The importance of a word increases in proportion to the number of times it appears in the file, but at the same time it decreases inversely with the frequency of its appearance in the corpus. The algorithm has been widely used in data mining, text processing and information retrieval, such as finding its keywords from an article.

The main idea of TFIDF is that if a word or phrase appears more frequently in an article and rarely appears in other articles, it is considered that the word or phrase has a good ability to distinguish categories and is suitable for classification. TF-IDF is actually

TF*IDF, where TF (Term Frequency), represents the frequency; IDF (Inverse Document where entries appear in the article Document

The main idea of Frequency), is that if there are fewer documents containing a word Word, the greater the discrimination of the word, that is, IDF.

The bigger it is. For how to get the keywords of an article, we can calculate that the larger the TF-IDF,TF-IDF of all the nouns that appear in this article, the higher the discrimination of the noun to this article.

A few words with high TF-IDF values can be used as keywords in this article.

2. Principle of TF-IDF algorithm

In a given file, the word frequency (term frequency, TF)

Refers to the number of times a given word appears in the file. This number is usually normalized (molecules are generally smaller than denominators)

Distinguish it from IDF), to prevent it from leaning towards long files. (the same word may have a higher word frequency in a long file than a short file, whether the word is important or not.)

Reverse file frequency (inverse document frequency, IDF)

Is a measure of the universal importance of a word. The IDF, of a particular word can be obtained by dividing the total number of files by the number of files containing the word, and then taking the resulting quotient logarithm.

The high frequency of words in a particular file, as well as the low file frequency of the word in the entire file set, can produce a high weight TF-IDF. Therefore, TF-IDF tends to filter out common words and retain important words.

The main idea of TFIDF is that if a word or phrase appears more frequently in an article and rarely appears in other articles, it is considered that the word or phrase has a good ability to distinguish categories and is suitable for classification. TFIDF is actually: TF

* IDF,TF word frequency (Term Frequency), IDF anti-document frequency (Inverse Document

Frequency. TF represents the frequency at which entries appear in document d (in addition: TF word frequency (Term)

Frequency) refers to the number of times a given word appears in the file). The main idea of IDF is that if there are fewer documents containing entry t, that is, the smaller n is, the larger IDF is (see subsequent formula), which means that entry t has a good ability to distinguish categories. If the number of documents containing entry t in one kind of document C is m, while the total number of documents containing t in other classes is k, it is obvious that all documents containing t are n ≤ m ≤ k, when m is large, n is also large, and the value of IDF obtained according to IDF formula will be small, which indicates that the distinguishing ability of t category of this entry is not strong. (another saying: IDF antidocument frequency (Inverse)

Document

Frequency) refers to the smaller the number of documents containing entries and the larger the IDF, which indicates that the entries have a good ability to distinguish categories.) However, in fact, sometimes, if an entry appears frequently in a class document, it means that the entry can well represent the characteristics of the text of this class, such an entry should give them a higher weight and be selected as a feature word of this kind of text to distinguish it from other types of documents. This is the deficiency of IDF.

In a given file, term frequency,TF refers to the frequency at which a given word appears in that file. This number is the number of words (term)

Count) normalization to prevent it from leaning towards long files. (the same word may have a higher number of words in a long file than in a short file, whether it is important or not.) For words in a particular document

Its importance can be expressed as follows:

In the above formula

It’s the word.

In the file

The number of times in the file, while the denominator is in the file

The sum of the number of times all words appear in the.

Reverse file frequency (inverse document

Frequency,IDF) is a measure of the universal importance of words. The IDF, of a particular word can be divided by the total number of files by the number of files containing the word, and then the resulting quotient logarithm can be obtained:

Of which

Total number of files in the corpus

Include words

Number of documents (that is,

If the word is not in the corpus, it will cause the divisor to be zero, so it is usually used

And then

The high frequency of words in a particular file, as well as the low file frequency of the word in the entire file set, can produce a high weight TF-IDF. Therefore, TF-IDF tends to filter out common words and retain important words.

3. TF-IDF algorithm is implemented by Scikit-Learn.

Because TF-IDF

It is very common in text data mining, so the built-in TF-IDF implementation is also provided in the machine learning package of Python. The main function used is TfidfVectorizer (), to look at a simple example.

The final result is a 4 × 94 × 9 matrix. Each row represents a document, and each column represents the score of each word in the document. If a word does not appear in the document, the corresponding position is 0. Number 9

There are a total of nine (different) words in the vocabulary in the corpus. For example, you can see that in document 1, and, does not appear, so the value in the first column of the first row of the matrix is 0. Word first

Only appeared in document 1, so the word first has a higher weight in the first line. Document and this have appeared in three documents, so they have a lower weight. And the is in

It has appeared in four documents, so it has the lowest weight.

The last thing to note is that because of the function TfidfVectorizer (),

There are many parameters, and we only take the default form here, so the output results may differ from those obtained using the (most basic and primitive) algorithm described earlier (but the size of the quantity does not change). Interested readers can refer to [4] for more information about execution in Scikit-Learn

The details of the TF-IDF algorithm.

4. Summary of TF-IDF algorithm

Related Passages:

Leave a Reply

Your email address will not be published. Required fields are marked *