Intro: Magi is an information extraction and retrieval system based on machine learning developed by Peak Labs. It can extract knowledge from natural language texts in any field into structured data, and con
Magi is an information extraction and retrieval system based on machine learning developed by Peak Labs. It can extract knowledge from natural language texts in any field into structured data, and continue to aggregate and correct errors through lifelong learning, and then provide human users and other artificial intelligence with a knowledge system that can be parsed, recoverable and traceable.
If you’re here from magi.com, congratulations on finding half of Magi! The site, which looks a lot like a search engine, is the public version of Magi, but unlike search engines, Magi not only contains huge amounts of text on the Internet, but also tries to understand and learn the knowledge and data contained in these texts.
Try to search for something you care about at magi.com, or ask questions directly, and Magi will try to provide you with highly aggregated structured knowledge results:
We have developed an Internet search engine for Magi from scratch, so magi.com also provides common search results on a web-wide scale. So, even if you don’t happen to have structured results, you won’t come in vain.
It is worth mentioning that the above learning process runs without intervention 7 x 24 hours a day, and knowledge in real-time news events generally takes only five minutes to master. As the number of cross-verifiable sources of information increases, the credibility of previously learned knowledge is reassessed so that errors in the results are automatically corrected.
At present, only a very small number of knowledge on the Internet has been manually sorted out into machine-parsed formats, such as various encyclopedia columns and vertical domain databases. However, these information is only a drop in the ocean, regardless of coverage, update frequency, reliability can not meet the growing demand for automation and intelligence.
The fundamental contradiction is that it is not difficult for human beings to read natural language, but human energy is limited, can not keep up with the speed of the generation of valuable information, nor can it guarantee stability and objectivity; although the machine is tiring and superior in speed, it is difficult to make use of the numerous and complex free text, so that the inestimable value is buried between the lines.
Imagine if there is a database that is constantly automatically updated and contains data structures that are easily processed by programs and algorithms extracted from text information throughout the Internet, then perhaps:
All kinds of voice assistants will no longer say to you, “I’m sorry, I don’t know.” ;
Business intelligence can gain a wide range of background knowledge to make better judgment;
The efficiency of data collection and verification of financial information services will be significantly improved.
As a public version of magi.com, it provides human users with a new way to interact with Internet data, while the technology platform behind the Magi system carries the other half of the important significance: so that machines can understand and make full use of the endless knowledge in the Internet.
In the current attempts in related fields, machine Q & A is still a human service after all. The text answer given according to the text question can not be used directly by downstream tasks. At the same time, the question and answer model itself can not meet the needs of scale in terms of capacity or update efficiency, and what is more fatal is that the knowledge in the model exists in the “black box” composed of floating-point numbers. in our view, it is not the most responsible way to present these information which can not be interpreted and traced directly to the user. In addition, the scheme based on document retrieval can not meet the structured requirements, and the efficiency constraints in real-time online services will make it difficult to evaluate all the documents to obtain global optimization, and its query requirements for user input are high.
To sum up, we think that knowledge extraction is much more important than simply answering questions, and active discovery of potential knowledge and continuous extraction and correction are significantly stronger than passively matching results according to the input questions. It is already very difficult for machines to understand language, and Magi is choosing to face one of the most complex goals: open domain Internet text, to face up to the core contradictions of scale and accuracy in knowledge engineering.
In order to improve the utilization of information, Magi must extract all the knowledge as thoroughly as possible from each paragraph of text with uneven quality and different topics. This determines that all existing technical solutions are not available: this is no longer a clear sequence tagging problem, the staggered superposition relationship makes the search space explosive growth, unrestricted areas also mean that there is no training data available at all.
It took us many years to design and develop the whole technology stack from scratch: distributed search engine with original succinct index structure, neural extraction system using specially designed Attention network, streaming crawling system that does not rely on Headless browser, natural language processing pipeline supporting mixed processing in more than 170 languages, … . At the same time, we worked silently and gained unique training / pre-training data.
By introducing the quality factor of query-independent in traditional search, the high quality and reliable message source will be paid more attention to, and its extraction model based on multi-level migration learning completely abandons the links that limit the generalization ability, such as artificial rules, role tagging, dependency analysis and so on, and can be directly applied to all kinds of foreign language texts under the premise of zero-resource and obtain satisfactory results. With the accumulation of data and the expansion of source diversity, the system can also continue to learn and adjust, automatically eliminate the noise and error results learned.
These efforts work together to present Magi here. As a unique and forward-looking project, some of the data and related research results of Magi will be published on platforms such as Zenodo and arXiv on a regular basis.
Magi is far from mature at present, but its characteristics determine its infinite possibilities and growth space.
Starting with the most intractable open domain information of the Internet, Magi proves its possibility as a the One system to rule them all. In the face of text information in various fields, the technical scheme of Magi has gone from breaking down item by item to unification, which represents the difference from limited to infinite.
With the increasing amount of data and credibility, Magi will be used as the ImageNet of knowledge to empower various industries. The information extraction task in each professional subdivision field can achieve a better scheme by using a small amount of data to fine-tuning the Magi model.
Perhaps in the near future, with the progress of the industry as a whole, the structured network built by Magi will become the cornerstone of explainable artificial intelligence.
Peak Labs recently released a public version of its artificial intelligence system, Magi, magi.com. Through this search engine, users can enter keywords to obtain the structured knowledge and web search results that Magi has learned independently from the Internet text, and each structured result will be followed by a source link and its credibility score.
This is different from the traditional search engine we use. The traditional search engine returns a series of links. To interpret the problem, you also need to click on the web page to mine useful information.
After the release of the engine, a large number of netizens gathered around and hung up its server. Magi author sent Weibo to respond: “suddenly a lot of people pay attention to us, really thank you very much, in fact, search engine is really not our main business, we did not do any promotion, let alone be prepared to deal with this terrible traffic.” The amount of computation of Magi single search is much heavier than that of ordinary web search. Please let us have mercy and apologize again! “
Links are provided on the right side of the answer, and you can see from which specific source the answer was learned with the mouse:
Magi’s focus is on the nature of user search behavior, with a slight improvement over traditional search engines: “help you think.” When the input wants to understand things or information, the traditional search engine gives the link information according to the weight of the result (Page Rank), which needs to sum up and judge the credibility of its own. Magi has done one more step, not only to include a large number of texts on the Internet, but also to try to understand and learn the knowledge and data contained in these texts.
Ji Yichao says Magi is similar to a civilian version of IBM Watson or a non-academic version of Wolfram Alpha. Wolfram Alpha is a search engine that can read your questions. Its goal is to “calculate everything.” According to inventor Stephen Wolfram, it is a computing knowledge engine, not a search engine like Baidu or Google. To put it simply, it is actually a synthesis of drawing calculators, reference books, and search engines, very advanced.
In addition to giving the calculation results directly, Wolfram Alpha can also deal with factual questions based on natural languages, such as:
If you enter “China GDP”, what will appear is not a large number of web pages, but intuitive data and charts. Including: the latest situation of China’s GDP, China’s GDP growth from 1970 to the present (chart form), China’s inflation rate, unemployment rate.
If you enter “How many people in China”, you can see data on the current total population, population density, average annual population growth rate, life expectancy and average age in China.
Ji Yichao, founder of Magi from the Chinese team Peak Labs, is also known as a developer. In 2011, while studying in the middle school attached to Peking University, he completed the development of mammoth browser iOS alone. In 2012, Ji Yichao started his own company and continued to promote browser and input method projects. At present, Peak Labs focuses on the Magi project, focusing on the technology behind it, as well as the development of related commercial products.
“what we really commercialize is the technology behind Magi-open information extraction based on migration learning.” The advantage of Magi migration learning NLU algorithm is that it only needs to use general data to train AI engine, which can make AI engine suitable for professional vertical domain. Magi first uses Internet knowledge and its own data for pre-training, and the task in the professional vertical field needs only a very small amount of manual data tagging, which can achieve the training effect of large-scale data.
I. Utilization and versatility
Magi no longer relies on preset rules and areas, “without problems,” to learn and understand text information on the Internet, while trying to find out all the information (exhaustive) instead of choosing the only best (most promising). Magi downplayed specific entity or domain-related concepts through a series of pre-training tasks, and instead learned, “what information may people pay attention to in the content?” . A special feature representation, network model, training task and system platform are designed for Magi (discussed below), and a lot of energy is invested to gradually construct the special training / pre-training data of proprietary. Through lifelong learning continuous aggregation and error correction, Magi provides a parsed, recoverable and traceable knowledge system for human users and other artificial intelligence.
II. Coverage and timeliness
Cooperate with our own web search engine to evaluate the source quality, the information source and the domain do not have the whitelist, synthesize Clarity (clarity), Credibility (credibility), Catholicity (universality) three Magi weighing knowledge engineering scale and the accuracy difficulty quantitative standard to carry on the source quality evaluation. And pay attention to timeliness, timeliness is reflected in the time line tracking of existing knowledge mentioned above, so that batch updates are no longer triggered periodically, and the whole system continues to learn, aggregate, update and correct errors online.
III. Plasticity and internationalization
There is no pre-NER and dependency parsing links to reduce the loss of parent text information. A special Attention network structure and several matching pre-training tasks are designed for the extraction model of Magi. Technology stack full language-independent, can implement low resources and cross-language transfer.
Magi official website and Ji Yichao also admitted that there are still some shortcomings, such as ambiguity, engineering, as well as scale and accuracy. With regard to the slow search, Ji Yichao said on Weibo that this is because the amount of computation in a single search is much heavier than the average web search. Magi search results are not good enough at present, but this does not prevent it from becoming a future search engine direction, providing users with a trusted and understanding of the knowledge after learning. Especially in this AI era, the results of search engines should be more close to the needs of users.
At present, the mainstream search engine relies on machine crawling, web page search based on hyperchain analysis, adopts the combination of search crawler and sorting algorithm, and takes keywords as the core of automatic retrieval, so as to realize the automatic acquisition and importance ranking of massive information. As the access to information, it is directly related to the quality of the information we obtain, but also achieved the early Internet companies.
But now the excessive commercialization of search engines has aroused the aversion of users. The advantage of Magi is that it removes commercial elements, sifts out ads, makes the information searched purer, more valuable, and saves users time.
Ji Yichao said on his Weibo: “now Magi is full of the simple heart of an engineer, do not want to hate you with advertising, and have no interest in your privacy.”