NLP in practice: A Web App Demo for text summarization in Chinese, English and Japanese by Google Pegasus Project

Cloud AI Evangelist
4 min readDec 24, 2020

# STRATEGYSUM Text Summarization \テキスト要約\テキストのまとめ\文本摘要 system
An online text summarization inference service based on the Google Pegasus project model [1–2] was deployed on Google Cloud using tesla V4 GPUs to test it out, cheers for trying!
The online demo supports English, Chinese, and Japanese text, users only need to input it, and the system will automatically detect the language and perform the summary processing.
web demo
http://35.196.244.158
If you have any questions or need to provide a model deployment plan, you can contact me at: steven.zhang@spacexdeepsace.com
[1] https://github.com/google-research/pegasus
[2] https://ai.googleblog.com/2020/06/pegasus-state-of-art-model-for.html
Today, online users suffer from information overload, and the vast amount of rapidly growing text information on the web makes it difficult for users to read all the material they may need.
Automatic text summarization is formally a solution to this problem and also automatic text summarization is one of the advanced natural language applications. There are two types of text summarization tasks, extractive summarization and abstractive summarization. Extractive summarization aims to create a summary by selecting a subset of the input text to maximize coverage of important content and minimize redundancy. In contrast, abstractive abstracts aim to create an abstract representation of the input text and use natural language generation techniques to generate the abstract. Compared to extractive abstraction, abstractive abstraction is more challenging to generate and is a close approximation to manually creating abstracts, since abstractive abstractions may contain expressions that are not present in the original text. Strategy sum is a cross-lingual abstraction solution we propose, which has the advantage of using an end-to-end deep learning based on pegasus [1] The advantage of strategySum is that it uses an end-to-end deep learning model based on pegasus [1] to extract cross-language abstract expressions, thus supporting multilingual abstract generation, and combining various strategies such as textual inference [2] to ensure that the generated abstracts are faithful to the original text.
- In processing multilingual summarization tasks, the training corpus is relatively abundant for some languages, while it is very sparse for others. When the training data is very sparse, it is very difficult to handle the summarization task for the language of interest because there are not enough summary examples for the model to learn, which makes the model prone to severe overfitting and underfitting. We propose to use language type embedding as a hint to the model to equip a single model with the ability to handle multiple languages, which takes advantage of a larger training corpus for certain languages and can be trained so that cross-language summary expressions can be extracted from the text and decoded into the target language.
- Cross-lingual language models are trained using transformer-based [3] as the language model and GSG (Gap Sentences Generation) proposed by pegasus as a pre-training scheme while introducing language embeddings.
- Abstract abstracts generated by general abstract models often make up content that is not true or contradicts the original text. To address this problem, both Google and Microsoft have proposed their own insights and evaluation criteria for abstract fidelity. If it is not faithful to the original text, the triadic approach proposed in [5] is used for proofreading, and if it still fails, the abstracted abstract is replaced by the extracted abstract to ensure the maximum faithfulness of the abstract to the original text.
- The space complexity of the transformer model is the square of the sequence length, and for the problem that the general transformer structure of the model is difficult to handle very long texts, we propose to use the partition-extraction method to reduce the length of the text by using the extractive abstract first, while filtering the noisy and unimportant content, and then generate the abstractive abstract to ensure the compatibility with long texts, and at the same time ensure that The generated abstracts are realistic.
## Abstract task test
> Chang’e-5 is the lunar probe of the third phase of China’s lunar exploration project, which performs the first extraterrestrial sample return mission of the People’s Republic of China [3] and completes the last “return” mission in the “orbit, landing and return” of the lunar exploration project [4]. ]. Chang’e-5 was launched at 4:30 a.m. on November 24, 2020, from the Wenchang Space Launch Complex in Hainan [3][2], and automatically sampled on the lunar surface and took off from the lunar surface [5][6][7], returning to Earth and landing in Inner Mongolia on December 17. 8]
>
> … … etc. about 6000 words
__Abstract abstract: __
Chang’e-5 was launched on November 24, 2020, from the Wenchang Space Launch Complex in Hainan. It is part of the third phase of China’s lunar exploration project. It has achieved four “firsts” since China started its space activities: the first automatic sampling on the lunar surface; the first takeoff from the lunar surface. The mission was completed with an unmanned lunar surface sample return.

--

--