Leveraging Natural Supervision for Language Representation Learning and Generation: Conclusion

1 Jun 2024


(1) Mingda Chen.


This thesis has made the following contributions:

• We improved self-supervised training objectives for large-scale pretrained language models in Chapter 3. In Section 3.1, we replaced the next sentence prediction loss with a novel sentence ordering prediction loss in language model pretraining and showed that the change led to a series of state-of-the-art pretrained encoders. In Section 3.2, in contrast to previous work, which finetuned pretrained decoders on human-annotated datasets, we showed that self-supervised tasks with proper designs could also lead to similar gains in the in-context few-shot learning setting, promoting models’ ability in cross-task generalization.

• We converted various naturally occurring data structures on Wikipedia into supervision for various NLP tasks in Chapter 4. In Section 4.1, we leveraged hyperlinks as supervision for pretraining entity representations, leading to models that can encode arbitrary entities. In Section 4.2, we used article structures, such as section and document titles, to train sentence representations. Evaluation results on discourse-related tasks showed that such training helped model performance. In Section 4.3, we extracted training data from article category graphs and demonstrated that the extracted data improved model performance on textual entailment tasks. These results revealed the advantages of structure-aware model pretraining.

• We defined novel tasks that disentangled semantics and syntax and tackled the tasks by designing training objectives and neural architectures in Chapter 5. In Section 5.1, we built the first neural models to disentangle semantics and syntax in sentence representations. The models use the fact that for a paraphrase pair, the semantics is shared, but syntax varies. In addition to semantic evaluation metrics, we proposed evaluation metrics for syntactic representations, finding that the best performance for both metrics is achieved when there is maximal disentanglement between the two latent representations. In Section 5.2, we adapted this framework for controlled paraphrasing, where we seek to control the output text with a syntactic, sentential exemplar. To formally define this controlled generation task, we annotated evaluation sets and proposed evaluation metrics. In a later work, we extended this framework and task setting to machine translation (Chen et al., 2020b), showing the potential that this idea could generalize to arbitrary data with the pair data structure.

• In Chapter 6, we built challenging datasets from fan-contributed websites. We also proposed evaluation metrics and possible solutions and conducted thorough experiments to characterize the new challenges. In Section 6.1, we generate arbitrary Wikipedia section text from various tabular data by casting the task as long-form data-to-text generation and creating a large-scale dataset. The task is challenging as models need to generate a coherent passage connecting all the entities in the tabular data, and the story also needs to fit the background knowledge in the tabular data. In Section 6.2, we summarize lengthy transcripts for TV shows. The task has several challenges: e.g., plot information is not stated explicitly but rather only implied in the dialogue and the need to draw information from a wide range of the input transcripts. As characters are fundamental to TV show plots, we also proposed two character-centric evaluation metrics. In Section 6.3, we generate long-form stories from character descriptions and summaries. The task poses several challenges for story generation models, including lengthy inputs and outputs and consistency in character modeling.

Below we discuss several possible future directions.

• Disentangling Latent Factors. Chapter 5 introduced neural models for improving interpretability and controllability using implicit yet natural supervision from paraphrases and bilingual text. Future work could generalize this idea to any resources that are formed by data pairs, such as dialogues or summarization. They could be used for disentangling the factors that are shared between pairs of inputs and those that are not shared, such as intentions and the personalized styles in dialogues, sentence-level fluency and document-level discourse in sentence modeling, or important events and irrelevant details in summarization.

Another possibility is to disentangle task supervision when a task can be decomposed into two sub-tasks, e.g., cross-lingual summarization can be thought of as a combined task of summarization and translation. Disentangling task supervision could help us improve the models’ ability in cross-task generalization and tease out the valuable intermediate supervision that is usually unavailable.

In general, disentangling latent factors is an appealing research direction in that, although large pretrained models have yielded superhuman performance, researchers still lack an understanding of the behaviors of these models. Outside this thesis, I have also completed work benefiting from interpretable latent variables, leading to efficient neural models (Chen and Gimpel, 2018) and effective semi-supervised learning (Chen et al., 2018a). Better interpretability could also help us improve their robustness and worst-case behaviors to be better applied in user-facing applications.

• Natural Supervision for Text Generation. Chapter 4 presented approaches to leverage various natural supervision for representation learning. For future work, it would be interesting to see whether we could apply the same thing to text generations. In particular, future work may consider using hyperlinks to improve the entity tracking performance in their text generation systems and using article structures to enhance the discourse coherence in the generated texts.

• Unified Models for Various Language Supervision. While Chapter 4 described modeling choices to take different language knowledge (e.g., entities and discourse) into account, it is still unclear as to the best design for a unified model that can incorporate all of these learning signals. Future work may find a unified model to show superior performance as humans rely on multiple language properties simultaneously to solve tasks. In addition, future work may consider combining discourse, linking, and paraphrase objectives with BERTlike models, as well as other types of natural supervision, such as naturally occurring bold/italics/underlining annotations in web text, and long-distance discourse cues like two paragraphs in two chapters, among others.

• Learning Commonsense Knowledge from Natural Supervision. Future work could also consider learning commonsense knowledge from naturally occurring data. For example, learning domain-specific commonsense from dialogues in certain subreddits[1] (e.g., technical or social) or distilling commonsense knowledge from existing pretrained models. Commonsense understanding is ubiquitous in language. As humans often assume the knowledge is wellknown to anything they interact with, it is seldom explicitly described. Also, due to the assumption, humans tend to believe any intelligent system should understand such knowledge. The two properties of commonsense knowledge make it a challenging and imperative capability. In practice, when deploying models in real-life applications, the knowledge can also make them more reliable due to the improved language understanding.

• Text Generation with Rich Descriptions. Future work could explore text generation with rich, detailed descriptions about the world in which the task is situated. This direction is related to the work in Chapter 6 as the descriptions can be either tabular data about certain background knowledge (Section 6.1) or lengthy documents about fictional characters (Section 6.3). These detailed descriptions explicitly describe knowledge that the generated text should follow, so they are like “controlled environments” that simulate the real world, offering opportunities to improve evaluations for text generations and enhance the faithfulness of neural models.

This paper is available on arxiv under CC 4.0 license.

[1] Subreddits are on the social media website Reddit and dedicated to a particular topic that people write about.