Leveraging Natural Supervision for Language Representation Learning and Generation: Introduction

1 Jun 2024


(1) Mingda Chen.


Written language is ubiquitous. Humans use writing to communicate ideas and store knowledge in their daily lives. These activities naturally produce traces of human intelligence, resulting in abundant, freely-available textual data: e.g., Wikipedia,[1] Reddit,[2] and Fandom,[3] among others. These data often contain sophisticated knowledge expressed with complex language structures. For example, for encyclopedias, there usually are dedicated structures for connecting pieces of information scattered around different places for the convenience of readers (e.g., hyperlinks that point the same person or events mentioned in other documents to the same place for disambiguation). Aside from explicit structures, corpora have rich implicit structures. For example, the data pair in bilingual text shares the same semantic meaning but differs in syntactic forms. The implicit difference between the data pair allows us to disentangle the semantic and syntactic information implied in the data structure.

Despite the rich structures, recent advances in NLP have been driven by deep neural models trained on a massive amount of plain text, which often strips away the knowledge and structure from the input. This thesis research approaches to better drive supervision from various naturally-occurring textual resources. In particular, we (1) improve ways of transforming plain text into training signals; (2) propose approaches to exploit the rich structures in Wikipedia and paraphrases; and (3) create evaluation benchmarks from fan-contributed websites to reflect real-world challenges. Below we briefly introduce these three areas and summarize our contributions.

This paper is available on arxiv under CC 4.0 license.

[1] https://www.wikipedia.org/, an online collaborative encyclopedia.

[2] https://www.reddit.com/, an online forum for discussion and web content rating.

[3] https://www.fandom.com/, a fan-contributed encyclopedia of movies, films, and other media.