Leveraging Natural Supervision for Language Representation Learning and Generation: Abstract

1 Jun 2024


(1) Mingda Chen.


Recent breakthroughs in Natural Language Processing (NLP) have been driven by language models trained on a massive amount of plain text. While powerful, deriving supervision from textual resources is still an open question. For example, language model pretraining often neglects the rich, freely-available structures in textual data. In this thesis, we describe three lines of work that seek to improve the training and evaluation of neural models using naturally-occurring supervision.

We first investigate self-supervised training losses to help enhance the performance of pretrained language models for various NLP tasks. Specifically, for general-purpose language representation learning, we alter the sentence prediction loss to make it better suited to other pretraining losses and more challenging to solve and show that the change led to a series of state-of-the-art pretrained encoders. For in-context learning, in contrast to previous work, which finetuned pretrained decoders on human-annotated datasets, we design an intermediate finetuning step that uses self-supervised training to promote models’ ability in cross-task generalization.

Then we describe methods to leverage the structures in Wikipedia and paraphrases. In particular, we leverage hyperlinks as supervision for pretraining entity representations, leading to models that can encode arbitrary entities. We use article structures, such as section and document titles, to train sentence representations. Evaluation results on discourse-related tasks show that such training helped model performance. We extract training data from article category graphs and demonstrate that the extracted data improves model performance on textual entailment tasks. In addition, we leverage the pair data structure in paraphrases by building the first neural models to disentangle semantics and syntax in sentence representations. In addition to semantic evaluation metrics, we propose metrics for syntactic representations, finding that the best performance for both metrics is achieved when there is maximal disentanglement between the two latent representations. We extend the framework for a novel generation task that controls the syntax of output text with a sentential exemplar. To formally define this controlled generation task, we annotate evaluation sets and proposed evaluation metrics.

Lastly, we discuss our work on tailoring textual resources for establishing challenging evaluation tasks. We introduce three datasets by defining novel tasks using various fan-contributed websites. The first dataset generates arbitrary Wikipedia section text from various tabular data by casting the task as long-form data-to-text generation and creating a large-scale dataset. The task is challenging as models need to generate a coherent passage connecting all the entities in the tabular data, and the story also needs to fit the background knowledge in the tabular data. The second dataset summarizes lengthy transcripts for TV shows. The task has several challenges: e.g., plot information is not stated explicitly but rather only implied in the dialogue and the need to draw information from a wide range of the input transcripts. As characters are fundamental to TV show plots, we also propose two character-centric evaluation metrics. The third dataset generates long-form stories from character descriptions and summaries. The task poses several challenges for story generation models, including lengthy inputs and outputs and consistency in character modeling.

This paper is available on arxiv under CC 4.0 license.