Leveraging Natural Supervision for Language Representation Learning and Generation: Contributions

1 Jun 2024


(1) Mingda Chen.

1.2 Contributions

In summary, this thesis makes the following contributions:

• By adequately designing self-supervision, we improve the quality of pretrained language models and their abilities for cross-task generalization. In Section 3.1, we replace the next sentence prediction loss with a novel sentence ordering prediction loss in language model pretraining and show that the change led to a series of state-of-the-art pretrained encoders. In Section 3.2, in contrast to previous work, which finetuned pretrained decoders on human annotated datasets, we show that self-supervised tasks with proper designs could also lead to similar gains in the in-context few-shot learning setting, promoting models’ ability in cross-task generalization.

• We design model architectures and training objectives to exploit the rich structures in Wikipedia articles. In Section 4.1, we leverage hyperlinks as supervision for pretraining entity representations, leading to models that can encode arbitrary entities. In Section 4.2, we use article structures, such as section and document titles, to train sentence representations. Evaluation results on discourse-related tasks show that such training helped model performance. In Section 4.3, we extract training data from article category graphs and demonstrate that the extracted data improves model performance on textual entailment tasks. These results reveal the advantages of structure-aware model pretraining.

• We leverage the pair data structure in paraphrases and bilingual text to disentangle semantics and syntax in sentence representations, which allows us to learn interpretable and controllable neural models. In Section 5.1, we build the first neural models to disentangle semantics and syntax in sentence representations. The models use the fact that for a paraphrase pair, the semantics is shared, but syntax varies. In addition to semantic evaluation metrics, we propose evaluation metrics for syntactic representations, finding that the best performance for both metrics is achieved when there is maximal disentanglement between the two latent representations. In Section 5.2, we adapt this framework for controlled paraphrasing, where we seek to control the output text with a syntactic, sentential exemplar. To formally define this controlled generation task, we annotate evaluation sets and proposed evaluation metrics. In a later work, we extend this framework and task setting to machine translation (Chen et al., 2020b), showing the potential that this idea could generalize to arbitrary data with the pair data structure. • We demonstrate that we can create challenging benchmark datasets for various long-form text generation tasks by tailoring fan-contributed textual resources. We do so by defining new NLP tasks and studying these new tasks through extensive experiments. In Section 6.1, we generate arbitrary Wikipedia section text from various tabular data by casting the task as long-form data-to-text generation and creating a large-scale dataset. The task is challenging as models need to generate a coherent passage connecting all the entities in the tabular data, and the story also needs to fit the background knowledge in the tabular data. In Section 6.2, we summarize lengthy transcripts for TV shows. The task has several challenges: e.g., plot information is not stated explicitly but rather only implied in the dialogue and the need to draw information from a wide range of the input transcripts. As characters are fundamental to TV show plots, we also propose two character-centric evaluation metrics. In Section 6.3, we generate long-form stories from character descriptions and summaries. The task poses several challenges for story generation models, including lengthy inputs and outputs and consistency in character modeling.

This paper is available on arxiv under CC 4.0 license.