We propose the Detailed Outline Control (π³πΎπ²) framework for improving long-range plot coherence when automatically generating several-thousand-word-long stories. π³πΎπ² consists of two complementary components: a detailed outliner and a detailed controller. The detailed outliner creates a more detailed, hierarchically structured outline, shifting creative burden from the main drafting procedure to the planning stage. The detailed controller ensures the more detailed outline is still respected during generation by controlling story passages to align with outline details. In human evaluations of automatically generated stories, π³πΎπ² substantially outperforms a strong Re3 baseline (Yang et al., 2022) on plot coherence (22.5% absolute gain), outline relevance (28.2%), and interestingness (20.7%). Humans also judged π³πΎπ² to be much more controllable in an interactive generation setting.
Detailed planning instead of just a three-sentence draft (beginning, middle, end)
The Detailed outliner recursively expands outline items in breadth-first order. To create each new entry, it proposes candidate events and selects the best via filtering and reranking, and then detects the setting and relevant characters.
Token-by-token through-out generation instead of relying on only an initial prompt or post-hoc rejection sampling.
Pairwise comparison on 1000- to 1500-word passages between systems.
π³πΎπ² stories are rated substantially more plot-coherent, outline-relevant, and interesting compared to RE3 and ROLLING-OPT.
The human is asked to give an initial premise. RE3 and π³πΎπ² will generate the outline based on the premise. The the human is asked to edit the generated outline. The final outline is used for both systems to generate the full stories.
Humans judged faithfulness to authorial intent, control over generation, system intuitiveness, and story quality. π³πΎπ² is preferred by a wide margin on all metrics.
@inproceedings{yang-etal-2023-DOC,
title = "DOC: Improving Long Story Coherence With Detailed Outline Control",
author = "Yang, Kevin and
Klein, Dan and
Peng, Nanyun and
Tian, Yuandong",
editor = "Rogers, Anna and
Boyd-Graber, Jordan and
Okazaki, Naoaki",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.190",
doi = "10.18653/v1/2023.acl-long.190",
pages = "3378--3465",
}
While large language models (LLMs) have shown impressive results for more objective tasks such as QA and retrieval, it remains nontrivial to evaluate their performance on open-ended text generation for reasons including (1) data contamination; (2) multi-dimensional evaluation criteria; and (3) subjectiveness stemming from reviewers' personal preferences. To address such issues, we propose to model personalization in an uncontaminated open-ended generation assessment. We create two new datasets πΏππ-πΌπΏππ and πΏππ-π³πΎπ² for personalized story evaluation, by re-purposing existing datasets with proper anonymization and new personalized labels. We further develop a personalized story evaluation model πΏππππ΄ to infer reviewer preferences and provide a personalized evaluation. Specifically, given a few exemplary reviews from a particular reviewer, πΏππππ΄ predicts either a detailed review or fine-grained comparison in several aspects (such as interestingness and surprise) for that reviewer on a new text input. Experimental results show that πΏππππ΄ outperforms GPT-4 by 15.8% on Kendall correlation of story ratings, and by 13.7% on pairwise preference prediction accuracy.
We create two new datasets πΏππ-πΌπΏππ and πΏππ-π³πΎπ² for personalized story evaluation, by re-purposing existing datasets with proper anonymization and implicit personalized labels.
We propose πΏππππ΄, which takes reviewer's prior reviewers and reason the specific preference. Based on this, it can give a personalized review and score for an individual story, or conduct personalized fine-grained comparsion between two stories.
Highest Correlation with Human Rating on individual story scoring. Best accuracy for pairwise comparison on five aspects of story quality.
@article{wang2023learning,
title={Learning Personalized Story Evaluation},
author={Wang, Danqing and Yang, Kevin and Zhu, Hanlin and Yang, Xiaomeng and Cohen, Andrew and Li, Lei and Tian, Yuandong},
journal={arXiv preprint arXiv:2310.03304},
year={2023}
}
Story plots, while short, carry most of the essential information of a full story that may contain tens of thousands of words. We study the problem of automatic generation of story plots, which includes story premise, character descriptions, plot outlines, etc. To generate a single engaging plot, existing plot generators (e.g., π³πΎπ² (Yang et al., 2022a)) require hundreds to thousands of calls to LLMs (e.g., OpenAI API) in the planning stage of the story plot, which is costly and takes at least several minutes. Moreover, the hard-wired nature of the method makes the pipeline non-differentiable, blocking fast specialization and personalization of the plot generator. In this paper, we propose three models, πΎππππΏπππ, π΄πΈπ΄πΏπππ and ππ»πΏπππ, to address these challenges. πΎππππΏπππ replaces expensive OpenAI API calls with LLaMA2 (Touvron et al., 2023) calls via careful prompt designs, which leads to inexpensive generation of high-quality training datasets of story plots. We then train an end-to-end story plot generator, π΄πΈπ΄πΏπππ, by supervised fine-tuning (SFT) using approximately 13000 story plots generated by πΎππππΏπππ. π΄πΈπ΄πΏπππ generates story plots of comparable quality to πΎππππΏπππ, and is > 10Γ faster (1k tokens in only 30 seconds on average). Finally, we obtain ππ»πΏπππ that is further fine-tuned with RLHF on several different reward models for different aspects of story quality, which yields 60.0% winning rate against π΄πΈπ΄πΏπππ along the aspect of suspense and surprise.
We first re-built our previous π³πΎπ² pipeline with LLaMa-v2, to get rid of the rate limit and potential model update in OpenAI APIs. The rebuilt pipeline is named as πΎππππΏπππ. With a large batch of high quality story plot datasets generated by πΎππππΏπππ, we fine-tune LLaMa-v2-7B-chat to obtain an end-to-end story plot generator π΄πΈπ΄πΏπππ.
ππ»πΏπππ: We train reward models for each aspect and do RLHF, which yields a 60.0% winning rate against π΄πΈπ΄πΏπππ along the aspect of suspense and surprise !
@article{zhu2023end,
title={End-to-end Story Plot Generator},
author={Zhu, Hanlin and Cohen, Andrew and Wang, Danqing and Yang, Kevin and Yang, Xiaomeng and Jiao, Jiantao and Tian, Yuandong},
journal={arXiv preprint arXiv:2310.08796},
year={2023}
}