Long-form Story Generation and Evaluation

Kevin Yang, Hanlin Zhu, Justin Wong, Danqing Wang, Andrew Cohen, Xiaomeng Yang, Jiantao Jiao, Nanyun Peng, Dan Klein, Lei Li, Yuandong Tian

Meta AI (FAIR)

DOC PerSE E2EPlot

Overview for Long-form Story Generation and Evaluation. We first propose several story generation models for long-form story. We then develop story evaluator based on human feedback and align the story generation model with human feedback.

Re3 (a) constructs a structured overarching plan; (b) generates story passages by repeatedly injecting contextual information from the plan and current story state; (c) reranks different continuations for plot coherence and premise relevance; (d) edits the best continuation for factual consistency. 𝙳𝙾𝙲 consists of two complementary components: a detailed outliner and a detailed controller. The detailed outliner creates a hierarchically structured outline and the detailed controller controls story passages to align with outline details.

𝙿𝚎𝚛𝚂𝙴: a personalized story evaluation model to infer reviewer preferences and provide a personalized evaluation. Given a few exemplary reviews from a particular reviewer, it can predicts a detailed review or fine-grained comparison from several aspects.

E2Eplot makes story generation faster and cheaper and is end-to-end fine-tunable with human feedback. Here are two plots generated by OpenPlot (LLaMA-2-based 𝙳𝙾𝙲) and E2E Plot.

DOC: Improving Long Story Coherence with Detailed Outline Control

Kevin Yang, Dan Klein, Nanyun Peng^†, Yuandong Tian^††

UC Berkeley, †UCLA, ††Meta AI
ACL 2023

Paper Code

We propose the Detailed Outline Control (𝙳𝙾𝙲) framework for improving long-range plot coherence when automatically generating several-thousand-word-long stories. 𝙳𝙾𝙲 consists of two complementary components: a detailed outliner and a detailed controller. The detailed outliner creates a more detailed, hierarchically structured outline, shifting creative burden from the main drafting procedure to the planning stage. The detailed controller ensures the more detailed outline is still respected during generation by controlling story passages to align with outline details. In human evaluations of automatically generated stories, 𝙳𝙾𝙲 substantially outperforms a strong Re3 baseline (Yang et al., 2022) on plot coherence (22.5% absolute gain), outline relevance (28.2%), and interestingness (20.7%). Humans also judged 𝙳𝙾𝙲 to be much more controllable in an interactive generation setting.

More plot-coherent & outline-relevant & interesting long-form stories !

With better Outliner

Detailed planning instead of just a three-sentence draft (beginning, middle, end)

The Detailed outliner recursively expands outline items in breadth-first order. To create each new entry, it proposes candidate events and selects the best via filtering and reranking, and then detects the setting and relevant characters.

and better Controller

Token-by-token through-out generation instead of relying on only an initial prompt or post-hoc rejection sampling.

Drafting With Detailed Control: Event + Setting + Character
Control Strength: initialized as 0 for each outline item and increment it with each subsequent drafting step
Future Context in Generation: include the next outline item as future context in the prompt

📝 Get better Stories

Pairwise comparison on 1000- to 1500-word passages between systems.

Coherent: Percentage of passages judged plot-coherent by human annotators
Relevant: Percentage judged faithful to the corresponding outline item.
Interesting: Percentage judged interesting.

𝙳𝙾𝙲 stories are rated substantially more plot-coherent, outline-relevant, and interesting compared to RE3 and ROLLING-OPT.

📝 and better interactive experience

The human is asked to give an initial premise. RE3 and 𝙳𝙾𝙲 will generate the outline based on the premise. The the human is asked to edit the generated outline. The final outline is used for both systems to generate the full stories.

Intent: Which passage better followed their original intent.
Control: Which workflow they felt gave them more control.
Intuition: Which system was more helpful or intuitive.
Quality: Which system they would choose to write another story, if prioritizing quality.

Humans judged faithfulness to authorial intent, control over generation, system intuitiveness, and story quality. 𝙳𝙾𝙲 is preferred by a wide margin on all metrics.

BibTeX

@inproceedings{yang-etal-2023-DOC,
    title = "DOC: Improving Long Story Coherence With Detailed Outline Control",
    author = "Yang, Kevin  and
      Klein, Dan  and
      Peng, Nanyun  and
      Tian, Yuandong",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.190",
    doi = "10.18653/v1/2023.acl-long.190",
    pages = "3378--3465",
}

Learning Personalized Story Evaluation

Danqing Wang^1,2, Kevin Yang^2,3, Hanlin Zhu^2,3, Xiaomeng Yang², Andrew Cohen², Lei Li⁴, Yuandong Tian²

¹UC Santa Barbara, ²Meta AI, ³UC Berkeley, ⁴CMU

Paper Code

While large language models (LLMs) have shown impressive results for more objective tasks such as QA and retrieval, it remains nontrivial to evaluate their performance on open-ended text generation for reasons including (1) data contamination; (2) multi-dimensional evaluation criteria; and (3) subjectiveness stemming from reviewers' personal preferences. To address such issues, we propose to model personalization in an uncontaminated open-ended generation assessment. We create two new datasets 𝙿𝚎𝚛-𝙼𝙿𝚂𝚃 and 𝙿𝚎𝚛-𝙳𝙾𝙲 for personalized story evaluation, by re-purposing existing datasets with proper anonymization and new personalized labels. We further develop a personalized story evaluation model 𝙿𝚎𝚛𝚂𝙴 to infer reviewer preferences and provide a personalized evaluation. Specifically, given a few exemplary reviews from a particular reviewer, 𝙿𝚎𝚛𝚂𝙴 predicts either a detailed review or fine-grained comparison in several aspects (such as interestingness and surprise) for that reviewer on a new text input. Experimental results show that 𝙿𝚎𝚛𝚂𝙴 outperforms GPT-4 by 15.8% on Kendall correlation of story ratings, and by 13.7% on pairwise preference prediction accuracy.

Create Personalized Story Evaluation Dataset

Two challenges for current personalized story evaluation

Contamination: Does the LLM-based evaluator really do a good judge, or just memorize what it has seen?
How to collect personalization labels: How can we get personal preferences? If not easy to get explicit labels, how about the implicit labels that can reveal ones preferences?

We create two new datasets 𝙿𝚎𝚛-𝙼𝙿𝚂𝚃 and 𝙿𝚎𝚛-𝙳𝙾𝙲 for personalized story evaluation, by re-purposing existing datasets with proper anonymization and implicit personalized labels.

Personalized Story Evaluation Model (𝙿𝚎𝚛𝚂𝙴)

Can the general-purpose LLM-based evaluator be personalized? How to personalize it?

We propose 𝙿𝚎𝚛𝚂𝙴, which takes reviewer's prior reviewers and reason the specific preference. Based on this, it can give a personalized review and score for an individual story, or conduct personalized fine-grained comparsion between two stories.

PerSE significantly outperforms all baselines including GPT-4

Highest Correlation with Human Rating on individual story scoring. Best accuracy for pairwise comparison on five aspects of story quality.

BibTeX

@article{wang2023learning,
    title={Learning Personalized Story Evaluation},
    author={Wang, Danqing and Yang, Kevin and Zhu, Hanlin and Yang, Xiaomeng and Cohen, Andrew and Li, Lei and Tian, Yuandong},
    journal={arXiv preprint arXiv:2310.03304},
    year={2023}
}

End-to-end Story Plot Generator

Hanlin Zhu^1,2*, Andrew Cohen^2*, Danqing Wang^2,3, Kevin Yang^1,2, Xiaomeng Yang², Jiantao Jiao¹, Yuandong Tian²

¹UC Berkeley, ²Meta AI, ³UC Santa Barbara
^*Equal contributions

Paper Code

Story plots, while short, carry most of the essential information of a full story that may contain tens of thousands of words. We study the problem of automatic generation of story plots, which includes story premise, character descriptions, plot outlines, etc. To generate a single engaging plot, existing plot generators (e.g., 𝙳𝙾𝙲 (Yang et al., 2022a)) require hundreds to thousands of calls to LLMs (e.g., OpenAI API) in the planning stage of the story plot, which is costly and takes at least several minutes. Moreover, the hard-wired nature of the method makes the pipeline non-differentiable, blocking fast specialization and personalization of the plot generator. In this paper, we propose three models, 𝙾𝚙𝚎𝚗𝙿𝚕𝚘𝚝, 𝙴𝟸𝙴𝙿𝚕𝚘𝚝 and 𝚁𝙻𝙿𝚕𝚘𝚝, to address these challenges. 𝙾𝚙𝚎𝚗𝙿𝚕𝚘𝚝 replaces expensive OpenAI API calls with LLaMA2 (Touvron et al., 2023) calls via careful prompt designs, which leads to inexpensive generation of high-quality training datasets of story plots. We then train an end-to-end story plot generator, 𝙴𝟸𝙴𝙿𝚕𝚘𝚝, by supervised fine-tuning (SFT) using approximately 13000 story plots generated by 𝙾𝚙𝚎𝚗𝙿𝚕𝚘𝚝. 𝙴𝟸𝙴𝙿𝚕𝚘𝚝 generates story plots of comparable quality to 𝙾𝚙𝚎𝚗𝙿𝚕𝚘𝚝, and is > 10× faster (1k tokens in only 30 seconds on average). Finally, we obtain 𝚁𝙻𝙿𝚕𝚘𝚝 that is further fine-tuned with RLHF on several different reward models for different aspects of story quality, which yields 60.0% winning rate against 𝙴𝟸𝙴𝙿𝚕𝚘𝚝 along the aspect of suspense and surprise.

Tried of waiting long for just one story?

E2EPlot: Generate a high quality story plot with one click and less than 30 seconds!

We first re-built our previous 𝙳𝙾𝙲 pipeline with LLaMa-v2, to get rid of the rate limit and potential model update in OpenAI APIs. The rebuilt pipeline is named as 𝙾𝚙𝚎𝚗𝙿𝚕𝚘𝚝. With a large batch of high quality story plot datasets generated by 𝙾𝚙𝚎𝚗𝙿𝚕𝚘𝚝, we fine-tune LLaMa-v2-7B-chat to obtain an end-to-end story plot generator 𝙴𝟸𝙴𝙿𝚕𝚘𝚝.

Easy to align with human preferences and become more specific

𝚁𝙻𝙿𝚕𝚘𝚝: We train reward models for each aspect and do RLHF, which yields a 60.0% winning rate against 𝙴𝟸𝙴𝙿𝚕𝚘𝚝 along the aspect of suspense and surprise !

BibTeX

@article{zhu2023end,
  title={End-to-end Story Plot Generator},
  author={Zhu, Hanlin and Cohen, Andrew and Wang, Danqing and Yang, Kevin and Yang, Xiaomeng and Jiao, Jiantao and Tian, Yuandong},
  journal={arXiv preprint arXiv:2310.08796},
  year={2023}
}