Long-form Story Generation and Evaluation

Kevin Yang, Hanlin Zhu, Justin Wong, Danqing Wang, Andrew Cohen, Xiaomeng Yang, Jiantao Jiao, Nanyun Peng, Dan Klein, Lei Li, Yuandong Tian

DOC: Improving Long Story Coherence with Detailed Outline Control

UC Berkeley, †UCLA, ††Meta AI
ACL 2023

We propose the Detailed Outline Control (𝙳𝙾𝙲) framework for improving long-range plot coherence when automatically generating several-thousand-word-long stories. 𝙳𝙾𝙲 consists of two complementary components: a detailed outliner and a detailed controller. The detailed outliner creates a more detailed, hierarchically structured outline, shifting creative burden from the main drafting procedure to the planning stage. The detailed controller ensures the more detailed outline is still respected during generation by controlling story passages to align with outline details. In human evaluations of automatically generated stories, 𝙳𝙾𝙲 substantially outperforms a strong Re3 baseline (Yang et al., 2022) on plot coherence (22.5% absolute gain), outline relevance (28.2%), and interestingness (20.7%). Humans also judged 𝙳𝙾𝙲 to be much more controllable in an interactive generation setting.

More plot-coherent & outline-relevant & interesting long-form stories !

With better Outliner

Detailed planning instead of just a three-sentence draft (beginning, middle, end)

outliner

The Detailed outliner recursively expands outline items in breadth-first order. To create each new entry, it proposes candidate events and selects the best via filtering and reranking, and then detects the setting and relevant characters.

and better Controller

Token-by-token through-out generation instead of relying on only an initial prompt or post-hoc rejection sampling.

controller
  • Drafting With Detailed Control: Event + Setting + Character
  • Control Strength: initialized as 0 for each outline item and increment it with each subsequent drafting step
  • Future Context in Generation: include the next outline item as future context in the prompt

πŸ“ Get better Stories

Pairwise comparison on 1000- to 1500-word passages between systems.

  1. Coherent: Percentage of passages judged plot-coherent by human annotators
  2. Relevant: Percentage judged faithful to the corresponding outline item.
  3. Interesting: Percentage judged interesting.
result1

𝙳𝙾𝙲 stories are rated substantially more plot-coherent, outline-relevant, and interesting compared to RE3 and ROLLING-OPT.

πŸ“ and better interactive experience

The human is asked to give an initial premise. RE3 and 𝙳𝙾𝙲 will generate the outline based on the premise. The the human is asked to edit the generated outline. The final outline is used for both systems to generate the full stories.

  1. Intent: Which passage better followed their original intent.
  2. Control: Which workflow they felt gave them more control.
  3. Intuition: Which system was more helpful or intuitive.
  4. Quality: Which system they would choose to write another story, if prioritizing quality.
result2

Humans judged faithfulness to authorial intent, control over generation, system intuitiveness, and story quality. 𝙳𝙾𝙲 is preferred by a wide margin on all metrics.

BibTeX

@inproceedings{yang-etal-2023-DOC,
    title = "DOC: Improving Long Story Coherence With Detailed Outline Control",
    author = "Yang, Kevin  and
      Klein, Dan  and
      Peng, Nanyun  and
      Tian, Yuandong",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.190",
    doi = "10.18653/v1/2023.acl-long.190",
    pages = "3378--3465",
}
  

Learning Personalized Story Evaluation

1UC Santa Barbara, 2Meta AI, 3UC Berkeley, 4CMU

While large language models (LLMs) have shown impressive results for more objective tasks such as QA and retrieval, it remains nontrivial to evaluate their performance on open-ended text generation for reasons including (1) data contamination; (2) multi-dimensional evaluation criteria; and (3) subjectiveness stemming from reviewers' personal preferences. To address such issues, we propose to model personalization in an uncontaminated open-ended generation assessment. We create two new datasets π™ΏπšŽπš›-π™Όπ™Ώπš‚πšƒ and π™ΏπšŽπš›-𝙳𝙾𝙲 for personalized story evaluation, by re-purposing existing datasets with proper anonymization and new personalized labels. We further develop a personalized story evaluation model π™ΏπšŽπš›πš‚π™΄ to infer reviewer preferences and provide a personalized evaluation. Specifically, given a few exemplary reviews from a particular reviewer, π™ΏπšŽπš›πš‚π™΄ predicts either a detailed review or fine-grained comparison in several aspects (such as interestingness and surprise) for that reviewer on a new text input. Experimental results show that π™ΏπšŽπš›πš‚π™΄ outperforms GPT-4 by 15.8% on Kendall correlation of story ratings, and by 13.7% on pairwise preference prediction accuracy.

Create Personalized Story Evaluation Dataset

Two challenges for current personalized story evaluation
  • Contamination: Does the LLM-based evaluator really do a good judge, or just memorize what it has seen?
  • How to collect personalization labels: How can we get personal preferences? If not easy to get explicit labels, how about the implicit labels that can reveal ones preferences?
dataset

We create two new datasets π™ΏπšŽπš›-π™Όπ™Ώπš‚πšƒ and π™ΏπšŽπš›-𝙳𝙾𝙲 for personalized story evaluation, by re-purposing existing datasets with proper anonymization and implicit personalized labels.


Personalized Story Evaluation Model (π™ΏπšŽπš›πš‚π™΄)

Can the general-purpose LLM-based evaluator be personalized? How to personalize it?
perse

We propose π™ΏπšŽπš›πš‚π™΄, which takes reviewer's prior reviewers and reason the specific preference. Based on this, it can give a personalized review and score for an individual story, or conduct personalized fine-grained comparsion between two stories.

PerSE significantly outperforms all baselines including GPT-4
results

Highest Correlation with Human Rating on individual story scoring. Best accuracy for pairwise comparison on five aspects of story quality.

BibTeX

@article{wang2023learning,
    title={Learning Personalized Story Evaluation},
    author={Wang, Danqing and Yang, Kevin and Zhu, Hanlin and Yang, Xiaomeng and Cohen, Andrew and Li, Lei and Tian, Yuandong},
    journal={arXiv preprint arXiv:2310.03304},
    year={2023}
}
  

End-to-end Story Plot Generator

1UC Berkeley, 2Meta AI, 3UC Santa Barbara
*Equal contributions

Story plots, while short, carry most of the essential information of a full story that may contain tens of thousands of words. We study the problem of automatic generation of story plots, which includes story premise, character descriptions, plot outlines, etc. To generate a single engaging plot, existing plot generators (e.g., 𝙳𝙾𝙲 (Yang et al., 2022a)) require hundreds to thousands of calls to LLMs (e.g., OpenAI API) in the planning stage of the story plot, which is costly and takes at least several minutes. Moreover, the hard-wired nature of the method makes the pipeline non-differentiable, blocking fast specialization and personalization of the plot generator. In this paper, we propose three models, π™Ύπš™πšŽπš—π™Ώπš•πš˜πš, π™΄πŸΈπ™΄π™Ώπš•πš˜πš and πšπ™»π™Ώπš•πš˜πš, to address these challenges. π™Ύπš™πšŽπš—π™Ώπš•πš˜πš replaces expensive OpenAI API calls with LLaMA2 (Touvron et al., 2023) calls via careful prompt designs, which leads to inexpensive generation of high-quality training datasets of story plots. We then train an end-to-end story plot generator, π™΄πŸΈπ™΄π™Ώπš•πš˜πš, by supervised fine-tuning (SFT) using approximately 13000 story plots generated by π™Ύπš™πšŽπš—π™Ώπš•πš˜πš. π™΄πŸΈπ™΄π™Ώπš•πš˜πš generates story plots of comparable quality to π™Ύπš™πšŽπš—π™Ώπš•πš˜πš, and is > 10Γ— faster (1k tokens in only 30 seconds on average). Finally, we obtain πšπ™»π™Ώπš•πš˜πš that is further fine-tuned with RLHF on several different reward models for different aspects of story quality, which yields 60.0% winning rate against π™΄πŸΈπ™΄π™Ώπš•πš˜πš along the aspect of suspense and surprise.

Tried of waiting long for just one story?

E2EPlot: Generate a high quality story plot with one click and less than 30 seconds!

We first re-built our previous 𝙳𝙾𝙲 pipeline with LLaMa-v2, to get rid of the rate limit and potential model update in OpenAI APIs. The rebuilt pipeline is named as π™Ύπš™πšŽπš—π™Ώπš•πš˜πš. With a large batch of high quality story plot datasets generated by π™Ύπš™πšŽπš—π™Ώπš•πš˜πš, we fine-tune LLaMa-v2-7B-chat to obtain an end-to-end story plot generator π™΄πŸΈπ™΄π™Ώπš•πš˜πš.

doc

Easy to align with human preferences and become more specific

πšπ™»π™Ώπš•πš˜πš: We train reward models for each aspect and do RLHF, which yields a 60.0% winning rate against π™΄πŸΈπ™΄π™Ώπš•πš˜πš along the aspect of suspense and surprise !

results

BibTeX

@article{zhu2023end,
  title={End-to-end Story Plot Generator},
  author={Zhu, Hanlin and Cohen, Andrew and Wang, Danqing and Yang, Kevin and Yang, Xiaomeng and Jiao, Jiantao and Tian, Yuandong},
  journal={arXiv preprint arXiv:2310.08796},
  year={2023}
}