Columbia University

Technology Ventures

Automated natural language processing tool uses machine learning to identify quoted speech in literary narratives

Technology #proxy66

Questions about this technology? Ask a Technology Manager

Download Printable PDF

License this Technology
Quoted Speech Attribution Corpus
Image Gallery
Kathleen McKeown
Managed By
Richard Nguyen

Narrative structure (i.e. storytelling) is prevalent in textual information channels such as news, personal blogs, and literature. This technology applies machine learning to literary analysis by automatically identifying the speakers of quoted speech in natural language textual stories. The method was developed using a corpus of over 3,000 instances of quoted speech from six works of 19th and 20th century literature. The text is first preprocessed to find quotes and candidate characters. After classifying the quotes into syntactic categories, features from the text specific to its syntactic category are extracted for training. The result is an algorithm that attributes instances of quoted speech to their respective speakers in narrative discourse.

Algorithm achieves rapid attribution at 83% accuracy by dividing quotes into syntactic classes to leverage common discourse patterns

This method bridges two important aspects between machine learning and literary analysis: automating the process of reading a text and identifying the speaker of each quotation. In order to leverage dialogue chains and the frequent use of expression verbs, a pattern matching algorithm was implemented to assign each quote to five syntactic categories. One-third of the selected corpus was used to develop the algorithm, while the remainder was used for training and testing. Results showed that this method correctly assigned a quote to its characters 83% of the time without any a priori information, exceeding the “nearest character” baseline. Ongoing work is aimed at social network extraction and investigation of methods for extracting segments of indirect (unquoted) speech and their speakers.

Lead Inventor:

Kathleen McKeown, Ph.D.


  • Sentiment analysis (i.e. opinion mining)
  • Political discourse analysis (debates, speeches, hearings, etc.)
  • Automated literary analysis
  • Social network extraction
  • Automatic text summarization
  • Speech recognition and segmentation


  • Addresses the application of natural language processing to literature
  • Leverages two aspects of the semantics of quoted speech (dialogue chains and the frequent use of expression verbs) by using a pattern matching algorithm to assign each quote to a syntactic category.
  • Draws upon a corpus of more than 3,000 quotations
  • Achieves 83% accuracy in assigning speakers to quotes

Patent information:

Patent Pending

Tech Ventures Reference: IR Proxy66

Related Publications:

Elson and McKeown, Automatic Attribution of Quoted Speech in Literary Narrative, AAAI 2010