Question-Answering for Segment Retrieval on Podcast Transcripts
Abstract
Podcasting has rapidly ascended as one of the primary forms of spoken-word media in
the 21st century. The Spotify Podcast Dataset has compiled transcripts of over 100,000
podcast episodes, making it one of the largest repositories of spoken word data. The
segment retrieval task aims to find the most relevant segments to a given query from
the set of episode transcripts. This thesis presents a two-stage approach to segment
retrieval using an end-to-end question-answering (QA) deep learning architecture with
an additional step to expand answers to segments. Standard BM25 retrieval on an index
of predetermined segments from each episode serves as a baseline retrieval system.
Experiments for both approaches involved producing and evaluating a ranked list of 20
relevant segments for 50 test topics. Comparison between the two retrieval methods
shows that the QA retriever trails the baseline in nDCG@10 by 0.128, precision@10 by
0.184, and average segment relevance score by 0.461. QA retrieval slightly outperforms
the baseline by 0.024 in recall@10 while slightly underperforming it by 0.102 in average
segment relevance score when discounting irrelevant segments. The results suggest
that the QA retrieval approach in this thesis can adequately identify and rank relevant
segments within a relevant input text. However, for some queries, it may struggle
to find enough relevant candidate documents during the first stage of retrieval. QA
retrieval shows promise in handling informational queries for the user goal of answering
a question. Future work includes improving processes such as candidate document
retrieval, answer span expansion, and data annotation.
Collections
- OU - Theses [2115]
The following license files are associated with this item: