Outlet Title
Proceedings of the 58th Hawaii International Conference on System Sciences
Document Type
Conference Proceeding
Publication Date
2025
Abstract
Topic modeling is a crucial unsupervised machine learning technique for identifying themes within unstructured text. This study compares traditional topic modeling methods, like Latent Dirichlet Allocation (LDA), against advanced embedding-based models, specifically BERTopic-OpenAI. The analysis utilizes two distinct datasets: user reviews from the mental health app Replika and the 20newsgroup dataset. For the Replika dataset, both methods identified common themes, but BERTopic-OpenAI uncovered additional nuanced topics, demonstrating its enhanced semantic capabilities. Quantitative evaluation of the 20newsgroup dataset further highlighted BERTopic-OpenAI's advantage through achieving higher topic coherence and diversity than the best-performing LDA model. These results suggest that embedding-based models provide more coherent, interpretable, and diverse topics, making them valuable tools for extracting meaningful insights from extensive and variable-length text corpora. Future research should focus on refining these advanced techniques to improve their applicability and effectiveness in dynamic and varied textual environments.
Recommended Citation
Wahbeh, Abdullah; Al-Ramahi, Mohammad; El-Gayar, Omar F.; Elnoshokaty, Ahmed; and Nasralah, Tareq, "Evaluating Topic Models with OpenAI Embeddings: A Comparative Analysis on Variable-Length Texts Using Two Datasets" (2025). Research & Publications. 438.
https://scholar.dsu.edu/bispapers/438