Resources
Authors & Affiliations
Panos Bozelos, Tim P. Vogels
Abstract
The accelerating output of scientific research presents significant challenges in content discoverability by readers and organization by platforms. Traditional systems often fall short in dynamically accommodating such rapid expansion, hence the need for agile and adaptive solutions. We address these challenges by applying advanced Machine Learning (ML) & Natural Language Processing (NLP) to effectively manage and organize a growing collection of scientific content.Specifically, we focus on World-Wide.org, a platform hosting >3,000 scientific seminars. We use a fine-tuned DeBERTa language model for zero-shot classification to categorize seminars into >50 scientific fields—allowing for broader yet more fine-grained browsing. Furthermore, we employ state-of-the-art keyword extraction techniques, particularly EmbedRank, in conjunction with BioSentVec—a model trained on the entire PubMed corpus. The aggregation of extracted keywords subsequently drives the creation of field-specific topic pages, which are continuously updated through a mix of sparse (lexical matching) and dense (vector embeddings) information retrieval. Our method assists in early detection of emerging research areas within our database. To provide deeper editorial-like analysis, we leverage the generative inference capabilities of Large Language Models to establish conceptual links between seminar abstracts, highlighting the similarities, differences, and novel aspects. Finally, our implementation of Locality-Sensitive Hashing (LSH) automates seminar versioning by mapping similar textual items to the same “buckets” with high probability, enabling us to identify updates or near-duplicate talks.Through these easily transferable methods, we have been able to automate content management, significantly improving the discoverability and accessibility of scientific discourse on World-Wide.org.