AlephBERT: Language Model Pre-training and Evaluation from Sub-Word to Sentence Level

Amit Seker, Elron Bandel, Dan Bareket, Idan Brusilovsky, Refael Shaked Greenfeld, Reut Tsarfaty

פרסום מחקרי: פרק בספר / בדוח / בכנספרסום בספר כנסביקורת עמיתים

תקציר

Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology and lie at the heart of many artificial intelligence advances. While advances reported for English using PLMs are unprecedented, reported advances using PLMs for Hebrew are few and far between. The problem is twofold. First, so far, Hebrew resources for training large language models are not of the same magnitude as their English counterparts. Second, most benchmarks available to evaluate progress in Hebrew NLP require morphological boundaries which are not available in the output of PLMs. In this work we remedy both aspects. We present AlephBERT, a large PLM for Modern Hebrew, trained on larger vocabulary and a larger dataset than any Hebrew PLM before. Moreover, we introduce a novel neural architecture that recovers the morphological segments encoded in contextualized embedding vectors. Based on this new morphological component we offer an evaluation suite consisting of multiple tasks and benchmarks that cover sentence-level, word-level and sub-word level analyses. On all tasks, AlephBERT obtains state-of-the-art results beyond contemporary Hebrew state-of-the-art models. We make our AlephBERT model, the morphological extraction component, and the Hebrew evaluation suite publicly available, for future investigations and evaluations of Hebrew PLMs.

שפה מקוריתאנגלית
כותר פרסום המארחACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
עורכיםSmaranda Muresan, Preslav Nakov, Aline Villavicencio
מוציא לאורAssociation for Computational Linguistics (ACL)
עמודים46-56
מספר עמודים11
מסת"ב (אלקטרוני)9781955917216
סטטוס פרסוםפורסם - 2022
פורסם באופן חיצוניכן
אירוע60th Annual Meeting of the Association for Computational Linguistics, ACL 2022 - Dublin, אירלנד
משך הזמן: 22 מאי 202227 מאי 2022

סדרות פרסומים

שםProceedings of the Annual Meeting of the Association for Computational Linguistics
כרך1
ISSN (מודפס)0736-587X

כנס

כנס60th Annual Meeting of the Association for Computational Linguistics, ACL 2022
מדינה/אזוראירלנד
עירDublin
תקופה22/05/2227/05/22

הערה ביבליוגרפית

Publisher Copyright:
© 2022 Association for Computational Linguistics.

טביעת אצבע

להלן מוצגים תחומי המחקר של הפרסום 'AlephBERT: Language Model Pre-training and Evaluation from Sub-Word to Sentence Level'. יחד הם יוצרים טביעת אצבע ייחודית.

פורמט ציטוט ביבליוגרפי