Getting the ##life out of living: How adequate are word-pieces for modelling complex morphology?

Stav Klein, Reut Tsarfaty

פרסום מחקרי: פרק בספר / בדוח / בכנספרסום בספר כנסביקורת עמיתים


This work investigates the most basic units that underlie contextualized word embeddings, such as BERT - the so-called word pieces. In Morphologically-Rich Languages (MRLs) which exhibit morphological fusion and non-concatenative morphology, the different units of meaning within a word may be fused, intertwined, and cannot be separated linearly. Therefore, when using word-pieces in MRLs, we must consider that: (1) a linear segmentation into sub-word units might not capture the full morphological complexity of words; and (2) representations that leave morphological knowledge on sub-word units inaccessible might negatively affect performance. Here we empirically examine the capacity of word-pieces to capture morphology by investigating the task of multi-tagging in Hebrew, as a proxy to evaluating the underlying segmentation. Our results show that, while models trained to predict multi-tags for complete words outperform models tuned to predict the distinct tags of WPs, we can improve the WPs tag prediction by purposefully constraining the word-pieces to reflect their internal functions. We conjecture that this is due to the naïve linear tokenization of words into word-pieces, and suggest that linguistically-informed word-pieces schemes, that make morphological knowledge explicit, might boost performance for MRLs.

שפה מקוריתאנגלית
כותר פרסום המארחSIGMORPHON 2020 - 17th SIGMORPHON Workshop on Computational Research in Phonetics Phonology, and Morphology, Proceedings of the Workshop
מוציא לאורAssociation for Computational Linguistics (ACL)
מספר עמודים6
מסת"ב (אלקטרוני)9781952148194
מזהי עצם דיגיטלי (DOIs)
סטטוס פרסוםפורסם - 2020
פורסם באופן חיצוניכן
אירוע17th SIGMORPHON Workshop on Computational Research in Phonetics Phonology, and Morphology, SIGMORPHON 2020 as part of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 - Virtual, Online, ארצות הברית
משך הזמן: 10 יולי 2020 → …

סדרות פרסומים

שםProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (מודפס)0736-587X


כנס17th SIGMORPHON Workshop on Computational Research in Phonetics Phonology, and Morphology, SIGMORPHON 2020 as part of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020
מדינה/אזורארצות הברית
עירVirtual, Online
תקופה10/07/20 → …

הערה ביבליוגרפית

Publisher Copyright:
© 2020 Association for Computational Linguistics.

טביעת אצבע

להלן מוצגים תחומי המחקר של הפרסום 'Getting the ##life out of living: How adequate are word-pieces for modelling complex morphology?'. יחד הם יוצרים טביעת אצבע ייחודית.

פורמט ציטוט ביבליוגרפי