תקציר
This work investigates the most basic units that underlie contextualized word embeddings, such as BERT - the so-called word pieces. In Morphologically-Rich Languages (MRLs) which exhibit morphological fusion and non-concatenative morphology, the different units of meaning within a word may be fused, intertwined, and cannot be separated linearly. Therefore, when using word-pieces in MRLs, we must consider that: (1) a linear segmentation into sub-word units might not capture the full morphological complexity of words; and (2) representations that leave morphological knowledge on sub-word units inaccessible might negatively affect performance. Here we empirically examine the capacity of word-pieces to capture morphology by investigating the task of multi-tagging in Hebrew, as a proxy to evaluating the underlying segmentation. Our results show that, while models trained to predict multi-tags for complete words outperform models tuned to predict the distinct tags of WPs, we can improve the WPs tag prediction by purposefully constraining the word-pieces to reflect their internal functions. We conjecture that this is due to the naïve linear tokenization of words into word-pieces, and suggest that linguistically-informed word-pieces schemes, that make morphological knowledge explicit, might boost performance for MRLs.
שפה מקורית | אנגלית |
---|---|
כותר פרסום המארח | SIGMORPHON 2020 - 17th SIGMORPHON Workshop on Computational Research in Phonetics Phonology, and Morphology, Proceedings of the Workshop |
מוציא לאור | Association for Computational Linguistics (ACL) |
עמודים | 204-209 |
מספר עמודים | 6 |
מסת"ב (אלקטרוני) | 9781952148194 |
מזהי עצם דיגיטלי (DOIs) | |
סטטוס פרסום | פורסם - 2020 |
פורסם באופן חיצוני | כן |
אירוע | 17th SIGMORPHON Workshop on Computational Research in Phonetics Phonology, and Morphology, SIGMORPHON 2020 as part of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 - Virtual, Online, ארצות הברית משך הזמן: 10 יולי 2020 → … |
סדרות פרסומים
שם | Proceedings of the Annual Meeting of the Association for Computational Linguistics |
---|---|
ISSN (מודפס) | 0736-587X |
כנס
כנס | 17th SIGMORPHON Workshop on Computational Research in Phonetics Phonology, and Morphology, SIGMORPHON 2020 as part of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 |
---|---|
מדינה/אזור | ארצות הברית |
עיר | Virtual, Online |
תקופה | 10/07/20 → … |
הערה ביבליוגרפית
Publisher Copyright:© 2020 Association for Computational Linguistics.