Getting the ##life out of living: How adequate are word-pieces for modelling complex morphology?

Stav Klein, Reut Tsarfaty

نتاج البحث: فصل من :كتاب / تقرير / مؤتمرمنشور من مؤتمرمراجعة النظراء

ملخص

This work investigates the most basic units that underlie contextualized word embeddings, such as BERT - the so-called word pieces. In Morphologically-Rich Languages (MRLs) which exhibit morphological fusion and non-concatenative morphology, the different units of meaning within a word may be fused, intertwined, and cannot be separated linearly. Therefore, when using word-pieces in MRLs, we must consider that: (1) a linear segmentation into sub-word units might not capture the full morphological complexity of words; and (2) representations that leave morphological knowledge on sub-word units inaccessible might negatively affect performance. Here we empirically examine the capacity of word-pieces to capture morphology by investigating the task of multi-tagging in Hebrew, as a proxy to evaluating the underlying segmentation. Our results show that, while models trained to predict multi-tags for complete words outperform models tuned to predict the distinct tags of WPs, we can improve the WPs tag prediction by purposefully constraining the word-pieces to reflect their internal functions. We conjecture that this is due to the naïve linear tokenization of words into word-pieces, and suggest that linguistically-informed word-pieces schemes, that make morphological knowledge explicit, might boost performance for MRLs.

اللغة الأصليةالإنجليزيّة
عنوان منشور المضيفSIGMORPHON 2020 - 17th SIGMORPHON Workshop on Computational Research in Phonetics Phonology, and Morphology, Proceedings of the Workshop
ناشرAssociation for Computational Linguistics (ACL)
الصفحات204-209
عدد الصفحات6
رقم المعيار الدولي للكتب (الإلكتروني)9781952148194
المعرِّفات الرقمية للأشياء
حالة النشرنُشِر - 2020
منشور خارجيًانعم
الحدث17th SIGMORPHON Workshop on Computational Research in Phonetics Phonology, and Morphology, SIGMORPHON 2020 as part of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 - Virtual, Online, الولايات المتّحدة
المدة: ١٠ يوليو ٢٠٢٠ → …

سلسلة المنشورات

الاسمProceedings of the Annual Meeting of the Association for Computational Linguistics
رقم المعيار الدولي للدوريات (المطبوع)0736-587X

!!Conference

!!Conference17th SIGMORPHON Workshop on Computational Research in Phonetics Phonology, and Morphology, SIGMORPHON 2020 as part of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020
الدولة/الإقليمالولايات المتّحدة
المدينةVirtual, Online
المدة١٠/٠٧/٢٠ → …

ملاحظة ببليوغرافية

Publisher Copyright:
© 2020 Association for Computational Linguistics.

بصمة

أدرس بدقة موضوعات البحث “Getting the ##life out of living: How adequate are word-pieces for modelling complex morphology?'. فهما يشكلان معًا بصمة فريدة.

قم بذكر هذا