Accurate unlexicalized parsing for modern hebrew

Reut Tsarfaty, Khalil Sima'An

نتاج البحث: فصل من :كتاب / تقرير / مؤتمرمنشور من مؤتمرمراجعة النظراء

ملخص

Many state-of-the-art statistical parsers for English can be viewed as Probabilistic Context-Free Grammars (PCFGs) acquired from treebanks consisting of phrase-structure trees enriched with a variety of contextual, derivational (e.g., markovization) and lexical information. In this paper we empirically investigate the applicability and adequacy of the unlexicalized variety of such parsing models to Modem Hebrew, a Semitic language that differs in structure and characteristics from English. We show that contrary to experience with parsing the WSJ, the markovized, head-driven unlexicalized variety does not necessarily outperform plain PCFGs for Semitic languages. We demonstrate that enriching unlexicalized PCFGs with morphologically marked agreement features percolated up the parse tree (e.g., definiteness) outperforms plain PCFGs as well as a simple head-driven variation on the MH treebank. We further show that an (unlexicalized) head-driven variety enriched with the same features achieves even better performance. We conclude that morphologically rich languages introduce an additional dimension of parametrization that is orthogonal to the horizontal/vertical dimensions proposed before [1] and its contribution is essential and complementary.

اللغة الأصليةالإنجليزيّة
عنوان منشور المضيفText, Speech and Dialogue - 10th International Conference, TSD 2007, Proceedings
ناشرSpringer Verlag
الصفحات39-47
عدد الصفحات9
رقم المعيار الدولي للكتب (المطبوع)9783540746270
المعرِّفات الرقمية للأشياء
حالة النشرنُشِر - 2007
منشور خارجيًانعم
الحدث10th International Conference on Text, Speech and Dialogue, TSD 2007 - Pilsen, التشيك
المدة: ٣ سبتمبر ٢٠٠٧٧ سبتمبر ٢٠٠٧

سلسلة المنشورات

الاسمLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
مستوى الصوت4629 LNAI
رقم المعيار الدولي للدوريات (المطبوع)0302-9743
رقم المعيار الدولي للدوريات (الإلكتروني)1611-3349

!!Conference

!!Conference10th International Conference on Text, Speech and Dialogue, TSD 2007
الدولة/الإقليمالتشيك
المدينةPilsen
المدة٣/٠٩/٠٧٧/٠٩/٠٧

بصمة

أدرس بدقة موضوعات البحث “Accurate unlexicalized parsing for modern hebrew'. فهما يشكلان معًا بصمة فريدة.

قم بذكر هذا