Uncovering the unarchived web

Thaer Samar, Hugo C. Huurdeman, Anat Ben-David, Jaap Kamps, Arjen De Vries

نتاج البحث: فصل من :كتاب / تقرير / مؤتمرمنشور من مؤتمرمراجعة النظراء

ملخص

Many national and international heritage institutes realize the importance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national domain, or by crawling a pre-defined list of websites selected by the archiving institution. In either method, crawling results in more information being harvested than just the websites intended for preservation; which could be used to reconstruct impressions of pages that existed on the live web of the crawl date, but would have been lost forever. We present a method to create representations of what we will refer to as a web collection's aura: the web documents that were not included in the archived collection, but are known to have existed | due to their mentions on pages that were included in the archived web collection. To create representations of these unarchived pages, we exploit the information about the unarchived URLs that can be derived from the crawls by combining crawl date distribution, anchor text and link structure. We illustrate empirically that the size of the aura can be substantial: in 2012, the Dutch Web archive contained 12.3M unique pages, while we uncover references to 11.9M additional (unarchived) pages.

اللغة الأصليةالإنجليزيّة
عنوان منشور المضيفSIGIR 2014 - Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval
ناشرAssociation for Computing Machinery
الصفحات1199-1202
عدد الصفحات4
رقم المعيار الدولي للكتب (المطبوع)9781450322591
المعرِّفات الرقمية للأشياء
حالة النشرنُشِر - 2014
منشور خارجيًانعم
الحدث37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2014 - Gold Coast, QLD, أستراليا
المدة: ٦ يوليو ٢٠١٤١١ يوليو ٢٠١٤

سلسلة المنشورات

الاسمSIGIR 2014 - Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval

!!Conference

!!Conference37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2014
الدولة/الإقليمأستراليا
المدينةGold Coast, QLD
المدة٦/٠٧/١٤١١/٠٧/١٤

بصمة

أدرس بدقة موضوعات البحث “Uncovering the unarchived web'. فهما يشكلان معًا بصمة فريدة.

قم بذكر هذا