TY - JOUR
T1 - Lost but not forgotten
T2 - finding pages on the unarchived web
AU - Huurdeman, Hugo C.
AU - Kamps, Jaap
AU - Samar, Thaer
AU - de Vries, Arjen P.
AU - Ben-David, Anat
AU - Rogers, Richard A.
N1 - Publisher Copyright:
© 2015, The Author(s).
PY - 2015/9/17
Y1 - 2015/9/17
N2 - Web archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to reconstruct different types of descriptions for these pages and sites, based on links and anchor text in the set of crawled pages. We experiment with this approach on the Dutch Web Archive and evaluate the usefulness of page and host-level representations of unarchived content. Our main findings are the following: First, the crawled web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of a Web archive. Second, the link and anchor text have a highly skewed distribution: popular pages such as home pages have more links pointing to them and more terms in the anchor text, but the richness tapers off quickly. Aggregating web page evidence to the host-level leads to significantly richer representations, but the distribution remains skewed. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived web: in a known-item search setting we can retrieve unarchived web pages within the first ranks on average, with host-level representations leading to further improvement of the retrieval effectiveness for websites.
AB - Web archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to reconstruct different types of descriptions for these pages and sites, based on links and anchor text in the set of crawled pages. We experiment with this approach on the Dutch Web Archive and evaluate the usefulness of page and host-level representations of unarchived content. Our main findings are the following: First, the crawled web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of a Web archive. Second, the link and anchor text have a highly skewed distribution: popular pages such as home pages have more links pointing to them and more terms in the anchor text, but the richness tapers off quickly. Aggregating web page evidence to the host-level leads to significantly richer representations, but the distribution remains skewed. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived web: in a known-item search setting we can retrieve unarchived web pages within the first ranks on average, with host-level representations leading to further improvement of the retrieval effectiveness for websites.
KW - Anchor text
KW - Information retrieval
KW - Link evidence
KW - Web archives
KW - Web archiving
KW - Web crawlers
UR - http://www.scopus.com/inward/record.url?scp=84939272509&partnerID=8YFLogxK
U2 - 10.1007/s00799-015-0153-3
DO - 10.1007/s00799-015-0153-3
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:84939272509
SN - 1432-5012
VL - 16
SP - 247
EP - 265
JO - International Journal on Digital Libraries
JF - International Journal on Digital Libraries
IS - 3-4
ER -