Indexing Weblogs One Post at a Time

Natalie S. Glance

In order to perform analysis over weblogs, we must first identify the appropriate unit of a weblog that corresponds to a document. We argue in the paper that, for weblogs, the correct unit is the weblog post. A weblog post is a structured document with the following fields: date, timestamp, title, content, permalink and author. We present our approach for segmenting weblogs into posts, which breaks down into several steps: (1) automatic feed discovery; (2) feed-guided segmentation, using the weblog feed and HTML; and (3) model-based weblog segementation.


This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.