BROWSE TOPICS

RESOURCES

ABOUT THIS SITE


NewsFinder Procedure

Automating Collection, Classification, Ranking and Publishing of News Stories for AI in the News


AITopics > AI in the News > NewsFinder > NewsFinder Procedure

This page documents the NewsFinder software. If you are looking for recent news published by NewsFinder, visit the AI in the News page.

Brief Description of the NewsFinder Procedure

Stories are selected and published to AI in the News by the NewsFinder program.

See Information about NewsFinder for a general description of our goals and mission statement.

The NewsFinder prototype was written by Tom Charytoniuk and turned into an operational service by Liang Dong. Joshua Eckroth has subsequently improved the service. NewsFinder is written in PHP and Python; data are stored in a MySQL database. The well-known nltk library is used for word stemming, stopword filtering, and n-gram text processing. The libsvm2 library is used to train and utilize support vector machines. Justext helps with automatic text extraction of web pages. Finally, the cheetah templating system allows us to keep HTML and wiki text separate from program logic.

The source code for NewsFinder can be obtained from GitHub at this address: github.com/AAAI/AINews. The code is available under the following terms:

Copyright (c) 2011 by the Association for the Advancement of Artificial Intelligence. This program and parts of it may be used and distributed without charge for non-commercial purposes as long as this notice is included.

Each Sunday, one of the editors reviews the pending stories from the week's news by logging into http://aaai.org/AITopics/AINewsAdmin and then either vetting each story for inclusion or discarding it. (Past stories in the "All" category can also be reviewed to provide more data for the learning programs.) TP=correctly included for publication; TN=correctly excluded; FP=inappropriately included; FN=inappropriately excluded. The FN category still needs some refinement because some publishable stories are excluded because a published story covers the same event.)

Each Monday morning, NewsFinder crawls the web and selects relevant news articles about artificial intelligence. The process can be described in the following three stages: discover, filter, and publish.

I. Discover

  • Various online news websites are searched for interesting news covering developments in artificial intelligence. If a website provides an RSS feed, that resource is utilized; otherwise web pages are parsed so as to extract a story's main content and publication date. See list of news sources for sources used and other suggestions.
  • Google News is searched as a source of "news we otherwise missed." Google News itself is an aggregator; however, NewsFinder improves upon Google News by filtering for relevance to the AI community, categorizing each news article in one or more of the 19 topics found in AITopics, and filtering out duplicate stories.
  • User-submitted news is also collected.

II. Filter

  • NewsFinder removes any articles whose text does not contain at least one (stemmed) term in the whitelist.
  • Next, any articles that are classified as "Not related" are removed. A support vector machine (SVM) has been trained to determine if an article is related to artificial intelligence. The SVM has positive and negative examples; each example is represented as a high-dimensional vector of "tf-idf" values (term frequence / inverse document frequency), one tf-idf value per word in the article. In tests, this method has produced 98% accuracy for determining whether or not an article is reporting about some aspect of artificial intelligence.
  • Then the software removes any articles that have no associated topics (from the 19 topics that make up the AITopics website). Each topic is represented by its own SVM, and an article may be associated with more than one topic (in machine learning parlance, this is not "multiclass" classification but rather several independent binary classification problems). In tests, this method has produced about 80% to 99% accuracy depending on topic (Applications, being such a broad topic, is not correctly identified as often as more specific topics like Philosophy).
  • Lastly, any articles that appear to be duplicates are removed. Every article is compared with every other article from the past two weeks (we assume that after two weeks, readers don't mind seeing another article on a similar subject). Articles are compared via the cosine similarity metric between their tf-idf vectors. If an article A is similar to an article B, then the following logic decides which of A or B to remove:
    • If either A or B has already been published (in the prior week's news), then remove the other.
    • If A comes from a more relevant source than B, then remove B; if B comes from a more relevant source than A, then remove A.
    • Finally, if A is associated with more categories, remove B; otherwise, remove A.

III. Publish

  • The articles that have survived through the various filters described above are sorted by the following criteria: first sort by the article's source relevance (e.g. New York Times is more relevant than CNet); then, if the source relevances are the same for a pair of articles, sort by number of categories (more categories first).
  • Final published articles are chosen by the following process until 12 articles have been selected: add the first unselected article in the sorted collection if none of the article's associated topics are already "full" (a full topic has six articles); if some topic is full, skip this article altogether and move on to the next unselected article. This restriction encourages breadth in the set of published articles.
  • The chosen articles are summarized into four representative sentences and rendered as an email to the AI-Alert mailing list, as the AINews wiki page, and as RSS feeds (aggregate and for each topic).
  • All articles, chosen or not, are published on the AllNews wiki page; every article is also published on its own wiki page, accessible from AINews (if it was a chosen article) or AllNews wherein one may read more about why or why not the article was chosen (including categorization scores, duplication checking results, etc.).
AAAI   Recent Changes   Edit   History   Print   Contact Us
Page last modified on March 18, 2012, at 07:10 PM