Evaluating Regularizers for Estimating Distributions of Amino Acids

Kevin Karplus

This paper makes a quantitative comparison of different methods, called regularizers, for estimating the distribution of amino acids in a specific context, given a very small sample of amino acids from that distribution. The regularizers considered here are zero-offsets, pseudocounts, substitution matrices (with several variants), and Dirichlet mixture regularizers. Each regularizer is evaluated based on how well it estimates the distributions of the columns of a multiple alignment---specifically, the expected encoding cost per amino acid using the regularizer and all possible samples from each column. In general, pseudocounts give the lowest encoding costs for samples of size zero, substitution matrices give the lowest encoding costs for samples of size one, and Dirichlet mixtures give the lowest for larger samples. One of the substitution matrix variants, which added pseudocounts and scaled counts, does almost as well as the best known Dirichlet mixtures, but with a lower computation cost.

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.