Proceedings:
Proceedings of the Twentieth International Conference on Machine Learning, 1995
Volume
Issue:
Proceedings of the Twentieth International Conference on Machine Learning, 1995
Track:
Contents
Downloads:
Abstract:
This paper makes a quantitative comparison of different methods, called regularizers, for estimating the distribution of amino acids in a specific context, given a very small sample of amino acids from that distribution. The regularizers considered here are zero-offsets, pseudocounts, substitution matrices (with several variants), and Dirichlet mixture regularizers. Each regularizer is evaluated based on how well it estimates the distributions of the columns of a multiple alignment---specifically, the expected encoding cost per amino acid using the regularizer and all possible samples from each column. In general, pseudocounts give the lowest encoding costs for samples of size zero, substitution matrices give the lowest encoding costs for samples of size one, and Dirichlet mixtures give the lowest for larger samples. One of the substitution matrix variants, which added pseudocounts and scaled counts, does almost as well as the best known Dirichlet mixtures, but with a lower computation cost.
ISMB
Proceedings of the Twentieth International Conference on Machine Learning, 1995