A text can be represented by a vector of binary weights indicating the presence or absence of terms inside documents. Therefore, the correlation of terms is quantified by the frequency of co-occurrence or the internal product of the index term vector, but it has the limitation that it does not capture the joint relationship. As each element of the binary document vector is set to be 0 or 1, 0 and 1 are dealt with quantitatively and methods of multidimensional incidence tables and various statistics are used.
Maximum entropy is one method applicable to such problems, and it can represent a variety of models depending on which interaction terms are involved. The maximum entropy model is the basic principle of the maximum entropy classifier, and it constructs a model for all known factors and excludes the unknown factors. The basic characteristic is that the features are independent of each other, so the features useful to the final classification can be added freely, regardless of mutual influence.
The maximum entropy model with constraints on binary features is widely used in natural language processing, speaker identification, information retrieval, text filtering, machine translation, etc.
In lexeme equivocality removal using maximum entropy, when the context, i.e., neighboring words are given, the probability of each outcome (each meaning of the target word) is estimated and the variance of the neighboring words for each meaning is used to compute the probability. Either log-linear (log-linear) model or exponential (exponential) model is used to combine features. However, where the dependence structure between variables using maximum entropy is included, except for the relationship where the joint probability is zero, which would reduce the computational effort in training and prediction still remains a challenge.
Pak Il Chol, a researcher at the Faculty of Information Science and Technology, has proposed a joint distribution estimation method via marginal distribution constraint based on the maximum entropy principle in document data represented by binary weight vectors, and how to apply it to document classification and ranking. The proposed method allows determination of the target joint distribution using the marginal distribution obtained from the given data as a constraint. He has also proposed a method to search for the maximum dimension of the marginal distribution, except for the case of marginal probability 0.
The simulation results show that the proposed method for classification through marginal distribution estimation underperforms the simple Bayesian approach assuming conditional independence when the dataset size is small, but outperforms the simple Bayesian approach when the dataset size gets larger. He has also found that the ranking shows an improvement of about 4% in MAP over the two-valued independent retrieval model.
© 2021 Kim Chaek University of Technology