文章基本信息

标题：Improving classification with Cost-sensitive metaclassifier.
作者：Muntean, Maria ; Valean, Honoriu ; Ileana, Ioan 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2010
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Most of the real-world data are unbalanced in terms of proportion of samples available for each class, which can cause problems such as over fit or little relevance. The Support Vector Machine (SVM), proposed by Vapnik and his colleagues in 1990's [Vapnik, 2000], is a new machine learning method based on Statistical Learning Theory and it is widely used in the area of regressive, pattern recognition and probability density estimation due to its simple structure and excellent learning performance. Joachims validated its outstanding performance in the area of text categorization in 1998. SVM can also overcome the over fitting and under fitting problems [Hong et al., 2009], and it has been used for imbalanced data classification [Li et al., 2009].
关键词：Algorithms;Classification;Classification (Library Science);Machine learning

Improving classification with Cost-sensitive metaclassifier.

Muntean, Maria ; Valean, Honoriu ; Ileana, Ioan 等

1. INTRODUCTION

Most of the real-world data are unbalanced in terms of proportion of samples available for each class, which can cause problems such as over fit or little relevance. The Support Vector Machine (SVM), proposed by Vapnik and his colleagues in 1990's [Vapnik, 2000], is a new machine learning method based on Statistical Learning Theory and it is widely used in the area of regressive, pattern recognition and probability density estimation due to its simple structure and excellent learning performance. Joachims validated its outstanding performance in the area of text categorization in 1998. SVM can also overcome the over fitting and under fitting problems [Hong et al., 2009], and it has been used for imbalanced data classification [Li et al., 2009].

A SVM is an algorithm that uses a nonlinear mapping to transform the original training data into a higher dimension. Within this new dimension, it searches for the linear optimal separating hyper plane. With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyper plane. The SVM finds this hyper plane using support vectors and margins.

2. COST-SENSITIVE CLASSIFICATION

In actual applications, it exist the problems that wrong classify result in different harm degree of different sort sample. The solution proposed in literature is the Cost-sensitive SVM approach [He et al., 2009], a new method for unbalanced classification.

Fundamental to the Cost-sensitive learning methodology is the concept of the cost matrix. This approach takes the classify cost into account, and it aims to reduce the classify cost to the least. Instead of creating balanced data distributions through different sampling strategies, Cost-sensitive learning targets the imbalanced learning problem by using different cost matrices that describe the costs for misclassifying any particular dataset. A very useful tool, the Confusion Matrix for two classes is shown in Table 1.

The true positives (TP) and true negatives (TN) are correct classifications. A false positive (FP) occurs when the outcome is incorrectly predicted as 1 (or positive) when it is actually 0 (negative). A false negative (FN) occurs when the outcome is incorrectly predicted as negative when it is actually positive.

In addition, the accuracy measure may be defined. It represents the ratio between correctly classified instances and the sum of all instances classified, both correct and incorrect ones. The above measure was defined as:

Acc= TP + TN/TP + TN + FP + FN (1)

3. DESCRIPTION OF THE MULTIPLYING ALGORITHM USED

This paper introduces an algorithm named Enhancer aimed for increasing the TP of underrepresented classes of datasets, using Cost-sensitive classification and SVM.

Experimentally we have found out that the features that help in raising the TP of a class are the cost matrix and the amount of instances that the class has. The last one can be modified by multiplying the number of instances of that class that the dataset initially has.

The Enhancer algorithm is detailed in the following pseudo code:

1. Read and validate input;

2. For all the classes that are not well represented:

BEGIN
 Evaluate class with no attribute added
 Evaluate class at Max multiplication rate
 Evaluate the class at Half multiplication
 REPEAT
 Flag = False
 Evaluate the intervals (beginning, middle),
 (middle, end)
 If the end condition is met
 Flag = True
 If the first interval has better results we should use
 this, otherwise the other
 Find the class evaluation after multiplying class
 instances middle times
UNTIL Flag = False
END

3. Multiply all the classes with the best factor obtained;

4. Evaluate dataset.

The Enhancer algorithm described in the pseudo code used a Divide et Impera technique, that searched in the space (0 multiplication--max multiplication) for the optimal multiplier for the class. The algorithm is going to stop its search under two circumstances:

* The granulation is getting to thin, i.e., the difference between the beginning and end of an interval is very small (under a set epsilon). This constraint is set, in order not to let the algorithm wonder around searching for solutions that vary one from another by a very small number (<[10.sup.-2]).

* The modulus of the difference between the [DELTA][TP.sub.t] + [DELTA][A.sub.CC] from the first and the second interval should be bigger that a known value. This value is the considered to be the deviation of the Accuracy added to the deviation of the TP of that class:

[mu] = [sigma][A.sub.CC] + [sigma]TP (2)

With the 10 fold cross validation, the dataset was randomized, and stratified using an integer seed that took values in the range 1-10. The algorithm performed 10 times the evaluation of the data set, and all the time had a different test set.

4. EXPERIMENTAL RESULTS

For the evaluation, we used the Pima dataset, obtained from the online UCI Machine Learning Repository. The class distribution for this dataset is illustrated below (Fig. 2):

In order to improve the classification of the weakly represented classes in those datasets, in which they are in very small numbers with respect to the other classes, two approaches were tested:

* Cost-sensitive classification

* Multiplication of the instances of weakly represented classes

A. Cost-sensitive classification

By modifying the cost matrix, we obtained a variation quite high in the TP of Class 0 and we were able to change the level of the accuracy from 41% to 89% (Fig. 3).

[FIGURE 3 OMITTED]

The TP of the Class 0 and Class1 evolve almost complementary one from another (when one TP rises, the other falls and the other way around). The TP of class 1 also seemed to be bounded by about 96%.

[FIGURE 4 OMITTED]

B. Multiplying underrepresented classes

After applying the class multiplications all the TP of Class 1 hits a zone of instability, until the multiplying factor reached 1.0, when the TP ascent stabilized. Seemingly the problem in this case is that the accuracy was dropping or remaining constant, while the TP of the Class reached the maximum value (Fig. 5).

[FIGURE 5 OMITTED]

5. CONCLUSIONS

We observed that the Enhancer classifier performed better than Cost Sensitive classifier (Fig. 6) and that with the new algorithm, the TP of certain classes of interest were increased significantly while keeping the general accuracy in the desired range.

[FIGURE 6 OMITTED]

We have also discovered that by oversampling the instances of the class of interest, we are helping the SVM algorithm to overcome the soft margin. As an effect, it classifies better future instances of this class of interest.

This solution is especially important when it is far more important to classify the instances of a class correctly, and if in this process we might classify some of the other instances as belonging to this class we do not produce any harm.

6. REFERENCES

He, H. & Garcia, E., A., Learning from imbalanced data (2009). IEEE Transactions on Knowledge and Data Engineering, VOL. 21, NO. 9, September, 2009, ISSN: 1041-4347

Hong, M.; Yanchun, G.; Yujie, W. & Xiaoying, L. (2009). Study on classification method based on Support Vector Machine, 2009 First International Workshop on Education Technology and Computer Science, pp.369-373, March, 78, 2009, ISBN: 9781424435814, Wuhan, China

Li, Y.; Danrui, X. & Zhe, D. (2009). A new method of Support Vector Machine for class imbalance problem, 2009 International Joint Conference on Computational Sciences and Optimization, pp. 904-907, April 24-26, 2009, ISBN: 9780769536057, Hainan Island, China

Vapnik, V., N. (2000). The nature of statistical learning theory, NewYork: Springer-Verlag, 2000, ISBN: 9780387987804

*** (2010) http://archive.ics.uci.edu/ml/--University of California Irvine. UCI Machine Learning Repository, Accessed on: 2010-06-15

Tab. 1. Confusion matrix for a two-class problem

 Predicted Class

 Class = 1 Class = 0

Actual Class Class = 1 TP FN
 Class = 0 FP TN

Fig. 2. Pima dataset class distribution

Pima dataset class distribution

Class 0 500
Class 1 268

Note: Table made from pie chart.