IJCRR - Vol 02 Issue 10, October, 2010

Pages: 09-15

Date of Publication: 30-Nov--0001

Print Article   Download XML  Download PDF


Author: Sunanda Das, Asit Kumar Das

Category: Technology

Abstract:Objective: Our approach is to reduce the large data size to a small data size which represents same features of the total large data set, so that computational complexity
becomes shorter.
In this paper we present a new approach to minimize the data size and then to cluster that reduced data. The volume of data being generated nowadays to cluster is
increasingly large. How to extract useful information from such data collections is an
important issue. A promising technique is the Rough set theory, a new mathematical
approach to data analysis based on objects of interest into similarity classes which are
indiscernible with respect to some features.
Result and conclusion:
This theory offers two fundamental concepts: reduct and core.
In this paper, some basic ideas of rough set theory are first presented. Some experiment
results are also given.

Keywords: Rough set theory, Data mining, correlation

Full Text:


Data mining is an emerging area of computational intelligence that offers new theories, techniques, and tools for processing large volumes of data. It has gained considerable attention among practitioners and researchers as evidenced by the number of publications, conferences, and application reports. The growing volume of data that is available in a digital form has accelerated this interest. Data mining relates to other areas, including machine learning, cluster analysis, regression analysis, and neural networks [1-5]. Rough set theory [6] is a relatively new mathematical technique developed by Pawlak in the 1980s to describe quantitatively uncertainty, imprecision and vagueness. Classical set theory deals with crisp sets and rough set theory may be considered an extension of the classical set theory. The rough set approach has many advantages. An important step in the knowledge discovery process is the reduction of thedimensionality of data. In real database systems, though there are many attributes and records, in some circumstances, in fact only some of the attributes are indispensable. If the dispensable attributes can be eliminated, the complexity of analyzing the data can be greatly reduced. Our algorithm is based on rough set theory which consists of two parts. The first part is for attribute reduction and the second is for rule extraction.

Dataset: In this experiment we use a wine dataset having 13 different attributes each of having 178 different data values. Using the 10 fold method we divide this dataset into 10 different test datasets and train datasets.



Using correlation function in MATLAB, we get 10 different tables for 10 different train datasets in which each attribute value shows the correlation with other attributes.


After doing the stepwise experimental work, we get 10 correlation matrix. From the correlation-matrix of first training dataset, we got graph as given in Figure 1. The correlation coefficients are a normalized measure of the strength of the linear relationship between two variables and range between -1 and 1, where -1 means that one column of data has a negative linear relationship to another column of data. 0 means there is no linear relationship between the data columns. 1 means that there is a positive linear relationship between the data columns. Figure 1 is a grouped bar graph. The bars in the first group correspond to the first row of the matrix, the 2nd group to the 2nd row and so on. In this figure, 1st group bar, there are 13 bars each of which represents the correlation between each pair of attribute ( row & column). which is having negative and positive values in correlation are represented in both side of the origin, in such a way that each cluster of column bar shows 13 attributes in positive and negative value. So, this above Figure1 represents the graphical representation of correlation-matrix of first training dataset of wine dataset. After getting the correlation values between the 13 attributes of wine dataset, we get the 8 different Functional Dependencies using the threshold value θ = 0.5. And then using closer property we get 7 different values to predict the classification of these attributes i.e. {A,J}, {F,G,L,M,I,K}, {B,C,D,E,H}. Then using the Information gain formula, we find that the set{ F,G,L,M,I,K } and {B,C,D,E,H} can not be clustered again. Then using cardinality formula, the 3 different cardinality values in percentage of the sets {A,J}, {F,G,L,M,I,K}, {B,C,D,E,H} are generelated that are 32.9%, 68.4% , 42.7% respectively. Among these values 68.4% of set {F,G,L,M,I,K} is the highest value. This means that the reduct set {F,G,L,M,I,K} only can represent the characteristics of the total 13 attributes of this wine dataset.


In this paper, basic concepts of data mining and the rough set theory were discussed. The patterns formed by the rules extracted with rough set theory differ from other patterns. Here the method was illustrated with a numerical example. This method shows that instead of handling large volume of data, we can easily work with small-size of data which gives same meaningful information and characteristics of the whole data. This method enhances the utility of the extracted knowledge, reduces timecomplexity. This method can be further enhanced by getting the cardinality value 100% approximately.


