What is data mining?


Japanese

Data mining (the knowledge excavation from data)
Example of the mdthodology of data mining

Data mining (the knowledge excavation from data) is the process of discovering and studying the new knowledge based on the patterns learnt and accumulated in a large-scale database as the knowledgebase. The data mining system is the knowledge acquisition system that excavates such knowledge from database and it stores these acquired information inside and outside the computer as a knowledge base. Also it intends to achieve the generation of new knowledge with a minimum of human interposition. Data mining may be called KDD (abbreviation of Knowledge Discovery in Database).



Large-scale database     Occurrence of problem

Massive data          →  Treatment efficiency     Data mining

Complicated data            Recognition of human



   Two important problems must be solved to treat large-scale database.  The 1st problem is the quantity of data and the 2nd is the qualitative issue of records. These will be explained along Figure 2 below.  The former problem occurs as the volume of data   accumulated in a firm become enormous.  For example, in an enterprise called “Pharma” that I explained in case 1,2 the stored data exceeds 60 gigabyte only with its  sales records within a year, so the technical method that improves processing efficiency must be developed.  Also, it is necessary to develop some expressive structures that human being is able to recognize, because perceiving such enormous data is impossible with a man’s ability.

    The latter problem is generated due to the existence of complicated relations among data and attributes inside the huge data. At the bare mention of the patterns or rules among attributions and data, there are infinite combinations because their relations are so complicated. Because it is impossible to inspect all of them, an approximation way that improves processing efficiency is necessary.  Also, it is difficult for a human being to find a meaning from the discovered patterns within data when they are so complex. Consequently the development of the expressive structure of the patterns, which makes recognizing and creating meaning from them, is demanded.

    Data mining assume the subject that solves the problem that occurs while treating large-scale database.  In short, it is requested to solve the problem such as the processing efficiency and the limit of the recognition power of human being that occur as the scale and complexity of the data become larger.  The technical approach in data mining intends to develop the practical method that aims to achieve both efficiency and efficacy in the analysis that used the large-scale database.


Example of the methodology of data mining


 

 1. Knowledge Presentation

    The data mining is the process that transforms voluminous complicated data into the comprehensive patterns or rules for a human being. Therefore the massive complex data needs to be transformed into the expression that a man can understand. Knowledge presentation is the framework that converts a large amount of intricate data into a particular data or procedure that human being can figure out based on an intention. Data mining is the process that transforms enormous data to certain knowledge presentation, in which a human being finds a meaning, and it is one of the  important roles of data mining to develop such knowledge presentation according to a purpose.

For example, there is the knowledge presentation called a decision tree.  This is the framework to classify a large amount of data in which the results of judgment regarding conditions are displayed as branches and the whole data is rendered as a tree structure. (refers to Figure 3) Also, Fukuda, Morimoto, Morishita and Tokuyama (1996) expressed the distribution of data by using color and succeeded to visualize voluminous data.  Such knowledge presentation enables to express the knowledge of human being freely (Anzai, 1989), and it serves an important role to the discovery or development of the knowledge.


2. Evaluation Standard

    With the aim of discovering innovative patterns or interesting rules out of a lot of data effectively, a standard to appreciate these findings must be established. A large number of patterns and rules exist in database. In fact, it is very difficult for a man to evaluate them from every angle one by one.  By giving a simple criterion for estimation, those judgment and evaluation can be carried out by a computer. And the efficient discovery of beneficial patterns or rules becomes possible when a human being judges or   evaluates only favorable results acquired from the computer-driven evaluation. (Matheus, Chan, and Piatetsky-Shapiro, 1993).

    Take up the Confidence and Support from a connection rule (Agrawal, Imielinski and Swami, 1993) as an example of the evaluation standard.  The connection rule (association rule) indicates that certain phenomenon B breaks out at the time of a certain condition A.  There are many pattern and rules such as the high probability that the clients who purchase a cheeseburger also get an apple pie, or many women in 30s buy 100 or more diapers. There are Confidence and Support in the evaluation standard to figure out the odds of such patterns and rules to be appeared. The Confidence indicates the proportion that the data which fills condition A causes phenomenon B.  By seeing Confidence the probability that the pattern is to break out is evaluated. However, even when the pattern appears to be certain, it is not possible to determine its significance right away.  It is because the data that fills condition A might be a slight part in the whole.  The Support indicates rate of phenomenon B that fills condition A in all data, which makes possible to evaluate the importance of rate of the discovered pattern in the whole.

 
3. Algorithm

    Various presentation are presumable in the knowledge to be discovered. It is required to use the suitable evaluation standard to find a characteristic pattern in such various knowledge presentation.  In addition, algorithm must be developed to process those problems effectively, and it becomes more important especially as treating voluminous data.  In example Figure 3, in the case that there are 2 attributes that exert an influence on age, the leaf become 4 of 22 on the assumption of 2 classification, and the figure is comparatively small.  However, a leaf becomes 1024 of 210, when the attributes are 10, and as the attributes increases, according to a law of exponent, so does a leaf.  It is conceivable that the problems of the calculation processing efficiency of voluminous data become a significant issue in real business.