Machine Learning & Data Mining FAQ

En esta sección se incluye un pequeño FAQ acerca de KDD (Knowledge Discovery in Databases) y la relación entre ML(Machine Learning) y DM (Data Mining).

Frequently asked questions on Knowledge Discovery in Databases. I have received a number of requests for the FAQ of this mailing list. Below, I list the most frequent questions and my initial answers. Any comments, corrections, etc are welcome. With your help, I hope to compile the initial draft of the FAQ list to be posted to other mailing lists.

-- Gregory Piatetsky-Shapiro

********** Questions *****************

Definitions:

1.0 What is Knowledge Discovery in Databases (KDD), Data Mining, etc?

1.1 What is the difference between KDD and Machine Learning?

************ Initial Answers *****************************

1.0 What is Knowledge Discovery in Databases (KDD), Data Mining, etc?

The notion of Knowledge Discovery in Databases (KDD) has been given various names, including data mining, knowledge extraction, data pattern processing, data archaeology, information harvesting, siftware, and even (when done poorly) data dredging. Whatever the name, the essence of KDD is the {\em nontrivial extraction of implicit, previously unknown, and potentially useful information from data} (Frawley et al 1992). KDD encompasses a number of different technical approaches, such as clustering, data summarization, learning classification rules, finding dependency networks, analyzing changes, and detecting anomalies (see Matheus et al 1993).

-- Gregory Piatetsky-Shapiro

1.1 What is the difference between Data Mining and Machine Learning ?

Knowledge Discovery in Databases (Data Mining) and the part of Machine Learning dealing with learning from examples overlap in the algorithms used and in problems addressed.

The main differences are

1) Knowledge Discovery (KDD) is concerned with finding *understandable* knowledge, while ML is concerned with improving performance of an agent. So training a neural network to balance a pole is part of ML, but not of KDD. However, there are efforts to extract knowledge from neural networks which are very relevant for KDD.

2) KDD is concerned with very large, real-world databases, while ML *typically* (but not always) looked at smaller data sets. So efficiency questions are much more important for KDD.

3) ML is a broader field which includes not only learning from examples, but also reinforcement learning, learning with teacher, etc.

So one can say that KDD is that part of ML which is concerned with finding *understandable* knowledge in large sets of real-world examples.

*********************

From: Xindong Wu <xindong@reef.cs.jcu.edu.au>

Date: Sat, 2 Apr 1994 10:42:58 +1000 (EST)

Here are some additions from my own research.

1.0 What is Knowledge Discovery in Databases (KDD), Data Mining, etc?

KDD is a research frontier (Wu 94) for both database technology and machine learning techniques, and has seen sustained research over recent years. It acts as a link between the two fields, thus offering a dual benefit. Firstly, since database technology has already found wide application in many fields, machine learning research obviously stands to gain from this greater exposure and established technological foundation. Secondly, as databases grow in both number and size, the prospect of mining them for new, useful knowledge becomes yet more enticing. Machine learning techniques can augment the ability of existing DBMSs to represent, acquire, and process a collection of expertise such as those which form part of the semantics of many advanced applications (e.g., CAD/CAM).

1.1 What is the difference between Data Mining and Machine Learning ?

When we integrate machine learning techniques into database systems to implement KDD from databases, we must face many problems such as (1) more efficient learning algorithms because realistic databases are normally very large and noisy, and (2) more expressive representations for both data (e.g. tuples in relational databases, which represent instances of a problem domain) and knowledge (e.g. rules in a rule-based system, which can be used to solve users' problems in the domain, and the semantic information contained in the relational schemata). Practical KDD systems are expected to include 3 interconnected phases (Wu 92): (1) Translation of standard database information into a form suitable for use by learning facilities; (2) Using machine learning techniques to produce knowledge bases from databases; and (3) Interpreting the knowledge produced to solve users' problems and/or reduce data spaces.

References:

X. Wu, HCV User's Manual (Release 1.0 June 1992), DAI Technical Paper No. 9, 30 pp., Department of Artificial Intelligence, University of Edinburgh, 1992.

X. Wu, The HCV Induction Algorithm, Proceedings of the 21st ACM Computer Science Conference, S.C. Kwasny and J.F. Buck (Eds.), ACM Press, U.S.A., 1993, 168--175.

Dr Xindong Wu Department of Computer Science Telephone: +61 (0)77 81-4617 James Cook University Fax: +61 (0)77 81-4029 Townsville, Australia Qld 4811 Email: xindong@cs.jcu.edu.au