Date of Award

Fall 8-1-2003

Document Type

Thesis

Degree Name

Master of Science in Information Systems (MSIS)

First Advisor

Stephen Krebsbach

Second Advisor

Terry Dennis

Third Advisor

Zehai Zhou

Abstract

The amount of data being collected for both business and scientific purposes is ever increasing. The vastness of the data collected makes it humanly impossible to sift through the data and discover the hidden patterns. This calls for the deployment of tools for doing the job of mining knowledge from the data. Over the years, research from such diverse fields as artificial intelligence, database theory, statistics etc., has contributed to the development of several approaches and algorithms for data mining. A proliferation of tools, both commercial and open source, that aid the end-users in their data mining activities, has become available. Business users with little or no prior experience in the area of data mining are now able to apply these tools to help them in their decision making process. Given the multitude of different tools available for data mining, a naive user who seeks to employ one or more of these tools is faced with several questions and issues concerning the choice of such tools. This project investigates some of the issues involved in not only obtaining and using the tools, but also the issues involved in the pre­processing of the data on which these tools are run and the analysis of the results of such data analysis. This is done by looking at two open-source tools, ROSETTA and Apriori. The data used for analysis is also available in the public domain at the UCI repository of data useful for analyzing machine learning algorithms and tools. The use of these tools may require substantial preprocessing of the users data. It is also essential for a naive user to understand the principles on which these tools are based and the contexts in which they are applicable. It is also important to understand. This project investigates these issues through the use of two well-known approaches to knowledge discovery in databases: Rough Sets and Association Rule Mining. It demonstrates the use of two open source tools that are based on these two approaches by applying the tools on two data sets that are also available in the public domain. It is shown that the use of these open source tools may not be appropriate, for the novice user, when applied to the KDD process.

Comments

dsu-th-268

Share

COinS