Date of Award
Fall 8-1-2003
Document Type
Thesis
Degree Name
Master of Science in Information Systems (MSIS)
First Advisor
Stephen Krebsbach
Second Advisor
Terry Dennis
Third Advisor
Zehai Zhou
Abstract
The amount of data being collected for both business and scientific purposes is ever increasing. The vastness of the data collected makes it humanly impossible to sift through the data and discover the hidden patterns. This calls for the deployment of tools for doing the job of mining knowledge from the data. Over the years, research from such diverse fields as artificial intelligence, database theory, statistics etc., has contributed to the development of several approaches and algorithms for data mining. A proliferation of tools, both commercial and open source, that aid the end-users in their data mining activities, has become available. Business users with little or no prior experience in the area of data mining are now able to apply these tools to help them in their decision making process. Given the multitude of different tools available for data mining, a naive user who seeks to employ one or more of these tools is faced with several questions and issues concerning the choice of such tools. This project investigates some of the issues involved in not only obtaining and using the tools, but also the issues involved in the preprocessing of the data on which these tools are run and the analysis of the results of such data analysis. This is done by looking at two open-source tools, ROSETTA and Apriori. The data used for analysis is also available in the public domain at the UCI repository of data useful for analyzing machine learning algorithms and tools. The use of these tools may require substantial preprocessing of the users data. It is also essential for a naive user to understand the principles on which these tools are based and the contexts in which they are applicable. It is also important to understand. This project investigates these issues through the use of two well-known approaches to knowledge discovery in databases: Rough Sets and Association Rule Mining. It demonstrates the use of two open source tools that are based on these two approaches by applying the tools on two data sets that are also available in the public domain. It is shown that the use of these open source tools may not be appropriate, for the novice user, when applied to the KDD process.
Recommended Citation
Mudigonda, Srikanth, "Issues Related to the Use of Open Source Data Mining Tools" (2003). Masters Theses & Doctoral Dissertations. 269.
https://scholar.dsu.edu/theses/269
Comments
dsu-th-268