# GhostMiner - Main Functions

### GhostMiner Developer

GhostMiner Developer is the tool for data-model designers and developers who, using databases, can train, test run and select useful models. The use of this program requires a **good knowledge of statistical analysis** and some knowledge of methods of computational intelligence. Developer supports each step of data mining process which is represented by a single project. Each project is based on analysis of data unique to a specific problem. The analysis starts from raw data and moves to encompass:

The project's output is a model of knowledge inherent in analyzed data. The model can then be used as a support tool in decision-making processes.

#### Data preprocessing

Since data mining consists of analysis of data it is natural to start the process of mining from looking at the data first. In GhostMiner the information about data is provided in two ways.

First, purely statistical information about the data is given, such as:

- the number of vectors
- the number of classes
- the number of features
- the number of vectors per class
- the minimum, average, and maximum values of each feature
- the variance of each feature
- the number of missing values for each feature

The data may be viewed in its original or pre-processed (standardized or normalized) form, ordering the vectors according to some feature values or filtering vectors that belong to selected classes only.

A second, quick evaluation of the data using charts and diagrams is provided. GhostMiner allows the viewing of data from several angles.

In the N-dots representation values for each feature and each class are presented, allowing the viewer to see the usefulness of each feature in terms of separation of data vectors into classes. The data samples are projected on one dimension, corresponding to one of the features.

Projection of pairs of features that look promising in the N-dots charts are worth looking at since they may show interesting clusterization properties. The 2D plots allow observation of such two-dimensional projections.

##### Feature selection

The feature selection models facilitate manual and/or automatic feature selection from all available scope of the features in the dataset. They are especially important when the numbers of features is huge and only some of them have actual impact for constructed models. The selection commands are simple and easy to understand. The following methods are implemented:

- Manual feature selection
- Correlation coefficients based feature selection
- Feature selection wrappers where each classification model listed below could be “wrapped”
- Feature selection committee

Automatic feature selection methods provide the features ranking according to their importance.

##### Model learning

There is no single algorithm that will achieve the best results on all data. For that reason GhostMiner provides several different types of data mining algorithms:

- For classification
- IncNet Neural Network
- FSM Neurofuzzy System
- SSV Decision Tree
- Support Vector Machine (SVM)
- k-Nearest Neighbors algorithm

- For clusterization
- Dendrograms method
- Support vectors clusterization

Unfortunately there are no simple rules to tell which algorithm should be employed for which particular problem to obtain the best results. In order to build the most accurate model one should apply all available algorithms to the data at hand. Through comparison of results obtained by using different algorithms one can decide which algorithm is the best. Using K-classifier approach or building a committee of models may further increase the accuracy of results.

If the goal is not only to create the most accurate predictive model but also to gain some understanding of the data, models should be based either on the decision tree or the neurofuzzy system. The SSV decision tree allows the user to follow the branches of the tree to the node a given data vector has been assigned to; it also provides crisp logical rules equivalent to the tree. FSM neurofuzzy system used with rectangular functions also provides logical rules. Used with Gaussian and triangular transfer functions FSM creates fuzzy description of the data.

The k-NN algorithm also provides some understanding of the data, showing the most similar cases in the reference database. Feature selection, or weighting and optimization of distance function, allows the user to find more precisely those aspects of data that decide to which class a given vector should be assigned.

##### Model analysis

Sometimes even using the best single model some questionable cases are difficult to classify. To improve accuracy of results and support detailed analysis of such difficult cases GhostMiner has several enhancements:

- Committees of models
- K classifiers
- Model Transform&Classify

**Committees**

It is well known that a combination, or a committee of models, may achieve better results. An additional advantage of a committee is that they are more stable than single models. This is manifested by better repetition of results, i.e. smaller variance of the accuracy.

In GhostMiner the committee is any combination of one (or more) single models described above. The final decision where to assign a given record is taken by majority voting, but it may be instructive to look at how different models have voted.

**K classifiers**

In multiclass cases dividing all data into classes may be hard. Sometimes the description of the data is simpler if a single class is discriminated against all other classes. K-classifiers allows the achievement of this in a simple way. K is the number of classes here; for each class a separate model of the same type is created. Thus, K-classifiers allows for usage of K class problems they are separate classifiers that try to separate only one class from all the others. All classifiers have to be of the same type, for example SSV decision trees. For datasets with many classes an advantage of this approach is the creation of simple models for classes that are easy to distinguish and more sophisticated models for classes which have complicated decision borders.

**Model Transform & Classify**

The Transform & Classify model is an ensemble of a data transformation and classification models. It encapsulates the two models in a single classifier. The task is to classify data after some transformations (e.g. feature selection). Thanks to the encapsulation the transformation and classification can be performed internally by the model without any additional user's effort. The Transform & Classify model is a classifier, so it can be used as any other classifier, i.e. it can be validated with Cross validation or XTest, tested on external data, used in GHOSTMINER ANALYZER etc. To classify a data vector a transform & classify model transforms the vector first (according to the rules learned by the transformation model) and then applies the classification submodel to the transformed data.

##### Model testing

To assess the success of data mining models we need to determine some criteria. One such piece of criteria is the accuracy of the model. This quantity refers to the degree of fit between the model and the data. It measures how error-free the models predictions are. Testing a model is possible on hold-out data or test data. A result from test data allows the estimation of accuracy that one may expect for a given problem.

GhostMiner provides the following tools for evaluating models:

- Crossvalidation
- X-test
- Confusion matrix

**Crossvalidation** (CV) is a method of estimating the accuracy of a classification or regression model if no test data is provided. The data is divided into several parts, where each part in turn is used to test a model while the remaining parts are employed for model training. Training models in crossvalidation mode allows the achievement of better generalization, since parameters are optimized on the data that the model has not been directly trained on.

**X-test** is an extended crossvalidation, allowing the discovery of not only the expected accuracy but also the variance of models trained on given data. In this test crossvalidation tests are repeated several times, for example 5 or 10, and the average result plus the variance are calculated. This option is rather expensive, therefore it should be used only on small data sets. For larger data set results with the test set are sufficient to estimate expected accuracy.

The **confusion matrix** shows the numbers of the actual versus predicted class values. It shows not only how well the model predicts, but also presents the details needed to see exactly where things may have gone wrong. For example for classification tasks the matrix is computed to determine which kinds of misclassification exist in the test data. Sensitivity and specificity of results are easily calculated from the confusion matrix entries.

### GhostMiner Analyzer

GhostMiner Analyzer is aimed at end users who are not necessarily experts in computational techniques. The idea is to provide a simple tool for diagnosis, decision support or data classification that a medical doctor, a manager or a chemist would be able to use with ease.

Once models of the data have been created using GhostMiner Developer tools they are stored in project files. Analyzer reads the project file and allows for a detailed evaluation of the results, including estimation of probabilities for different decisions, and visualization of new data in relation to the reference cases.

##### Analysis of a new case

First, a project should be loaded to the Analyzer using the Open icon or selecting Open from the menu Project. This will create a dataset window, displaying information about the dataset used for training, and the project tree, showing the data models that have been created. The training dataset is used for reference purposes here. Entering the values of features in the Analyzed data window and pressing the Analyze button creates another window, containing class probabilities and some additional information, depending on the model and the parameters of the model chosen to create the project.

If **FSM** (Feature Space Mapping) was used the result window shows class probabilities and a window with rules. With Gaussian or triangular functions the transfer functions selected as parameters of the FSM model incorporate rules which are fuzzy and not clear, and consequently they are not so easy to interpret. For rectangular functions crisp logic rules are provided and the rule used for actual classification is displayed with a bigger font.

For **kNN** (k Nearest Neighbors), except for class probabilities, all k neighbors, their classes and distances are displayed, and actual vectors are shown in a small dataset window. This tool may be used to find a specific number of vectors in the neighborhood of a given one. It is enough to create a project selecting the kNN model with fixed k and than use it in the Analyzer.

**IncNet** (Incremental Neural Network) has always one submodel per class. Except for the final probability it may be useful to look at the voting of each submodel. If the vector is quite novel all submodels will assign it to the another class, but after renormalization the final voting will assign it to the other class, although the confidence in such classification should be low.

**SSV** (Separability Split Value) decision tree selects only one class, but some estimation of probability that no error has been made may be made by looking at the tree itself. The branch that was used for classification is highlighted and information about the number of vectors from different classes is given in its leaf. Logical rules are also given, with the relevant rule highlighted.

In case of a **committee** all models used to form it are shown and probabilities calculated for the final decision. Full information about individual models is shown after clicking on the model in the project tree. It may be instructive to inspect this information and note that some models may have voted differently. Look at the rules used and look at the most similar cases.

The new case is displayed in a different color than all other cases in 2-D projections of the data and in N-Dots 3D analysis of the feature/class distribution and is accessible from the Dataset window.

It can also be visualized using interactive multidimensional scaling (IMDS) in relation to all other cases in the reference dataset.