CUNY Ph.D. Program in Computer Science
Technical Reports

Tree Menu Help




Submit TechReport

Send Suggestions
TR-2003011
Data Modelling and Description: A Guide to Using the SYLModel
Author(s):  Jayson E. Rome, Alexei D. Miasnikov, Robert M. Haralick
Received Date:  August 20, 2003
Download:  

Abstract

This document is an introduction to data modelling and description using the SYLModel library. A Generalized Linear Model (GLM) consists of a systematic component that describes the explanatory variables that are used as predictors, a probability distribution that specifies the random component of the response variable and a link that describes the functional relationship between the systematic component and the expected value of the random component. Graphical models are a convenient formalism for modelling complex conditional independence relationships between variables.
We are interested in modelling a variety of observed data including numeric valued data, discrete data, symbolic data. SYLModel is a collection of C++ classes that implement the functionality of GLMs. It includes classes to perform standard linear regression with identity link, Poisson regression with log link, and logistic regression with logit link. Additional tools include classes to construct and analyze gaussian graphical models, contingency table modelling, and data clustering. Basic principles of GLM are discussed to illustrate the use of various SYLModel components. A decision tree classifier makes a class assignment through a
hierarchical procedure in which each node represents a decision rule and child and leaf nodes represent refinements of the decision.
We present techniques for constructing such trees using various decision rules under the supervised learning paradigm. Clustering is the unsupervised partitioning of a dataset into clusters in which elements within a cluster are in some sense more similar to elements not in the same cluster. Projection pursuit is a technique for finding interesting projections of a dataset. We present two projection pursuit clustering algorithms and a method to evaluate the performance of an arbitrary clustering algorithm, given a set of ground truth data with known cluster assignments.