VIETNAM NATIONAL UNIVERSITY, HANOI

COLLEGE OF TECHNOLOGY

***

CHU THAI HOA

METHODOLOGY OF

RELATIONAL DATAMINING FOR

STOCK MARKET PREDICTION

Major:

Code:

Information Technology

1.01.10

MASTER'S THESIS

Instructor:

Prof. Dr. HO TU BAG

DAI HOC QUOC GiA HA NOl

TRUNG FAM IHONG TIN THIJ VIEN

000 ^J 000095^

Hanoi, June 2007

ABSTRACT

This thesis presents the methodology of relational data mining for stock market

prediction by making clear each problem related to the keywords: methodology,

relational, data mining, stock market, and prediction, then coming to the

methodology of relational data mining with the emphasis on Machine Methods for

Discovering Regularities (MMDR) for stock market prediction.

Stock market prediction has been widely studied in terms of time-series

prediction problem. Deriving relationships that allow one to predict future values of

time series is challenging. One approach to prediction is to spot pattems in the past,

when we already know what followed them, and to test on more recent data. If a

pattem is followed by the same outcome frequently enough, we can gain confidence

that it is a genuine relationship.

The purpose of relational data mining (RDM) is to overcome the limitations of

attributed-based learning methods (commonly used in finance) in representing

background knowledge and complex relations. RDM approaches look for pattems

that involve multiple tables (relations) from a relational database. This approach will

play a key role in future advances in data mining methodology and practice.

MMDR method is one of the few Hybrid Probabilistic Relational Data Mining

methods developed and applied to stock market data. The method has an advantage

in handling numerical data. It expresses pattems in First-order Logic (FOL) and

assigns probabilities to rules generated by composing pattems. This will be made

clear through an application of MMDR with computational experiment on price

index data of Standard and Poor's 500.

The thesis consists of 3 chapters concentrating on relational data mining

methodology for stock market prediction.

Methodology of Relational Data mining for Stock Market Prediction

ACKNOWLEDGEMENTS

This thesis would not have been completed if there was no help and support of

many people. I would like to take this opportunity to express my gratitude to the

many people who helped me during the time of development leading to the thesis.

In particular, I would like to thank my instructor. Prof Dr. HO Tu Bao, for his

courage of accepting me as a Master's student, for his enthusiasm, his knowledge

and his encouragement in the work throughout. I would never been able to finish

this Thesis without his encouragement as well as his strict requirement for quality of

the research.

I also enjoyed and appreciated the fruitful exchange of ideas with Dr. NGUYEN

Trong Dung, to whom I am also grateful for comments on the thesis. In the early

days of my research. Dr. HA Quang Thuy, Dr. PHAM Tran Nhu and Dr. DO Van

Thanh stimulated my interest in data mining in financial forecast. I am thankful for

that and for the many discussions I had with them.

I am indebted to CFO. LE The Anh, CFO. NGUYEN Minh Quang for their

patience with my questions on financial and stock market forecast. I am also grateful

to Dr. PHAM Ngoc Khoi, Dr. NGUYEN Phu Chien, MSc. DAO Van Thanh, Mrs.

LE Thi Hoang My for words of encouragement during months of the thesis efforts

and for their style-improving suggestions. My thanks also go to everyone who has

provided support or advice to me on data mining, stock market, forecast and so on in

one way or another.

My family has been creating good conditions for me to complete the thesis. I

dedicate the thesis to my father, my mother and my young brother whose love and

support are always for me.

Hanoi, June 2007,

CHU Thai Hoa.

Methodology of Relational Data mining for Stock Market Prediction

TABLE OF CONTENTS

ABSTRACT

i

ACKNOWLEDGEMENTS

ii

TABLE OF CONTENTS

iii

LIST OF TABLES AND FIGURES

v

LIST OF ABBREVIATIONS

vi

INTRODUCTION

1

Problem definition

1

Motivations of the Thesis

2

Objectives of the Thesis

4

Method of the Thesis study

4

Stmcture of the Thesis

5

CHAPTER I: OVERVIEW OF STOCK MARKET PREDICTION IN DM...6

LI. Introduction to stock market prediction

....6

1.1.1. Basic concepts of forecast

6

1.1.2. Prediction tasks in stock market

7

1.1.3. Stock market time series properties

8

1.1.4. Stock market prediction with the efficient market theory

9

1.1.5. Questions in stock market prediction

10

1.1.6. Challenges and Possibilifies on Developing a Stock Market

Prediction System

11

1.2. Data mining methodology for stock market prediction

13

1.2.1. Prediction in data mining

13

1.2.2. Parameters

14

1.2.3. Approaches to stock market prediction

15

1.2.4. Data mining methods in stock market

17

CHAPTER II: RELATIONAL DATA MINING FOR STOCK MARKET

PREDICTION

"22

ILL Introduction

22

II.2. Basic problems

22

11.2.1. First-order logic and rules

22

11.2.2. Representative measurement theory

25

11.2.3. Breadth-first search

29

11.2.4. Occam's razor principle

30

IL3. Theory of RDM

31

11.3.1. Data types in RDM

31

11.3.2. Relational representation of examples

33

11.3.3. Background knowledge and problems of search for regularities

34

IL4. An algorithm for RDM: MMDR

39

II.4.1. Motivations of choice for MMDR

39

Methodology of Relational Data mining for Stock Market Prediction

III

11.4.2. Some concepts

40

11.4.3. Algorithm MMDR

L'"!"...".^.".^43

CHAPTER III: AN APPLICATION OF MMDR TO STOCK PRICE

PREDICTION

47

IILL MMDR model for prediction

47

III.2. Experiment preparation

48

111.2.1. Data description and representation

48

111.2.2. Demo program

50

IIL3. Application of MMDR model

52

111.3.1. Step 1: Generating logical rules

52

111.3.2. Step 2: Learning logical rules

54

IIL3.3. Step 3: Creating intervals

56

IIL4. Results and evaluations

58

111.4.1. Stability of discovered rules on test data

58

111.4.2. Evaluations of forecast performance

61

CONCLUSIONS

70

Contributions of the thesis

70

Limitations of the thesis

71

Future work

72

Summary

73

APPENDICIES

.....vii

Source code

vii

REFERENCES

xii

In English

xii

In Vietnamese

xvii

Website

xvii

Methodology of Relational Data mining for Stock Market Prediction

IV

LIST OF TABLES AND FIGURES

Comparison of AVL-based methods and first-order logic methods

20

UpDown predicate

23

Predicates Up and Down

23

Examples of terms

24

Attribute-based data example

34

Partial background knowledge for stock market

..37

Figure III.l. Flow diagram for MMDR model: steps and techniques

48

Training set and Test set

49

Examples of rule consistent with hypotheses H1-H4

54

Table A.1: Stability checking table

59

Table A.2: Performance matrics for a set of 125 regularities

62

Figure A.l: Performance of 125 found regularities on test data

62

Table A.3: Performance matrics for a set of 292 regularities

63

Figure A.2: Performance of 125 found regularities on test data

63

Table A.5: Performance for regularity with conditional probability of 0.49

66

Figure A.3: Performance of an individual regualrity with conditional probability of

0.49 on test data

66

Table A.6: Performance for regularity with conditional probability of 0.84

67

Figure A.4: Performance of an individual regualrity with conditional probability of

0.84 on test data

67

Table A.7: Forecast result for the day December 1'^ 2006 (the regularity with

conditionalprobability of 0.84)

68

Table A.8: Forecast result for the day December 1^ 2006 (the set of 292

regularities with conditional probability not less than 0.65)

69

Methodology of Relational Data mining for Stock Market Prediction

LIST OF ABBREVIATIONS

AI

: Artificial Intelligence

AVL(s) : Attribute-value language(s)

DM

: Data mining

FOL

: First-order Logic

ILP

: Inductive Logic Programming

ML

: Machine Leaming

MMDR : Machine Methods for Discovering Regularities

MRDM : Multi-Relational Data mining

RDM

: Relational Data mining

RMT

: Representative measurement theory

Methodology of Relational Data mining for Stock Market Prediction

VI

INTRODUCTION

Problem definition

There are four major technological reasons stimulating data mining

development, applications and public interest: the emergence of very large databases;

advances in computer technology; fast access to vast amounts of data; and the ability

to apply computationally intensive statistical methodology to these data.

Data mining is the process of discovering hidden patterns in data. Due to the

large size of databases, importance of information stored, and valuable information

obtained, finding hidden pattems in data has become increasingly significant. The

stock market provides an area in which large volumes of data are created and stored

on a daily basis.

Financial forecasfing has been widely studied at a case of time-series prediction

problem. Times series such as the stock market are often seen as non-stationary

which present challenges in predicting fiiture values. The efficient market theory

states that it is pracfically impossible to predict financial markets long-term.

However, there is good evidence that short-term trends do exist and programs can be

written to find them. The data miners' challenge is to find the trends quickly while

they are valid, as well as to recognize the time when the trends are no longer

effective. Data mining methods provides thefi-ameworkfor stock market predictions

to discover hidden trends and pattems.

Well-known and commonly used data mining methods in stock market are

attributed-based leaming methods but they have some serious drawbacks: limited

ability to represent background knowledge and lack of complex relations. The

purpose of RDM is to overcome these limitations. RDM is a learning method that is

better suited for stock market mining with a better ability to explain discovered rules

than other symbolic approaches.

However, current relational methods are relatively inefficient and have rather

limited facilities for handling numerical data. RDM as a hybrid leaming method

combines the strength of FOL and probabilistic inference to meet these challenges.

One of the few Hybrid Probabilistic Relational Data Mining methods, MMDR that

handles numerical data efficiently, is developed and applied to stock market data.

It is believed that now is the time for RDM methods, in particular, MMDR to

stock market prediction has advantages in discovering regularities in stock market

time series.

Methodology of Relational Data mining for Stock Market Prediction

1

Motivations of the Thesis

In the past few years, Vietnam's stock market was still in early stage of

development and thus did not catch attention from investors and researchers.

Especially, to interested learners, mastering professional methods of stock market

analysis and forecast require to have fime and wide background knowledge to study

all fields covered. Moreover, according to the efficient market theory, it is

practically impossible to infer a fixed long-term global forecasting model from

historical stock market information. Therefore, there have been few Vietnamese

interested in and performing research on stock market prediction.

Two recent years have witnessed the surprising development of the Vietnamese

stock market with a host of notable events. Especially, after Vietnam became a

World Trade Organization (WTO) member, the Vietnamese economy has so many

opportunities to develop, leading to the development of many companies and

markets including the financial and stock markets. It is said that Vietnam's stock

market will grow rapidly in the next years, and it will ranlc second in the region, just

after China, in terms of growth rate.

Under the rapid development of Vietnam's financial market, professional

activities such as analysis and prediction of financial market should be paid more

attention. In particular, these activities play a significant role in the task of macro

economic forecast at the National Center for Socio-economic Information and

Forecast (under the Ministry of Planning and Investment), which helps make sound

policies related to socio-economic management and regulation at macro level. Data

mining provides some methods and techniques that are able to help approach stock

market prediction quite effectively.

In fact, there have been already some studies and successful applications of data

mining techniques to stock market forecast. However, the capture of loiowledge and

application techniques of each approach is quite challenging and consumes time. I

read some papers and especially paid attention to a research on relational data

mining in finance by two researchers, Prof Dr. Boris Kovalerchuk and Dr. Evgenii

Vityaev. They reported that, "Mining stock market data presents special challenges.

For one, the rewards for finding successftil pattems are potentially enormous, but so

are the difficulties and sources of conftisions. The efficient market theory states that

it is practically impossible to predict financial markets long-term. However, there is

good evidence that short-term trends do exist and programs can be written to find

them. The data miners' challenge is to find the trends quickly while they are valid, to

Methodology of Relational Data mining for Stock Market Prediction

deal effectively with time series and calendar effects, as well as to recognize the

time when the trends are no longer effective".

The leaming method RDM is able to leam more expressive rules, make better

use of underlying domain knowledge and explain discovered rules than other

symbolic approaches. It is thus better suited for stock market mining. This approach

will play a key role in fiiture advances in data mining methodology and practice.

The earlier algorithms for RDM suffer fi-om a relative computational inefficiency

and have rather limited tools for processing numerical data. This problem is

especially necessary to be considered in stock market analysis where data commonly

are numerical time series. Therefore, RDM as a hybrid leaming method that

combining the strength of FOL and probabilistic inference is developed to meet these

challenges. One of the few Hybrid Probabilisfic Relational Data Mining methods,

MMDR, that handles numerical data efficiently, is developed and applied to stock

market forecasting.

The common question "Can stock market prediction be profitable?" is often

made to any research on methods of stock market prediction. In fact, there are few

people doing research on RDM for stock market forecast, because it requires

interested learners to have wide background knowledge to understand all fields

covered. Much less has been reported publicly on success of data mining in real

trading by financial institutions. If real success is reported then competitors can

apply the same methods and the leverage will disappear, because in essence all

ftindamental data mining methods are not proprietary. I used to concentrate my

study in attempt to end up with a Master's Degree and as a millionaire (kidding), but

this is too high risk to take.

Basing my intention on practical suggestions and requirements, as well as my

personal interest, I came to a decision of doing research on stock market forecast.

Through some school lessons and extra self-learning efforts, I access some data

mining techniques to seek a solution to the task. Those above motivate the aim of

the thesis - to carry out research and experiment on methodology of RDM for stock

market prediction.

Methodology of Relational Data mining for Stock Market Prediction

Objectives of the Thesis

- Systematical organization of RDM methodology for stock market

prediction

Most of the exisfing studies on RDM for stock market prediction are reported in

a short and overview way, which causes difficulties for many readers. The thesis is

primarily based on the book "Data Mining in Finance: Advances in Relational and

Hybrid Methods" and some papers by the two researchers Dr. Kovalerchuk & Dr.

Vityaev. However, after having a thorough grasp of the RDM methodology, I

systematically organize the methodology, especially the algorithm MMDR in my

view and supplement more extensions of knowledge in data mining and stock

market forecast to the thesis. Hopefiilly, it plays an important role in helping new

comers move toward the problem more favorably.

- Experiment performance of MMDR method to stock market prediction

Centre of the thesis is the issue of discovering regularities in stock price series

addressed and illustrated through the MMDR. The thesis also carries out an RDM

application to stock market prediction through an experiment with a small selfdeveloped program in a set of Standard and Poor's data. The experiment helps

understand and trust more the feasibility and efficiency of RDM methodology and

MMDR algorithm presented in the thesis.

Method of the Thesis study

The study behind the thesis has been mostly goal driven. As problems appeared

on the way to realizing stock market prediction, they were tackled by various means

as listed below:

• Investigation of some existing machine learning and data mining methods

through related documents such as Doctoral Theses, Master' theses, online

papers, books, etc.

• Reading of financial and stock market literatures for properties, forecast

techniques and hints of regularities in stock market data able to be exploited.

• Learning about some existing stock market prediction software for deeper

understanding of regularity discovered.

• Some theoretical considerations on mechanisms behind the generation of

stock data, and on general predictability demands and limits.

• Practical insights into the realm of trading in stock market.

• Contacts with experts on data mining and data mining software development,

with stock market investors and chief financial officers.

Methodology of Relational Data mining for Stock Market Prediction

4

Courses on economic forecast and stock market mostly organized by the

National Center for Socio-economic Information and Forecast.

Collection of related documents and systemization of Mater's thesis.

Programming in PHP and carrying experiments to illustrate and to prove the

main idea and algorithm presented in the Thesis.

Structure of the Thesis

The thesis is stmctured in the following way. The first part introduces the

problem definifion, method of study, objectives and stmcture of the Thesis.

Chapter 1 provides an overview of stock market prediction in data mining

through two following parts. "Introduction to stock market prediction" includes

basic concepts of stock market forecast, data mining with the Efficient Market

Theory, stock market time series properties, and drawbacks and possibilities on

developing a stock market prediction, etc. The last part "Data mining methodology

for stock market prediction" presents some major types of data mining prediction,

approaches to stock market prediction and comparisons on representation languages

and data mining methods used in stock market.

Chapter 2 talks about some basic problems, theory of RDM and an algorithm

MMDR. In comparison with other data mining methods, the RDM approach is

considered fi-om the point of view of their Data Types, Representation Languages

(to manipulate and interpret data) and Class of hypothesis (to be tested on data). One

of the few Hybrid Probabilistic Relational Data Mining methods, MMDR, which is

equipped with probabilistic mechanism that is necessary for time series with high

level of noise, is mainly introduced.

In Chapter 5, an MMDR application to stock market price prediction is made

clear for the methodology through three steps: mle generating, rule learning and

interval creating. This chapter also brings out some statisfic results and evaluations

for the experiment conducted to demonstrate the application.

Finally, contributions, limitations and fiiture work of my research are given as

conclusion part for the thesis. At the appendix part, the thesis also provides some

table stmctures and source code developed by myself that are used for experiment.

Methodology of Relational Data mining for Stock Market Prediction

CHAPTER I:

1.1.

OVERVIEW OF STOCK MARKET

PREDICTION IN DATA MINING

Introduction to stock market prediction

1.1.1. Basic concepts of forecast

This section provides a brief basic concepts of forecast. An introductory

discussion of the topic can be found in [46] - Michael Leonard, Large-Scale

Automatic Forecasting: Millions of Forecasts, International Symposium of

Forecasting, 2002.

Forecasts are time series predictions made for future periods in time. They are

random variables and therefore have an associated probability distribution. The

mean or median of each forecast is called the prediction. The variance of each

forecast is called the prediction error variance and the square root of the variance is

called the prediction standard error. The variance is computed from the forecast

model parameter estimates and the model residual variance.

The forecast for the next future period is called the one-step ahead forecast. The

forecast for h periods in the future is called the h-step ahead forecast. The forecast

horizon or forecast lead is the number of periods into the future for which

predictions are made (one-step, two-step,..., h-step). The larger the forecast horizon,

the larger the prediction error variance at the end of the horizon.

The confidence limits are based on the prediction standard errors and a chosen

confidence limit size. A confidence limit size of 0.05 results in 95% confidence

limits. The confidence limits are often computed assuming a normal distribution, but

others could be used. As with the prediction standard errors, the width of the

confidence limits increases with the forecast horizon.

The prediction error is the difference between the predicted value and the actual

value when the actual value is known. For transformed models, it is important to

understand the difference between the model errors (or residuals) and the prediction

errors. The residuals measure the departure from the model in the transformed

metric. The prediction errors measure the departure from the original series.

Taken together, the predictions, prediction standard errors, and confidence

limits at each period in the forecast horizon are the forecasts. Although many people

use the word "forecast" to imply only prediction, a forecast is not one number for

each future time period.

Using a transformed forecasting model requires the following steps:

Methodology of Relational Data mining for Stock Market Prediction

6

• The time series data are transformed.

• The transformed time series data are fit using the forecasting model.

• The forecasts are computed using the parameter estimates and the

transformed time series data.

• The forecasts (predictions, prediction standard errors, and confidence limits)

are inverse transformed.

The naive inverse transformation resuhs in median forecasts. To obtain mean

forecasts requires that the prediction and the prediction error variance both are

adjusted based on the transformation. Additionally, the model residuals will be

different from the prediction errors due to this inverse transformation. If no

transformation is used, the model residual and the prediction error will be the same,

and likewise the mean and median forecast will be the same (assuming a symmetric

disturbance distribution).

The statistics of fit evaluate how well a forecasting model performs by

comparing the actual data to the predictions. For a given forecast model that has

been fitted to the time series data, the model should be checked or evaluated to see

how well it fits or forecasts the data. The statistics of fit can be computed from the

model residuals or the prediction errors.

When a particular statistic of fit is used for forecast model selection, it is

referred to as the model selection criterion. When using model selection criteria to

rank forecasting models, it is important to compare the errors on the same metric,

that is, you should not compare transformed model residuals with non-transformed

model residuals. You should first inverse transform the forecasts from the

transformed model prior to compufing the prediction errors and then compute the

model selection criterion based on the prediction errors.

1.1.2.

Prediction tasks in stock market

Boris Kovalerchuk, Evgenii Vityaev, Data Mining For Financial Applications, In:

0. Maimon, L. Rokach (Eds.): The Data Mining and Knowledge Discovery

Handbook, Springer 2005, pp. 1203-1224

Stock market prediction includes uncovering market trends, planning

investment strategies, identifying the best time to purchase the stocks and what

stocks to purchase. Prediction tasks in stock market typically are posed in one of two

forms:

• Straight prediction of the stock market numeric characteristic, e.g., stock

return or exchange rate

Methodology of Relational Data mining for Stock Market Prediction

• The prediction whether the stock market characteristic will increase or

decrease.

Having in mind that in the first case, it is necessary to take into account the

trading cost and the significance of the trading return. And in the second case, it is

necessary to forecast whether the stock market characteristic will increase or

decrease no less than some threshold. Thus, the difference between data mining

methods for the first or second case can be less obvious, because the second case

may require some kind of numeric forecast.

Financial institutions produce huge datasets that build a foundation for approaching

these enormously complex and dynamic problems with data mining tools. Potential

significant benefits of solving these problems motivate extensive research for years.

1.1.3. Stock market time series properties

One may wonder if there are universal characteristics of the many series coming

from markets different in size, location, sophistication, etc. The surprising fact is that

there are. Moreover, interacting systems in other fields, such as statistical mechanics,

suggest that the properties of stock market time series loosely depend on the market

microstructure and are common to a range of interacting systems. Such observations

have stimulated new models of markets based on analogies with particle systems and

brought in new analysis techniques opening the era of econophysics. A more

detailed discussion of stock market time series properties can be found in [66] Stefan Zemke, On Developing a Financial Prediction System: Pitfalls and

Possibilities, First International Workshop on Data Mining Lessons Learned at

ICML'02, 2002. This section introduces a brief on stock market time series

properties including:

- Distribution

Distribution of stock market series tends to be non-normal, sharp peaked and

heavy-tailed, these properties being more pronounced for intraday values. Such

observations were pioneered interestingly around the time the EMH was formulated.

Extreme values appear more frequently in a stock market series as compared to a

nomially-distributed series of the same variance. This is important to the practitioner

since often the values cannot be disregarded as erroneous outliers but must be

actively anticipated, because of their magnitude which can influence trading

performance.

- Scaling property

Methodology of Relational Data mining for Stock Market Prediction

8

Scaling property of a time series indicates that the series is self-similar at

different time scales. This is common in stock market time series, i.e. given a plot of

returns without the axis signed; it is next to impossible to say if it represents hourly,

daily or monthly changes, since all the plots look similar, with differences appearing

at minute resolution. Thus prediction methods developed for one resolution could, in

principle, be applied to others.

- Data frequency

Data frequency refers to how often series values are collected: hourly, daily,

weekly etc. Usually, if a stock market series provides values on daily, or longer,

basis, it is low frequency data, otherwise - when many intraday quotes are included

- it is high frequency. Tick-by-tick data includes all individual transactions, and as

such, the event-driven fime between data points varies creating challenge even for

such a simple calculation as correlation.

1.1.4. Stock market prediction with the efficient market theory

The Efficient Market Theory/Hypothesis (EMH) inifially got wide acceptance

in the financial community. It asserts, in weak form, that the current price of an asset

already reflects all informafion obtainablefi*ompast prices and assumes that news is

promptly incorporated into prices. Since news is assumed unpredictable, so are

prices. In other words, according to the EMH, the evolufion of the prices for each

economic variable is a random walk. The variations in prices are completely

independent from one fime step to the next in the long run. EMH states that it is

practically impossible to infer a fixed long-term global forecasting model from

historical stock market informafion. This idea is based on the observation that if the

market presents some kind of regularity then someone will take advantage of it and

the regularity disappears.

However, real markets do not obey all the consequences of the hypothesis, e.g.,

price random walk implies normal distribution, not the observed case; there is a

delay while price stabilizes to a new level after news, which among other, lead to a

more modem view: "Overall, the best evidence points to the following conclusion.

The market isn't efficient with respect to any of the so-called levels of efficiency.

The value investing phenomenon is inconsistent with semi-strong form efficiency,

and the January effect is inconsistent even with weak form efficiency. Overall, the

evidence indicates that a great deal of information available at all levels is, at any

given time, reflected in stock prices. The market may not be easily beaten, but it

appears to be beatable, at least if you are willing to work at it."

Methodology of Relational Data mining for Stock Market Prediction

The market efficiency theory does not exclude that hidden short-term local

conditional regularities may exist. These regularities can not work "forever," they

should be corrected frequently. It has been shown that the stock market data are not

random and that the efficient market hypothesis is merely a subset of a larger

chaotic market hypothesis. This hypothesis does not exclude successful short term

forecasfing models forpredicfion of chaofic time series.

Data mining does not try to accept or reject the efficient market theory. Data

mining creates tools which can be useful for discovering subtle short-term

conditional pattems and trends in wide range of stock market data. This means that

retraining should be a permanent part of data mining in stock market and any claim

that a silver bullet trading has been found should be treated similarly to claims that a

perpetual mobile has been discovered.

1.1.5.

Questions in stock market prediction

Following are some questions of scienfific and pracfical interest concerning

stock market prediction:

• Prediction possibility: Is statistically significant prediction of stock market

data possible? Is profitable prediction of such data possible? What involves

answer to the former question, adjusted by constraints imposed by the real

markets?

• Methods: If prediction is possible, what methods are best at performing it?

What methods are best-suited for what data characteristics - could it be said

in advance?

• Meta-methods: What are the ways to improve the methods? Can

metaheuristics successful in other domains, such as ensembles or pruning,

improve stock market prediction?

• Data: Can the amount, type of data needed for prediction, be characterized?

• Data preprocessing: Can data transformations that facilitate prediction be

identified? In particular, what transformation formulae enhance input data?

• Evaluation: What are the features of sound evaluation procedure, respecting

the properties of stock market data and the expectations of stock market

prediction? What are the common evaluation drawbacks?

• Predictor development: Are there any common features of successful

prediction systems? If so, what are they, and how could they be advanced?

Can common reasons of failure of stock market prediction be identified?

Are they intrinsic, non-reparable, or there is a way to amend them?

• Transfer to other domains: Can the methods developed for stock market

prediction benefit other domains?

Methodology of Relational Data mining for Stock Market Prediction

10

Predictability estimation: Can stock market data be reasonably quickly estimated

to be predictable or not, without the investment to build a custom system? What

are the methods, what do they actually say, what are their limits?

Consequences of predictability: What are the theoretical and practical

consequences of demonstrated predictability of stock market data, or the

impossibility of it? How a successful prediction method translates into

economical models? What could be the social consequences of stock market

prediction?

1.1-6. Challenges and Possibilities on Developing a Stock Market

Prediction System

A successful stock market predicfion system presents many challenges. Some

are encountered over agam, and though an individual solution might be systemspecific, general principles still apply. Using them as a guideline might save fime,

effort, boost results, as such promoting project's success.

The idea of stock market predicfion (and resulting riches) is appealing, initiating

countless attempts. In this competitive environment, if one wants above-average

resuhs, one needs above-average insight and sophistication. Reported successful

systems are hybrid and custom made, whereas straightforward approaches, e.g. a

neural network plugged to relatively unprocessed data, usually fail. The

individuality of a hybrid system offers chances and dangers. One can bring together

the best of many approaches; however the interaction complexity hinders judging

where the performance dis/advantage is coming from.

Stock market prediction has been widely studied at a case of time-series

prediction problem; The difficulty of this problem is due to the following factors:

low signal-to-noise ratio, non-Gaussian noise distribufion, nonstationarity, and

nonlinearly. Deriving relationships that allow one to predict future values of time

series is a challenging task when the underlying system is highly non-linear.

Usually, the history of the time series is provided and the goal is to extract from that

data a dynamic system. The dynamic system models the relationship between a

window of past values and a value T time steps ahead. Discovering such a model is

difficult in pracfice since the processes are typically cormpted by noise and can only

be partially modeled due to missing information and the overall complexity of the

problem. In addition, stock market time series are inherently non-stationary so

adaptive forecasting techniques are required.

- Data Preprocessing

Methodology of Relational Data mining for Stock Market Prediction

11

Before data is fed into an algorithm, it must be collected, inspected, cleaned and

selected. Since even the best predictor will fail on bad data, data quality and

preparafion is cmcial. Also, since a predictor can exploit only certain data features, it

is important to detect which data preprocessing/presentation works best.

• Visual inspecfion is invaluable. At first, one can look for: trend - if need to

remove, histogram - redistribute, missing values and outliers, any

regularities.

• Missing values deah with by data mining methods

• Series to instances conversion is required by most leaming algorithms

expecting as an input a fixed length vector

• Indicators are series derived from others, enhancing some features of

interest, such as trend reversal.

• Feature selection can make learning feasible, as because of the curse of

dimensionality long instances demand (exponentially) more data.

- Prediction Algorithms

Common leaming algorithms point their features important to stock market

prediction:

• Linear methods are widely used in stock market prediction.

• Neural Network seems the method of choice for stock market predicfion.

• C4.5, ILP - generate decision trees/if-then rules - human understandable, if

small.

• Nearest Neighbor does not create a general model, but to predict, it looks

back for the most similar case(s). Irrelevant/noisy features disrupt the

similarity measure, so pre-processing is worthwhile.

• Bayesian classifier/predictor first learns probabilities how evidence supports

outcomes, used then to predict new evidence's outcome.

• Support Vector Machines (SVM) are a relatively new and powerful learner,

having attractive characteristics for time series prediction.

- System Evaluation

Proper evaluation is critical to a prediction system development. First, it has to

measure exactly the interesting effect as opposed to prediction accuracy. Second, it

has to be sensitive enough as to disfinguish oflen minor gains. Third, it has to

convince that the gains are no merely a coincidence.

• Evaluation bias resulfing from the evaluation scheme and time series data,

needs to be recognized.

Methodology of Relational Data mining for Stock Market Prediction

12

• Evaluation data should include different regimes, markets, even data errors,

and be plentiful. Dividing test data into segments helps to spot performance

irregularities (for different regimes).

• Sanity checks involve common sense. Prediction errors along the series

should not reveal any stmcture, unless the predictor missed something.

1.2.

Data mining methodology for stock market prediction

1.2.1. Prediction in data mining

a.

Introduction

The goal of data mining is to produce new knowledge that the user can act

upon. It does this by building a model of the real world based on data collected from

a variety of sources. The result of the model building is a description of patterns and

relationships in the data that can be confidenfiy used for prediction.

Prediction is one of the most important problems in data mining. It involves

using some variables or fields in the data set to predict unknown or future values of

other variables of interest. The goal of prediction is to forecast or deduce the value

of an attribute based on values of other attributes.

b. Major types of prediction

- In the view of contruction and use of model

Prediction can be viewed as the construction and use of model to assess the

class of an unlabeled sample, or to assess the value or value ranges of an attribute

that a given sample is likely to have. In this view, classification and regression are

the two major types of predicfion problems:

• Classification: used to discrete or nominal values. It predicts into what

category or class a case falls. In other words, classification problems aim to

identify the characteristics that indicate the group to which each case

belongs. Data mining creates classificafion models by examining already

classified data (cases) and inductively finding a predictive pattern.

• Regression: used to predict continuous or ordered values. It predicts what

number value a variable will have. In other words, regression uses existing

values to forecast what other values will be. The prediction of continuous

values can be modeled by statistical techniques of regression.

- In the view of use of prediction to predict

This view is commonly accepted in data mining. Predicfion refers the use of

prediction to predict class labels as classification and to predict continuous values as

prediction:

Methodology of Relational Data mining for Stock Market Prediction

13

• Classification: used to extract models describing important data classes.

Classificafion predicts categorical class label. It classifies data (constructs a

model) based on the training set and the values (class labels) in a classifying

attribute and uses it in classifying new data.

• Prediction: used to predict future data trends, i.e., predict unknown or

missing values. It models confinuous-valued funcfions. Any of the methods

and techniques used for classification may also be used for prediction.

1.2.2.

Parameters

There are several parameters to characterize data mining methodologies for

stock market forecasting:

1.2.2.1. Datatypes

Two major groups of data types

• Attributes data type: object is represented by attributes that is each object x

is given by a set of values ^i(x), A2{x\.., An{x).

• Relational data type: objects are represented by their relations with other

objects. For instance, x>y, yz. In this example we may not know that

x=3, y=l and z=2. Thus attributes of objects are not known, but their

relations are known. Objects may have different attributes (e.g., x=5, y=2,

and z= 4), but still have the same relations.

1.2.2.2. Data set and techniques

Fundamental and technical analyses are two widely used techniques in stock

market forecast.

- Fundamental analysis

Fundamental analysis tries to determine all the econometric variables that may

influence the dynamics of a given stock price or exchange rate. Often it is hard to

establish which of these variables are relevant and how to evaluate their effect.

- Technical analysis

Technical analysis assumes that when the sampling rate of a given economic

variable is high, all the information necessary to predict the future values is

contained in the time series itself There are several difficulties in technical analysis

for accurate prediction: successive ticks correspond to bids from different sources,

the correlation between price variations may be low, time series are not stationary,

good statisfical indicators may not be known, different realizations of the random

process may not be available, and the number of training examples may not be

enough to accurately infer rules. Therefore, the technical analysis can fit short-term

predictions for stock market time series without great changes in the economic

Methodology of Relational Data mining for Stock Market Prediction

14

environment between successive ticks. Actually, the technical analysis was more

successful in identifying market trends, which is much easier than forecasting the

future stock prices. Currently different data mining techniques try to incorporate

some of the most common technical analysis strategies in pre-processing of data and

in the construction of appropriate attributes.

Two major options exist: use the time series itself or use all variables that may

influence the evolution of the time series. Data mining methods do not restrict

themselves to a particular option. They follow a fundamental analysis approach

incorporating all available attributes and their values, but they also do not exclude a

technical analysis approach based only on a time series such as stock price and

parameters derivedfi"omit. Most popular time series are index value at open, index

value at close, highest index value, lowest index value and trading volume and

lagged returns from the time series of interest. Fundamental factors include the price

of gold, retail sales index, industrial production indices, and foreign currency

exchange rates. Technical factors include variables that are derived from time series

such as moving averages.

1.2.2.3. Mathematical algorithm (method, model)

A variety of statistical, neural network and logical methods has been developed.

For example, there are many neural network models, based on different

mathematical algorithms, theories and methodologies. Combinations of different

models may provide a better performance than those provided by individuals. Many

data mining methods assume a functional form of the relationship being modeled.

1.2.2.4. Form of relationships between objects

The next characteristic of a specific data mining methodology is a form of the

relationship between objects. Many data mining methods assume o. functional form

of the relationship being modeled. For instance, the linear discriminant analysis

assumes linearity of the border that discriminates between two classes in the space

of attributes. Often it is hard to justify such functional form in advance. RDM

methodology in stock market does not assume a functional form for the relationship.

In addition, RDM algorithms do not assume the existence of derivatives. It can

automatically leam symbolic relations on numerical data of stock market time series.

1.2.3. Approaches to stock market prediction

a. Physics approach and data mining approach

The impact of market players on market regularities stimulated a surge of

attempts to use ideas of statistical physics in finance. If an observer is a large

marketplace player then such observer can potentially change regularities of the

Methodology of Relational Data mining for Stock Market Prediction

15

marketplace dynamically. Attempts to forecast in such dynamic environment with

thousands active agents leads to much more complex models than traditional data

mining models designed for. This is one of the major reasons that such interactions

are modeled using ideas from statistical physics rather than from statistical data

mining. The physics approach in finance is also known as "econophysic" and

"physics of finance". The major difference from data mining approach is coming

from the fact that in essence the data mining approach is not about developing

specific methods for financial tasks, but the physics approach is.

b. Deterministic dynamic system approach

Stock market data are often represented as a time series of a variety of attributes

such as stock prices and indexes. Time series prediction has been one of the ultimate

challenges in mathematical modeling for many years. Currently data mining

methods try to enhance this study with new approaches. Dynamic system approach

has been developed and applied successfully for many difficult problems in physics.

Recently several studies have been accomplished to apply this technique in stock

market. Usually, the history of the time series is provided and the goal is to extract

from that data a dynamic system. The dynamic system models the relationship

between a window of past values and a value T time steps ahead. Below presents the

major steps of this approach:

• Step 1: Development of state space for the dynamic system, i.e. selecting

and/or inventing attributes characterizing the system behavior.

• Step 2: Discovering the laws that govern the phenomenon, i.e. discovering

relations between attributes of current and previous states (state vectors) in

the form of differential equations.

• Step 3: Solving differential equations for identifying the transition function

(mles).

• Step 4: Use of the transition funcfion as a predictor of the next state of the

dynamic system, e.g., next day stock value.

Inferring a set of rules for dynamic system assumes that there is

• Enough information in the available data to sufficiently characterize the

dynamics of the system with high accuracy

• All of the variables that influence the time series are available or they vary

slowly enough that the system can be modeled adaptively

• The system has reached some kind of stationary evoludon

• The system is a detenninistic system

• The evoludon of a system can be described by means of a surface in the

space of delayed values.

Methodology of Relational Data mining for Stock Market Prediction

16

There are several applicafions of these methods to stock time series. However,

the literature claims both for and against the existence of chaotic deterministic

system underlying stock market. Recent research has focused on methods to

disfinguish stochastic noise from deterministic chaotic dynamics and more generally

on constmcting systems combining deterministic and probabilistic techniques.

1.2.4.

Data mining methods in stock market

Almost every computational method has been explored and used for financial

modeling. New developments augment traditional technical analysis of stock market

curves that has been used extensively by financial institutions. Such stock charting

helps to identify buy/sell signals (timing "flags") using graphical pattems. Data

mining as a process of discovering useful patterns, correlations has its own place in

stock market modeling.

Similarly to other computational methods, almost every data mining method

and technique has been used in financial modeling. An incomplete list includes a

variety of linear and non-linear models, multi-layer neural networks, k-means and

hierarchical clustering; k-nearest neighbors, decision tree analysis, regression

(logistic regression; general multiple regression), ARIMA, principal component

analysis, and Bayesian leaming. Less traditional methods used include rough sets,

RDM methods (deterministic inductive logic programming) and newer probabilistic

methods, support vector machine, independent component analysis, Markov models

and hidden Markov models.

1.2.4.1. Representation languages

a.

Propositional Logic language

A proposition is a statement that can be true or false. Propositional logic uses

true statements to form or prove other tme statements. In other words, propositional

logics are concerned with propositional (or sentential) operators which may be

applied to one or more propositions giving new propositions.

Propositional logic has very limited expressive power. It is not adequate for

formalizing valid arguments that rely on the internal stmcture of the propositions

involved.

b. First-order logic language

First-order logic (FOL) is a system of deduction extending propositional logic

by the ability to express relations between individuals. FOL languag^s^support

variables, relations, and complex expressions.

DAI HOC QUOC GIA HA NOl

TRUrJG TAM THONG TIN THL/ViEN

Methodology of Relational Data mining for Stock Market Prediction

17

The FOL language differs from a propositional logic language mainly by the

presence of variables. Therefore, a language of monadic fiinctions and predicates is

a FOL language, but a very restricted language.

c. Attribute-value languages

Attribute-value language is a propositional language in which propositions are

attribute-value pairs that can be considered as predicates. In other words, in an

attribute-value language, objects are described by tuples of attribute-value pairs,

where each attribute represents some characteristic of the object.

Attribute-value languages are languages of monadic fiinctions (fiinctions of one

variable) and monadic predicates (Boolean functions with only one argument). This

language was not designed to represent relations that involve two, three or more

objects.

d. Comparison of these languages

Many well-known rule learners are propositional but propositional

representations offer no general way to describe the essential relations among the

values of the attributes. In contrast with propositional mles, first order mles have an

advantage in discovering relational assertions because they capture relations

directly. Several types of hypotheses/mles presented in FOL are simple relational

assertions with variables. Relational assertions can be conveniently expressed using

first-order representations, while they are very difficult to describe using

propositional representations.

Also, first order mles allow one to express naturally other more general

hypotheses not only the relation between pairs of attributes. These more general

rules can be as for classification problems as for an interval forecast of continuous

variable. Moreover, these mles are able to catch Markov chain type of models used

for stock market time series forecast. That algorithms designed to leam sets of firstorder rules that contain variables is significant because first-order rules are much

more expressive than propositional mles.

1.2.4.2. AVL-based methods

The common data mining methodology assuming attributes data type is known

as an attribute-based or attribute-value methodology. It covers a wide range of

statistical and connectionist (neural network) methods. There are two types of

attribute-value methods: the first one is based on numerical expressions, and the

second one is based on logical expressions and operations.

Historically, methods based on AVLs such as neural networks, the nearest

neighbors method, and decision trees dominate in financial applications of data

Methodology of Relational Data mining for Stock Market Prediction

18

COLLEGE OF TECHNOLOGY

***

CHU THAI HOA

METHODOLOGY OF

RELATIONAL DATAMINING FOR

STOCK MARKET PREDICTION

Major:

Code:

Information Technology

1.01.10

MASTER'S THESIS

Instructor:

Prof. Dr. HO TU BAG

DAI HOC QUOC GiA HA NOl

TRUNG FAM IHONG TIN THIJ VIEN

000 ^J 000095^

Hanoi, June 2007

ABSTRACT

This thesis presents the methodology of relational data mining for stock market

prediction by making clear each problem related to the keywords: methodology,

relational, data mining, stock market, and prediction, then coming to the

methodology of relational data mining with the emphasis on Machine Methods for

Discovering Regularities (MMDR) for stock market prediction.

Stock market prediction has been widely studied in terms of time-series

prediction problem. Deriving relationships that allow one to predict future values of

time series is challenging. One approach to prediction is to spot pattems in the past,

when we already know what followed them, and to test on more recent data. If a

pattem is followed by the same outcome frequently enough, we can gain confidence

that it is a genuine relationship.

The purpose of relational data mining (RDM) is to overcome the limitations of

attributed-based learning methods (commonly used in finance) in representing

background knowledge and complex relations. RDM approaches look for pattems

that involve multiple tables (relations) from a relational database. This approach will

play a key role in future advances in data mining methodology and practice.

MMDR method is one of the few Hybrid Probabilistic Relational Data Mining

methods developed and applied to stock market data. The method has an advantage

in handling numerical data. It expresses pattems in First-order Logic (FOL) and

assigns probabilities to rules generated by composing pattems. This will be made

clear through an application of MMDR with computational experiment on price

index data of Standard and Poor's 500.

The thesis consists of 3 chapters concentrating on relational data mining

methodology for stock market prediction.

Methodology of Relational Data mining for Stock Market Prediction

ACKNOWLEDGEMENTS

This thesis would not have been completed if there was no help and support of

many people. I would like to take this opportunity to express my gratitude to the

many people who helped me during the time of development leading to the thesis.

In particular, I would like to thank my instructor. Prof Dr. HO Tu Bao, for his

courage of accepting me as a Master's student, for his enthusiasm, his knowledge

and his encouragement in the work throughout. I would never been able to finish

this Thesis without his encouragement as well as his strict requirement for quality of

the research.

I also enjoyed and appreciated the fruitful exchange of ideas with Dr. NGUYEN

Trong Dung, to whom I am also grateful for comments on the thesis. In the early

days of my research. Dr. HA Quang Thuy, Dr. PHAM Tran Nhu and Dr. DO Van

Thanh stimulated my interest in data mining in financial forecast. I am thankful for

that and for the many discussions I had with them.

I am indebted to CFO. LE The Anh, CFO. NGUYEN Minh Quang for their

patience with my questions on financial and stock market forecast. I am also grateful

to Dr. PHAM Ngoc Khoi, Dr. NGUYEN Phu Chien, MSc. DAO Van Thanh, Mrs.

LE Thi Hoang My for words of encouragement during months of the thesis efforts

and for their style-improving suggestions. My thanks also go to everyone who has

provided support or advice to me on data mining, stock market, forecast and so on in

one way or another.

My family has been creating good conditions for me to complete the thesis. I

dedicate the thesis to my father, my mother and my young brother whose love and

support are always for me.

Hanoi, June 2007,

CHU Thai Hoa.

Methodology of Relational Data mining for Stock Market Prediction

TABLE OF CONTENTS

ABSTRACT

i

ACKNOWLEDGEMENTS

ii

TABLE OF CONTENTS

iii

LIST OF TABLES AND FIGURES

v

LIST OF ABBREVIATIONS

vi

INTRODUCTION

1

Problem definition

1

Motivations of the Thesis

2

Objectives of the Thesis

4

Method of the Thesis study

4

Stmcture of the Thesis

5

CHAPTER I: OVERVIEW OF STOCK MARKET PREDICTION IN DM...6

LI. Introduction to stock market prediction

....6

1.1.1. Basic concepts of forecast

6

1.1.2. Prediction tasks in stock market

7

1.1.3. Stock market time series properties

8

1.1.4. Stock market prediction with the efficient market theory

9

1.1.5. Questions in stock market prediction

10

1.1.6. Challenges and Possibilifies on Developing a Stock Market

Prediction System

11

1.2. Data mining methodology for stock market prediction

13

1.2.1. Prediction in data mining

13

1.2.2. Parameters

14

1.2.3. Approaches to stock market prediction

15

1.2.4. Data mining methods in stock market

17

CHAPTER II: RELATIONAL DATA MINING FOR STOCK MARKET

PREDICTION

"22

ILL Introduction

22

II.2. Basic problems

22

11.2.1. First-order logic and rules

22

11.2.2. Representative measurement theory

25

11.2.3. Breadth-first search

29

11.2.4. Occam's razor principle

30

IL3. Theory of RDM

31

11.3.1. Data types in RDM

31

11.3.2. Relational representation of examples

33

11.3.3. Background knowledge and problems of search for regularities

34

IL4. An algorithm for RDM: MMDR

39

II.4.1. Motivations of choice for MMDR

39

Methodology of Relational Data mining for Stock Market Prediction

III

11.4.2. Some concepts

40

11.4.3. Algorithm MMDR

L'"!"...".^.".^43

CHAPTER III: AN APPLICATION OF MMDR TO STOCK PRICE

PREDICTION

47

IILL MMDR model for prediction

47

III.2. Experiment preparation

48

111.2.1. Data description and representation

48

111.2.2. Demo program

50

IIL3. Application of MMDR model

52

111.3.1. Step 1: Generating logical rules

52

111.3.2. Step 2: Learning logical rules

54

IIL3.3. Step 3: Creating intervals

56

IIL4. Results and evaluations

58

111.4.1. Stability of discovered rules on test data

58

111.4.2. Evaluations of forecast performance

61

CONCLUSIONS

70

Contributions of the thesis

70

Limitations of the thesis

71

Future work

72

Summary

73

APPENDICIES

.....vii

Source code

vii

REFERENCES

xii

In English

xii

In Vietnamese

xvii

Website

xvii

Methodology of Relational Data mining for Stock Market Prediction

IV

LIST OF TABLES AND FIGURES

Comparison of AVL-based methods and first-order logic methods

20

UpDown predicate

23

Predicates Up and Down

23

Examples of terms

24

Attribute-based data example

34

Partial background knowledge for stock market

..37

Figure III.l. Flow diagram for MMDR model: steps and techniques

48

Training set and Test set

49

Examples of rule consistent with hypotheses H1-H4

54

Table A.1: Stability checking table

59

Table A.2: Performance matrics for a set of 125 regularities

62

Figure A.l: Performance of 125 found regularities on test data

62

Table A.3: Performance matrics for a set of 292 regularities

63

Figure A.2: Performance of 125 found regularities on test data

63

Table A.5: Performance for regularity with conditional probability of 0.49

66

Figure A.3: Performance of an individual regualrity with conditional probability of

0.49 on test data

66

Table A.6: Performance for regularity with conditional probability of 0.84

67

Figure A.4: Performance of an individual regualrity with conditional probability of

0.84 on test data

67

Table A.7: Forecast result for the day December 1'^ 2006 (the regularity with

conditionalprobability of 0.84)

68

Table A.8: Forecast result for the day December 1^ 2006 (the set of 292

regularities with conditional probability not less than 0.65)

69

Methodology of Relational Data mining for Stock Market Prediction

LIST OF ABBREVIATIONS

AI

: Artificial Intelligence

AVL(s) : Attribute-value language(s)

DM

: Data mining

FOL

: First-order Logic

ILP

: Inductive Logic Programming

ML

: Machine Leaming

MMDR : Machine Methods for Discovering Regularities

MRDM : Multi-Relational Data mining

RDM

: Relational Data mining

RMT

: Representative measurement theory

Methodology of Relational Data mining for Stock Market Prediction

VI

INTRODUCTION

Problem definition

There are four major technological reasons stimulating data mining

development, applications and public interest: the emergence of very large databases;

advances in computer technology; fast access to vast amounts of data; and the ability

to apply computationally intensive statistical methodology to these data.

Data mining is the process of discovering hidden patterns in data. Due to the

large size of databases, importance of information stored, and valuable information

obtained, finding hidden pattems in data has become increasingly significant. The

stock market provides an area in which large volumes of data are created and stored

on a daily basis.

Financial forecasfing has been widely studied at a case of time-series prediction

problem. Times series such as the stock market are often seen as non-stationary

which present challenges in predicting fiiture values. The efficient market theory

states that it is pracfically impossible to predict financial markets long-term.

However, there is good evidence that short-term trends do exist and programs can be

written to find them. The data miners' challenge is to find the trends quickly while

they are valid, as well as to recognize the time when the trends are no longer

effective. Data mining methods provides thefi-ameworkfor stock market predictions

to discover hidden trends and pattems.

Well-known and commonly used data mining methods in stock market are

attributed-based leaming methods but they have some serious drawbacks: limited

ability to represent background knowledge and lack of complex relations. The

purpose of RDM is to overcome these limitations. RDM is a learning method that is

better suited for stock market mining with a better ability to explain discovered rules

than other symbolic approaches.

However, current relational methods are relatively inefficient and have rather

limited facilities for handling numerical data. RDM as a hybrid leaming method

combines the strength of FOL and probabilistic inference to meet these challenges.

One of the few Hybrid Probabilistic Relational Data Mining methods, MMDR that

handles numerical data efficiently, is developed and applied to stock market data.

It is believed that now is the time for RDM methods, in particular, MMDR to

stock market prediction has advantages in discovering regularities in stock market

time series.

Methodology of Relational Data mining for Stock Market Prediction

1

Motivations of the Thesis

In the past few years, Vietnam's stock market was still in early stage of

development and thus did not catch attention from investors and researchers.

Especially, to interested learners, mastering professional methods of stock market

analysis and forecast require to have fime and wide background knowledge to study

all fields covered. Moreover, according to the efficient market theory, it is

practically impossible to infer a fixed long-term global forecasting model from

historical stock market information. Therefore, there have been few Vietnamese

interested in and performing research on stock market prediction.

Two recent years have witnessed the surprising development of the Vietnamese

stock market with a host of notable events. Especially, after Vietnam became a

World Trade Organization (WTO) member, the Vietnamese economy has so many

opportunities to develop, leading to the development of many companies and

markets including the financial and stock markets. It is said that Vietnam's stock

market will grow rapidly in the next years, and it will ranlc second in the region, just

after China, in terms of growth rate.

Under the rapid development of Vietnam's financial market, professional

activities such as analysis and prediction of financial market should be paid more

attention. In particular, these activities play a significant role in the task of macro

economic forecast at the National Center for Socio-economic Information and

Forecast (under the Ministry of Planning and Investment), which helps make sound

policies related to socio-economic management and regulation at macro level. Data

mining provides some methods and techniques that are able to help approach stock

market prediction quite effectively.

In fact, there have been already some studies and successful applications of data

mining techniques to stock market forecast. However, the capture of loiowledge and

application techniques of each approach is quite challenging and consumes time. I

read some papers and especially paid attention to a research on relational data

mining in finance by two researchers, Prof Dr. Boris Kovalerchuk and Dr. Evgenii

Vityaev. They reported that, "Mining stock market data presents special challenges.

For one, the rewards for finding successftil pattems are potentially enormous, but so

are the difficulties and sources of conftisions. The efficient market theory states that

it is practically impossible to predict financial markets long-term. However, there is

good evidence that short-term trends do exist and programs can be written to find

them. The data miners' challenge is to find the trends quickly while they are valid, to

Methodology of Relational Data mining for Stock Market Prediction

deal effectively with time series and calendar effects, as well as to recognize the

time when the trends are no longer effective".

The leaming method RDM is able to leam more expressive rules, make better

use of underlying domain knowledge and explain discovered rules than other

symbolic approaches. It is thus better suited for stock market mining. This approach

will play a key role in fiiture advances in data mining methodology and practice.

The earlier algorithms for RDM suffer fi-om a relative computational inefficiency

and have rather limited tools for processing numerical data. This problem is

especially necessary to be considered in stock market analysis where data commonly

are numerical time series. Therefore, RDM as a hybrid leaming method that

combining the strength of FOL and probabilistic inference is developed to meet these

challenges. One of the few Hybrid Probabilisfic Relational Data Mining methods,

MMDR, that handles numerical data efficiently, is developed and applied to stock

market forecasting.

The common question "Can stock market prediction be profitable?" is often

made to any research on methods of stock market prediction. In fact, there are few

people doing research on RDM for stock market forecast, because it requires

interested learners to have wide background knowledge to understand all fields

covered. Much less has been reported publicly on success of data mining in real

trading by financial institutions. If real success is reported then competitors can

apply the same methods and the leverage will disappear, because in essence all

ftindamental data mining methods are not proprietary. I used to concentrate my

study in attempt to end up with a Master's Degree and as a millionaire (kidding), but

this is too high risk to take.

Basing my intention on practical suggestions and requirements, as well as my

personal interest, I came to a decision of doing research on stock market forecast.

Through some school lessons and extra self-learning efforts, I access some data

mining techniques to seek a solution to the task. Those above motivate the aim of

the thesis - to carry out research and experiment on methodology of RDM for stock

market prediction.

Methodology of Relational Data mining for Stock Market Prediction

Objectives of the Thesis

- Systematical organization of RDM methodology for stock market

prediction

Most of the exisfing studies on RDM for stock market prediction are reported in

a short and overview way, which causes difficulties for many readers. The thesis is

primarily based on the book "Data Mining in Finance: Advances in Relational and

Hybrid Methods" and some papers by the two researchers Dr. Kovalerchuk & Dr.

Vityaev. However, after having a thorough grasp of the RDM methodology, I

systematically organize the methodology, especially the algorithm MMDR in my

view and supplement more extensions of knowledge in data mining and stock

market forecast to the thesis. Hopefiilly, it plays an important role in helping new

comers move toward the problem more favorably.

- Experiment performance of MMDR method to stock market prediction

Centre of the thesis is the issue of discovering regularities in stock price series

addressed and illustrated through the MMDR. The thesis also carries out an RDM

application to stock market prediction through an experiment with a small selfdeveloped program in a set of Standard and Poor's data. The experiment helps

understand and trust more the feasibility and efficiency of RDM methodology and

MMDR algorithm presented in the thesis.

Method of the Thesis study

The study behind the thesis has been mostly goal driven. As problems appeared

on the way to realizing stock market prediction, they were tackled by various means

as listed below:

• Investigation of some existing machine learning and data mining methods

through related documents such as Doctoral Theses, Master' theses, online

papers, books, etc.

• Reading of financial and stock market literatures for properties, forecast

techniques and hints of regularities in stock market data able to be exploited.

• Learning about some existing stock market prediction software for deeper

understanding of regularity discovered.

• Some theoretical considerations on mechanisms behind the generation of

stock data, and on general predictability demands and limits.

• Practical insights into the realm of trading in stock market.

• Contacts with experts on data mining and data mining software development,

with stock market investors and chief financial officers.

Methodology of Relational Data mining for Stock Market Prediction

4

Courses on economic forecast and stock market mostly organized by the

National Center for Socio-economic Information and Forecast.

Collection of related documents and systemization of Mater's thesis.

Programming in PHP and carrying experiments to illustrate and to prove the

main idea and algorithm presented in the Thesis.

Structure of the Thesis

The thesis is stmctured in the following way. The first part introduces the

problem definifion, method of study, objectives and stmcture of the Thesis.

Chapter 1 provides an overview of stock market prediction in data mining

through two following parts. "Introduction to stock market prediction" includes

basic concepts of stock market forecast, data mining with the Efficient Market

Theory, stock market time series properties, and drawbacks and possibilities on

developing a stock market prediction, etc. The last part "Data mining methodology

for stock market prediction" presents some major types of data mining prediction,

approaches to stock market prediction and comparisons on representation languages

and data mining methods used in stock market.

Chapter 2 talks about some basic problems, theory of RDM and an algorithm

MMDR. In comparison with other data mining methods, the RDM approach is

considered fi-om the point of view of their Data Types, Representation Languages

(to manipulate and interpret data) and Class of hypothesis (to be tested on data). One

of the few Hybrid Probabilistic Relational Data Mining methods, MMDR, which is

equipped with probabilistic mechanism that is necessary for time series with high

level of noise, is mainly introduced.

In Chapter 5, an MMDR application to stock market price prediction is made

clear for the methodology through three steps: mle generating, rule learning and

interval creating. This chapter also brings out some statisfic results and evaluations

for the experiment conducted to demonstrate the application.

Finally, contributions, limitations and fiiture work of my research are given as

conclusion part for the thesis. At the appendix part, the thesis also provides some

table stmctures and source code developed by myself that are used for experiment.

Methodology of Relational Data mining for Stock Market Prediction

CHAPTER I:

1.1.

OVERVIEW OF STOCK MARKET

PREDICTION IN DATA MINING

Introduction to stock market prediction

1.1.1. Basic concepts of forecast

This section provides a brief basic concepts of forecast. An introductory

discussion of the topic can be found in [46] - Michael Leonard, Large-Scale

Automatic Forecasting: Millions of Forecasts, International Symposium of

Forecasting, 2002.

Forecasts are time series predictions made for future periods in time. They are

random variables and therefore have an associated probability distribution. The

mean or median of each forecast is called the prediction. The variance of each

forecast is called the prediction error variance and the square root of the variance is

called the prediction standard error. The variance is computed from the forecast

model parameter estimates and the model residual variance.

The forecast for the next future period is called the one-step ahead forecast. The

forecast for h periods in the future is called the h-step ahead forecast. The forecast

horizon or forecast lead is the number of periods into the future for which

predictions are made (one-step, two-step,..., h-step). The larger the forecast horizon,

the larger the prediction error variance at the end of the horizon.

The confidence limits are based on the prediction standard errors and a chosen

confidence limit size. A confidence limit size of 0.05 results in 95% confidence

limits. The confidence limits are often computed assuming a normal distribution, but

others could be used. As with the prediction standard errors, the width of the

confidence limits increases with the forecast horizon.

The prediction error is the difference between the predicted value and the actual

value when the actual value is known. For transformed models, it is important to

understand the difference between the model errors (or residuals) and the prediction

errors. The residuals measure the departure from the model in the transformed

metric. The prediction errors measure the departure from the original series.

Taken together, the predictions, prediction standard errors, and confidence

limits at each period in the forecast horizon are the forecasts. Although many people

use the word "forecast" to imply only prediction, a forecast is not one number for

each future time period.

Using a transformed forecasting model requires the following steps:

Methodology of Relational Data mining for Stock Market Prediction

6

• The time series data are transformed.

• The transformed time series data are fit using the forecasting model.

• The forecasts are computed using the parameter estimates and the

transformed time series data.

• The forecasts (predictions, prediction standard errors, and confidence limits)

are inverse transformed.

The naive inverse transformation resuhs in median forecasts. To obtain mean

forecasts requires that the prediction and the prediction error variance both are

adjusted based on the transformation. Additionally, the model residuals will be

different from the prediction errors due to this inverse transformation. If no

transformation is used, the model residual and the prediction error will be the same,

and likewise the mean and median forecast will be the same (assuming a symmetric

disturbance distribution).

The statistics of fit evaluate how well a forecasting model performs by

comparing the actual data to the predictions. For a given forecast model that has

been fitted to the time series data, the model should be checked or evaluated to see

how well it fits or forecasts the data. The statistics of fit can be computed from the

model residuals or the prediction errors.

When a particular statistic of fit is used for forecast model selection, it is

referred to as the model selection criterion. When using model selection criteria to

rank forecasting models, it is important to compare the errors on the same metric,

that is, you should not compare transformed model residuals with non-transformed

model residuals. You should first inverse transform the forecasts from the

transformed model prior to compufing the prediction errors and then compute the

model selection criterion based on the prediction errors.

1.1.2.

Prediction tasks in stock market

Boris Kovalerchuk, Evgenii Vityaev, Data Mining For Financial Applications, In:

0. Maimon, L. Rokach (Eds.): The Data Mining and Knowledge Discovery

Handbook, Springer 2005, pp. 1203-1224

Stock market prediction includes uncovering market trends, planning

investment strategies, identifying the best time to purchase the stocks and what

stocks to purchase. Prediction tasks in stock market typically are posed in one of two

forms:

• Straight prediction of the stock market numeric characteristic, e.g., stock

return or exchange rate

Methodology of Relational Data mining for Stock Market Prediction

• The prediction whether the stock market characteristic will increase or

decrease.

Having in mind that in the first case, it is necessary to take into account the

trading cost and the significance of the trading return. And in the second case, it is

necessary to forecast whether the stock market characteristic will increase or

decrease no less than some threshold. Thus, the difference between data mining

methods for the first or second case can be less obvious, because the second case

may require some kind of numeric forecast.

Financial institutions produce huge datasets that build a foundation for approaching

these enormously complex and dynamic problems with data mining tools. Potential

significant benefits of solving these problems motivate extensive research for years.

1.1.3. Stock market time series properties

One may wonder if there are universal characteristics of the many series coming

from markets different in size, location, sophistication, etc. The surprising fact is that

there are. Moreover, interacting systems in other fields, such as statistical mechanics,

suggest that the properties of stock market time series loosely depend on the market

microstructure and are common to a range of interacting systems. Such observations

have stimulated new models of markets based on analogies with particle systems and

brought in new analysis techniques opening the era of econophysics. A more

detailed discussion of stock market time series properties can be found in [66] Stefan Zemke, On Developing a Financial Prediction System: Pitfalls and

Possibilities, First International Workshop on Data Mining Lessons Learned at

ICML'02, 2002. This section introduces a brief on stock market time series

properties including:

- Distribution

Distribution of stock market series tends to be non-normal, sharp peaked and

heavy-tailed, these properties being more pronounced for intraday values. Such

observations were pioneered interestingly around the time the EMH was formulated.

Extreme values appear more frequently in a stock market series as compared to a

nomially-distributed series of the same variance. This is important to the practitioner

since often the values cannot be disregarded as erroneous outliers but must be

actively anticipated, because of their magnitude which can influence trading

performance.

- Scaling property

Methodology of Relational Data mining for Stock Market Prediction

8

Scaling property of a time series indicates that the series is self-similar at

different time scales. This is common in stock market time series, i.e. given a plot of

returns without the axis signed; it is next to impossible to say if it represents hourly,

daily or monthly changes, since all the plots look similar, with differences appearing

at minute resolution. Thus prediction methods developed for one resolution could, in

principle, be applied to others.

- Data frequency

Data frequency refers to how often series values are collected: hourly, daily,

weekly etc. Usually, if a stock market series provides values on daily, or longer,

basis, it is low frequency data, otherwise - when many intraday quotes are included

- it is high frequency. Tick-by-tick data includes all individual transactions, and as

such, the event-driven fime between data points varies creating challenge even for

such a simple calculation as correlation.

1.1.4. Stock market prediction with the efficient market theory

The Efficient Market Theory/Hypothesis (EMH) inifially got wide acceptance

in the financial community. It asserts, in weak form, that the current price of an asset

already reflects all informafion obtainablefi*ompast prices and assumes that news is

promptly incorporated into prices. Since news is assumed unpredictable, so are

prices. In other words, according to the EMH, the evolufion of the prices for each

economic variable is a random walk. The variations in prices are completely

independent from one fime step to the next in the long run. EMH states that it is

practically impossible to infer a fixed long-term global forecasting model from

historical stock market informafion. This idea is based on the observation that if the

market presents some kind of regularity then someone will take advantage of it and

the regularity disappears.

However, real markets do not obey all the consequences of the hypothesis, e.g.,

price random walk implies normal distribution, not the observed case; there is a

delay while price stabilizes to a new level after news, which among other, lead to a

more modem view: "Overall, the best evidence points to the following conclusion.

The market isn't efficient with respect to any of the so-called levels of efficiency.

The value investing phenomenon is inconsistent with semi-strong form efficiency,

and the January effect is inconsistent even with weak form efficiency. Overall, the

evidence indicates that a great deal of information available at all levels is, at any

given time, reflected in stock prices. The market may not be easily beaten, but it

appears to be beatable, at least if you are willing to work at it."

Methodology of Relational Data mining for Stock Market Prediction

The market efficiency theory does not exclude that hidden short-term local

conditional regularities may exist. These regularities can not work "forever," they

should be corrected frequently. It has been shown that the stock market data are not

random and that the efficient market hypothesis is merely a subset of a larger

chaotic market hypothesis. This hypothesis does not exclude successful short term

forecasfing models forpredicfion of chaofic time series.

Data mining does not try to accept or reject the efficient market theory. Data

mining creates tools which can be useful for discovering subtle short-term

conditional pattems and trends in wide range of stock market data. This means that

retraining should be a permanent part of data mining in stock market and any claim

that a silver bullet trading has been found should be treated similarly to claims that a

perpetual mobile has been discovered.

1.1.5.

Questions in stock market prediction

Following are some questions of scienfific and pracfical interest concerning

stock market prediction:

• Prediction possibility: Is statistically significant prediction of stock market

data possible? Is profitable prediction of such data possible? What involves

answer to the former question, adjusted by constraints imposed by the real

markets?

• Methods: If prediction is possible, what methods are best at performing it?

What methods are best-suited for what data characteristics - could it be said

in advance?

• Meta-methods: What are the ways to improve the methods? Can

metaheuristics successful in other domains, such as ensembles or pruning,

improve stock market prediction?

• Data: Can the amount, type of data needed for prediction, be characterized?

• Data preprocessing: Can data transformations that facilitate prediction be

identified? In particular, what transformation formulae enhance input data?

• Evaluation: What are the features of sound evaluation procedure, respecting

the properties of stock market data and the expectations of stock market

prediction? What are the common evaluation drawbacks?

• Predictor development: Are there any common features of successful

prediction systems? If so, what are they, and how could they be advanced?

Can common reasons of failure of stock market prediction be identified?

Are they intrinsic, non-reparable, or there is a way to amend them?

• Transfer to other domains: Can the methods developed for stock market

prediction benefit other domains?

Methodology of Relational Data mining for Stock Market Prediction

10

Predictability estimation: Can stock market data be reasonably quickly estimated

to be predictable or not, without the investment to build a custom system? What

are the methods, what do they actually say, what are their limits?

Consequences of predictability: What are the theoretical and practical

consequences of demonstrated predictability of stock market data, or the

impossibility of it? How a successful prediction method translates into

economical models? What could be the social consequences of stock market

prediction?

1.1-6. Challenges and Possibilities on Developing a Stock Market

Prediction System

A successful stock market predicfion system presents many challenges. Some

are encountered over agam, and though an individual solution might be systemspecific, general principles still apply. Using them as a guideline might save fime,

effort, boost results, as such promoting project's success.

The idea of stock market predicfion (and resulting riches) is appealing, initiating

countless attempts. In this competitive environment, if one wants above-average

resuhs, one needs above-average insight and sophistication. Reported successful

systems are hybrid and custom made, whereas straightforward approaches, e.g. a

neural network plugged to relatively unprocessed data, usually fail. The

individuality of a hybrid system offers chances and dangers. One can bring together

the best of many approaches; however the interaction complexity hinders judging

where the performance dis/advantage is coming from.

Stock market prediction has been widely studied at a case of time-series

prediction problem; The difficulty of this problem is due to the following factors:

low signal-to-noise ratio, non-Gaussian noise distribufion, nonstationarity, and

nonlinearly. Deriving relationships that allow one to predict future values of time

series is a challenging task when the underlying system is highly non-linear.

Usually, the history of the time series is provided and the goal is to extract from that

data a dynamic system. The dynamic system models the relationship between a

window of past values and a value T time steps ahead. Discovering such a model is

difficult in pracfice since the processes are typically cormpted by noise and can only

be partially modeled due to missing information and the overall complexity of the

problem. In addition, stock market time series are inherently non-stationary so

adaptive forecasting techniques are required.

- Data Preprocessing

Methodology of Relational Data mining for Stock Market Prediction

11

Before data is fed into an algorithm, it must be collected, inspected, cleaned and

selected. Since even the best predictor will fail on bad data, data quality and

preparafion is cmcial. Also, since a predictor can exploit only certain data features, it

is important to detect which data preprocessing/presentation works best.

• Visual inspecfion is invaluable. At first, one can look for: trend - if need to

remove, histogram - redistribute, missing values and outliers, any

regularities.

• Missing values deah with by data mining methods

• Series to instances conversion is required by most leaming algorithms

expecting as an input a fixed length vector

• Indicators are series derived from others, enhancing some features of

interest, such as trend reversal.

• Feature selection can make learning feasible, as because of the curse of

dimensionality long instances demand (exponentially) more data.

- Prediction Algorithms

Common leaming algorithms point their features important to stock market

prediction:

• Linear methods are widely used in stock market prediction.

• Neural Network seems the method of choice for stock market predicfion.

• C4.5, ILP - generate decision trees/if-then rules - human understandable, if

small.

• Nearest Neighbor does not create a general model, but to predict, it looks

back for the most similar case(s). Irrelevant/noisy features disrupt the

similarity measure, so pre-processing is worthwhile.

• Bayesian classifier/predictor first learns probabilities how evidence supports

outcomes, used then to predict new evidence's outcome.

• Support Vector Machines (SVM) are a relatively new and powerful learner,

having attractive characteristics for time series prediction.

- System Evaluation

Proper evaluation is critical to a prediction system development. First, it has to

measure exactly the interesting effect as opposed to prediction accuracy. Second, it

has to be sensitive enough as to disfinguish oflen minor gains. Third, it has to

convince that the gains are no merely a coincidence.

• Evaluation bias resulfing from the evaluation scheme and time series data,

needs to be recognized.

Methodology of Relational Data mining for Stock Market Prediction

12

• Evaluation data should include different regimes, markets, even data errors,

and be plentiful. Dividing test data into segments helps to spot performance

irregularities (for different regimes).

• Sanity checks involve common sense. Prediction errors along the series

should not reveal any stmcture, unless the predictor missed something.

1.2.

Data mining methodology for stock market prediction

1.2.1. Prediction in data mining

a.

Introduction

The goal of data mining is to produce new knowledge that the user can act

upon. It does this by building a model of the real world based on data collected from

a variety of sources. The result of the model building is a description of patterns and

relationships in the data that can be confidenfiy used for prediction.

Prediction is one of the most important problems in data mining. It involves

using some variables or fields in the data set to predict unknown or future values of

other variables of interest. The goal of prediction is to forecast or deduce the value

of an attribute based on values of other attributes.

b. Major types of prediction

- In the view of contruction and use of model

Prediction can be viewed as the construction and use of model to assess the

class of an unlabeled sample, or to assess the value or value ranges of an attribute

that a given sample is likely to have. In this view, classification and regression are

the two major types of predicfion problems:

• Classification: used to discrete or nominal values. It predicts into what

category or class a case falls. In other words, classification problems aim to

identify the characteristics that indicate the group to which each case

belongs. Data mining creates classificafion models by examining already

classified data (cases) and inductively finding a predictive pattern.

• Regression: used to predict continuous or ordered values. It predicts what

number value a variable will have. In other words, regression uses existing

values to forecast what other values will be. The prediction of continuous

values can be modeled by statistical techniques of regression.

- In the view of use of prediction to predict

This view is commonly accepted in data mining. Predicfion refers the use of

prediction to predict class labels as classification and to predict continuous values as

prediction:

Methodology of Relational Data mining for Stock Market Prediction

13

• Classification: used to extract models describing important data classes.

Classificafion predicts categorical class label. It classifies data (constructs a

model) based on the training set and the values (class labels) in a classifying

attribute and uses it in classifying new data.

• Prediction: used to predict future data trends, i.e., predict unknown or

missing values. It models confinuous-valued funcfions. Any of the methods

and techniques used for classification may also be used for prediction.

1.2.2.

Parameters

There are several parameters to characterize data mining methodologies for

stock market forecasting:

1.2.2.1. Datatypes

Two major groups of data types

• Attributes data type: object is represented by attributes that is each object x

is given by a set of values ^i(x), A2{x\.., An{x).

• Relational data type: objects are represented by their relations with other

objects. For instance, x>y, y

x=3, y=l and z=2. Thus attributes of objects are not known, but their

relations are known. Objects may have different attributes (e.g., x=5, y=2,

and z= 4), but still have the same relations.

1.2.2.2. Data set and techniques

Fundamental and technical analyses are two widely used techniques in stock

market forecast.

- Fundamental analysis

Fundamental analysis tries to determine all the econometric variables that may

influence the dynamics of a given stock price or exchange rate. Often it is hard to

establish which of these variables are relevant and how to evaluate their effect.

- Technical analysis

Technical analysis assumes that when the sampling rate of a given economic

variable is high, all the information necessary to predict the future values is

contained in the time series itself There are several difficulties in technical analysis

for accurate prediction: successive ticks correspond to bids from different sources,

the correlation between price variations may be low, time series are not stationary,

good statisfical indicators may not be known, different realizations of the random

process may not be available, and the number of training examples may not be

enough to accurately infer rules. Therefore, the technical analysis can fit short-term

predictions for stock market time series without great changes in the economic

Methodology of Relational Data mining for Stock Market Prediction

14

environment between successive ticks. Actually, the technical analysis was more

successful in identifying market trends, which is much easier than forecasting the

future stock prices. Currently different data mining techniques try to incorporate

some of the most common technical analysis strategies in pre-processing of data and

in the construction of appropriate attributes.

Two major options exist: use the time series itself or use all variables that may

influence the evolution of the time series. Data mining methods do not restrict

themselves to a particular option. They follow a fundamental analysis approach

incorporating all available attributes and their values, but they also do not exclude a

technical analysis approach based only on a time series such as stock price and

parameters derivedfi"omit. Most popular time series are index value at open, index

value at close, highest index value, lowest index value and trading volume and

lagged returns from the time series of interest. Fundamental factors include the price

of gold, retail sales index, industrial production indices, and foreign currency

exchange rates. Technical factors include variables that are derived from time series

such as moving averages.

1.2.2.3. Mathematical algorithm (method, model)

A variety of statistical, neural network and logical methods has been developed.

For example, there are many neural network models, based on different

mathematical algorithms, theories and methodologies. Combinations of different

models may provide a better performance than those provided by individuals. Many

data mining methods assume a functional form of the relationship being modeled.

1.2.2.4. Form of relationships between objects

The next characteristic of a specific data mining methodology is a form of the

relationship between objects. Many data mining methods assume o. functional form

of the relationship being modeled. For instance, the linear discriminant analysis

assumes linearity of the border that discriminates between two classes in the space

of attributes. Often it is hard to justify such functional form in advance. RDM

methodology in stock market does not assume a functional form for the relationship.

In addition, RDM algorithms do not assume the existence of derivatives. It can

automatically leam symbolic relations on numerical data of stock market time series.

1.2.3. Approaches to stock market prediction

a. Physics approach and data mining approach

The impact of market players on market regularities stimulated a surge of

attempts to use ideas of statistical physics in finance. If an observer is a large

marketplace player then such observer can potentially change regularities of the

Methodology of Relational Data mining for Stock Market Prediction

15

marketplace dynamically. Attempts to forecast in such dynamic environment with

thousands active agents leads to much more complex models than traditional data

mining models designed for. This is one of the major reasons that such interactions

are modeled using ideas from statistical physics rather than from statistical data

mining. The physics approach in finance is also known as "econophysic" and

"physics of finance". The major difference from data mining approach is coming

from the fact that in essence the data mining approach is not about developing

specific methods for financial tasks, but the physics approach is.

b. Deterministic dynamic system approach

Stock market data are often represented as a time series of a variety of attributes

such as stock prices and indexes. Time series prediction has been one of the ultimate

challenges in mathematical modeling for many years. Currently data mining

methods try to enhance this study with new approaches. Dynamic system approach

has been developed and applied successfully for many difficult problems in physics.

Recently several studies have been accomplished to apply this technique in stock

market. Usually, the history of the time series is provided and the goal is to extract

from that data a dynamic system. The dynamic system models the relationship

between a window of past values and a value T time steps ahead. Below presents the

major steps of this approach:

• Step 1: Development of state space for the dynamic system, i.e. selecting

and/or inventing attributes characterizing the system behavior.

• Step 2: Discovering the laws that govern the phenomenon, i.e. discovering

relations between attributes of current and previous states (state vectors) in

the form of differential equations.

• Step 3: Solving differential equations for identifying the transition function

(mles).

• Step 4: Use of the transition funcfion as a predictor of the next state of the

dynamic system, e.g., next day stock value.

Inferring a set of rules for dynamic system assumes that there is

• Enough information in the available data to sufficiently characterize the

dynamics of the system with high accuracy

• All of the variables that influence the time series are available or they vary

slowly enough that the system can be modeled adaptively

• The system has reached some kind of stationary evoludon

• The system is a detenninistic system

• The evoludon of a system can be described by means of a surface in the

space of delayed values.

Methodology of Relational Data mining for Stock Market Prediction

16

There are several applicafions of these methods to stock time series. However,

the literature claims both for and against the existence of chaotic deterministic

system underlying stock market. Recent research has focused on methods to

disfinguish stochastic noise from deterministic chaotic dynamics and more generally

on constmcting systems combining deterministic and probabilistic techniques.

1.2.4.

Data mining methods in stock market

Almost every computational method has been explored and used for financial

modeling. New developments augment traditional technical analysis of stock market

curves that has been used extensively by financial institutions. Such stock charting

helps to identify buy/sell signals (timing "flags") using graphical pattems. Data

mining as a process of discovering useful patterns, correlations has its own place in

stock market modeling.

Similarly to other computational methods, almost every data mining method

and technique has been used in financial modeling. An incomplete list includes a

variety of linear and non-linear models, multi-layer neural networks, k-means and

hierarchical clustering; k-nearest neighbors, decision tree analysis, regression

(logistic regression; general multiple regression), ARIMA, principal component

analysis, and Bayesian leaming. Less traditional methods used include rough sets,

RDM methods (deterministic inductive logic programming) and newer probabilistic

methods, support vector machine, independent component analysis, Markov models

and hidden Markov models.

1.2.4.1. Representation languages

a.

Propositional Logic language

A proposition is a statement that can be true or false. Propositional logic uses

true statements to form or prove other tme statements. In other words, propositional

logics are concerned with propositional (or sentential) operators which may be

applied to one or more propositions giving new propositions.

Propositional logic has very limited expressive power. It is not adequate for

formalizing valid arguments that rely on the internal stmcture of the propositions

involved.

b. First-order logic language

First-order logic (FOL) is a system of deduction extending propositional logic

by the ability to express relations between individuals. FOL languag^s^support

variables, relations, and complex expressions.

DAI HOC QUOC GIA HA NOl

TRUrJG TAM THONG TIN THL/ViEN

Methodology of Relational Data mining for Stock Market Prediction

17

The FOL language differs from a propositional logic language mainly by the

presence of variables. Therefore, a language of monadic fiinctions and predicates is

a FOL language, but a very restricted language.

c. Attribute-value languages

Attribute-value language is a propositional language in which propositions are

attribute-value pairs that can be considered as predicates. In other words, in an

attribute-value language, objects are described by tuples of attribute-value pairs,

where each attribute represents some characteristic of the object.

Attribute-value languages are languages of monadic fiinctions (fiinctions of one

variable) and monadic predicates (Boolean functions with only one argument). This

language was not designed to represent relations that involve two, three or more

objects.

d. Comparison of these languages

Many well-known rule learners are propositional but propositional

representations offer no general way to describe the essential relations among the

values of the attributes. In contrast with propositional mles, first order mles have an

advantage in discovering relational assertions because they capture relations

directly. Several types of hypotheses/mles presented in FOL are simple relational

assertions with variables. Relational assertions can be conveniently expressed using

first-order representations, while they are very difficult to describe using

propositional representations.

Also, first order mles allow one to express naturally other more general

hypotheses not only the relation between pairs of attributes. These more general

rules can be as for classification problems as for an interval forecast of continuous

variable. Moreover, these mles are able to catch Markov chain type of models used

for stock market time series forecast. That algorithms designed to leam sets of firstorder rules that contain variables is significant because first-order rules are much

more expressive than propositional mles.

1.2.4.2. AVL-based methods

The common data mining methodology assuming attributes data type is known

as an attribute-based or attribute-value methodology. It covers a wide range of

statistical and connectionist (neural network) methods. There are two types of

attribute-value methods: the first one is based on numerical expressions, and the

second one is based on logical expressions and operations.

Historically, methods based on AVLs such as neural networks, the nearest

neighbors method, and decision trees dominate in financial applications of data

Methodology of Relational Data mining for Stock Market Prediction

18

## Gián án A STUDY OF STRUCTURAL BREAKS IN MALAYSIAN STOCK MARKET

## the stock market, the theory of rational expectations, and the efficient markets hypothesis

## Tài liệu NOTES AND QUERIES: A MEDIUM OF INTER-COMMUNICATION FOR LITERARY MEN, ARTISTS, ANTIQUARIES, GENEALOGISTS, ETC pdf

## Tài liệu Reviews of National Policies for EducationImproving Lower Secondary Schools in Norway pptx

## Tài liệu THE DEVELOPMENT AND UTILIZATION OF SOME EQUIPMENT FOR LOW REYNOLDS NUMBER SUPERSONIC FLOW RESEARCH ppt

## Tài liệu The Development of the Feeling for Nature in the Middle Ages and Modern Times doc

## Research Program of the Partnership for a New Generation of Vehicles doc

## Functional endoscopic balloon dilation of sinus ostia for chronic rhinosinusitis (Review) pdf

## Programmatic update Use of Antiretroviral Drugs for Treating Pregnant Women and Preventing HIV Infection in Infants pptx

## REPORT OF THE DIRECTOR GENERAL ON THE WORK OF THE ORGANIZATION FOR THE YEAR 2011 docx

Tài liệu liên quan