198 High-Function Business Intelligence in e-business

The data is copied into Lotus 1-2-3 and charted as shown in Figure 4-39.

Figure 4-39 September stock prices

4.3.8 Project growth rates of Web hits for capacity planning purposes

An online retailer of books and music recently created a new Web site devoted to

discount home electronics, and wants to assess the rate of increase of the Web

site hit rate during the first few months of operation. Typically, the daily hit rate for

a popular new news site grows rapidly for a while, and so it is natural to try and fit

a power (non-linear) curve to the data.

The objective is to project the growth rate in order to perform capacity planning.

Data

The major attributes in this scenario are the number of hits, and the day of the

hits on the Web site.

Chapter 4. Statistics, analytic, OLAP functions in business scenarios 199

BI functions showcased

REGR_COUNT, REGR_SLOPE, REGR_ICPT

Steps

Consider a non-linear equation of the form:

This is equivalent to:

log y = a log x + log b

It can be represented as shown in Example 4-33.

Example 4-33 Representing a non-linear equation

SELECT

REGR_SLOPE(LOG(y), LOG(x)) AS a,

EXP(REGR_ICPT(LOG(y), LOG(x))) AS b

.......

The aforementioned curve fitting is explored in the query shown in Example 4-34.

Example 4-34 Computer slope and intercept

SELECT

REGR_COUNT(hits,days) AS num_days,

REGR_SLOPE(LOG(hits),LOG(days)) AS a,

EXP(REGR_ICPT(LOG(hits),LOG(days))) AS b

FROM traffic_data

The foregoing query lists the number of non-null pairs of hits and day in the table,

and computes the values of ‘a’ and ‘b’ in the foregoing equation.

The results of the foregoing query is shown as follows, and indicates that there

were 100 non-null data points in the table, and that the estimated values for ‘a’

and ‘b’ are 1.9874 and 21.4302 respectively.

num_days a b

------------------- ----------------------- -----------------------

100 1.9874 21.4302

R

2

provides the quality of the curve fitting, and is incorrectly

2

computed using

the SQL shown in Example 4-33.

2

This is the incorrect method because we are computing this function on the transformed data using

logarithmic function, rather than on the untransformed date.

ybx

=

a

200 High-Function Business Intelligence in e-business

Example 4-35 Compute

R

2

SELECT

REGR_COUNT(hits,days) AS num_days,

DECIMAL(REGR_SLOPE(log(hits), log(days)),10,4) AS a,

DECIMAL(EXP(REGR_ICPT(log(hits), log(days))),10,4) AS b,

DECIMAL(REGR_R2(log(hits), log(days)),10,4) AS r2

FROM traffic_data

The results of the foregoing query, using the built-in R2 function, are shown as

follows.

num_days a b r2

-------------- ------------- --------------- ----------------

100 1.9874 21.4302 0.9912

However, in order to correctly compute R

2

for the non-linear fit of the original

untransformed data, the SQL shown in Example 4-36 should be used.

Example 4-36 Correct

R

2

computation on original untransformed data

WITH coeffs(a,b) AS

(

SELECT

REGR_SLOPE(LOG(hits),LOG(days)) AS a,

EXP(REGR_ICPT(LOG(hits),LOG(days))) AS b

FROM traffic_data

),

residuals(days,hits,error) AS

(

SELECT

t.days,t.hits,t.hits - c.b*power(t.days,c.a)

FROM traffic_data t, coeffs c

)

SELECT 1e0-(SUM(error*error)/REGR_SYY(hits,days)) AS rr2

FROM residuals

The result of this query is as follows:

-------------------rr2

+9.55408646608249E-001

Note that the correct value R

2

is 0.955 which is lower than 0.991. This is typical.

Computing of R

2

for the transformed data usually results in an overestimate of

the goodness of fit.

The curve fitting data and R

2

value is shown in Figure 4-40 as charted by

Kaleidograph.

Get *DB2 UDB's High-Function Business Intelligence in e-business* now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.