Understand Common Data Sources

Before you get started analyzing, determine where the data will be coming from.

In the early days of web site measurement, there was only a single source of data, web server logfiles [Hack #22] . Generated automatically by web servers like Apache and Microsoft Internet Information Server, these flat text files were simply a report of which IP addresses were requesting which objects. At one point, a smart and enterprising soul realized that these files could be parsed and that the results could tell you roughly what people were doing on your web site. Some say, “In the beginning there were logfiles and they were good.…” Unfortunately, web server logfiles weren’t good enough.

Problems with Web Server Logfiles

People began to see problems creeping into their log-based analysis—missing information and requests that could in no way be coming from a human being—and gradually programmers realized that something better would be needed. Problems arose from the emergence of forward caching devices, the addition of page caching in the browsers, and the explosion of nonhuman user agents attempting to catalog the rapidly expanding Internet. As more and more people came online, the caching devices were needed to improve the overall browsing experience, and while they helped the average surfer connecting with a 28k modem, they dramatically impacted the accuracy of log-parsing applications.

Enter the Packet Sniffers

Packet sniffing, also referred to as the network data collection model [Hack #8] , basically puts a listening device on a major node in the web delivery architecture, as shown in Figure 1-6. The listening device would then passively log requests for resources, essentially sitting in front of your web server farm. While sniffing provided some advantages—centralized data collection, more details about failed or cancelled requests, and improved accuracy in server overload conditions—few applications ever supported the model and it never really took off.

Figure 1-6. Network data collectors in an archetypical web server farm

JavaScript Page Tags Are Born

About the same time the packet sniffers were starting to struggle, a handful of companies realized that by embedding a small JavaScript file into web pages, you could collect a great deal of useful information about the page being viewed and the visitor doing the viewing. The JavaScript would then dynamically generate an image request to an external server, appending all the collected information into the query string, and the client-side page tag was born [Hack #28] .

Once initial security and privacy concerns were worked out, JavaScript page tags quickly took off. Companies eagerly bought into the use of tags to render traffic data in real time, and others were delighted by the relative simplicity of the hosted application model.

Eventually, though, companies began to discover the limits in their utility as well. A major limitation of page tags is that if they’re not on a page, no data is collected for that page, making deployment mistakes quite costly from a data collection perspective. Also, some companies balk at the seemingly ever-expanding size of the tags, occasionally exceeding 20 KB!

Evaluating Data Sources

Given that each data source has its benefits and limitations, you might be asking yourself “How should I decide which source is best for my business?” For the most part, there are a handful of needs and limitations. In general, a “yes” answer to any of the criteria presented in Table 1-2 will yield a “best” data source to use.

Table 1-2. Criteria to help you determine which data source to use

Need or limitation	Logfile	Packet sniffer	Page tag
Need to access to historical data, generated prior to application implementation	BEST	NO	NO
Do not have the ability to modify each page on the web site, such as by adding a JavaScript page tag	BEST	OK	NO
Concerned about accuracy being affected by caching devices and software agents	OK	OK	BEST
Have no desire to maintain additional hardware inside the network	OK	NO	BEST
Need access to information in real time	OK	OK	BEST
Need information about which software agents (robots and spiders) are indexing the site	BEST	BEST	NO
Concerned about measurement in server overload conditions	OK	BEST	NO
Concerned about ownership of information	BEST	BEST	NO^[1]
Need to be able to process and track a variety of different types of data (commerce servers, content servers, proxy servers, and media servers) or determine successful delivery of downloadable files (PDF, EXE, DOC, XLS, etc.)	BEST	OK	NO
Concerned about implementation times and ease of implementation	BEST	OK	NO
Need ability to log data directly to a database	BEST	OK	NO
Concerned about high up-front costs	OK	OK	BEST
^[1]While most hosted application vendors will insist that you “own” the data, this is nearly never true since the data physically resides in a different location. Exceptions to this are applications like Visual Sciences and WebTrends SmartSource, which allow page tags to be run from your own infrastructure (and thus the data is wholly owned by you).

While there are no absolutes, it is important decide which source you prefer to use before you start contacting software or service providers. Also, there are often differences in pricing models associated with each data source, with page tags most commonly associated with hosted applications, which are typically paid for on an ongoing basis, and web server logfiles and packet sniffers with software-based solutions, which are typically paid for up front at the beginning of the relationship.

Further complicating your decision making process is the recent emergence of vendors offering hybrid solutions—usually combining page tags and logfiles to create a more holistic view of your data. Vendors like Visual Sciences, Urchin, and Sane Solutions are now offering the ability to blend data sources in new and interesting ways. While hybrid vendors may not be the best solution for you, if you have experience with both logs and page tags and are looking for an alternative, these vendors are worth a look.

Get Web Site Measurement Hacks now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Web Site Measurement Hacks by Eric T. Peterson

Understand Common Data Sources

Problems with Web Server Logfiles

Enter the Packet Sniffers

JavaScript Page Tags Are Born

Evaluating Data Sources

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly