About Demo Documentation Downloads Links Thanks Wishlist
 

Documentation

This section (which is still under heavy development) should provide you all basic information about the phpWebStats package. It's possible you won't find here some information important for you (well, actually it's just a beta version), in that case send me your questions by e-mail. My public e-mail address is dracula007@atlas.cz, and I recommend you to use some clear subject ("phpWebStats question" etc.) because I receive a lot of spam on this address.

Contents

  1. Requirements & recommendations
  2. Install
    1. Installation of the database part
    2. Installation of the sources
  3. Features
    1. Basic features
    2. Advanced features
  4. How does it work
  5. Function reference
    1. bro.php
    2. stats.php
    3. analyze.php
  6. Visualization
  7. Possible problems

Requirements & recommendations

The phpWebStats package requires only two things installed on the server:

PHP 3.x/4.x
I recommend the 4.x version of sessions support
SQL database
Actually only MySQL is supported by me, in the future I'll probably support the PostgreSQL as well.

With these two packages installed you'll be able to log the traffic and to display information as a simple text, HTML tables etc. If you want to display images (i.e. charts), you'll have to check that that the GD library is installed and that PHP is compiled with the GD-lib support. There are some problems concerning this library, because of different formats support (see the GD-lib README).

I recommend you to compile it with PNG support, because none owns rights to this format (GIF is owned by Unisys). PNG is better than GIF (more colors, better compression, etc.) and most application supports it. If you don't have PHP compiled with PNG support, and can't or don't want to recompile, you'll have to edit the PHP scripts generating images.

Install

The installation and usage of this package is pretty simple. These are the presumptions:

  • You have access to hosting (free or commercial), supporting PHP and MySQL.
  • You have these information about the database:
    • database name
    • hostname
    • username
    • password

It's not important if the database used for phpWebStats is separated from other databases. In fact you can use the same database for your own data as well as for phpWebStats, as long as no table name collision occures. The table names for phpWebStats use "pws_" prefix (pws_arch, pws_browser,...) so the collision is improbable.

If you don't have the database information presented above, contact the administrator or create the database yourself (if you can).

Installation of the database part

The database installation consists of two steps - creation of structure (i.e. tables) and import of data (signatures, etc.). These steps consist in running two SQL scripts, namely structure.sql and import.sql, located in the database/ directory. All you have to do is run these scripts from command-line

# mysql -u username -p database_name < structure.sql
# mysql -u username -p database_name < import.sql

or run them using phpMyAdmin or any other tool. Installation of the database part is finished.

Installation of the sources

There are three main source files in source/ directory

  • stats.php - used for detecting and logging the information into the database
  • analyze.php - used for analysis of the data (retrieving the data from the database)
  • bro.php - database abstraction library, used by both php files presented above

Copy these files into your include directory, and use them as all other libraries. The function reference is presented here.

The last step you have to do is to open the bro.php file, and set actual database information at the top of the file.

Features

Basic features

All the functions of this package are based on the HTTP_USER_AGENT header and regular expressions stored in a MySQL database. But as stated on the first page, it should be really easy to use different SQL database with simple regular expressions support.

  • client detection
  • operating system detection
  • architecture detection

Advanced features

These features are not based on regular expressions.

  • search engines detection
  • mail collectors detection
  • country detection (based on the client IP)
  • language detection (based on the HTTP_ACCEPT_LANGUAGE)

The information if the client is a search engines and/or mailcollector is actually retrieved during browser detection (viz. Basic features), and allows you to filter out robots (including downloaders, different sorts of validators, etc.), so you'll get only "human traffic," which I suppose is the thing you're interested in.

How does it work

This section should give a overview of the process of detection, logging and retrievieng of information by phpWebStats. Don't expect deep analysis of source codes, this section explains "what is done" rather than "how is it done." If you want to know how exactly are certain information gathered, see the source code.

What information are available

The first thing we should consider what sources of information are available.

optional infromation

Most of the information we can get from the client are optional. That means the client can give us these information, but we can't force him to do so. The client himself can modify most of the information as well, or the information can be modified by proxy servers/firewalls/... so we shouldn't use these information for some security purposes. But overwhelming majority of users will give us these information, and they'll give us correct information, so for our purpose (traffic analysis) this is sufficient.

By optional information we mean in particular following HTTP headers:

  • User_Agent
  • Accept-Language
  • Http_Referer

Someone could ask - why couldn't we use some client-side JavaScript to get more information from the client (screen size, number of colours, etc.)? Well, I've decided to develop system not depending on any client-side technology. Maybe in some future version I'll include option to use JavasSript (or some other client-side technology), but at this time this is not my priority. If you think I should change my oppinion, write me an e-mail.

information we'll always have

For every connection, there are information we'll always get, in particular IP address. Client can't change this, but the IP address can be used by hundreds of clients behind a proxy, so we shouldn't trust these information absolutely. But again, for our purpose this is sufficient.

Processing of information

User_Agent

This is probably the most important header, used to detect the following:

  • browser name and (major/minor) version
  • OS name and version
  • arch name

As a first step, this header is preprocessed, that means it's turned to lowercase, delimiters (except ".") are replaced by space (" "), all double spaces are replaced by simple spaces (" " to " ") and the result is trimmed (whitespaces are removed from both ends). Here are several examples:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
mozilla 4.0 compatible msie 6.0 windows nt 5.0
Googlebot/2.1 (+http://www.googlebot.com/bot.html)
googlebot 2.1 +http: www.googlebot.com bot.html
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.0.1) Gecko/20020826
mozilla 5.0 windows u windows nt 5.0 en us rv:1.0.1 gecko 20020826

browser name detection

When the preprocessing of the User_Agent header is done, the detect_browser executes a regular expression match ageinst the pws_browser.browser_regexp column. The rows that match this regular expression are ordered by the priority pws_browser.browser_priority, and the one with the highest priority is used.

browser version detection

When detectign the browser name, not only name is retrieved from the database. In a database, there's a column pws_browser.browser_version_regexp determining the way to detect version for particular browser. It contains several regular expressions separated by "|", ordered by priority.

Function reference

Here you'll find basic information about the function in the three source files. The function reference is divided by the source files.

bro.php

This is a database abstraction library, providing following functions:

  • db_connect()
  • db_disconnect()
  • db_query()
  • db_num_rows()
  • db_fetch_row()
db_connect()
receives:
nothing
returns:
connection to the database
description:
Connects to the database server, selects the correct database, and returns the connection.
db_disconnect($connection)
receives:
$connection - connection to the database
returns:
nothing
description:
Receives a connection to the database ($connection), closes this connection and returns nothing.
db_query($query,$connection)
receives:
$query - sql query
$connection - connection to the database
returns:
result set from the query
description:
Executes the query over the database and returns the result (if any).
db_num_rows($result,$connection)
receives:
$result - result set from a SQL query execution
$connection - connection to the database
returns:
number of rows in a result set
description:
Counts the rows in the result set (usually by calling some built-in function) and return this number.
db_fetch_row($result,$connection)
receives:
$result - result set from a SQL query execution
$connection - connection to the database
returns:
associative array
description:
Fetches one row from the result set as an associative array and returns this array.

stats.php

Functions located in this file are used for basic detection of information and logging into the database. The funtions are:

  • detect_version()
  • detect_browser()
  • detect_os()
  • detect_arch()
  • detect_country()
  • detect_lang()
  • log_access()
detect_version($regexp,$user_agent)
receives:
$regexp - regular expression used to search the version
$user_agent - preprocessed $_SERVER["HTTP_USER_AGENT"] string
returns:
associative array
"major" => major version
"minor" => minor version
description:
This function tries to detect the major/minor version of the browser, using a regular expression search in the $user_agent parameter. If the version wasn't detected, these values in associative array are empty strings.
detect_browser($connection,$user_agent)
receives:
$regexp - regular expression used to search the version
$user_agent - preprocessed $_SERVER["HTTP_USER_AGENT"] string
returns:
associative array
"key" => key from the pws_browser table
"name" => name of the browser
"major" => major version
"minor" => minor version
description:
This function tries to detect the browser in $user_agent string by a regexp search against the pws_browser table (browser_regexp column). If it's impossible to detect the browser, then all items in the returned array are empty strings (""), except "key", which is 0.
detect_os($connection,$user_agent)
receives:
$connection - connection to the database
$user_agent - preprocessed $_SERVER["HTTP_USER_AGENT"] string
returns:
associative array
"key" => key from the pws_browser table
"name" => name of the OS
"version" => version of the OS
description:
This function detects the OS of the browser by a regexp search against the pws_os table (os_regexp column). If it's impossible to detect the OS, then all values in the returned array are empty strings (""), except "key", which is 0.
detect_arch($connection,$user_agent)
receives:
$connection - connection to the database
$user_agent - preprocessed $_SERVER["HTTP_USER_AGENT"] string
returns:
associative array
"key" => key from the pws_browser table
"name" => name of the architecture
description:
This function detects the architecture of the client by using a regexp search against the pws_arch table (arch_regexp column).
detect_country($connection,$ip)
receives:
$connection - connection to the database
$ip - client's IP
returns:
associative array
"key" => key from the pws_country table
"name" => name of the contry
description:
This function detect the country of the client, using reverse lookup on client's ip address. Several problems that can occure during this step, see possible problems. If the country can't be detected, then "key" is 0 and the "name" is empty string ("").
detect_lang($connection)
receives:
$connection - connection to the database
returns:
associative array
"key" => key from the pws_lang table
"name" => name of the language
description:
This function detect the language set as a preffered language in the browser. string (""). If the language can't be detected, then
"key" => 0
"name" => "" (empty string)
log_access($connection,$page_id,$visitor_id)
receives:
$connection - connection to the database
$page_id - page ID
$visitor_id - visitor ID
returns:
nothing
description:
This function does all the detection and logging. It calls all the previsous functions from stats.php, and stores the results in a table pws_stats.

The $page_id parameter should somehow identify the page, from which the function is called (for example for index it could be $page_id = "Index", for photogalleries it could be $page_id = "Galleries", etc.)

The $visitor_id parameter should somehow identify the visitor, so you will be able to get unique visitors and total pageviews. The simpliest to do this would be setting "$visitor_id = session ID" but this way you won't be able to distinguish between new visitors and visitors who are coming back. So better would be using session for every visit, and cookies for storing information about previous visit.

analyze.php

These are the basic functions used to analysis of the database. In the future versions I'll add more functions. If you think about a feature I'd add here, write me. I'll do that immediately, I'll add your request to the wishlist or something like that. Available functions are:

  • get_unique_users()
  • get_pageviews()
  • get_number_days()
  • get_month_stats()
  • get_hour_stats()
  • get_countries_stats()
  • get_pages_stats()
  • get_language_stats()
  • get_browser_stats()
  • get_os_stats()
  • get_week_stats()
get_unique_users($connection)
receives:
$connection - connection to the database
returns:
number of unique users
description:
This function retrieves number of unique unsers from the database.
get_pageviews($connection)
receives:
$connection - connection to the database
returns:
nothing
description:
This function retrieves number of pageviews from the database.
get_number_days($year,$month)
receives:
$year - year
$month - month
returns:
number of days
description:
Returns number of days for selected month/year.
get_month_stats($connection,$year,$month)
receives:
$connection - connection to the database
$year - year
$month - month
returns:
associative array
day of month => ("users" => number of unique users, "views" => number of pageviews)
description:
This function retrieves month statistics, that means number of unique visitors and pageviews by day.
get_hour_stats($connection,$from,$to)
receives:
$connection - connection to the database
$from - unix timestamp, lower bound of a time interval (default -1 means no lower bound)
$to - unix timestamp, upper bound of a time interval (default -1 means no upper bound)
returns:
associative array
hour => ("users" => number of visitors, "views" => number of pageviews)
description:
get_countries_stats($connection,$from,$to,$sort)
receives:
$connection - connection to the database
$from - unix timestamp, lower bound of a time interval (default -1 means no lower bound)
$to - unix timestamp, upper bound of a time interval (default -1 means no upper bound)
$sort - determines the sorting rule (default 0 means sorting by country name)
returns:
associative array
index => ("name" => country name, "users" => number of users, "views" => number of views)
description:
get_pages_stats($connection,$from,$to,$sort)
receives:
$connection - connection to the database
$from - unix timestamp, lower bound of a time interval (default -1 means no lower bound)
$to - unix timestamp, upper bound of a time interval (default -1 means no upper bound)
$sort - determines the sorting rule (default 0 means sorting by country name)
returns:
associative array
id => page ID
count => number of pageviews
unique => number of unique users
description:
get_language_stats($connection,$from,$to,$sort)
receives:
$connection - connection to the database
$from - unix timestamp, lower bound of a time interval (default -1 means no lower bound)
$to - unix timestamp, upper bound of a time interval (default -1 means no upper bound)
$sort - determines the sorting rule (default 0 means sorting by country name)
returns:
nothing
description:
get_browser_stats($connection,$detail = false,$from,$to,$sort)
receives:
$connection - connection to the database
$detail - boolean, determines the detail of output
$from - unix timestamp, lower bound of a time interval (default -1 means no lower bound)
$to - unix timestamp, upper bound of a time interval (default -1 means no upper bound)
$sort - determines the sorting rule (default 0 means sorting by country name)
returns:
array of associative arrays
"name" => browser name
"count" => pageviews
"unique" => unique visitors
description:
This function retrieves probably the most interesting information from the database - statistics of browsers used by visitors. There are two possible levels of detail - rough ($detail == false) and fine ($detail == true). When the rough level of detail is used, then browsers are distinguished by the name (MS IE, Netscape, etc.), when the fine level of detail is used, the browsers are distinguished by the name and major version (MS IE 5, MS IE 5.5, MS IE 6, Netscape 4, Netscape 6, etc.).
get_os_stats($connection,$from,$to,$sort)
receives:
$connection - connection to the database
$from - unix timestamp, lower bound of a time interval (default -1 means no lower bound)
$to - unix timestamp, upper bound of a time interval (default -1 means no upper bound)
$sort - determines the sorting rule (default 0 means sorting by country name)
returns:
array of associative arrays
"name" => name of OS
"count" => pageviews
"unique" => unique visitors
description:
This function retrieves another interesting information from the database - statistics of OS usage.
get_week_stats($connection)
receives:
$connection - connection to the database
returns:
array of associative arrays
"name" => name of OS
"count" => pageviews
"unique" => unique visitors
description:
This function retrieves another interesting information from the database - statistics of OS usage.

Possible problems

Browser/OS/architecture detection

The client can send you bad, incomplete or fake information (HTTP_USER_AGENT header), so you won't be able to detect him correctly. It's imposible to decide which of the information sent to you is correct and which is not. And of course it's possible the browser identification isn't actually in the database too.

If this situation occures, the detection will give you inforrect or incomplete information, and there's no way to change this.

Country detection

The country detection is based on client's IP address, and tries to retrieve the domain name for that IP. If succesful, the last part (part after the last dot) if tested if it's a country code (ISO 3166). The following domains are not recognized as country codes:

  • .aero
  • .arpa
  • .biz
  • .com
  • .coop
  • .edu
  • .info
  • .int
  • .museum
  • .net
  • .org.

So there's one possible problem - the lookup can be unsuccessful. It could be impossible to get the domain, or the domain name doesn't contain information about country.

There's a possibility to use the NetGeo service (here), which will give you the correct answer in overwhelm majority of request, but there are is one big restriction - at most 30 lookups every 30 seconds are allowed (the server doesn't use rolling average), so this service is not suitable for sites with more than 30 new visitors per 30 seconds. (There's a second problem too - there can be up to 30 seconds timeout, but this can be solved.) This is the reason why I didn't include NetGeo service support into the current release. In the future it will be possible to use NetGeo, but not for on-the-fly detection.