Extract, Transfer and Load Procedure

The ETL loads and process the data of the EUROPA WebNest, and produces the data for the Europa Analytics.

ETL Overview
The ETL (Extract, Transfer and Load) runs on the Sun Solaris 64-bit server 'SASCOMP2'.

Currently it runs daily and takes between 10 and 17 hrs for the whole SAS Web analytics datamart.

The following items are included:


 * Select and download Webserver logs (20 mins)
 * Extract the webserver logs (1h30)
 * Transform and analyse sessions (2h30)
 * Load the DG COMM datamart (25 mins)
 * Load the Europa Overall webmart (10 hrs)
 * Load 135 webmarts (between 3 and 1h45)
 * Backup 135 webmarts and Europa Overall (between 3 mins and 3 hrs)
 * Backup DG_COMM datamart (3hrs)
 * Analyse Spiders (45 mins)

Select and download webserver logs
Every run new S-NET webserver logs are selected and downloaded from a dedicated S-NET weblog-server. The SWA system keeps track of each webserver log processed and helps in selecting the next logs to be processed.

Daily the maximum of 95 logs can be processed.

When there is no delay between 65 and 77 webserver logs are selected.

In case there is a delay the maximum of 95 logs can be processed. For performance reasons the number is 95. Trying to process an higher number of logs always resulted in slower ETL.

Renaming logfiles
To optimise the ETL, the logfiles are ran in 5 parallel processes from the biggest until the smallest. To ease the selection/(re-ordering) of the logfiles by size, the files are renamed, so the alphabetic order corresponds to the size order too.

The webserver logs are downloaded zipped from the weblogserver and stored in a dedicated directory uncompressed.

The files are renamed according to their zipped sizes, meaning that the biggest log is renamed to 0001_[old name] and the smallest log is renamed to 00nn_[old name]. Log files smaller then 600bytes only contains the header and are suppressed.

Analyse Spiders
Analysis of spiders, index engines or crawlers is based on clickstream analysis within SAS Web analytics (SWA) tool. SWA proposes the list and provide all clicks done by those Ip-addresses. Following elements are checked:


 * number of clicks above 5000 with avg of 2secs per click.
 * number pf clicks above 15000 with avg of 1 sec per click.
 * number of OS and browsers used in sessions
 * more then 13 languages used in one session

IP-addresses of spiders are stored and used to filter out while extracting.

IP-addresses not returning are kept for 500 days. They are deleted and not used anymore for filtering in ETL.