##############################################################################
#
#	NOTE: The single most pervasive concept in the Intranet Logger system
#	is the notion of main-auxiliary table 'families'. Probably the most
#	confusing notion is that any main table field that stores log data
#	can either hold the log data--if it is numeric or date/time--or can
#	hold the index number of the auxiliary table record that holds the 
#	log data. I chose this for flexibilty. In time, a 'most effective 
#	structure' may make itself apparent and it may be more to the point
#	to impose a consistent structure on which main table fields, if any,
#	store log data or auxiliary table index numbers.
#
#	In any event, the sooner this idea of table 'families' becomes clear,
#	the sooner the whole system should seem more reasonable.
#
#	END NOTE.
#
#	This is a description of the relationship of the log files to the 
#	tables in the database. 
#
#	The design for all log types is basically the same. Logs are made up
#	abundant repetitive data and relatively little new data. 
#
#	For each log type, there is a main table contains only date/time or 
#	numeric data. Some of the numeric data is contained in the logs 
#	(pids, etc.) but most of it are index numbers associated with strings
#	in the auxiliary tables.
#
#	So, effectively, you have great 'compression' of the data from the logs
#	to the database by a factor of approximately 0.01. 
#
##############################################################################

This is a map used to contruct the scripts that load the database with
log data. The data is listed in two contexts:

1. As it is returned by perl when 'split'.
2. As it resides in the database tables.

There are seven lines in between each pair of '======' lines.
1. The file name the formatted sample came from.
2. The first line from that file formatted for database insertion.
3. The 'X' marks a value that is indexed in the main table and is stored in 
    an auxillary table. The values not marked with an 'X' are stored in the 
    main table. Thus, all fields in the main table are either numeric or 
    date/time fields.
4. The database field names (as they appear in the table definitions)
    associated with the fields in the formatted log file lines.
5. The index numbers of the fields when the line (see #2 above) is split
    using the perl function 'split'. When a quote/comma delimited line is 
    split, perl returns an array with two more elements than the actual 
    number. So a record with 9 fields will 'split' into an array with eleven 
    elements. All of this means the first meaningful element of such an 
    array is numbered 1 instead of 0, as one might normally expect.
6. The field names again. The fields in '<>' are values stored in the main 
    table. The other numbers are the indeces of the elements of the 'split' 
    array that are stored in the auxillary tables.
7. The index numbers of the arrays that are used when inserting the record
    into the main table. Numbers in '<>' are the index numbers of the record 
    fields in the split line (see #5, #2 above). The other numbers are the 
    index numbers of the array that is returned when the auxillary 
    tables are queried for the index numbers that correspond to the record 
    field values.
	
    When the decision is made to insert a new record in the main table, the 
    values for fields not numbered in '<>' are retrieved from the
    auxiliary tables. So row 5 is the index number list for the data fields
    of the log entry record. Then there is the index list of the array 
    returned when the auxiliary tables are queried for the index numbers of
    the data that matches the fields in log entry record.
    
    Row 7 is this second array interpolated with row 5. I can only hope 
    this is clear. I'm sorry if it's not.

    Recall that the idea here is that it is cheaper to endlessly repeat 
    numbers in a table rather than text data.  Records are first inserted 
    into the auxillary tables, then the associated index number is inserted 
    in the main table. This is why the scripts update the auxiliary tables 
    first.

Example: http_access:

There are actally *nine* fields in each record. These are delimited by 
quotes and commas. When the quoted line is split using these delimiters,
Perl returns an array with *eleven* elements. So the first field that 
has a value of interest is 1. This is why the index numbers in line 5 
start at 1.

The values not bracketted '<>' are inserted into their respective 
auxillary tables so that they can be associated with an index number.

The indeces of the values are retrieved. Perl DBI returns query data in
an array. The numbers not in '<>' in line 7 are the index numbers of
this array. If you scan the index numbers not bracketed by '<>', they
are in series.

So, line 5 is the array index numbers for the actual data in the fields
of the record and line 7 is the index numbers for the associated index
number of the field value in the array of query data returned (previous
paragraph) from the auxillary tables.

For example, we are going to insert a new record for httpd_access.
We query the auxiliary tables and get back an array with these elements:

Element: |   0   |   1   |   2   |   3   |   4   |   5   |   6   |
Value  : |   2   |  19   |  21   |   7   |   9   |  115  |  36   |

The value of element '0' is the row number in table 'log_host' that has 
    the data that matches the data in the first field of the log record;
    in this case, row 2 has the data value 'server', which matches the 
    first field of the log record.

The value of element '1' is the row number in table 'remote_host' that has 
    the data that matches the data in the second field of the log record;
    in this case, row 19 has the data value '198.142.82.165', which matches 
    the second field of the log record.
    
And so on...

Now, when we get to the data_time field, the main record keeps this data
and so the data is inserted from the array returned when perl 'split' the 
log entry record. In this case, that array index is 5. The 'request' field
data is maintained in the 'request' auxiliary table. The row number that 
has the data that matches field 6 in the log record is the value of 
element 4 in the array returned from the query just above.

This is why the numbers go out of sync in rows 5 and 7.

If 50 records are inserted in the database that all have '198.142.82.165'
as the remote host, the number of records in main 'httpd_access' will gain
50 records but the table 'remote_host' will not be altered at all. 

Further, the 50 records in the main table will have only the number '19'
(1 - 2 bytes?) as the identifying information for the remote host. Small.
Very small.

All of the perl scripts named 'load_*.pl' have queries built using these
guidelines.

==================================================================================================================================================
1. http_access.log.2001_04_07
2. 'server','198.142.82.165',      '-'     ,    '-'    ,'2001-02-27 12:32:48','GET...HTTP/1.0',     '404'     ,     '300'    ,   '-'  ,   '-'    |
3.|    X   |        X       |       X      |     X     |                     |        X       |               |              |    X   |    X     |
4.|  host  |  remote_host   |remote_logname|remote_user|      date_time      |     request    |  last_status  |  bytes_sent  |referrer|user_agent|
5.|    1   |        2       |       3      |     4     |          5          |        6       |       7       |       8      |    9   |    10    |    
6.|log_host|  remote_host   |remote_logname|remote_user|    < date_time >    |     request    |< last_status >|< bytes_sent >|referrer|user_agent|
7.|    0   |        1       |       2      |     3     |         <5>         |        4       |      <7>      |      <8>     |    5   |     6    |
==================================================================================================================================================
1. login_last.log.2001_04_03
2. 'server','root','ftp','slowpoke.linux.f',   'Thu'   ,'2001-02-22 18:40:00',     '-'    ,  '19:01:00'  ,    '-'    , '0:21:00' |
3.|   X    |   X  |  X  |        X         |     X     |                     |      X     |              |           |           |
4.|  host  | user | port|      origin      |day_of_week|       date_time     |login_status|  check_time  |  days_on  | time_on   |
5.|    1   |  2   |  3  |        4         |     5     |           6         |      7     |       8      |     9     |    10     |
6.|  host  | user |port |      origin      |day_of_week|     < date_time >   |login_status|< check_time >|< days_on >|< time_on >|
7.|    0   |  1   |  2  |        3         |     4     |          <6>        |      5     |      <8>     |    <9>    |   <10>    |
==================================================================================================================================
1. login_lastlog.log.2001_04_03

NOTE: An entry here will either have a string in 'login_status' or a
value in 'date_time', but not both.

2. 'server1','kfanyo','pts/7','server2.linux.ft','2001-02-23 15:17:54'
3.|   X   |  X   |  X   |   X    |      X       |              |
4.| host  | user | port | origin | login_status |   date_time  |
5.|   1   |  2   |  3   |   4    |      5       |       6      |
6.|  host | user | port | origin | login_status | < date_time >|
7.|   0   |  1   |  2   |   3    |      4       |      <6>     |  
=================================================================================================================================
1. login_who.is.2001_04_03
2. 'server1','kfanyo','tty1',   '-'  ,'2001-02-19 01:00:00','04 00:00:00','0 00:00:1','0 00:00:01','-bash','2001-02-24 00:54:39'|
3.|    X    |    X   |  X   |   X    |                     |             |           |            |   X   |                     |
4.|   host  |  user  | port | origin |      date_time      |    idle     |   jcpu    |   pcpu     |  proc |    log_file_date    |
5.|    1    |    2   |  3   |   4    |          5          |      6      |     7     |     8      |   9   |         10          |
6.|   host  |  user  | port | origin |    < date_time >    |  < idle >   | < jcpu >  | < pcpu >   |  proc |  < log_file_date >  |
7.|    0    |    1   |  2   |   3    |         <5>         |     <6>     |    <7>    |    <8>     |   4   |        <10>         |
====================================================================================================================================================================
1. samba_log.nmb.2001_04_05
2. 'server1',  'log' ,'nmb' ,'2001-02-24 11:11:45',   '1'  ,      '-'     ,   '-'   , '-'  ,'nmbd',  'nmbd.c' ,'main', '757' ,'Copyright Andrew Tridgell 1994-1998'
3.|    X    |    X   |  X   |                     |        |              |    X    |  X   |      |     X     |   X  |       |                  X                  |
4.|   host  |facility|daemon|      date_time      |  flag  |  daemon_pid  |effective|actual|module|source_file| func |  pid  |               message               |
5.|    1    |    2   |  3   |          4          |    5   |       6      |     7   |  8   |   9  |    10     |  11  |  12   |                 13                  |
6.|   host  |facility|daemon|    < date_time >    |< flag >|< daemon_pid >|effective|actual|module|source_file| func |< pid >|               message               |
7.|    0    |    1   |  2   |         <4>         |   <5>  |      <6>     |     3   |  4   |   5  |    6      |  7   | <12>  |                  8                  |
====================================================================================================================================================================
1. syslog_auth.info.2001_04_03
2.  'auth' , 'info' ,'2001-02-24 12:06:08','srvr1', 'su' ,'25026','+ pts/4 kfanyo-root'
3.|    X   |    X   |                     |   X   |  X           |          X          |
4.|facility|severity|       date_time     | host  | user |  pid  |        message      |
5.|   1    |   2    |           3         |   4   |  5   |   6   |          7          |
6.|facility|severity|     < date_time >   | host  | user |< pid >|        message      |
7.|   0    |   1    |          <3>        |   2   |  3   |  <6>  |          4          |
====================================================================================================================================================================

There are two basic varieties of syslog records. Those with a process ID (pid)
and those without. Those without must have their respective inserts reflect this.
I don't know whether it is more to the point to have null values in these fields
or to use a zero value. My gut feeling is to go with a zero and risk the confusion.
====================================================================================================================================================================
1. syslog_kern.crit.2001_04_03
2.  'kern' , 'crit' ,'2001-03-21 04:43:15','freebie', 'kernel',   '0'   ,'EXT2-fs error (device ide0(3,1)): ext2_readdir: directory #126274 contains a hole at offset 0'|
3.|    X   |   X    |                     |    X    |    X    |	        |                      X                                                                            |
4.|facility|severity|      date_time      |   host  |  user   |         |                   message                                                                         |
5.|   1    |   2    |          3          |    4    |   5     |    6    |                      7                                                                            |
6.|facility|severity|    < date_time >    |   host  |  user   |< (pid) >|                   message                                                                         |
7.|   0    |   1    |         <3>         |    2    |   3     |   <6>   |                      4                                                                            |
====================================================================================================================================================================
1. syslog_tcpd.ftp.2001_04_03
2. 'tcpd' , 'ftp' ,'2001-02-23 21:55:43','srvr1','wu.ftpd','213.37.108.34',  'root'   ,'root@213.37.108.34','213.37.108.34','24.0.92.211','wu.ftpd@24.0.92.211','24.0.92.211'
3.|   X   |   X   |                     |  X    |    X    |       X       |     X     |          X         |       X       |      X      |           X         |     X       |
4.|wrapper|service|       date_time     | host  | daemon  |  remote_name  |remote_user|     remote_info    |remote_address | local_name  |       local_info    |local_address|
5.|   1   |   2   |           3         |   4   |    5    |       6       |     7     |          8         |       9       |      10     |           11        |     12      |
6.|wrapper|service|     < date_time >   | host  | daemon  |  remote_name  |remote_user|     remote_info    |remote_address | local_name  |       local_info    |local_address|
7.|   0   |   1   |          <3>        |   2   |    3    |       4       |     5     |          6         |       7       |      8      |            9        |     10      |
====================================================================================================================================================================

Another view of this idea: 

The following is a brief view of the 'main' tables in the database. The field 
numbers bracketed by '[]' designate auto-increment fields, the the field 
numbers bracketed by '<>' designate fields in the main table which store log
data, and the other fields store the index numbers in the auxiliary tables of 
records that have the log data.

#=======================================
#	httpd_access
#---------------------------------------
#	[1]	id
#	[2]	time_in
#	3	log_host
#	4	remote_host
#	5	remote_logname
#	6	remote_user
#	<7>	date_time
#	8	request
#	<9>	last_status
#	<10>	bytes_sent
#	11	referrer
#	12	user_agent
#=======================================
#	samba_/[sn]m[bd]/
#---------------------------------------
#	[1]	id
#	[2]	time_in
#	3	log_host
#	4	facility
#	5	daemon
#	<6>	date_time
#	<7>	flag
#	<8>	daemon_pid
#	9	effective
#	10	actual
#	11	module
#	12	source_file
#	13	func
#	<14>	pid
#	15	message
#=======================================
#	syslog_tcpd
#---------------------------------------
#	[1]	id
#	[2]	time_in
#	3	wrapper
#	4	service
#	<5>	date_time
#	6	log_host
#	7	daemon
#	<8>	pid
#	9	remote_name
#	10	remote_user
#	11	remote_info
#	12	remote_address
#	13	local_name
#	14	local_info
#	15	local_address
#=======================================
#	syslog_<other>
#---------------------------------------
#	[1]	id
#	[2]	time_in
#	3	log_host
#	4	facility
#	5	severity
#	<6>	date_time
#	7	user
#	<8>	pid
#	9	message
#=======================================
#	login_last
#---------------------------------------
#	[1]	id
#	[2]	time_in
#	3	log_host
#	4	user
#	5	port
#	6	origin
#	7	day_of_week
#	<8>	date_time
#	9	login_status
#	<10>	check_time
#	<11>	days_on
#	<12>	time_on
#=======================================
#	login_lastlog
#---------------------------------------
#	[1]	id
#	[2]	time_in
#	3	log_host
#	4	user
#	5	port
#	6	origin
#	7	login_status
#	<8>	date_time
#=======================================
#	login_who
#---------------------------------------
#	[1]	id
#	[2]	time_in
#	3	log_host
#	4	user
#	<5>	port
#	6	origin
#	<7>	date_time
#	8	idle
#	<9>	jcpu
#	<10>	pcpu
#	11	proc
#	<12>	log_file_date
#=======================================