############################################################################## # # NOTE: The single most pervasive concept in the Intranet Logger system # is the notion of main-auxiliary table 'families'. Probably the most # confusing notion is that any main table field that stores log data # can either hold the log data--if it is numeric or date/time--or can # hold the index number of the auxiliary table record that holds the # log data. I chose this for flexibilty. In time, a 'most effective # structure' may make itself apparent and it may be more to the point # to impose a consistent structure on which main table fields, if any, # store log data or auxiliary table index numbers. # # In any event, the sooner this idea of table 'families' becomes clear, # the sooner the whole system should seem more reasonable. # # END NOTE. # # This is a description of the relationship of the log files to the # tables in the database. # # The design for all log types is basically the same. Logs are made up # abundant repetitive data and relatively little new data. # # For each log type, there is a main table contains only date/time or # numeric data. Some of the numeric data is contained in the logs # (pids, etc.) but most of it are index numbers associated with strings # in the auxiliary tables. # # So, effectively, you have great 'compression' of the data from the logs # to the database by a factor of approximately 0.01. # ############################################################################## This is a map used to contruct the scripts that load the database with log data. The data is listed in two contexts: 1. As it is returned by perl when 'split'. 2. As it resides in the database tables. There are seven lines in between each pair of '======' lines. 1. The file name the formatted sample came from. 2. The first line from that file formatted for database insertion. 3. The 'X' marks a value that is indexed in the main table and is stored in an auxillary table. The values not marked with an 'X' are stored in the main table. Thus, all fields in the main table are either numeric or date/time fields. 4. The database field names (as they appear in the table definitions) associated with the fields in the formatted log file lines. 5. The index numbers of the fields when the line (see #2 above) is split using the perl function 'split'. When a quote/comma delimited line is split, perl returns an array with two more elements than the actual number. So a record with 9 fields will 'split' into an array with eleven elements. All of this means the first meaningful element of such an array is numbered 1 instead of 0, as one might normally expect. 6. The field names again. The fields in '<>' are values stored in the main table. The other numbers are the indeces of the elements of the 'split' array that are stored in the auxillary tables. 7. The index numbers of the arrays that are used when inserting the record into the main table. Numbers in '<>' are the index numbers of the record fields in the split line (see #5, #2 above). The other numbers are the index numbers of the array that is returned when the auxillary tables are queried for the index numbers that correspond to the record field values. When the decision is made to insert a new record in the main table, the values for fields not numbered in '<>' are retrieved from the auxiliary tables. So row 5 is the index number list for the data fields of the log entry record. Then there is the index list of the array returned when the auxiliary tables are queried for the index numbers of the data that matches the fields in log entry record. Row 7 is this second array interpolated with row 5. I can only hope this is clear. I'm sorry if it's not. Recall that the idea here is that it is cheaper to endlessly repeat numbers in a table rather than text data. Records are first inserted into the auxillary tables, then the associated index number is inserted in the main table. This is why the scripts update the auxiliary tables first. Example: http_access: There are actally *nine* fields in each record. These are delimited by quotes and commas. When the quoted line is split using these delimiters, Perl returns an array with *eleven* elements. So the first field that has a value of interest is 1. This is why the index numbers in line 5 start at 1. The values not bracketted '<>' are inserted into their respective auxillary tables so that they can be associated with an index number. The indeces of the values are retrieved. Perl DBI returns query data in an array. The numbers not in '<>' in line 7 are the index numbers of this array. If you scan the index numbers not bracketed by '<>', they are in series. So, line 5 is the array index numbers for the actual data in the fields of the record and line 7 is the index numbers for the associated index number of the field value in the array of query data returned (previous paragraph) from the auxillary tables. For example, we are going to insert a new record for httpd_access. We query the auxiliary tables and get back an array with these elements: Element: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | Value : | 2 | 19 | 21 | 7 | 9 | 115 | 36 | The value of element '0' is the row number in table 'log_host' that has the data that matches the data in the first field of the log record; in this case, row 2 has the data value 'server', which matches the first field of the log record. The value of element '1' is the row number in table 'remote_host' that has the data that matches the data in the second field of the log record; in this case, row 19 has the data value '198.142.82.165', which matches the second field of the log record. And so on... Now, when we get to the data_time field, the main record keeps this data and so the data is inserted from the array returned when perl 'split' the log entry record. In this case, that array index is 5. The 'request' field data is maintained in the 'request' auxiliary table. The row number that has the data that matches field 6 in the log record is the value of element 4 in the array returned from the query just above. This is why the numbers go out of sync in rows 5 and 7. If 50 records are inserted in the database that all have '198.142.82.165' as the remote host, the number of records in main 'httpd_access' will gain 50 records but the table 'remote_host' will not be altered at all. Further, the 50 records in the main table will have only the number '19' (1 - 2 bytes?) as the identifying information for the remote host. Small. Very small. All of the perl scripts named 'load_*.pl' have queries built using these guidelines. ================================================================================================================================================== 1. http_access.log.2001_04_07 2. 'server','198.142.82.165', '-' , '-' ,'2001-02-27 12:32:48','GET...HTTP/1.0', '404' , '300' , '-' , '-' | 3.| X | X | X | X | | X | | | X | X | 4.| host | remote_host |remote_logname|remote_user| date_time | request | last_status | bytes_sent |referrer|user_agent| 5.| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 6.|log_host| remote_host |remote_logname|remote_user| < date_time > | request |< last_status >|< bytes_sent >|referrer|user_agent| 7.| 0 | 1 | 2 | 3 | <5> | 4 | <7> | <8> | 5 | 6 | ================================================================================================================================================== 1. login_last.log.2001_04_03 2. 'server','root','ftp','slowpoke.linux.f', 'Thu' ,'2001-02-22 18:40:00', '-' , '19:01:00' , '-' , '0:21:00' | 3.| X | X | X | X | X | | X | | | | 4.| host | user | port| origin |day_of_week| date_time |login_status| check_time | days_on | time_on | 5.| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 6.| host | user |port | origin |day_of_week| < date_time > |login_status|< check_time >|< days_on >|< time_on >| 7.| 0 | 1 | 2 | 3 | 4 | <6> | 5 | <8> | <9> | <10> | ================================================================================================================================== 1. login_lastlog.log.2001_04_03 NOTE: An entry here will either have a string in 'login_status' or a value in 'date_time', but not both. 2. 'server1','kfanyo','pts/7','server2.linux.ft','2001-02-23 15:17:54' 3.| X | X | X | X | X | | 4.| host | user | port | origin | login_status | date_time | 5.| 1 | 2 | 3 | 4 | 5 | 6 | 6.| host | user | port | origin | login_status | < date_time >| 7.| 0 | 1 | 2 | 3 | 4 | <6> | ================================================================================================================================= 1. login_who.is.2001_04_03 2. 'server1','kfanyo','tty1', '-' ,'2001-02-19 01:00:00','04 00:00:00','0 00:00:1','0 00:00:01','-bash','2001-02-24 00:54:39'| 3.| X | X | X | X | | | | | X | | 4.| host | user | port | origin | date_time | idle | jcpu | pcpu | proc | log_file_date | 5.| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 6.| host | user | port | origin | < date_time > | < idle > | < jcpu > | < pcpu > | proc | < log_file_date > | 7.| 0 | 1 | 2 | 3 | <5> | <6> | <7> | <8> | 4 | <10> | ==================================================================================================================================================================== 1. samba_log.nmb.2001_04_05 2. 'server1', 'log' ,'nmb' ,'2001-02-24 11:11:45', '1' , '-' , '-' , '-' ,'nmbd', 'nmbd.c' ,'main', '757' ,'Copyright Andrew Tridgell 1994-1998' 3.| X | X | X | | | | X | X | | X | X | | X | 4.| host |facility|daemon| date_time | flag | daemon_pid |effective|actual|module|source_file| func | pid | message | 5.| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 6.| host |facility|daemon| < date_time > |< flag >|< daemon_pid >|effective|actual|module|source_file| func |< pid >| message | 7.| 0 | 1 | 2 | <4> | <5> | <6> | 3 | 4 | 5 | 6 | 7 | <12> | 8 | ==================================================================================================================================================================== 1. syslog_auth.info.2001_04_03 2. 'auth' , 'info' ,'2001-02-24 12:06:08','srvr1', 'su' ,'25026','+ pts/4 kfanyo-root' 3.| X | X | | X | X | X | 4.|facility|severity| date_time | host | user | pid | message | 5.| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 6.|facility|severity| < date_time > | host | user |< pid >| message | 7.| 0 | 1 | <3> | 2 | 3 | <6> | 4 | ==================================================================================================================================================================== There are two basic varieties of syslog records. Those with a process ID (pid) and those without. Those without must have their respective inserts reflect this. I don't know whether it is more to the point to have null values in these fields or to use a zero value. My gut feeling is to go with a zero and risk the confusion. ==================================================================================================================================================================== 1. syslog_kern.crit.2001_04_03 2. 'kern' , 'crit' ,'2001-03-21 04:43:15','freebie', 'kernel', '0' ,'EXT2-fs error (device ide0(3,1)): ext2_readdir: directory #126274 contains a hole at offset 0'| 3.| X | X | | X | X | | X | 4.|facility|severity| date_time | host | user | | message | 5.| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 6.|facility|severity| < date_time > | host | user |< (pid) >| message | 7.| 0 | 1 | <3> | 2 | 3 | <6> | 4 | ==================================================================================================================================================================== 1. syslog_tcpd.ftp.2001_04_03 2. 'tcpd' , 'ftp' ,'2001-02-23 21:55:43','srvr1','wu.ftpd','213.37.108.34', 'root' ,'root@213.37.108.34','213.37.108.34','24.0.92.211','wu.ftpd@24.0.92.211','24.0.92.211' 3.| X | X | | X | X | X | X | X | X | X | X | X | 4.|wrapper|service| date_time | host | daemon | remote_name |remote_user| remote_info |remote_address | local_name | local_info |local_address| 5.| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 6.|wrapper|service| < date_time > | host | daemon | remote_name |remote_user| remote_info |remote_address | local_name | local_info |local_address| 7.| 0 | 1 | <3> | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ==================================================================================================================================================================== Another view of this idea: The following is a brief view of the 'main' tables in the database. The field numbers bracketed by '[]' designate auto-increment fields, the the field numbers bracketed by '<>' designate fields in the main table which store log data, and the other fields store the index numbers in the auxiliary tables of records that have the log data. #======================================= # httpd_access #--------------------------------------- # [1] id # [2] time_in # 3 log_host # 4 remote_host # 5 remote_logname # 6 remote_user # <7> date_time # 8 request # <9> last_status # <10> bytes_sent # 11 referrer # 12 user_agent #======================================= # samba_/[sn]m[bd]/ #--------------------------------------- # [1] id # [2] time_in # 3 log_host # 4 facility # 5 daemon # <6> date_time # <7> flag # <8> daemon_pid # 9 effective # 10 actual # 11 module # 12 source_file # 13 func # <14> pid # 15 message #======================================= # syslog_tcpd #--------------------------------------- # [1] id # [2] time_in # 3 wrapper # 4 service # <5> date_time # 6 log_host # 7 daemon # <8> pid # 9 remote_name # 10 remote_user # 11 remote_info # 12 remote_address # 13 local_name # 14 local_info # 15 local_address #======================================= # syslog_ #--------------------------------------- # [1] id # [2] time_in # 3 log_host # 4 facility # 5 severity # <6> date_time # 7 user # <8> pid # 9 message #======================================= # login_last #--------------------------------------- # [1] id # [2] time_in # 3 log_host # 4 user # 5 port # 6 origin # 7 day_of_week # <8> date_time # 9 login_status # <10> check_time # <11> days_on # <12> time_on #======================================= # login_lastlog #--------------------------------------- # [1] id # [2] time_in # 3 log_host # 4 user # 5 port # 6 origin # 7 login_status # <8> date_time #======================================= # login_who #--------------------------------------- # [1] id # [2] time_in # 3 log_host # 4 user # <5> port # 6 origin # <7> date_time # 8 idle # <9> jcpu # <10> pcpu # 11 proc # <12> log_file_date #=======================================