This is the Small Time Intranet Logger project.

INTRODUCTION:

Intranet Logger is a suite of programs designed to centralize the parsing
and presentation of system logs generated by computers in an intranet. The 
log data is pushed to the logging server by each client machine. The logging
server, in turn, maintains the information in an RDBMS and then responds to 
queries via a http daemon interfaced to the RDBMS.

The log data is pushed to the logging server using shell scripts and nfs. The
log data is parsed at the server, formatted for database loading and loaded to
the database. In its present state, this is done using Perl. The RDBMS is MySQL,
the httpd is Apache and the two are interfaced via PHP.

With the exception of Apache and its own license, the system is a GNU-OTS, i.e., 
it uses GNU programs throughout according to GNU's documentation, i.e., 
(G)NU (O)ff (T)he (S)helf. It's tempting to just say "GOTS", isn't it? But 
Apache is not licensed (to my knowledge) under GNU but it is free. Then it's a 
(F)ree OTS system. Free to me, free to you. I say "FOTS" because I haven't 
tried to write or rewrite daemons or define a new network protocol or define 
new log format standards or anything else like that. As it turned out, the 
available software was more than enough foundation. Besides, one of the 
central themes in the design and coding is ease of adaptability to existing 
software and environments.

The only part I would even come close to calling 'innovative' is in the design 
of the database. It is built of table 'families' consisting of a 'main' table 
and 'auxiliary' tables that corresponds to each field in the main table that is 
not already numeric or date/time in format. Unique text data strings are stored 
in the auxiliary tables and the index numbers to the strings are stored in the 
main. More on this later. Yes, it does set the stage for some convoluted, funky
looking queries, at least until you get the hang of how the queries need to be 
constructed. Then they are just cumbersome. The result is, however, a drastic 
reduction in the size of the database on disk and the means to almost eliminate 
endlessly repetitive text strings in the stored data. There is also the speed
payoff that comes from all the data in the large tables being date/time or 
numeric.

Actually, I get the feeling that the word 'innovative' is a stretch. For all 
I really know this may be a method in use since the early 1940's. Maybe the 
50's.

The people at MySQL say that we should assume that data will occupy five times
(500%) more disk space after it is inserted in a database. My results so far 
show an increase in disk size of the database to be only 50%-100% even with 
every field of every table of the database indexed. I think this is good.

It all started from a desire to learn more about Perl. Then came an increased 
interest in security. So, looking for some security-focused data on which to 
do some practical extracting and reporting, I thought of log data. When I
found Devshed and its 'The Soothingly Seamless Setup of Apache, SSL, MySQL, 
and PHP' the course of Intranet Logger was pretty much set.

If you find it only fractionally as cool as I do, I will be very pleased.

ASSUMPTIONS:

1. The log data is in text.
2. Record fields in the log data are seperated by white space.
3. TBD

FEATURES:

1. Presently handles all logs generated by TCP Wrappers, Apache, 
the output of 'last', 'lastlog', and 'w' as well as all logs generated
by, or through, the syslog daemon.

2. It should handle any log generated by virtually any operating system as
long as:
    A. The log data is in ASCII text.
    B. The message data is at the end of the record, i.e., regardless of how 
	many fields may come prior to the message and regardless of how long 
	the message may be. 
    C. Lines after the first in multiline log entries begin with whitespace.
    D. The clients deliver log data to the server, i.e., data is pushed to 
	the logging server, not pulled from the clients.
    E. The format of the log entries is consistent enough, in terms of which
	data is in which field, to make processing cost effective.

(Actually I expect conditions 'A' and 'C' to be temporary in the long run though
they are necessary at this point)

3. It minimizes disk space by maintaining 'auxiliary' tables in the database. By
using these 'auxiliary' tables, the database stores only one copy of the text 
data of any given field in any given record. These records are referenced by 
index numbers and it is these numbers that are stored in the 'main' tables. So 
the 'main' tables (the tables that have a record for every unique log entry) 
store only numberic/date/time data and so are relatively small and fast. Another 
way to look at it: The system I have been using recently inserted its 1 
millionth record into its database. The database occupied approximately 180MB on 
disk. That comes out to approximately 180 bytes per record. If the size of the 
data is 1/5 its size in the database, we're looking at 36 bytes per log entry. 
Wow, if I do say so myself.

4. Does not depend on constant network connectivity to work. IOWs, any client 
machine or the server itself can go down for almost an indefinite length of time
and the logs will simply accumulate on their respective hosts and/or the logging
server until the connection is re-established. The only constraint here is 
available disk space. 

5. The code is written to be as 'self-documenting' as possible. And I opted for
making the code understandable first and efficient second. In other words, I 
have tried to make the system accessible to the widest possible audience. There 
is an index of functions, settings and other points of interest in the files so 
that a cut and paste into the search feature of your editor will take you to the
code you want without scrolling. Also, when debugging, there are many checkpoints
which will place text strings in the debugging output that you can use to search 
to the code generating the output. I got really tired of scrolling throught the 
code. Lastly, the majority of the documentation is in the code files. Indeed, a 
good deal of the text in this documentation is just a cut and paste or a rewrite 
of the comments in the code. 

6. The whole of the work done until now has been done with an eye toward 
collaboration, i.e., this has been an open source project since its conception.

7. Compartmentalized development. The project development has 6 more or less
distinct sub-projects:
    a. acquire - Collecting logs on the client and transferring them to the 
	server.
    b. parse - Filtering the log entries for corruption and duplication, 
	formatting the entries for loading into the database, and loading them.
    c. dbms - Everything about the database: Design, structure, implementation.
    d. analysis - How the information in the log entries is used.
    e. httpd - Everything about the http interface to the database and to the 
	world.
    f. security - Everything about security.
Of course, there are some dependencies and some overlap. But these are the
sub-project classifications I use in my local CVS repository and they have
worked well enough that I have seen no reason to modify them.

8. As of the initial release, sub-projects a, b, c are funtional enough that
I am using the system now. Sub-projects d, e, f are still waiting.

9. Initial development was with Slackware 7.1. Development to this point was
finished with Slackware 8.0. If you are using a 'straight' installation and 
setup of Slackware, this should work fairly easily. I don't know about other
OS's. This is one of the reasons I have looked forward to the initial 
release of the project.

ATTENTION:
1. This project is totally beta. Use at your own risk. Whatever you do, do it
with copies of the logs and test databases first.

If you set the 'rootdir' variable in 'prepLogs4DBMS_Vars.pm' and 
'load_tables.pl' and maintain the directory structure within the new root, the 
system should work anywhere. That is, it's pretty easy to set up a testbed.