pmacct (Promiscuous mode IP Accounting package)
pmacct is Copyright (C) 2004 by Paolo Lucente

(poorman's) TABLE OF CONTENTS: 
I.	Introduction
II.	Primitives
III.	Processes Vs. threads
IV.	Communications between core process and plugins
V.	Memory table plugin
VI.	SQL issues and *SQL plugins
VII.	Recovery modes


I. Introduction
Giving a quick look to the old 'INTERNALS' textfile, this one starts with a big step
forward, a rough table of contents. It is still not fancy nor formatted. I'm conscious the
package is still missing of a man page. The goal of this file is a careful description of
directions taken through the process of code writing, trying to expose the work done to
constructive critics and allowing ideas to exit from the crypticism of code. 


II. Primitives
They are a way to express what aggregation method has to be applied over incoming data.
While looking forward over time we see a more generalized way to bind aggregation methods
to pieces of data, currently primitives give a sufficient flexibility.
The concept of primitive itself carries the image of simple entities that can be stacked
together to form complex expressions using boolean operators.
Going practical, primitives are expressions like "src_port", "dst_host", "proto", etc.;
currently the unique boolean operator supported to glue expressions is "and". So, you can
aggregate traffic translating "who connects where, using which service" speech language
expression to one understandable by pmacct: "src_host,dst_host,dst_port,proto". Commas are
simple separators between each primitive because of the unique logical connective.


III. Processes Vs. threads 
pmacctd, pmacct daemon, relies strongly over a multi-process organization rather than over
threads. For threads we mean what is commonly referred as threads of execution that share
their entire address space inside a single process.
Processes are used to encapsulate each plugin's instance and, of course, the core process.
Core process collects packets via pcap library API, process them and send aggregation data
to plugins. Plugins should take aggregation data (struct pkt_data) and handle them in some
meaningful way.
Follows a picture:
					   |===> [ pmacctd/plugin ]
libpcap			           pipe	   |
===========> [ pmacctd/core ]==============|===> [ pmacctd/plugin ]

I don't like, except for specific cases (eg. big memory structures that would lead the
pages' copy-on-write to perform horrendly), the idea of threads on UNIXes and Linux. They
are suitable and are born in environments with expensive process spawning and weak IPC
facilities. Moreover the task of managing critical regions in a shared address space is
sometimes quite difficult and a fertile source of bugs simply because they easily know
too much about each others' internal states. They frequently translates in adding to
basic issues described in each Operating Systems' textbook, a fully new range of timing
dependent bugs that are excruciatingly difficult to even reproduce. These considerations
leave untouched portability troubles and differences of behaviour across platforms.


IV. Communications between core process and plugins
A single running pmacctd core process is able to feed data to more plugins, even of the
same type. Plugins of the same type are distinguished by their name. Names are unique.
Currently (and there are projects to change this) data flow inside a pipe, spawned by a
socketpair() call. Each plugin has its own communication pipe with core process. Pipes are
further encapsulated in channels: a channel is made of an aggregation method, a pipe, a
buffer and optionally a filter. A loop cycles through all channels sequentially, sending
data to plugins.
A pipe is effectively a peer-to-peer FIFO queue, where core process pulls data and then
the plugin pushes it. Defaut size of this queue is dependent over operating system;
'plugin_pipe_size' configuration key aims to tune manually this size; an eye has to be kept
to maximum sizes imposed by the system; on Linux, for example, maximum values are in
/proc/sys/net/cor/[w|r]mem_max. To adjust the depth of these queues is vital when facing
with large volume of traffic because the amount of data to get pulled into the pipe is
directly proportional to the number of packets seen by the machine.
Data (struct pkt_data) could be bufferized before entering the pipe. This aims to reduce
pressure over the kernel: the pipe realized with a socket() is an IPC structure handled by
the operating system and the pressure over the kernel increases with the number writes and
reads (which are both system calls). Size of the buffers are adjustable with the
'plugin_buffer_size' configuration key. This size intuitively has to be less than the pipe
size. Choosing a small ratio between two sizes could lead to pipe filling.
A rough but indicative count is the following:

average_traffic = packets per seconds in your network segment
sizeof(pkt_data) = ~20 bytes

pipe size > average_traffic * sizeof(pkt_data)

                     pipe
[ pmacctd/core ] =================================> [ pmacctd/plugin ]
                 |                                |
                 |   enqueued buffers     free    |
                 |==|==|==|==|==|==|==|===========|


V. Memory table plugin
The memory table is used to store aggregation data assembled by core process in a memory
structure which is organized as a hash table. The table is further divided in a number of
buckets where data is stored (struct acc). Data is direct-mapped to a bucket by the mean
of a trivial modulo function. Collisions in each bucket are solved constructing collision
chains.
An auxiliar structure, a LSU cache (Last Recently Used), is provided to speed up searches
and updates to the main table. It saves last updated or searched element in the table.
When a new search or update is required the LSU cache is compared first; if there is any
match the collision chain gets traversed.
It's advisable to use a prime number of buckets ('imt_buckets' configuration key), because
boost dispersion achievement with the modulo function. Chains are organized as a linked list
of elements and so they should be kept short because of linear search over them; having a
flat table (so, high number of buckets) helps in keeping chains short.
Memory is allocated in large chunks, called memory pools, to avoid as much as possible bad
effects and trashing from dispersion through memory pages. Drawbacks of the dense use of
malloc()es are described on every Operating Systems' textbook. These memory allocations are
tracked via a linked list of chunk's descriptors (struct memory_pool_desc) for later jobs,
such as freeing unused memory chunks or operations over the entire table, etc. Then, if the
user choose to have a predefined table, descriptors are all allocated at the beginning of
the execution; otherwise, if the user choose to let the memory table to grow undefinitely in
memory (eg. via commandline option '-m 0' or 'imt_mem_pools_number' config key) new nodes are
allocated and added to the list during the execution. pmacctd in doing the described tasks
does not rely over realloc() function, but only over malloc()es. Table grows and shrinks with
the help of the above described tracking structures. This is because my assumptions about
realloc() are the following:
(a) try to reallocate on the original memory block and (b) if (a) failed, allocate another
memory block and copy the contents of the original block to this new location. In this scheme
(a) can be done in constant time; in (b) only the allocation of new memory block and the
deallocation of original block are done in constant time, but the copy of previous memory
area, for large in-memory tables, could perform horrendly.
 

VI. SQL issues and *SQL plugins
Currently (as of 0.6.2) are available two SQL plugins, for data insertion in a MySQL or a
PostgreSQL DB. Storing data to a persistent backend, leaves chances for advanced operations
and so these plugins are intended to give a wider range of features (eg. fallback mechanisms
and backup storage methods if DB fails, etc.) not available in the memory plugin.
Let's firstly give a whole picture of how SQL plugins work. As data fed by the core process
via the communication pipe gets unpacked, it's not directly inserted in the DB but in a 3-way
associative cache; data to bucket mapping is computed via a modulo function over a crc32 sum
of the datum. If bucket already contains valid data, then its neighbor buckets are checked;
if both contains valid data, last bucket to get explored have its data replaced: old datum is
placed in a collision queue, new one in the cache. Data from the cache is pulled to the DB at
regular intervals and to speed up this operation a queries queue is continuously updated as
buckets get busy. When current interval expires, a new process if fork()ed to process both
queues from which are built queries to be sent to the DB. In the meanwhile data in the cache
is not erased but simply marked as invalid. Number of cache's buckets are tunable via the
'sql_cache_entries' configuration key; here also a prime number is strongly advisable to ensure
a better data dispersion through the cache. 
Three notes about the above described process: (a) few time ago has been introduced the concept
of lazy data refresh deadlines. Timeframes are checked without the auxilium of signals but when
new data arrives. If traffic is low, data is not kept stale in the cache until a new data but
a poll() timeout make the wheel rotating. (b) SQL plugin main loop is kept sufficiently fast
because of any interaction with the DB. It only gets data, computes modulo and handles both
cache and queues. (c) keeping data in a cache is the way to find again your data; this task become
not so quickly when datum is placed on a unsorted FIFO queue. Replacing data is based over the
assumption of exploiting a kind of temporal locality in internet data flows.  
A picture follows:
				    |====> [ cache ] ===|
pipe				    |			|====> [ collision queue ] ===|   DB
======> [ pmacctd/SQL plugin ] =====|====> [ cache ] ===|			      |======>
			|	    |			|====> [ queries queue ] =====|
			|	    |====> [ cache ] ===|
			|
			|=======> [ fallback mechanisms ]

Now, let's keep an eye on how data are structured on the DB side. Data are simply organized in flat
tuples, without external references. After being not full convinced about better normalized solutions
aimed to satifsy an abstract concept of flexibility, we've (and here come into play the load of mails
exchanged with Wim Kerkhoff) found that simple means fast. And to let the wheel rotate fast is a key
achievement, given that pmacctd needs not only to insert new data but also update existing records,
putting under heavy pressure DB when in busy network environments. 
Now a pair of concluding *practical* notes: (a) default SQL table and its primary key are suitable
for most normal usages, however they will be filled by zeroes of unused fields. I took this choice
a long time ago to allow people to compile sources and quickly get involved into increasing counters
without caring too much about SQL details (assumption, who is involved in network managements,
doesn't have necessarily to be involved into SQL stuff). So everyone with a busy network segment
under his feets have to tune the wheel himself to avoid performance constraints; a configuration
key 'sql_optimize_clauses' evaluates what primitives have been selected and avoids long 'WHERE'
clauses in 'INSERT' and 'UPDATE' queries. This involves the creation of an auxiliar index or the
update of the primary key to work smoothly. A custom table might be created, trading flexibility
with disk space wasting. (b) when using timestamps to break up data into different timeframes,
validity of data is connected not only to data itself but also to the time; as staten before, data
gets pulled into DB at regular intervals tunable via 'sql_refresh_time' key. Connecting these two
elements (refresh time and timeframe's length) with a multiplicative factor helps in avoiding
transient cache aliasing phenomena and in fully exploiting cache benefits. All data getting stale
in the middle of a data refresh interval, leads to frequent first-hit failures of cache modulo
function and to a quick growth in collision queue size. 


VII. Recovery modes
The concept of recovery modes is available only in SQL plugins and is aimed to avoid data loss,
taking a corrective action if the DB suffers an outage or simply become unresponsive. Two modes
are supported, write data to a structured logfile for later processing by a player program or
write data to a backup DB. While this second way is straightforward, few words about the logfile:
things has been kept simple, so much care and responsibility for keeping data meaningful is on your  
shoulders. A logfile is made of an header (struct logfile_header) containing DB configuration
parameters and data dumped from plugin. When appending data to logfile, if the file already
exists, its header is not checked against actual parameters but only the magic number in the
header to assume it's safe to append data. If multiple SQL plugins are running, each one should
have its own logfile.
The health of SQL server is checked everytime data is purged into it. If DB becomes unresponsive
a recovery flag is raised. This flag remains valid, without further checks, for the entire purging
event. A set of player tools are available, pmmyplay and pmpgplay; they currently don't contain any
advanced auto-process feature. Both they extract  needed informations (where to connect, which
username to use, etc.) from the logfile's header. While playing entire logfile or even a part of it,
a further method to detect DB's failures exists. A final statistics screen summarizes what has been
written successfully to help reprocess the logfile at a later stage. 

