    The following garbage collection scheme is loosely based on a generational mark-sweep garbage collector
with compaction. The motivation for compacting garbage collection is that to compact an individual generation,
one does not need to reserve a semi-space of equal size to the entire heap to perform a full collection, in the
event that a full collection occurs, and so greatly reduces memory usage without much if any cost to performance.
The fact that it is optionally non-relocating (can sweep without compaction) also allows it to be made partially
incremental unless an actual compaction needs to occur.

    The mark-sweep heap is divided into generations. The size of these generations provides an upper
bound on the number of forwarding blocks needed to compact a generation.

A card-marking system is used to keep track of any intergenerational roots in the system. Whenever a store to an
object that might possibly be in a non-young generation is done, the card corresponding to the object is marked.
During the marking phase, the GC scans the mark vector looking for marked cards. Any objects in the marked cards
are scanned for reference to young objects. If any are found, the marking phase pushes those objects onto the mark
stack for tracing. Similarly, during the compaction phase, the card marks may be used to later relocate any pointers
young objects. The marks are stored in a separate linear structure for the entire heap to make marking the cards cheap. 

*** Options for card-marking include:
1) coarse, where each card represents a fixed region (of anywhere from 128-512 bytes) of heap space
    In this scheme the marks are stored in a bit/byte vector. However, care is necessary to deal with any objects storing
raw data within the cards, as these will potentially masquerade as pointers or even object headers, making the headers
of all objects within the card difficult to find. So, either 1-2 bytes of data would also need to be stored for every 
card stating the offset of the first valid tagged pointer within the card, or otherwise all objects holding raw data 
(byte arrays, float arrays, maps, etc.) could simply be stored in separate generations that don't need to be marked or 
scanned for intergenerational references, as they can't hold any pointers to other objects anyway.

    Possibilities are to use a bit vector where native bit diddling instructions are provided (such as x86). Otherwise,
byte vectors on cheaper to access on most other architectures, but consume slightly more space.

2) fine-grained, using 1 bit for every word in the heap
    On writing to a slot, the bit corresponding to the exact word that was modified is marked. Uses more space, but it
avoids the need to scan entire objects, and also avoids the need to treat raw objects specially since the exact pointers
that need to be examined are marked. This also allows for marking objects independently of the object header in one 
single data structure, rather than necessitating a second one. In this scheme, 1 card of approximately 32 words 
(128 bytes) would necessarily correspond to 1 word of memory, or 3% of total heap space, which may be worth the possible
benefits. 

    However, one drawback is decreased throughput in scanning for dirty cards because of a less compact
representation. One way to deal with this may be to align all groups of data slots in objects to 2 or 4 word boundaries, 
thus increasing throughput by 2-4x, but this may not be at all necessary (throughput worries may be a red herring), or 
could end up being as complex as using the smarter coarse-grained card scheme.

3) remembered sets, as in Squeak's GC
    Just mark in the object's header whether it is root. Modifying an object checks if the root bit is set in the object.
If it is not, and if the object is old, the object is added to the remembered set and its root bit is set. Cheap trick:
all young objects can just have their root bit set, but not actually be added to the root set, so that this can be inlined
in very few instructions into the site of the store. One drawback is that large remembered sets that overflow the set
size will trigger full GCs. Possibly more precise than cards, but less precise than word-marking.

*** Options for the tracing phase include:
1) put the mark bit in the object header
    Aggravates virtual memory, because marking dirties all pages in an entire generation, which is not a big deal if
compaction is always done after every mark-sweep.

2) put the mark bits in a separate bit vector, storing 1 bit for every word in the heap
    Doesn't have the same problems with virtual memory dirtying since all the mark bits are stored in one place. Allows 
for the fine-grained word-marking scheme to be done with the same exact bit vector used for tracing the heap. Really only
beneficial if used in conjunction with fine-grained word marking.

3) Use a marking stack. This is orthogonal to the above schemes. If the stack is overflowed, just note that it has done
so and continue marking, just dropping any objects that won't fit on the marking stack. When the stack is emptied again,
sweep the entire generation looking for any objects that have been marked but that have unmarked children - push those
onto the stack and let the GC rip again! This is WAAAY simpler than pointer reversal, and WAAAY faster. Only requires
a fixed size marking stack which can be used for forwarding tables on subsequent phases as well.

*** Options for the compaction phase include:
1) Do as the Squeakers do and use any unused space for the forwarding table during compaction. Drawbacks to this mean 
very poor compaction efficiency when the heap is full (very little room for forwarding entries), unless a configurable 
section of the heap is reserved for forwarding purposes at all times, which adds tuning complexity, but on the other hand
allows the amount of memory wasted to be tuned. Another drawbacks is it requires possible assumptions about where free 
regions of memory are in the generation, requiring more frequent compaction. This more or less is the same as the next
option, unless generations are ditched.

2) Compact one generation at a time, and reserve extra space separate from the generation, which need only be as big as the
largest generation for best performance, also allowing for less if compaction is to be done in smaller units than the size
of a generation. This requires no tuning if generations are all fixed size, as an empty generation can be reserved at all
times for relocation purposes, simply using the generation that is being compacted as the actual forwarding table - like a
copy collector, but without the drawbacks in terms of reserving a semi-space large enough to copy the whole heap due to
the added marking phase.

3) Use a break table. This algorithm operates completely in-place and requires no extra space. Only draw-back is relocation
can be very slow if there are a lot of disparate unused areas of the heap. Using generations on top of this could seriously
cover up noticeable speed problems, but would probably be no better than the second option in the end. This is most useful
in generations where the death rate is either really high - increasing the chances of large swaths of unused space
that are not fragmented being available - or is really low - increasing the chance that there are very few distinct areas of
unused space that needs to be relocated around. This probably is not useful at all except in really, really memory challenged
embedded environments where every drop of space counts.

4) Squashing. This scheme is really orthogonal and works with both 2 and 3. The idea is simply that, when compacting, all
references from higher addressed objects to lower addressed objects are relocated at the exact same time. Any references
from lower addressed objects to higher addressed objects are marked in the card table, pretending as if they were roots.
This way, the second compaction pass only needs to visit part of the young generation to relocate it, rather than all of it,
and this can be folded into the relocation of inter-generational references. This can, in the best case, compact a generation
in a single sweeping pass. See Clarke's paper for reference.

*** Suggested approaches.

1) Definitely try out squashing with any of the generational schemes. Seems like a big win for generational collectors.

2) 1 byte mark per 128 byte card, 1 byte valid tagged pointer offset, mark bit stored in object, reserved space for forwarding tables
    Contain a table of marks, 1 byte for each card, as above. Then, use another table with 1 byte each per card, that marks
the offset to the first valid tagged pointer in the card. This second table would mostly be zeroed and never touched, except
when allocating an object containing raw data over a card boundary. This is probably the simplest and most general way to
handle stuff on diverse hardware.

3) word-marking scheme, mark bit stored in word-marking vector, reserved space for forwarding tables
    Contain a bit vector of marks, 1 bit for each word. Mark objects in the word-marking vector. This could work exceptionally
well for hardware with bit diddling instructions like x86. Extremely precise, but trades 1/4 the throughput in scanning the dirty
bits at the expense of only having to check 1/32th of a card (1 word) for an intergenerational reference. With large distributions 
of intergenerational references, this is will require much less work scanning individual cards for references. Also requires
much less dirtying of virtual memory pages (which is very good in the hypothetical SlateOS sense). The write barrier is
just as fast on x86 as the first scheme, but possibly up to 2x as expensive on hardware without bit diddling instructions. On the 
right hardware (x86), this scheme probably could be awesome for its simplicity and performance, but on other platforms the write 
barrier speed may become prohibitive. However, the speed gained in card scanning may offset the write barrier slowdown for these
platforms.

4) remembered sets, mark bit in objects, reserved space for forwarding tables, i.e. leave in the Squeak GC
    Remembered sets are fastest for scanning dirty objects, but may require tuning, or otherwise it devolves into full GC once
the heap gets large, and never recovers unless the whole heap is rebuilt or a shitload of memory is freed. This is not going to
be good for longer lived systems. Probably will exhibit worse performance on large or old heaps because of the need of many
full GCs.

5) Have a copying nursery! Rather orthogonal. 
    This seems pretty obvious, regardless of what GC scheme is used. Since the nurseries are fixed size, there is negligible
memory waste. This allows young object allocation to still be fast and cheap, i.e. a pointer bump. This only needs to be on 
the first generation and has benefits for concurrency, since allocation need not be as tightly synchronized. 

    Another way to implement this is just to allocate a fixed size chunk of memory from a mark sweep generation, or just allocate 
an entire empty mark-sweep generation and use this for allocation purposes. When this fills up, simply dump it back into the 
mark-sweep heap and let the GC do its job as normal.

Refs:
"Compacting Garbage Collection Can Be Fast and Simple." Charles Clarke and David Mason. 
    non-journal version at http://www.sarg.ryerson.ca/sarg/papers/199602-spe/
