Next:
Up: Overview
Previous: Notes on Strcmp
String Internals
The internal implementation of text objects in Rex86 is described here.
All pieces of text appearing in Rex86 source code are immediately added to a flat hash table of strings upon compilation. User inputted text is also inserted during run-time. These strings are maintained for the entire program run, though in later revisions, user inputted strings might instead be garbage collected to increase performance of large systems. Strings in this hash table consist of a character array and also a positive integer length, as well as a weak reference to the primary descriptor. Each character is a 7-bit ASCII value, zero extended to 16-bits.
Built on top of the string table are two types of data structures. There are front-lists and back-lists. Each are linked lists of keys into the string table. Front-lists represent strings that have been prepended together, and back-lists are strings that have been appended together. Both prepending to a front-list and appending to a back-list are simple conses.
The basic text data type is called a descriptor. A global structure called the bank stores descriptors in a binary search tree. New descriptors are added to this tree by having the tree randomly generate a non-zero 16-bit key, and using this key to maintain the binary ordering property. This random key is then properly inserted into the bank, attached to the descriptor. The key is then called a 'text ID', and corresponds to the value represented by the descriptor.
A descriptor is a struct consisting of the following data. First, a pointer to a front-list. Second, a pointer to a back list. Third, a non-negative integer offset into the string at the front of its front-list, always less than the length of that string. This makes a text piece a logical concatenation of a front-list and a back-list, with an optional number of characters at the front being ignored. An empty text object is always represented by the 'text ID' zero.
Descriptors should be (if not in this version then in a later revision) garbage collected by a separately running thread. This thread may collect all integer values from all registers, and from the stack, and assume they are 'text ID's. Then then bank is traversed, and if any descriptor is never referenced, it can be deallocated.
Since a MOV instruction can copy 'text ID's into multiple locations, every mutator operation on text must at some point allocate a new descriptor, in order to avoid aliases changing data inadvertently.
The first such operation is FETCH. This operation works by removing the front character from a piece of text and storing it's 'character id' into the destination register. This is implemented by allocating and returning a new descriptor with the same front-list, same back-list, and an offset one higher than before. In the case that the previous offset is exactly one less than the string's length, then instead set the offset to zero and the front-list pointer to the cdr of the old front-list. In the case that the cdr of the old front-list is zero, set the back-list to zero and build a new front-list by reversing the old back-list. If the old back-list was also zero, do not allocate a new descriptor at all, simply use the 'text ID' zero. This algorithm is written in pseudo-code at the end of this section.
Since this process will most likely produce a lot of garbage, there's a certain optimization that may be made to the system for the sake of efficiency. FETCH may set an internal system flag, call it UNIQUE, which is cleared by MOV, PUSH, or any other instruction that can copy data. If FETCH is ever called with UNIQUE set, it is safe to assume that the last descriptor made was not aliased, and if that descriptor is also an operand to this FETCH operation, it may be reused without reallocation.
The other mutator operator is APPEND, and luckily this is much easier to implement than FETCH. All we need to do is allocate a new descriptor, with the same offset and front-list as the old descriptor had. The new back-list is simply the old-backlist cons'ed with whatever text we're APPEND'ing. We may also include the above optimization if we wish to reduce the amount of garbage generated.
The advantage of using this front/back data structure should be clear by this point. Both FETCH and APPEND operations are constant time and non-destructive, excepting the case when a FETCH is performed on a piece of text which has a back-list but no front-list. However, I believe this cost of this case can be amortized to constant time of the entire operation.
The final detail to touch upon is the basic unit of strings; the CHARACTER type. This is simply an 7-bit integer, zero extended to 16-bits, representing a character in the ASCII set. This also makes these strings into proper UTF-16, though Unicode is not supported in Rex86. The reason we use 16-bit characters is so that the INTS data type can share the exact same code (and data structures) as TEXT. In fact, the two types are indistinguishable in all respects. This allows us to write procedure such as "length" without worrying about whether we're iterating over a list of integers or a list of characters.FETCH Algorithm
def FETCH
if (offset self) == (length (car (front-list self))) - 1 then
new-front-list <- (cdr (front-list self))
new-back-list <- (back-list self)
if (null? new-front-list) then
if (null? (back-list self)) then
return null
else
new-front-list <- (reverse (back-list self))
new-back-list <- null
end if
end if
with new descriptor
front-list <- new-front-list
back-list <- new-back-list
offset <- 0
end with
else
with new descriptor
front-list <- (front-list self)
back-list <- (back-list self)
offset <- offset + 1
end with
end if
key <- (insert bank descriptor)
return key
end def
Next:
Up: Overview
Previous: Notes on Strcmp
by dlong@progmatism.com. Plz don't copy kthx.