Array Indexer (beta)
A lape "object" to create an index on an array and provide basic searching.
What is an index?
If I sent you down to my basement to find some duct tape you will take a long time to come back. Not me... I know exactly where it is. It would take me no time at all.
In my mind I have a good idea where the items in my basement are. And that, my friends, is an index. A little bit of information that points you in the right direction. That way you don't have to search the whole basement to know there's no duct tape down there.
When is it worthwhile?
You probably wouldn't want to inventory the basement unless you know we were playing this game all day. Our computer index is much the same.
Consider these things:
- Searches are much faster using the index
- But it takes a fair amount of time to create
- And we have to recreate it when the data changes
An index is worthwhile if the data is stable long enough to have to search it enough times to make up for the indexing time. That's a mouthful.
How fast is fast?
Here are some numbers. The test program filled an array of records with random data then performed searches for random data against it. Traditional array searching was done, in one case just returning the first match, in the other returning all matches (i.e. searching the whole array). Index searches were performed, also with both cases of finding one or finding all.
First off, I'll admit it. Those red boxes are because I didn't get proper numbers - I think the indexer hit a bug or limit. No one's using ten million row arrays anyway
Of note here is how expensive the indexing is. The "Indexing cost" column is how many full array searches you would have to do before you made up for the indexing time. 276-731 times depending on the array size.
But look at the gains, too! Even on a small array it's 3.75 times faster to find things. On the big array it gets as high as 1830 times faster.
Also note how fast lape chunks though those arrays! I expected far worse.
Pointers
Yeah, sorry, we can't pull off this voodoo without some pointers.
If you don't know what a pointer is, every variable you create is sitting in memory somewhere and the memory is numbered to keep track of it. These numbers are called addresses. If I know the address of something I can look there and check it out.
There is a 'pointer' type. Pointers hold addresses like this. It doesn't hold the data at that address, just the address itself.
Let's look at this in lape:
Simba Code:var
i: integer;
p: pointer;
begin
i := 22;
p := @i;
writeln('i is a variable = ', i);
writeln('@i is the address of i = ', @i);
writeln('p is a pointer to something = ', p);
writeln('p^ is the thing it points to = ', p^)
end;Progress Report:i is a variable = 22 @i is the address of i = "i"::0x5F58320 (22) p is a pointer to something = "i"::0x5F58320 Error: Access violation at line 10
Dang it, I always do that! At least simba didn't crash. It failed because 'p' is just a plain old pointer. It has no idea what it's pointing to. We can use what is called 'casting' to be more explicit:
Simba Code:var
i: integer;
p: pointer;
begin
i := 22;
p := @i;
writeln('i is a variable = ', i);
writeln('@i is the address of i = ', @i);
writeln('p is a pointer to something = ', p);
writeln('p^ is the thing it points to = ', integer(p^))
end;Progress Report:i is a variable = 22 @i is the address of i = "i"::0x5F58320 (22) p is a pointer to something = "i"::0x5F58320 p^ is the thing it points to = 22
That's better. One last thing and we'll get to the real code. Pointers don't have to be generic like above. We can make a pointer to a specific datatype and avoid the casting in the last example:
Simba Code:var
i: integer;
p: ^integer;
begin
i := 22;
p := @i;
writeln('i is a variable = ', i);
writeln('@i is the address of i = ', @i);
writeln('p is a pointer to to an integer = ', p);
writeln('p^ is the thing it points to = ', p^)
end;
Usage
Function list:
Simba Code:procedure TIndex.init(IarrayPtr, IfieldP: pointer; IarraySize, IentrySize, IfieldSize: integer);
function TIndex.contains(item: Pointer): Boolean;
function TIndex.min(): Pointer;
function TIndex.max(): Pointer;
function TIndex.occurrencesOf(item: Pointer): integer;
function TIndex.findAll(item: Pointer): TpointerArray;
function TIndex.find(item: Pointer): Pointer;
Example usage:
As you can see, the initializer looks a little messy but it's not so bad. Even a novice should be able to mimic the example.Simba Code:{$include indexer.simba}
type
TGrade = record
name: string;
id: integer;
bestScore: extended;
grade: char;
timeSpent: int64;
end;
var
studentRecords: array of TGrade;
finder: TGrade;
idIndex: TIndex;
studentPointer: ^TGrade;
begin
{pretend we have records}
idIndex.init(@studentRecords[0],
@(studentRecords[0].id),
length(studentRecords),
sizeof(TTest),
sizeof(studentRecords[0].id));
finder.id = 1357; {let's find this guy}
studentPointer := idIndex.find(@finder);
writeln('The student with id 1357 is ', studentPointer^.name,
' and his grade is ', studentPointer^.grade);
end.
It needs to know:
- The address of the first record in the array
- The address of the first field (in the first record)... the one you want to index.
- How big is the array (and the record and the field).
When you search, you fill in the search field on a record of that same type and give its address to the search function.
The find functions return either a pointer to a single record or an array of pointers. You can turn those into usable variables with the '^' notation.
CODE
At last, the code.
Simba Code:// 7/2015 -- bonsai
//
{$f-}
type
TPointerArray = array of Pointer;
PTreeNode =^ TTreeNode;
TTreeNode = record
data: Pointer;
left: PTreeNode;
right: PTreeNode;
end;
TIndex = record
head: PTreeNode;
offset: integer;
fieldSize: integer;
end;
procedure TPointerArray.append(p: Pointer);
begin
setLength(self, length(self) + 1);
self[high(self)] := p;
end;
function TIndex._pcompare(a, b: Pointer): integer;
var
i: integer;
p1, p2: pointer;
begin
p1 := a + self.offset;
p2 := b + self.offset;
case self.fieldSize of
1:
begin
if byte(p1^) = byte(p2^) then
exit(0)
else if byte(p1^) < byte(p2^) then
exit( - 1)
else
exit(1);
end;
2:
begin
if word(p1^) = word(p2^) then
exit(0)
else if word(p1^) < word(p2^) then
exit( - 1)
else
exit(1);
end;
4:
begin
if uint32(p1^) = uint32(p2^) then
exit(0)
else if uint32(p1^) < uint32(p2^) then
exit( - 1)
else
exit(1);
end;
8:
begin
if uint64(p1^) = uint64(p2^) then
exit(0)
else if uint64(p1^) < uint64(p2^) then
exit( - 1)
else
exit(1);
end;
10:
begin
if extended(p1^) = extended(p2^) then
exit(0)
else if extended(p1^) < extended(p2^) then
exit( - 1)
else
exit(1);
end;
end;
end;
function TIndex._newNode(item: Pointer): PTreeNode;
var
p: PTreeNode;
begin
new(p);
p^.data := item;
p^.left := p^.right := nil;
result := p;
end;
procedure TIndex._clearFrom(var node: PTreeNode);
begin
if (node = nil) then
exit;
_clearFrom(node^.left);
_clearFrom(node^.right);
dispose(node);
node := nil;
end;
procedure TIndex._clear();
begin
self._clearFrom(self.head);
end;
function TIndex._findAllFrom(node: PTreeNode; item: Pointer): TpointerArray;
var
tmp: TPointerArray;
begin
setLength(result, 0);
while (node <> nil) do
begin
case self._pcompare(node^.data, item) of
-1: node := node^.right;
1: node := node^.left;
0:
begin
result.append(node^.data);
node := node^.left;
end;
end;
end;
end;
function TIndex._minFrom(node: PTreeNode): Pointer;
var
p: PTreeNode;
begin
p := node;
while (p^.left <> nil) do
p := p^.left;
result := p^.data;
end;
function TIndex._maxFrom(node: PTreeNode): Pointer;
var
p: PTreeNode;
begin
p := node;
while (p^.right <> nil) do
p := p^.right;
result := p^.data;
end;
function TIndex._insertFrom(var node: PTreeNode; item: Pointer): PTreeNode;
begin
if (node = nil) then
begin
result := _newNode(item);
exit
end;
if (self._pcompare(item, node^.data) <= 0) then
node^.left := self._insertFrom(node^.left, item)
else
node^.right := self._insertFrom(node^.right, item);
result := node;
end;
procedure TIndex._insert(item: Pointer);
begin
self.head := self._insertFrom(self.head, item);
end;
function TIndex._occurrencesOfFrom(node: PTreeNode; item: Pointer): integer;
begin
if (node = nil) then
exit(0);
case self._pcompare(node^.data, item) of
- 1: result := self._occurrencesOfFrom(node^.right, item);
0: result := 1 + self._occurrencesOfFrom(node^.left, item);
1: result := self._occurrencesOfFrom(node^.left, item);
end;
end;
function TIndex.contains(item: Pointer): Boolean;
begin
result := (self.find(item) <> nil);
end;
function TIndex.min(): Pointer;
begin
result := self._minFrom(self.head);
end;
function TIndex.max(): Pointer;
begin
result := self._maxFrom(self.head);
end;
function TIndex.occurrencesOf(item: Pointer): integer;
begin
result := self._occurrencesOfFrom(self.head, item);
end;
function TIndex.findAll(item: Pointer): TpointerArray;
begin
result := self._findAllFrom(self.head, item);
end;
function TIndex.find(item: Pointer): Pointer;
var
p: PTreeNode;
tst: integer;
begin
result := nil;
p := self.head;
while (p <> nil) do
begin
tst := self._pcompare(p^.data, item);
if (tst = 0) then
begin
result := p^.data;
break;
end
else
begin
if (tst > 0) then
p := p^.left
else
p := p^.right;
end;
end;
end;
procedure TIndex.debugFrom(node: PTreeNode;
printData: procedure(p: pointer) = nil);
begin
if (node = nil) then exit;
writeln('.. Node: ', node);
if (printData <> nil) then
printData(node^.data);
self.debugFrom(node^.left, printData);
self.debugFrom(node^.right, printData);
end;
procedure TIndex.debug(printData: procedure(p: pointer) = nil);
begin
writeln('--------------------------------------------');
writeln('Tree head=', head);
self.debugFrom(self.head, printData);
end;
procedure TIndex.init(IarrayPtr, IfieldP: pointer; IarraySize, IentrySize, IfieldSize: integer);
var
i: integer;
topBound: Pointer;
begin
if (IfieldP < IarrayPtr) or
((ifieldSize <> 1) and (ifieldSize <> 2) and (ifieldSize <> 4) and (ifieldSize <> 8) and (ifieldSize <> 10)) then
begin
writeln('error');
exit;
end;
self.offset := uint32(IfieldP) - uint32(IarrayPtr);
self.fieldSize := IfieldSize;
if (self.head <> nil) then
begin
self._clear();
end
self.head := nil;
topBound := IArrayPtr + (IarraySize * IentrySize);
repeat
self._insert(IarrayPtr);
IarrayPtr += IentrySize;
until IarrayPtr >= topBound;
end;
indexer.simba
indexerMain.simba
indexerPerf.simba