Stage 10: B+ Tree Search on Relations (18 hours)
- Understand the fundamentals of the B+ tree data structure and how it can be used for the purposes of indexing.
- Understand how indexes are stored in the XFS disk
- Implement the B+ tree search operations to select records in NITCbase
Introduction
A relation in a production database can contain millions or records and might span over a large number of disk blocks. An index would help us to traverse these disk blocks much quicker than a linear search over every record. If we find ourselves frequently doing search operations on the values of one specific attribute of the relation, then creating an index on that attribute will immensely speed up the search.
For example, consider the relation Student
having 5 attributes (Roll No
, Name
, Marks
, Grade
, Attendance
). As an administrator, we might have to frequently get the subset of students having Marks
greater than some amount M
. Instead of going through each and every record and checking if it satisfies our condition, the index allows us to easily reach the first record with Marks
> M
. We know that in a B+ tree, every subsequent leaf node will also satisfy this condition (Why?). You can see how this would save us a lot of time.
In NITCbase, indexes are B+ trees with internal nodes of size 100, and leaf nodes of size 63. Each of these nodes will be stored in a separate disk block. A fully filled internal node would consist of a 100 attribute values from various records and 101 pointers to their respective children. A pointer here refers to the block number of the corresponding internal or leaf index block. A fully filled leaf node would consist of 63 attribute values from various records. For each of these attribute values, the block number and slot number of the record with this value is also stored.
The attribute catalog stores whether a particular attribute of a relation has an index. If it does, the RootBlock
field of the attribute catalog will store the block number of the root block of the index. If not, RootBlock
will contain the value -1
.
To proceed further, we will need some prerequisite reading.
A production database might automatically create and dispose of indexes as required by the program without user intervention. In NITCbase, the user is expected to decide when the index is to be created and dropped (using the CREATE INDEX and DROP INDEX commands of the Schema Layer). Note that we will not be implementing these commands in the present stage and will instead be using them through the XFS Interface.
Q. Assume that we have an empty database with no relations. We start it and create a table LibraryBooks(name STR, id NUM, shelf NUM, borrower STR)
. We then insert 1000 records into the relation LibraryBooks
in descending order of their id
. It is given that the records have id
from 1000 to 1.
- If we were to do a search for a book with
id
> 500, which book would we get? What's the corresponding record-id=(block, slot)? - We then create an index on
id
for LibraryBooks
. How many index blocks would be created? - How many entries does the root block have? What is the rightmost value in the root node of the B+ tree?
- If we were to again do a search for a book with
id
> 500, which book would we get? What is the index of the entry corresponding to the found record in the leaf node of the B+ tree?
(click to view answer)
LibraryBooks(name STR, id NUM, shelf NUM, borrower STR)
. We then insert 1000 records into the relation LibraryBooks
in descending order of their id
. It is given that the records have id
from 1000 to 1.id
> 500, which book would we get? What's the corresponding record-id=(block, slot)?id
for LibraryBooks
. How many index blocks would be created?id
> 500, which book would we get? What is the index of the entry corresponding to the found record in the leaf node of the B+ tree?Answer
- We will get the record with
id
=1000 because that's the first record that will satisfy the condition when a linear search is done. Since this record will be the first record in the first block of the relation, we get that the rec-id is{6, 1}
. The 7th block is the first block that is available to be used by a relation as the first 6 blocks are reserved. - There will be 31 leaf index blocks and 1 internal index block.
- The root block will have 30 entries. The rightmost value in the node will be 968.
- We would get the book with
id
501 because the records will be sorted in ascending order in the leaf node. B+ search will return the first node in the leaf that satisfies the condition.
Implementation
In this stage, you will implement the B+ search operations on a relation in the BPlusTree::bPlusSearch()
function. We will also modify the functions we designed earlier to do an indexed search if an index is available.
Similar to the linear search operation that you implemented in the Block Access Layer, the BPlusTree::bPlusSearch()
function, when called for the first time, will return the first record that satisfies the given condition. Every subsequent call to the function will return a proceeding record that satisfies the condition until there are no more records to be returned.
Recall that in BlockAccess::linearSearch()
, the position of the previously returned record was stored in the RelCacheTable in the searchIndex
field of the entry corresponding to the relation. This field was used to keep track of the position while searching through the records. To restart a search from the beginning, we would have to reset the search index using the RelCacheTable::resetSearchIndex()
function.
The BPlusTree::bPlusSearch()
function too makes use of a searchIndex
field to keep track of it's previous search position in the leaf of the B+ tree. For a B+ search on the B+ tree of an attribute, the search index is stored in the linked list entry corresponding to the attribute in the AttrCacheTable entry of the relation. The search index can be reset using the AttrCacheTable::resetSearchIndex()
function.
Unlike the search index in the relation cache, this search index does not store the rec-id of the record (from the linear search). Here, the search index stores the last entry in the leaf of the B+ tree that matched our search (that is, the block number and the entry within the leaf index block).
To proceed further, you will need to read the documentation explaining how indexing is implemented in NITCbase. While the earlier documentation explained the algorithms of the B+ tree data structure, the following document explains the specific implementation details of this data structure in NITCbase.
To add the B+ tree search functionality, we will need to implement the following:
- The Buffer Layer methods to read from index blocks.
- Here, we introduce three new classes IndBuffer, IndInternal and IndLeaf.
IndBuffer
is an abstract class which defines virtual methods to access and update entries in index blocks. Note thatIndBuffer
cannot be instantiated owing to it being abstract.IndInternal
andIndLeaf
are children ofIndBuffer
used for buffered access to leaf blocks and internal index blocks respectively of the B+ tree.
- The Cache Layer methods to read and update the search index in the attribute cache
- The B+ Tree Layer method to search through a B+ tree present in the disk
- Modifications to the Block Access Layer to call the B+ search if an index is present
- Modifications to the Algebra Layer to reset the search index in the attribute cache before beginning B+ search
A sequence diagram documenting the call sequence for a call to the BlockAccess::search()
function is shown below.
NOTE: The functions are denoted with circles as follows.
🔵 -> methods that are already in their final state
🟢 -> methods that will attain their final state in this stage
🟤 -> methods that we built earlier and require more work later, but will leave as is in this stage
A class diagram showing all the relevant methods is given below. Note that the Buffer Layer classes corresponding to record blocks have been omitted for the sake of brevity.
Cache Layer
Buffer Layer
When an index is created on an attribute of a relation, the attribute catalog entry of the attribute is updated to store the block number of the root block of the B+ tree. This may be a leaf index block or an internal index block. (It will be a leaf if the total number of records is less than 63. Otherwise, there will be multiple leaf blocks and consequently an internal index block.)
Recall that the search()
function in the Block Access Layer is used to either do a B+ search or a linear search depending on the presence of an index. Our earlier implementation did not account for indexes and directly called the BlockAccess:linearSearch()
function. We now modify that function to check the attribute catalog entry and call the BPlusTree::bPlusSearch()
function if there is an index (we will implement this function later in this stage).
BlockAccess/BlockAccess.cpp
Implement the BlockAccess::search()
function by looking at the design docs.
In the Cache Layer, we add the methods to work with the search index in the attribute cache similar to what we had implemented for the search index in the relation cache. We have methods AttrCacheTable::getSearchIndex()
, AttrCacheTable::setSearchIndex()
and AttrCacheTable::resetSearchIndex()
. Note that each of these methods are overloaded to identify an attribute with either the attribute name or attribute offset. Thus, we have a total of 6 functions to implement in the AttrCacheTable class.
Cache/AttrCacheTable.cpp
Implement the following functions looking at their respective design docs
AttrCacheTable::getSearchIndex(relId, attrName, searchIndex*)
AttrCacheTable::getSearchIndex(relId, attrOffset, searchIndex*)
AttrCacheTable::setSearchIndex(relId, attrName, searchIndex*)
AttrCacheTable::setSearchIndex(relId, attrOffset, searchIndex*)
AttrCacheTable::resetSearchIndex(relId, attrName, searchIndex*)
AttrCacheTable::resetSearchIndex(relId, attrOffset, searchIndex*)
We had earlier modified BlockAccess::search()
such that it can do either a linear search or a B+ search. Recall that in our current implementation, we always reset the search index in the relation cache before the first call to linearSearch()
or search()
to start searching from the beginning. It is now needed to modify the design to reset the search index in the attribute cache before the first call to bPlusSearch()
or search()
.
Thus far in our implementation, BlockAccess::search()
has only been used in the select()
function in the Algebra Layer. So, we modify our earlier design to call AttrCacheTable::resetSearchIndex()
before the first call to search()
.
Algebra/Algebra.cpp
Implement the Algebra::select()
function by looking at the design docs.
In the Buffer Layer, we implement the methods to read from index blocks using classes IndInternal
and IndLeaf
as mentioned earlier. Similar to how we've used RecBuffer
to read from record blocks thus far, the IndBuffer
class defines virtual methods getEntry()
and setEntry()
to access the entry at a particular index in a B+ tree node. Note that every node of the B+ tree occupies an entire block of the XFS disk.
The IndInternal
and IndLeaf
classes also define a constructor 1 (which is used to read from an already allocated index block) and a constructor 2 (which is used to allocate a new index block) similar to what we had implemented for RecBuffer
. These constructors call their parent class IndBuffer
's constructor 1 and constructor 2 respectively which in turn calls the corresponding BlockBuffer
constructors that we implemented earlier.
Similar to the struct Attribute that we used to hold the values in a record, we have helper structs to store index entries as well. An entry from a leaf index block is represented using struct Index and an entry from an internal index block is represented using struct InternalEntry.
Since we do not need to modify indexes at this stage, we will only be implementing the getEntry()
method. We will also define an empty setEntry()
method to avoid compilation issues.
Buffer/BlockBuffer.cpp
Implement the following constructors looking at the respective design docs
IndBuffer::Constructor1
IndBuffer::Constructor2
IndInternal::Constructor1
IndInternal::Constructor2
IndLeaf::Constructor1
IndLeaf::Constructor2
Implement the following functions looking at the respective design docs
Add the following empty definitions to avoid compilation issues. We will implement these functions in later stages.
int IndInternal::setEntry(void *ptr, int indexNum) {
return 0;
}
int IndLeaf::setEntry(void *ptr, int indexNum) {
return 0;
}
An additional function we will need in the Buffer Layer is the StaticBuffer::getStaticBlockType()
function which takes a block number and returns the type of the block (REC
/IND_INTERNAL
/IND_LEAF
/UNUSED_BLK
/BMAP
). We will use this function when we implement bPlusSearch()
later in this stage.
Buffer/StaticBuffer.cpp
Implement the StaticBuffer::getStaticBlockType()
function by looking at the design docs.
Lastly, we implement the B+ Tree Layer function bPlusSearch()
which comprises all the logic for an indexed search operation. Refer to the algorithm for search if you have not done so already.
BPlusTree/BPlusTree.cpp
Implement the BPlusTree::bPlusSearch()
by looking at the design docs.
Exercises
Q1. Create a relation Numbers(key NUM)
which will be used to store a list of numbers. Insert all the values from the file numbers.csv into the relation. Run the following command in your NITCbase to select all numbers which are greater than 165,000 into a new relation.
SELECT * FROM Numbers INTO BigNumbers WHERE key > 165000;
Create an index for this relation using the CREATE INDEX command in the XFS Interface. Now, with an index on Numbers.key
, re-run the above command and note the difference in time between the select queries with and without the index.
NOTE:
- Don't forget to exit NITCbase before running commands in the XFS Interface (refer to the WARNING in the documentation of runtime disk).
- The file numbers.csv contains values for 166,464 records. As such, operations on the relation might take some time to complete.
Q2. Modify your implementation to print the number of comparisons to do both of the above SELECT queries.