Stage 10: B+ Tree Search on Relations (18 hours)

Learning Objectives

Understand the fundamentals of the B+ tree data structure and how it can be used for the purposes of indexing.
Understand how indexes are stored in the XFS disk
Implement the B+ tree search operations to select records in NITCbase

PREREQUISITE READING

B+ Trees

Introduction

A relation in a production database can contain millions or records and might span over a large number of disk blocks. An index would help us to traverse these disk blocks much quicker than a linear search over every record. If we find ourselves frequently doing search operations on the values of one specific attribute of the relation, then creating an index on that attribute will immensely speed up the search.

For example, consider the relation Student having 5 attributes (Roll No, Name, Marks, Grade, Attendance). As an administrator, we might have to frequently get the subset of students having Marks greater than some amount M. Instead of going through each and every record and checking if it satisfies our condition, the index allows us to easily reach the first record with Marks > M. We know that in a B+ tree, every subsequent leaf node will also satisfy this condition (Why?). You can see how this would save us a lot of time.

In NITCbase, indexes are B+ trees with internal nodes of size 100, and leaf nodes of size 63. Each of these nodes will be stored in a separate disk block. A fully filled internal node would consist of a 100 attribute values from various records and 101 pointers to their respective children. A pointer here refers to the block number of the corresponding internal or leaf index block. A fully filled leaf node would consist of 63 attribute values from various records. For each of these attribute values, the block number and slot number of the record with this value is also stored.

The attribute catalog stores whether a particular attribute of a relation has an index. If it does, the RootBlock field of the attribute catalog will store the block number of the root block of the index. If not, RootBlock will contain the value -1.

To proceed further, we will need some prerequisite reading.

PREREQUISITE READING

A production database might automatically create and dispose of indexes as required by the program without user intervention. In NITCbase, the user is expected to decide when the index is to be created and dropped (using the CREATE INDEX and DROP INDEX commands of the Schema Layer). Note that we will not be implementing these commands in the present stage and will instead be using them through the XFS Interface.

Q. Assume that we have an empty database with no relations. We start it and create a table LibraryBooks(name STR, id NUM, shelf NUM, borrower STR). We then insert 1000 records into the relation LibraryBooks in descending order of their id. It is given that the records have id from 1000 to 1.

If we were to do a search for a book with id > 500, which book would we get? What's the corresponding record-id=(block, slot)?
We then create an index on id for LibraryBooks. How many index blocks would be created?
How many entries does the root block have? What is the rightmost value in the root node of the B+ tree?
If we were to again do a search for a book with id > 500, which book would we get? What is the index of the entry corresponding to the found record in the leaf node of the B+ tree?

(click to view answer)

Answer

We will get the record with id=1000 because that's the first record that will satisfy the condition when a linear search is done. Since this record will be the first record in the first block of the relation, we get that the rec-id is {6, 1}. The 7th block is the first block that is available to be used by a relation as the first 6 blocks are reserved.
There will be 31 leaf index blocks and 1 internal index block.
The root block will have 30 entries. The rightmost value in the node will be 968.
We would get the book with id 501 because the records will be sorted in ascending order in the leaf node. B+ search will return the first node in the leaf that satisfies the condition.

Implementation

In this stage, you will implement the B+ search operations on a relation in the BPlusTree::bPlusSearch() function. We will also modify the functions we designed earlier to do an indexed search if an index is available.

Similar to the linear search operation that you implemented in the Block Access Layer, the BPlusTree::bPlusSearch() function, when called for the first time, will return the first record that satisfies the given condition. Every subsequent call to the function will return a proceeding record that satisfies the condition until there are no more records to be returned.

Recall that in BlockAccess::linearSearch(), the position of the previously returned record was stored in the RelCacheTable in the searchIndex field of the entry corresponding to the relation. This field was used to keep track of the position while searching through the records. To restart a search from the beginning, we would have to reset the search index using the RelCacheTable::resetSearchIndex() function.

The BPlusTree::bPlusSearch() function too makes use of a searchIndex field to keep track of it's previous search position in the leaf of the B+ tree. For a B+ search on the B+ tree of an attribute, the search index is stored in the linked list entry corresponding to the attribute in the AttrCacheTable entry of the relation. The search index can be reset using the AttrCacheTable::resetSearchIndex() function.

Unlike the search index in the relation cache, this search index does not store the rec-id of the record (from the linear search). Here, the search index stores the last entry in the leaf of the B+ tree that matched our search (that is, the block number and the entry within the leaf index block).

To proceed further, you will need to read the documentation explaining how indexing is implemented in NITCbase. While the earlier documentation explained the algorithms of the B+ tree data structure, the following document explains the specific implementation details of this data structure in NITCbase.

PREREQUISITE READING

Indexing in NITCbase

To add the B+ tree search functionality, we will need to implement the following:

The Buffer Layer methods to read from index blocks.
- Here, we introduce three new classes IndBuffer, IndInternal and IndLeaf.
- IndBuffer is an abstract class which defines virtual methods to access and update entries in index blocks. Note that IndBuffer cannot be instantiated owing to it being abstract.
- IndInternal and IndLeaf are children of IndBuffer used for buffered access to leaf blocks and internal index blocks respectively of the B+ tree.
The Cache Layer methods to read and update the search index in the attribute cache
The B+ Tree Layer method to search through a B+ tree present in the disk
Modifications to the Block Access Layer to call the B+ search if an index is present
Modifications to the Algebra Layer to reset the search index in the attribute cache before beginning B+ search

A sequence diagram documenting the call sequence for a call to the BlockAccess::search() function is shown below.

NOTE: The functions are denoted with circles as follows.
🔵 -> methods that are already in their final state
🟢 -> methods that will attain their final state in this stage
🟤 -> methods that we built earlier and require more work later, but will leave as is in this stage

A class diagram showing all the relevant methods is given below. Note that the Buffer Layer classes corresponding to record blocks have been omitted for the sake of brevity.

Cache Layer

Buffer Layer

When an index is created on an attribute of a relation, the attribute catalog entry of the attribute is updated to store the block number of the root block of the B+ tree. This may be a leaf index block or an internal index block. (It will be a leaf if the total number of records is less than 63. Otherwise, there will be multiple leaf blocks and consequently an internal index block.)

Recall that the search() function in the Block Access Layer is used to either do a B+ search or a linear search depending on the presence of an index. Our earlier implementation did not account for indexes and directly called the BlockAccess:linearSearch() function. We now modify that function to check the attribute catalog entry and call the BPlusTree::bPlusSearch() function if there is an index (we will implement this function later in this stage).

BlockAccess/BlockAccess.cpp

Implement the BlockAccess::search() function by looking at the design docs.

In the Cache Layer, we add the methods to work with the search index in the attribute cache similar to what we had implemented for the search index in the relation cache. We have methods AttrCacheTable::getSearchIndex(), AttrCacheTable::setSearchIndex() and AttrCacheTable::resetSearchIndex(). Note that each of these methods are overloaded to identify an attribute with either the attribute name or attribute offset. Thus, we have a total of 6 functions to implement in the AttrCacheTable class.

Cache/AttrCacheTable.cpp

Implement the following functions looking at their respective design docs

We had earlier modified BlockAccess::search() such that it can do either a linear search or a B+ search. Recall that in our current implementation, we always reset the search index in the relation cache before the first call to linearSearch() or search() to start searching from the beginning. It is now needed to modify the design to reset the search index in the attribute cache before the first call to bPlusSearch() or search().

Thus far in our implementation, BlockAccess::search() has only been used in the select() function in the Algebra Layer. So, we modify our earlier design to call AttrCacheTable::resetSearchIndex() before the first call to search().

Algebra/Algebra.cpp

Implement the Algebra::select() function by looking at the design docs.

In the Buffer Layer, we implement the methods to read from index blocks using classes IndInternal and IndLeaf as mentioned earlier. Similar to how we've used RecBuffer to read from record blocks thus far, the IndBuffer class defines virtual methods getEntry() and setEntry() to access the entry at a particular index in a B+ tree node. Note that every node of the B+ tree occupies an entire block of the XFS disk.

The IndInternal and IndLeaf classes also define a constructor 1 (which is used to read from an already allocated index block) and a constructor 2 (which is used to allocate a new index block) similar to what we had implemented for RecBuffer. These constructors call their parent class IndBuffer's constructor 1 and constructor 2 respectively which in turn calls the corresponding BlockBuffer constructors that we implemented earlier.

Similar to the struct Attribute that we used to hold the values in a record, we have helper structs to store index entries as well. An entry from a leaf index block is represented using struct Index and an entry from an internal index block is represented using struct InternalEntry.

Since we do not need to modify indexes at this stage, we will only be implementing the getEntry() method. We will also define an empty setEntry() method to avoid compilation issues.

Buffer/BlockBuffer.cpp

Implement the following constructors looking at the respective design docs

Implement the following functions looking at the respective design docs

Add the following empty definitions to avoid compilation issues. We will implement these functions in later stages.

int IndInternal::setEntry(void *ptr, int indexNum) {
  return 0;
}

int IndLeaf::setEntry(void *ptr, int indexNum) {
  return 0;
}

An additional function we will need in the Buffer Layer is the StaticBuffer::getStaticBlockType() function which takes a block number and returns the type of the block (REC/IND_INTERNAL/IND_LEAF/UNUSED_BLK/BMAP). We will use this function when we implement bPlusSearch() later in this stage.

Buffer/StaticBuffer.cpp

Implement the StaticBuffer::getStaticBlockType() function by looking at the design docs.

Lastly, we implement the B+ Tree Layer function bPlusSearch() which comprises all the logic for an indexed search operation. Refer to the algorithm for search if you have not done so already.

BPlusTree/BPlusTree.cpp

Implement the BPlusTree::bPlusSearch() by looking at the design docs.

Exercises

Q1. Create a relation Numbers(key NUM) which will be used to store a list of numbers. Insert all the values from the file numbers.csv into the relation. Run the following command in your NITCbase to select all numbers which are greater than 165,000 into a new relation.

SELECT * FROM Numbers INTO BigNumbers WHERE key > 165000;

Create an index for this relation using the CREATE INDEX command in the XFS Interface. Now, with an index on Numbers.key, re-run the above command and note the difference in time between the select queries with and without the index.

NOTE:
Don't forget to exit NITCbase before running commands in the XFS Interface (refer to the WARNING in the documentation of runtime disk).
The file numbers.csv contains values for 166,464 records. As such, operations on the relation might take some time to complete.

Q2. Modify your implementation to print the number of comparisons to do both of the above SELECT queries.

Stage 10: B+ Tree Search on Relations (18 hours)

Introduction​

Implementation​

Exercises​

Introduction

Implementation

Exercises