Stage 12: Join on Relations (10 hours)
- Implement the equi-join operation between relations in NITCbase
Introduction
The join operation is used to combine the records of two relations with respect to a condition on two columns from the respective relations. NITCbase allows us to combine two relations into a new relation with the =
condition. This is called an equijoin.
Equi-join is an operation involving two relations - say Relation_1 and Relation_2, such that both of them have a common attribute. The name of the common attribute could be different in the two relations, but the attribute type must match. When equijoin is performed, a Cartesian product of the tuples of the two relations is taken, and all the tuples in the Cartesian product where the common attribute has the same value are selected and written into a new relation. Since the value of the common attribute in all the selected tuples will be the same, only one copy of the common attribute is added to the new relation. Equi-join requires that the names of all the attributes of Relation_1 and Relation_2 are distinct, except possibly for the common attribute.
For example, consider the relations Shops(id NUM, shopName STR, contact STR)
and Products(id NUM, productName STR, shopId NUM)
. The Shops
relation stores details about all the shops. The Products
relation is used to store a list of all the products and the shop to find them. Note that the attribute shopId
in Products
is common with the attribute id
of Shops
. Here, we see that we have a common attribute between the two relations, but identified by distinct names in the two relations.
Shops
id | shopName | contact |
---|---|---|
1 | WezCafe | 1234509876 |
2 | BakeHouse | 3434238983 |
3 | BurgerLounge | 9892389331 |
4 | Monarch | 6739383883 |
Products
id | productName | shopId |
---|---|---|
13 | Burger | 3 |
24 | Cake | 4 |
32 | Steak | 4 |
46 | Pizza | 1 |
Suppose a customer stops by and asks for a list of products and the numbers to contact to get them. Clearly, we have all the data for this in the database. Essentially, what we need is a single relation which has data from both the above relations. So, we do a join operation. Let's call the target relation ProductShops
.
ProductShops
id | productName | shopId | shopName | contact |
---|---|---|---|---|
13 | Burger | 3 | BurgerLounge | 9892389331 |
24 | Cake | 4 | Monarch | 6739383883 |
32 | Steak | 4 | Monarch | 6739383883 |
46 | Pizza | 1 | WezCafe | 1234509876 |
Now, we see that our target relation has everything that the customer asked for.
A join operation results in the creation of a target relation which will consist of all the attributes from both source relations aside from the join attribute of the second relation. That is, the total number of attributes in the target relation will be numAttrs(Relation_1) + numAttrs(Relation_2) - 1
.
NITCbase also allows you to do a combination of join and project operations together in a single command to create a new target relation with the specified attributes from both relations. The associated commands are specified below.
Frontend User Interface Command | Operation |
---|---|
SELECT * FROM Rel1 JOIN Rel2 INTO TargetRel WHERE Rel1.Attr1 = Rel2.Attr2 | join |
SELECT Attr1,Attr2 FROM Rel1 JOIN Rel2 INTO TargetRel WHERE Rel1.Attr1 = Rel2.Attr2 | join + project |
Q. Consider we have two relations Events(id NUM, title STR, location STR)
and Locations(name STR, capacity NUM)
. Here, Events.location
and Locations.name
are common between the two relations. We run the following commands in NITCbase.
OPEN TABLE Events;
SELECT * FROM Events INTO Lectures WHERE location=ELHC;
OPEN TABLE Locations;
OPEN TABLE Lectures;
SELECT title, location, capacity FROM Lectures JOIN Locations INTO LectureCapacities WHERE Lectures.location = Locations.name;
- What are the attribute cache entries for the relation
LectureCapacities
? - Suppose we add a relation
Participants
with attributes (regNo
: NUM, eventTitle
: STR). Write commands to filter the regNo
of all the participants who are attending events happening in the location Auditorium
.
(click to view answer)
Events(id NUM, title STR, location STR)
and Locations(name STR, capacity NUM)
. Here, Events.location
and Locations.name
are common between the two relations. We run the following commands in NITCbase.OPEN TABLE Events;
SELECT * FROM Events INTO Lectures WHERE location=ELHC;
OPEN TABLE Locations;
OPEN TABLE Lectures;
SELECT title, location, capacity FROM Lectures JOIN Locations INTO LectureCapacities WHERE Lectures.location = Locations.name;
LectureCapacities
?Participants
with attributes (regNo
: NUM, eventTitle
: STR). Write commands to filter the regNo
of all the participants who are attending events happening in the location Auditorium
.Answer
RelName | AttributeName | AttributeType | PrimaryFlag | RootBlock | Offset |
---|---|---|---|---|---|
LectureCapacities | title | STR | - | -1 | 0 |
LectureCapacities | location | STR | - | -1 | 1 |
LectureCapacities | capacity | NUM | - | -1 | 2 |
OPEN TABLE Events;
OPEN TABLE Participants;
SELECT regNo,location FROM Participants JOIN Events INTO ParticipantLocations WHERE Participants.eventTitle = Events.title;
OPEN TABLE ParticipantLocations;
SELECT regNo FROM ParticipantLocations INTO AuditoriumParticipants WHERE location=Auditorium;
To do a join operation, we fetch every record from the first relation one by one. For every record, we do a search operation on the second relation to fetch the records that have the specified attribute value equal to the value in the record from the first relation. For every record of the first relation, there will be a set of search calls to the second relation to fetch all records that match on the common attribute.
Suppose that we are performing a join operation on a certain attribute between Relation_1 and Relation_2. Let Relation_1 have tuples and Relation_2 have tuples. If we were to do a linear search on the second relation to find the matching records, the join operation would have an overall complexity of . However, if an index was present on the second relation, then we would be able to do a B+ search to find the matching records. This reduces the complexity of the join operation to .
Because of this, the NITCbase design mandates the following. If the second relation in a join operation does not have an index on the join attribute, one will be created for it before the join operation proceeds.
Readers interested in proceding with the implementation may skip the following note about computational complexity and proceed with the next section.
Let Relation_1 have tuples and Relation_2 have tuples. For each between 1 and , let the -th tuple in Relation_1 match with a total of Ni number of tuples of Relation_2 on the value of the common attribute. We have N1 + N2 + … + Nm = n.
Suppose we do not create an index for the attribute in both Relation_1 and Relation_2, to match the two relations, then for each tuple in Relation_1, we have to perform a linear search across the entire second relation, which takes time. (the relation cache search index field has a crucial role in limiting linear search complexity to here). Since there are m tuples in Relation_1, conducting linear search on Relation_2 over all of them will involve a total complexity of .
Now, suppose we create an index for the shared attribute in Relation_2. Index creation requires complexity. This is because index creation in NITCbase involves n insertions into a B+ tree and each B+ tree insert operation has complexity.
The presence of the index makes the equi-join far more efficient. For the i-th tuple in Relation_1, we will hit the first matching tuple in Relation_2 within time with an indexed search. Now, each of the remaining Ni - 1 matching tuples of Relation_2 will be returned from the B+ search with constant time (again, the attribute cache search index is crucial here!). Thus, the total complexity in fetching the matching records for the i-th tuple of Relation_1 is . Since, this process has to be repeated for the m tuples in Relation_1, the complexity will be
The worst case complexity for the join operation would involve the cost of creating an index as well as the cost to search through it. This would be for the case when an index does not exist on the attribute for Relation_2. Thus, the worst case cost for an equi-join operation in NITCbase adds up to
This is a significant improvement over linear search and would reduce the time required to complete the operation by a significant amount, especially as we approach upwards of a million records.
To illustrate the saving, suppose Relation_1 has records and Relation_2 has records. For simplicity, let us take log to base 10.
Implementation
A sequence diagram documenting the call sequence involved in a call to the SELECT * FROM JOIN command is shown below. The calls to the Schema Layer, Cache Layer and Buffer Layer are omitted for the sake of clarity.
NOTE: The functions are denoted with circles as follows.
🔵 -> methods that are already in their final state
🟢 -> methods that will attain their final state in this stage
The only class that you will be modifying in this stage is the class corresponding to the Algebra Layer. The class diagram is shown below.
In the Algebra Layer, we add the join()
function which receives two relations and the attributes on which an equi-join is to be performed. This function results in the creation of a target relation which is the join of both the source relations, as we mentioned earlier.
The target relation produced from a join operation would contain all the attributes from both source relation (aside from the join attribute of the second relation). So, NITCbase requires that there be no attribute names that are common between the two relations except for the join attribute.
Algebra/Algebra.cpp
Implement the Algebra::join()
function by looking at the design docs.
In the Frontend Programming Interface, we update the handlers of the functions to call the respective Algebra Layer methods.
The SELECT * FROM JOIN
command maps to the function Frontend::select_from_join_where()
. This function only involves a call to the Algebra::join()
function that we implemented earlier.
The SELECT AttrList FROM JOIN
command maps to the function Frontend::select_attrlist_from_join_where()
. This function involves a join operation as well as a projection operation. Thus, both the corresponding Algebra Layer methods are called.
Frontend/Frontend.cpp
Implement the following functions looking at their respective design docs
The only thing left to do is for you to evaluate your implementation with some exercises.
But, before that we need to cover the FUNCTION command. This is a test command that is provided to you implement any other feature you want to implement. The Frontend User Interface will pass along any command beginning with FUNCTION
to the Frontend::custom_function()
method. The implementation of this method is left entirely to your imagination.
Now, moving on to the exercises.
Exercises
Q1. In a previous stage, you created relations Events(id NUM, title STR, location STR)
, Locations(name STR, capacity NUM)
and Participants(regNo NUM, event STR)
. Populate these relations with the data from the files events.csv, locations.csv and participants.csv respectively. Write commands to obtain the following relations.
EventLocations(name STR)
: Stores the name of all the locations that have an event happening in it. (duplicate entries are allowed)AudiPeople(regNo NUM, eventName STR)
: Stores the registration number and event of all the participants taking part in an event that is happening at the location 'Audi'MiniEventPeople(regNo STR, eventName NUM)
: Stores the registration number and event of all the participants taking part in an event that is happening at a location with capacity between 100 and 200.
Q2. This exercise will test the error conditions of the join functionality. Run the following in your NITCbase and ensure that you get the corresponding output.
CREATE TABLE EventRating(id NUM, title STR, rating NUM); # Relation EventRating created successfully
SELECT * FROM Events JOIN EventRating INTO LocRating WHERE Events.id = EventRating.id;
# Error: Duplicate attributes found
SELECT * FROM Events JOIN EventRating INTO LocRating WHERE Events.id = EventRating.name;
# Error: Mismatch in attribute type
SELECT * FROM Events JOIN EventRating INTO LocRating WHERE Events.id = EventRating.title;
# Error: Attribute does not exist
Q3. Create a relation Organizers(name STR, eventId NUM)
which will store the names of the organizers and the id of the event they are responsible for. Populate the relation with the values in the file organizers.csv. Write commands to obtain the required data for the following situations.
- Every participant needs to be informed the location to report to as well as the organizer to contact for the event they registered for.
- The organizer
Thomas
wants a list of all the participants of the events he is organizing. Since he is organising multiple events, he wants the regNo and the event name for all the participants he is responsible for.
And with that, you have completed the implementation of NITCbase! You've implemented a small database management system with algebraic operations, schema operations, disk buffer, caching, indexing and a whole lot more. Hope this experience was fruitful in helping you understand the working of a relational DBMS❤️.