Database System Concepts: Relational Model

These are my notes from Chapter 2 of Database System Concepts, which is about database systems.

The first chapter of this book gave a high-level overview of database systems. The rest of the book is divided into parts, where part 1 is about relational databases. It begins by defining what the relational model is, and how relational algebra allows us to interact with a database. These topics are the subjects of this post.

The relational model is one of the most prevalent data models in use today, in large part because of its simplicity. Most relational databases use the SQL language as both their DDL (data-definition language) and DML (data-modification language). The foundation for the SQL language is relational algebra, so this is why Database System Concepts introduces relational algebra first. While relational algebra is noticeably less user-friendly than SQL, having a good understanding of it makes SQL much easier to learn and understand. This is the value of spending some time with relational algebra prior to learning about SQL.

Structure of relational databases

A relational database is structured as a set of tables. Every table in the database has a unique name. Each row in a table represents a relationship among a set of values, and each column represents one of these values. Let’s use an example from the book:

The table above is called the accounts table. Each row in the table represents an account. Each account is represented as having a number, being part of a branch, and having a balance. So each account is comprised of attributes (account_number, branch_name and balance).

In mathematics there is a concept of a relation which, for our purposes, can be viewed as being identical to a table. So the accounts table above can also be called the accounts relation. Additionally, each row in the table is called a tuple in the relation:

  • table = relation
  • tuple = row

The order of tuples within a relation does not matter, meaning that you can reorder the accounts tuples in whatever way you want and they’ll all be equivalent to each other. 

Note: The book uses the relation/tuple terminology in this chapter and in subsequent ones, which is why I am introducing it here. I will also try to use it when the book does for the sake of consistency.

Each attribute in a relation has a certain domain, which is the set of all permissible values that it can have. For example, the balance attribute in the accounts relation would probably be restricted to numeric values only, so the domain for balance would be all numeric values. This would mean that trying to store a non-numeric value such as “abcd” to a balance would not work, because it is not in the domain of that attribute.

There is one value that is included in all domains, and that is the null value. This value indicates that the attribute to which null is assigned is either unknown or does not exist. For example, if the balance for the account_number[‘A-101’] is unknown, it would be set to null. This value can be tricky to work with, so it’s desirable to avoid having to deal with it when possible.

Relational database schema

As mentioned in the first post in this series, there is a difference between a database schema and a database instance. These concepts have their parallels in most programming languages. Suppose in an object-oriented programming language you have a Book class, which has properties Title, Author, and NumberOfPages:

public class Book
{
    public string Title { get; set; }
    public string Author { get; set; }
    public int NumberOfPages { get; set; }
}

This set of properties would represent the “schema”, or the structure, of the class. If you created a variable for a book, then re-assigned that variable to another book, then you would have changed the variable from one “instance” to another. The same principles apply with databases, where the schema represents the structure of the data and the instance refers to what the data actually is at a given point in time.

The accounts schema above is represented with the following syntax:

Account_schema = (account_number, branch_name, balance)

Designing your relational database schema is very important because it has a strong influence on how easy it is for applications to interact with the data. For example, a poorly chosen schema would lead to many null values being saved across tuples in the database, or duplicate entries would be necessary to represent certain data. Finally, certain database schemas can make it impossible to represent certain types of data at all.

For the reasons given above, as well as many others, you need to give a lot of thought and care into defining your database schema. This topic is actually covered in detail in part 2 of Database System Concepts, so I will not cover it here (not to mention that I can’t quite cover it anyway since I haven’t read part 2 yet ;)).

Keys

There needs to be a way to distinguish between different tuples in a given relation. This is done using keys. A superkey is one or more attributes that together will uniquely identify a tuple. Let’s look at the accounts relation once more:

It is quite clear just by looking at this table that the account_number attribute is our superkey for this relation. This is because each account has a unique number. Additionally, notice that branch_name and balance couldn’t be our superkey because it’s possible that multiple accounts would be part of the same branch or would have the same value.

You could actually argue that the combination of account_number and branch_name could also be a superkey for this relation, and that is true. However, since account_number alone is sufficient to uniquely identify a tuple, it wouldn’t make sense to select this combination as our superkey. This is the notion of candidate keys, which are potential superkeys that don’t have any unnecessary attributes. In this case, the account_number is a candidate key, but the combination of account_number and branch_name is not a candidate key because branch_name is unnecessary.

The term primary key is used to denote the key that is chose to uniquely identify each tuple in a relation. Our primary key of choice for the accounts relation is the account_number. It is convention when you’re defining relations to put the primary key attribute(s) first. If you look at the accounts relation, you’ll see that account_number is shown first.

The primary key for each relation needs to be chosen carefully. First of all, you want the primary key to be something that will never be the same between multiple tuples. That’s why you wouldn’t want to choose something like branch_name for the accounts relation; multiple accounts will exist in the same branch. Second of all, you want to primary key to be something that will never change. Having to change a primary key is very difficult and can cause anomalies in your data or queries, so it is risky to have to change it. A common choice for a primary key is the Guid – which is a globally unique identifier.

The relational algebra query language

Relational algebra is one of the foundations of the SQL language. The authors of Database System Concepts divide the relational algebra operations into three categories: 

  1. basic/fundamental operations 
  2. additional operations that can be expressed in terms of the basic operations 
  3. extended operations

Operations from all three of these categories end up being valuable when working with SQL, so it is worth understanding each of them. I’m just listing what they are below for reference, there are many examples online of how they’re used.

Basic relational algebra operations

There are six basic relational algebra operations:

  1. select
  2. project
  3. union
  4. set-difference
  5. cartesian-product
  6. rename

The select operation will select tuples within a relation that satisfy a specific condition or set of conditions. Typically these conditions are either comparisons (i.e. greater than or equal to) or booleans. Following is the set of comparators available in the select clause:

= equal to
not equal to
< less than
> greater than
less than or equal to
greater than or equal to
¬ logical NOT
logical AND
logical OR

The select operation is denoted by the Greek letter sigma ( σ ).

The project operation is used to specify exactly which attributes we want to show up in the output of a relational algebra operation. For example, if we queried the accounts relation but only cared about the branch_name attribute in the output, we would use the project operation (denoted by Π) to specify this.

It is important to recognize that the output of any relational operation is also a relation. This allows you to build relational algebra expressions by putting together multiple operations. For example, we can combine select and project operations to first select certain tuples from a relation that meet a certain condition, and then we can use the project operation to remove all attributes that we don’t care about. Most of the time to get what we want, we will need to perform this type of composition of operations.

The union operation (designated by ∪) is an “either/or” binary type of operation, which looks at two relations and selects tuples that are in either of the two relations, or that are in both of the two relations. This operation requires that the two relations being compared are compatible. To be compatible, the two relations must (1) have the same number of attributes and (2) the domains of the ith attribute of the first relation and the ith attribute of the second relation must be the same for all i.

The set-difference operation will find tuples that are in one relation but not in another. For two relations r and s, the operation r-s will find all the tuples that are in r but not in s. As with union, the two relations in a set-difference operation must be compatible.

The Cartesian-product operation will multiply two relations together. Each row in the first relation gets multiplied by each row in the second relation, and the output relation appends all the attributes of the two relations together. For two relations r and s, the Cartesian-product looks like r X s. This operation ends up being one of the foundations for the join operation which we will end up seeing often in SQL.

Finally, we have the rename operation. This is straightforward, it simply takes a relation and either changes its name or creates it if it doesn’t exist. It looks like rx(E) where r is the relation, x is the new name and E is a relational algebra expression. The reason this operation is useful is because there are situations in which an output might not have a name. Additionally, when you perform the Cartesian-product operation on a relation with itself, you need to rename one of these relations because otherwise each attribute name would be duplicated, which cannot happen and would be an error.

Additional relational algebra operations

In addition to the basic relational algebra operations, there are a set of operations which are used frequently enough to justify having their own names. The only reason that these ones aren’t considered basic is simply because they can be expressed in terms of the basic operations.

Following are the additional relational algebra operations which expand upon the  basic ones:

  1. set intersection
  2. natural join
  3. division
  4. assignment

The set-intersection operation is a binary operation that is used to find tuples that are part of both relations being compared. It is expressed using the symbol. For example, in the banking application, there is a relation for people who borrow from the bank called borrower and a relation for people who deposit called depositor. If we wanted to find all customers who were both borrowers and depositors we would use the set-intersection operation. As with unions, this operation needs to be performed on compatible types.

The natural join operation is one of the most important of all the operations to understand because it’s used very frequently. Basically, the natural join will take two relations and perform the Cartesian product on them. This will give us a relation which has all the attributes from both relations. Then, this result is filtered down into only the tuples which have equal values for the same attributes. For example, if we did a natural join on the borrower and depositor relations, we would first get the Cartesian product of these two relations, and then filter the results down into only those tuples which had borrower.customer_name = depositor.customer_name.

The division operation is good for queries that have the phrase “for all.” When you divide one relation by another, the only resulting tuples are those that match all of the attributes in the divisor relation.

Finally, we have the assignment operation, which is very similar to the notion of variable assignments in programming languages. With the assignment operation, you assign a relational algebra expression to a variable, and you can then use that variable later on. This is useful for breaking down operations into smaller pieces, and to re-use operations which are recurring in your queries. The assignment operation is denoted by the symbol, where the right hand side is the relational algebra expression and the left hand side is the variable you’re assigning it to.

Extended relational algebra operations

The extended relational algebra operations expand upon the basics to allow for things such as arithmetic operations and outer joins. Following is the list of the extended operations discussed in the book:

  1. generalized projection
  2. aggregate functions
  3. outer joins

The generalized projection operation allows you to use arithmetic operations during a projection. This allows you to perform operations such as addition, subtraction, multiplication and division to compute the output of an attribute.

Aggregate functions project the output relation as a single tuple. These include sum, average, min, max, and count which perform the aggregate function you would expect. If you want to perform these aggregates on certain sets within a relation, you can partition the relation into groups using the group function. This will output a relation that has n tuples, where n is the number of groups that result from the group function.

Finally, the outer join is a different type of join where you can take unmatched tuples from either or both of the relations and include them in the result. The ⟕ symbol is used to represent the left outer join, which will take any unmatched tuples from the left relation and include them in the results. The attributes from the right relation will be padded with null values. The right outer join is the same idea but with the right side relation, and is designated with the ⟖ symbol. Finally there is a full outer join, designated by the ⟗ character, which will take the unmatched tuples from both relations and include them in the output, padding the unmatched attributes with null values.

Null values

Since null is a valid element in relational operations, and since it represents an “unknown”, we need to understand how to work with it.

• Arithmetic operations (+, -, \*, /): if an arithmetic operation has null in it, the result is always null.

• Comparison Operations (<, <=, >, >=, !=): these evaluate to a special value called unknown since there is no way to tell for sure whether the result is true or false.

• Logical (Boolean) operations: (true and unknown) = unknown; (false and unknown) = false; (unknown and unknown) = unknown; (true or unknown) = true; (false or unknown) = unknown; (unknown or unknown) = unknown; (not unknown) = unknown.

• Select operation: if the condition returns unknown or false it is not added to the result.

• Join operation: if the 2 tuples both have a null value in a common attribute they will not show up in the result set because they will evaluate to unknown.

• Projection: null is treated the same as any other value, so if 2 tuples have the same attributes and have null for an attribute, one of them will get removed from the result.

• Union, intersection, difference: works the same as with projection, so tuples are considered the same if they have null for the same attributes.

• Aggregates: nulls get removed at the beginning before performing the aggregate (note: this means that aggregates don’t return null just because there are null values present). With grouping, nulls are treated the same as any other value.

Leave a comment

Your email address will not be published. Required fields are marked *