These are my notes from Chapter 2 of Database System Concepts, which is about database systems.

The first chapter of this book gave a high-level overview of database systems. The rest of the book is divided into parts, where part 1 is about relational databases. It begins by defining what the **relational model** is, and how **relational algebra** allows us to interact with a database. These topics are the subjects of this post.

The relational model is one of the most prevalent data models in use today, in large part because of its simplicity. Most relational databases use the SQL language as both their DDL (data-definition language) and DML (data-modification language). The foundation for the SQL language is relational algebra, so this is why *Database System Concepts* introduces relational algebra first. While relational algebra is noticeably less user-friendly than SQL, having a good understanding of it makes SQL much easier to learn and understand. This is the value of spending some time with relational algebra prior to learning about SQL.

### Structure of relational databases

A relational database is structured as a set of **tables**. Every table in the database has a unique name. Each row in a table represents a *relationship* among a set of values, and each column represents one of these values. Let’s use an example from the book:

The table above is called the *accounts* table. Each row in the table represents an account. Each account is represented as having a number, being part of a branch, and having a balance. So each account is comprised of **attributes** (*account_number*, *branch_name* and *balance*).

In mathematics there is a concept of a **relation** which, for our purposes, can be viewed as being identical to a table. So the *accounts* table above can also be called the *accounts* relation. Additionally, each row in the table is called a **tuple** in the relation:

- table = relation
- tuple = row

The order of tuples within a relation does not matter, meaning that you can reorder the *accounts* tuples in whatever way you want and they’ll all be equivalent to each other.

*Note: The book uses the relation/tuple terminology in this chapter and in subsequent ones, which is why I am introducing it here. I will also try to use it when the book does for the sake of consistency.*

Each attribute in a relation has a certain **domain**, which is the set of all permissible values that it can have. For example, the *balance* attribute in the *accounts* relation would probably be restricted to numeric values only, so the domain for *balance* would be all numeric values. This would mean that trying to store a non-numeric value such as “abcd” to a *balance* would not work, because it is not in the domain of that attribute.

There is one value that is included in *all* domains, and that is the **null** value. This value indicates that the attribute to which null is assigned is either unknown or does not exist. For example, if the balance for the account_number[‘A-101’] is unknown, it would be set to null. This value can be tricky to work with, so it’s desirable to avoid having to deal with it when possible.

### Relational database schema

As mentioned in the first post in this series, there is a difference between a **database schema** and a **database instance**. These concepts have their parallels in most programming languages. Suppose in an object-oriented programming language you have a *Book* class, which has properties *Title*, *Author*, and *NumberOfPages*:

```
public class Book
{
public string Title { get; set; }
public string Author { get; set; }
public int NumberOfPages { get; set; }
}
```

This set of properties would represent the “schema”, or the structure, of the class. If you created a variable for a book, then re-assigned that variable to another book, then you would have changed the variable from one “instance” to another. The same principles apply with databases, where the schema represents the structure of the data and the instance refers to *what* the data actually is at a given point in time.

The *accounts* schema above is represented with the following syntax:

*Account_schema = (account_number, branch_name, balance)*

Designing your relational database schema is very important because it has a strong influence on how easy it is for applications to interact with the data. For example, a poorly chosen schema would lead to many *null* values being saved across tuples in the database, or duplicate entries would be necessary to represent certain data. Finally, certain database schemas can make it impossible to represent certain types of data at all.

For the reasons given above, as well as many others, you need to give a lot of thought and care into defining your database schema. This topic is actually covered in detail in part 2 of *Database System Concepts*, so I will not cover it here (not to mention that I *can’t* quite cover it anyway since I haven’t read part 2 yet ;)).

### Keys

There needs to be a way to distinguish between different tuples in a given relation. This is done using keys. A **superkey** is one or more attributes that together will uniquely identify a tuple. Let’s look at the *accounts* relation once more:

It is quite clear just by looking at this table that the *account_number* attribute is our superkey for this relation. This is because each account has a unique number. Additionally, notice that *branch_name* and *balance* couldn’t be our superkey because it’s possible that multiple accounts would be part of the same branch or would have the same value.

You could actually argue that the combination of *account_number* and *branch_name* could also be a superkey for this relation, and that is true. However, since *account_number* alone is sufficient to uniquely identify a tuple, it wouldn’t make sense to select this combination as our superkey. This is the notion of **candidate keys**, which are potential superkeys that don’t have any unnecessary attributes. In this case, the *account_number* is a candidate key, but the combination of *account_number* and *branch_name* is not a candidate key because *branch_name* is unnecessary.

The term **primary key** is used to denote the key that is chose to uniquely identify each tuple in a relation. Our primary key of choice for the *accounts* relation is the *account_number*. It is convention when you’re defining relations to put the primary key attribute(s) first. If you look at the *accounts* relation, you’ll see that *account_number* is shown first.

The primary key for each relation needs to be chosen carefully. First of all, you want the primary key to be something that will never be the same between multiple tuples. That’s why you wouldn’t want to choose something like *branch_name* for the *accounts* relation; multiple accounts will exist in the same branch. Second of all, you want to primary key to be something that will never change. Having to change a primary key is very difficult and can cause anomalies in your data or queries, so it is risky to have to change it. A common choice for a primary key is the Guid – which is a globally unique identifier.

### The relational algebra query language

Relational algebra is one of the foundations of the SQL language. The authors of *Database System Concepts* divide the relational algebra operations into three categories:

- basic/fundamental operations
- additional operations that can be expressed in terms of the basic operations
- extended operations

Operations from all three of these categories end up being valuable when working with SQL, so it is worth understanding each of them. I’m just listing what they are below for reference, there are many examples online of how they’re used.

### Basic relational algebra operations

There are six basic relational algebra operations:

- select
- project
- union
- set-difference
- cartesian-product
- rename

The `select `

operation will select tuples within a relation that satisfy a specific condition or set of conditions. Typically these conditions are either comparisons (i.e. greater than or equal to) or booleans. Following is the set of comparators available in the `select`

clause:

= | equal to |

≠ | not equal to |

< | less than |

> | greater than |

≤ | less than or equal to |

≥ | greater than or equal to |

¬ | logical NOT |

∧ | logical AND |

∨ | logical OR |

The select operation is denoted by the Greek letter sigma ( σ ).

The project operation is used to specify exactly which attributes we want to show up in the output of a relational algebra operation. For example, if we queried the *accounts* relation but only cared about the *branch_name* attribute in the output, we would use the project operation (denoted by Π) to specify this.

It is important to recognize that the output of any relational operation is also a relation. This allows you to build relational algebra expressions by putting together multiple operations. For example, we can combine select and project operations to first select certain tuples from a relation that meet a certain condition, and then we can use the project operation to remove all attributes that we don’t care about. Most of the time to get what we want, we will need to perform this type of composition of operations.

The `union`

operation (designated by ∪) is an “either/or” binary type of operation, which looks at two relations and selects tuples that are in *either* of the two relations, *or* that are in *both* of the two relations. This operation requires that the two relations being compared are *compatible*. To be compatible, the two relations must (1) have the same number of attributes and (2) the domains of the *i*th attribute of the first relation and the *i*th attribute of the second relation must be the same for all *i*.

The `set-difference`

operation will find tuples that are in one relation but not in another. For two relations *r* and *s*, the operation `r-s`

will find all the tuples that are in *r* but not in *s*. As with `union`

, the two relations in a `set-difference`

operation must be compatible.

The `Cartesian-product`

operation will multiply two relations together. Each row in the first relation gets multiplied by each row in the second relation, and the output relation appends all the attributes of the two relations together. For two relations *r* and *s*, the Cartesian-product looks like `r X s`

. This operation ends up being one of the foundations for the `join`

operation which we will end up seeing often in SQL.

Finally, we have the `rename`

operation. This is straightforward, it simply takes a relation and either changes its name or creates it if it doesn’t exist. It looks like *rx*(E) where *r* is the relation, *x* is the new name and E is a relational algebra expression. The reason this operation is useful is because there are situations in which an output might not have a name. Additionally, when you perform the `Cartesian-product`

operation on a relation with *itself*, you need to rename one of these relations because otherwise each attribute name would be duplicated, which cannot happen and would be an error.

### Additional relational algebra operations

In addition to the basic relational algebra operations, there are a set of operations which are used frequently enough to justify having their own names. The only reason that these ones aren’t considered basic is simply because they can be expressed in terms of the basic operations.

Following are the additional relational algebra operations which expand upon the basic ones:

- set intersection
- natural join
- division
- assignment

The `set-intersection`

operation is a binary operation that is used to find tuples that are part of *both* relations being compared. It is expressed using the `∩`

symbol. For example, in the banking application, there is a relation for people who borrow from the bank called *borrower* and a relation for people who deposit called *depositor*. If we wanted to find all customers who were *both* borrowers and depositors we would use the `set-intersection`

operation. As with unions, this operation needs to be performed on compatible types.

The `natural join`

operation is one of the most important of all the operations to understand because it’s used very frequently. Basically, the natural join will take two relations and perform the Cartesian product on them. This will give us a relation which has all the attributes from both relations. Then, this result is filtered down into only the tuples which have equal values for the same attributes. For example, if we did a natural join on the *borrower* and *depositor* relations, we would first get the Cartesian product of these two relations, and then filter the results down into only those tuples which had *borrower.customer_name* = *depositor.customer_name*.

The `division`

operation is good for queries that have the phrase “for all.” When you divide one relation by another, the only resulting tuples are those that match *all* of the attributes in the divisor relation.

Finally, we have the `assignment`

operation, which is very similar to the notion of variable assignments in programming languages. With the `assignment`

operation, you assign a relational algebra expression to a variable, and you can then use that variable later on. This is useful for breaking down operations into smaller pieces, and to re-use operations which are recurring in your queries. The `assignment`

operation is denoted by the `←`

symbol, where the right hand side is the relational algebra expression and the left hand side is the variable you’re assigning it to.

### Extended relational algebra operations

The extended relational algebra operations expand upon the basics to allow for things such as arithmetic operations and outer joins. Following is the list of the extended operations discussed in the book:

- generalized projection
- aggregate functions
- outer joins

The `generalized projection`

operation allows you to use arithmetic operations during a projection. This allows you to perform operations such as addition, subtraction, multiplication and division to compute the output of an attribute.

`Aggregate functions`

project the output relation as a single tuple. These include *sum*, *average*, *min*, *max*, and *count* which perform the aggregate function you would expect. If you want to perform these aggregates on certain *sets* within a relation, you can partition the relation into groups using the `group`

function. This will output a relation that has *n* tuples, where *n* is the number of groups that result from the `group`

function.

Finally, the outer join is a different type of join where you can take unmatched tuples from either or both of the relations and include them in the result. The ⟕ symbol is used to represent the left outer join, which will take any unmatched tuples from the left relation and include them in the results. The attributes from the right relation will be padded with null values. The right outer join is the same idea but with the right side relation, and is designated with the ⟖ symbol. Finally there is a full outer join, designated by the ⟗ character, which will take the unmatched tuples from both relations and include them in the output, padding the unmatched attributes with null values.

### Null values

Since null is a valid element in relational operations, and since it represents an “unknown”, we need to understand how to work with it.

• Arithmetic operations (+, -, \*, /): if an arithmetic operation has null in it, the result is always null.

• Comparison Operations (<, <=, >, >=, !=): these evaluate to a special value called *unknown* since there is no way to tell for sure whether the result is true or false.

• Logical (Boolean) operations: (true and unknown) = unknown; (false and unknown) = false; (unknown and unknown) = unknown; (true or unknown) = true; (false or unknown) = unknown; (unknown or unknown) = unknown; (not unknown) = unknown.

• Select operation: if the condition returns unknown or false it is not added to the result.

• Join operation: if the 2 tuples both have a null value in a common attribute they will not show up in the result set because they will evaluate to unknown.

• Projection: null is treated the same as any other value, so if 2 tuples have the same attributes and have null for an attribute, one of them will get removed from the result.

• Union, intersection, difference: works the same as with projection, so tuples are considered the same if they have null for the same attributes.

• Aggregates: nulls get removed at the beginning before performing the aggregate (note: this means that aggregates don’t return null just because there are null values present). With grouping, nulls are treated the same as any other value.