Database System Concepts: Intro

These are my notes from the introductory chapter to Database System Concepts, which is about database systems.

Introduction

In computer science and the software development field in general, you’ll hear discussions about databases almost every single day. This is because databases are extremely important and almost every organization and enterprise uses them.

At a high level, most people know what a database is: it’s a collection of organized data. The typical person interacts with many databases every day, although it’s usually indirectly. For example, if you’re a student and you log into your account for school, your data comes from one of your school’s databases. If you’re logging into your bank account to view your balance, that information is also coming from a database. The fact of the matter is that almost all technologies that we use on a day-to-day basis have a database as part of their overall design.

If you’re a software developer who spends time writing server-side code, you’re almost certainly going to be writing code that interacts with a database. This is where it becomes important to start gaining a deeper understanding of these systems. 

Even if you won’t ever be making database design decisions, at some point you’re going to probably have to write SQL queries against your database. You might know exactly what data you’re looking for, but if you don’t understand how SQL works and what they’re doing with your data, you’ll probably struggle to extract the data that you want. Moreover, even if you do successfully get this data, there’s a high likelihood that there’s a more efficient and cost-effective way of writing your queries.

Having an understanding of databases also becomes very valuable when you start running into challenging problems with your application. For example, suppose you start to see anomolies in your data during concurrent operations, or during database failures. In situations like this, it is very valuable to understand exactly how your database deals with these types of situations. The way you design your application is dependent on the structure and properties of the database, whether you’re aware of it or not. So if you’re ignorant of what these properties are or what they mean, you are basically writing code with a blindfold over your eyes, making design decisions that have negative implications that you’re not even aware of.

These are just a couple of motivations that can be used to argue the importance of understanding database concepts. Personally, I have come across these situations in my own work and was definitely limited by my lack of understanding of databases. In these situations, I was completely reliant upon my teammates to understand the root cause of the problems we were seeing and how to resolve them. This is when I knew I needed to take some time to gain a better understanding of databases. The way I look at it is: there’s got to be at least one person who understands database concepts, why not try to become that person myself?

This particular post corresponds to Chapter 1 of Database System Concepts by Abraham Silberschatz. In this chapter he gives a high-level overview of some of the fundamental concepts of database systems which will be dealt with in more detail in subsequent chapters.

Database management systems (DBMS)

A database management system (DBMS) is the database and a set of programs that interact with the data in the database. The main point of a DBMS is to store and retrieve information conveniently and efficiently. For many enterprises, the DBMS is also responsible for things such as authorization and integrity management, transaction management and concurrency management. These types of responsibilites are defined a bit later in this post.

As mentioned in the introduction to this post, there are all kinds of users who will interact with the same database. Each of these users has a different level of understanding of databases. For example, many end-users who view their data in a browser have little to no understanding of databases at all. The data that they see comes from the database but is presented to them through an application that some software developers wrote. 

The software developers who wrote this application are interacting with the same database that the end-users are, but they need to have a greater understanding of it because they’re writing the application that interacts with it. 

Finally, we’ve got the people who actually design the database. These people have the greatest understanding of how the database works because they put it together.

The point is that there are different levels of abstraction of a database. By providing an abstract view of a database, a DBMS allows for users with varying levels of knowledge to successfully interact with it. At the lowest level, most databases use very complex data structures which require specialized knowledge to understand. This is to allow for the greatest level of efficiency of storage and performance. Application developers would have a hard time writing their application if they needed to understand these low-level constructs in order to interact with the database.

There are 3 basic levels of abstraction in a database:

  1. Physical Level: The lowest level of abstraction, which describes how the data are actually stored. These are the complex data structures.

2. Logical Level: This is the level where we describe what data are stored in the database, and what relationships exist among them.

3. View Level: The highest level of abstraction, which describes only part of the database. Many users won’t need all the information in the database, so the view level provides only the subset that they need. A single database can have multiple views which show different subsets of data.

Suppose you’re dealing with a customer data type, where the customer has a customerid, customername, customerstreet and customercity. At the physical level, a customer record would be described as a block of consecutive storage locations (i.e. bytes of data). The compiler abstracts this from the user. At the logical level, the record might be described as a type definition, i.e. a class in C# code. It might look like the following:

public class Customer
{
    public Guid Id {get; set; }
    public string Name { get; set; }
    public string Street { get; set; }
    public string City { get; set; }
}

Finally, at the view level, an application might provide customer data to the user in a browser when they’re trying to view their own profile.

Instances and schemas

Databases are always changing because data is always being added, changed or deleted. A database called Enterprise Database will have one set of data today, and a week from today it will almost certainly have a different set of data. It’s still the same database, it’s just that the data are different.

When we’re talking about a database at a point in time, we refer to it as an instance of that database. Basically, the Enterprise Database has an instance today and a different instance a week from today.

Even though Enterprise Database has different instances over the period of a week, the actual design of the database probably hasn’t changed. When talking about the overall design of the database – the layout of the data and the relationships between them – we use the word schema. In other words, our Enterprise Database has the same schema over the course of the week because nobody changed the layout of the data, but it has different instances because the data itself has changed.

Coming back to our concept of abstraction levels, the database schema is also defined at each of these levels of abstraction. A database has a physical schema which is the data structures that are used to store the data. It also has a logical schema which is the way developers choose to store the data based on the business domain that they’re working with. And then there are many views for this data.

If the physical schema of the database changes (i.e. the data structures are changed), it may or may not be required that the logical and view levels also change. Ideally your database will have physical data independence because this allows the physical schema to change without you having to change the logical way in which you store your data.

Data models

All databases have a data model. The data model of a database describes the data, data relationships, semantics and consistency constraints. There are 4 major categories of data models:

  1. Relational Data Model: a collection of tables is used to represent both the data as well as the relationships betweeen the data. Each table contains records of a specific type, and each type contains a fixed number of attributes. The columns in the table correspond to the attributes of that type.
  2. Entity-Relationship (ER) Data Model: the ER data model is based on a perception of the real world. The data is stored as a set of entities and relationships between the entities. An entity is something that has a real-world equivalent, i.e. a customer in a banking system.
  3. Object-Based Data Model: the object-based data model extends upon the ER model by including notions of encapsulation, methods/functions that act on the data, and object identity.
  4. Semistructured Data Model: in this model, individual data items of the same type may have different sets of attributes. This is different from the other models, where all records of a given type have the same attributes. A popular example of semistructured data is XML.

The most common data model is the relational model. This is the model that relational databases use, which will be discussed in the next post.

Database languages

To interact with a DBMS, we need to use languages that the system can understand. There are 2 general types of languages that we use when working with a DBMS:

  1. Data-Definition Langauge (DDL): this type of language is used to specify the database schema and any additional properties of the data. Data within a database usually includes a set of constraints. These constraints are tested every time the database is updated. For example, a banking system might have a constraint on the balances of an account, where the balance cannot fall below $100. It’s the DDL that allows you to specify this type of constraint.

    While constraints can be anything arbitrary, many arbitrary constraints can be costly to test – either performance- or compute-wise. Instead, most database systems concentrate on integrity constraints which don’t have much overhead to test. These include domain constraints (i.e. an attribute must be a member of the set of all valid members for that attribute), referential integrity (i.e. ensuring that a foreign key in one table is actually present in another table), assertions (i.e. conditions that the database must satisfy at all times), and authorization.

    The details of the schema and the consistency constraints are stored in a special table called the data dictionary. This is the so-called metadata of the database, which is the “data about the data.”
  2. Data-Manipulation Language (DML): this type of language allows for retrieval, insertion, deletion and modification of data in the database. There are 2 types of DMLs: procedural and nonprocedural. Procedural DMLs require the user to specify both what data they need and how to get it, whereas nonprocedural DMLs only require the user to specify what data they need. Clearly this means that nonprocedural DMLs are easier to learn, but it also means that the burden is on the DBMS to determine the best way of retreiving the requested data.

    While the DDL and DML each have their distinct responsibilities, in practice they are combined as part of a single language, as is the case with the SQL language.

    When we’re talking about making requests to a database to get data, we use the word query. Of course, when you query the database, your query needs to be translated into something the system can understand. This is done by the query processor, which translates your queries into sequences of actions at the physical level of the database system.

Database design

When it comes to database design, the main component is coming up with the schema. This of course depends specifically upon the business domain that the database is intended to be used for. A database for a banking application would be very different from one for a library, for example.

Proper database design is important because the design has direct impact on your application’s performance, complexity, security and failure tolerance, among other things. One of the major goals of database design is normalization, where the goal is to generate a set of relation schemas that allow you to sore information without unnecessary redundancy while still allowing you to retrieve information easily. Without proper design you run into issues such as data repetition which leads more easily to an inconsistent database, or the inability to represent certain information or relationships at all.

Data storage and querying

There are 2 major functional components of a database:

  1. The Storage Manager: the storage manager is the interface between the low-level data that’s actually stored on the physical machines and the applications that submit queries to the system. It’s responsible for translating the DML queries into commands that the machine can understand and work with. This means that the storage manager is what’s responsible for handling the storage, retrieval, and updates of data in the database.

    The storage manager is comprised of its own set of components. There is an authorization and integrity manager which are responsible for ensuring that all interactions with the database do not violate its constraints, and that the user who’s trying to perform an action is actually allowed to do so. There’s a transaction manager which is used to ensure that the database remains in a consistent state even if there are system failures or concurrent operations. A file manager manages the allocation of space on disk and the data structures that represent the stored information. Finally, there’s a buffer manager which fetches data from disk and brings it into main memory, and also decides what data should be cached in main memory. This is critical because it allows the database to handle data sizes that are larger than the size of main memory.

    The storage manager also creates indices which provide fast access to certain items. Think of this like a book’s index, which you can use to get to what page you’re looking for much more quickly than just flipping through the book. It’s the same concept.
  2. The Query Processor: the query processor determines what the most efficient way of translating a query into a set of low-level instructions is, i.e. it performs query optimization. It is also responsible for then executing these low-level instructions.

Transaction management

When you’re interacting with a database at the logical level, there is a concept of making a transaction with the database. A transaction is a collection of operations against the database that performs a single logical function. The example that Silberschatz gives is a transfer of money from one bank account to another. Suppose you are transferring $100 from your account to mine. There are two operations that need to happen here: your account needs to be credited by $100, and my account needs to be debited by $100. This is only a single transaction, but it is not complete until both of these operations are complete.

Clearly, there’d be a big problem for at least one of the parties involved if only one of these operations occurred and the other one failed. Either both operations of the transaction must complete, or neither of them should take place. This is the notion of an atomic transaction. A transaction that has the property of atomicity is one that guarantees that either all operations complete, or none of them complete (i.e. it’s an all-or-nothing guarantee).

Not only is it deseriable for transactions to be atomic, they also must maintain the consistency of the database. In the case of the money transfer example, the database would be consistent if the sum of the two accounts is the same after the completed transaction as it is before, and it would be inconsistent if this wasn’t true. This ensures that the accounts were modified by the proper amount.

A third important property of transactions is their durability which means that even if the system fails, the state of the database must be preserved. In the example given, this means that the updated state of the accounts should be the state that you see when the system is restored from a failure (assuming that the transaction completed successfully).

You will hear a common acronym being used when talking about database transaction properties: it’s called ACID. A transaction that is ACID compliant will guarantee the validity of the database even if there are errors, power failures, etc. The acronym ACID stands for atomicity, consistency, isolation and durability.

Conclusion

This post gives a high-level overview of some of the most important concepts in database systems. The book Database System Concepts will go into each of these sections in much greater detail in subsequent chapters.

One of the main takeaways of this post is to start getting familiar with the nomenclature used when talking about databases. I’ve tried to put the most important words and phrases in bold in the same way that the author does in the book.

The next post will start delving into the relational data model since this is one of the most commonly used models in enterprise databases.

Leave a comment

Your email address will not be published. Required fields are marked *