CS 348: Introduction to Database Management

David Toman

Estimated study time: 1 hr 17 min

Table of contents

Overview of Data Management

What Is Data?

The ANSI definition of data is broad: a representation of facts, concepts, or instructions in a formalized manner suitable for communication, interpretation, or processing by humans or automatic means. More pragmatically, data refers to any representation — characters, numeric quantities, binary strings — to which meaning can be assigned, and on which we perform operations to extract information about entities in the world.

The central concern of database management is persistent data: information that outlives the programs that create and consume it. Volatile data (values held in registers, RAM, or local variables during a computation) is of no interest to us here.

A Brief History of Data Management

Understanding how data management evolved motivates why relational databases were designed the way they are.

In the ancient history of computing, data was stored on magnetic tape. Each program owned its own private dataset. If two programs needed to share information, the data had to be duplicated — stored in both datasets separately. This meant rampant redundancy and no coordination between programs that happened to use the same real-world facts.

File processing systems improved things somewhat. Data moved from tapes to disk files, and a file system interface mediated access. Multiple programs could share a single file, and various access methods (sequential, indexed, random) became available. Still, data management logic — knowing how to find a record, how to update consistently, how to handle concurrent access — had to be reimplemented in every application. Programs remained tightly coupled to the physical layout of data.

The database approach emerged as the solution: separate the data management responsibilities from application logic and hand them to a single centralized component, the Database Management System (DBMS). Programs interact with the DBMS through a well-defined interface; the DBMS handles the rest.

Historical Milestones

1970: Edgar Codd proposes the relational data model, providing a firm mathematical foundation and enabling declarative queries.
1973: Charles Bachman wins the ACM Turing Award for his work on navigational databases (“The Programmer as Navigator”).
1976: Peter Chen proposes the Entity-Relationship (E-R) model for conceptual database design.
Late 1970s: IBM’s System R and UC Berkeley’s Ingres demonstrate that relational DBMSs are commercially feasible.
1981: Edgar Codd wins the ACM Turing Award.
1980s: Commercial relational technology flourishes (IBM DB2, Oracle, Informix, Sybase). SQL standardization begins through ANSI and ISO. Object-oriented DBMSs emerge and then largely fade.
1990s–present: SQL expands continuously. New application domains appear: OLAP, data warehousing, XML, embedded systems, multimedia, the Internet, data streams. In 1998, Jim Gray wins the Turing Award for his work on transaction concepts. In 2014, Michael Stonebraker receives the same honor.

What Is a Database?

A database is a large and persistent collection of factual data and metadata organized to facilitate efficient retrieval and revision. Factual data is the information itself (John’s age is 42). Metadata describes the structure — what kinds of entities exist, what properties they have, what their types are (there is a concept of “employee” with attributes “name” and “age”).

A data model determines the nature of that metadata and how retrieval and revision are expressed. Examples of databases in the everyday sense include a filing cabinet, a library catalog, or an inventory control system. A DBMS is a program (or suite of programs) that implements a data model and manages the database on behalf of applications.

What a DBMS Provides

A DBMS abstracts away common data-management tasks and provides a uniform, well-defined interface. Its core responsibilities are:

Support for an underlying data model: all data is stored and manipulated in a well-defined, consistent way.
Access control: different pieces of data can be accessed or modified only by authorized users or programs.
Concurrency control: multiple applications can access and modify data simultaneously without corrupting it.
Database recovery: if a system crashes, data that was successfully committed is not lost.
Database maintenance: the metadata itself (the schema) can be revised as requirements evolve.

Schema and Instance

Two fundamental concepts underpin the relational model (and all data models):

A database schema is a collection of metadata conforming to an underlying data model. It defines the structure of the data: what relations exist, what attributes each has, what types those attributes take, and what integrity constraints must hold.

A database instance is a collection of factual data as defined by a given schema. Crucially, one schema can have many possible instances — it is the schema that is fixed, while the data within it changes over time as the real world changes.

Example — A simple relational schema for a company:

EMP(ENO, ENAME, TITLE)
PROJ(PNO, PNAME, BUDGET)
WORKS(ENO, PNO, RESP, DUR)

An instance of this schema would be a set of rows in each of these three tables.

Three-Level Schema Architecture

A DBMS typically organizes its metadata into three levels:

External schema (view layer): what each application or user sees. Different users may have different external views of the same database — a payroll application sees only salary data, while an inventory application sees only warehouse data.
Conceptual schema: the logical structure of all data in the database, shared across all applications.
Physical schema (internal schema): how data is actually stored on disk — which files, which devices, which storage algorithms.

This separation is the basis for data independence.

Data Independence

Data independence is perhaps the most important reason to use a DBMS. The idea is that applications access data through an abstract data model, not directly through physical storage structures.

Physical data independence means that applications are immune to changes in how data is physically stored. If an index is added or a file is reorganized, no application code needs to change.

Logical data independence (modularity) means that changes to the conceptual schema — adding a new table, adding a new column — do not require changes to applications that don’t use that table or column.

Transactions

When multiple applications access the same data concurrently, problems arise. The classic example is two ATM withdrawals from the same account happening simultaneously: both read the balance, both check that it is sufficient, both deduct funds, both write back — and one deduction may silently disappear.

The solution is the transaction: an application-specified, atomic, and durable unit of work. Transactions in a DBMS are guaranteed to have the ACID properties:

Atomic: either the entire transaction occurs, or none of it does.
Consistent: each transaction, when it commits, leaves the database in a consistent state that satisfies all integrity constraints.
Isolated: concurrent transactions do not interfere with each other; each sees a view of the database as if it were running alone.
Durable: once a transaction commits, its changes are permanent, surviving even system failures.

Interfacing with the DBMS

Two kinds of languages form the programmer’s interface to a DBMS:

Data Definition Language (DDL): for specifying schemas. There may be different DDLs for the external, conceptual, and physical schemas.
Data Manipulation Language (DML): for expressing retrieval and revision requests. DMLs can be navigational (procedural — the programmer specifies how to navigate to the data) or non-navigational (declarative — the programmer specifies what data is wanted, and the DBMS determines how to find it).

The relational model uses a declarative DML, which is one of its great advantages.

Types of Database Users

End users: access the database indirectly through forms, web applications, or query-generating tools; may also write ad-hoc queries.
Application developers: design and implement programs that use the database.
Database administrators (DBAs): manage the conceptual schema, assist with view integration across applications, monitor and tune performance, define physical storage structures, load and reformat data, enforce security and reliability policies.

The Relational Model

Querying as Set Comprehension

Before introducing the relational model formally, it helps to understand the mathematical framework underlying it. Queries in the relational model are ultimately set comprehensions — expressions of the form

\[\{(x_1, \ldots, x_k) \mid \langle\text{condition}\rangle\}\]

which denote the set of all \(k\)-tuples satisfying the condition. This is not just a metaphor; the formal query language of the relational model, the relational calculus, is precisely this set-comprehension notation over first-order logic.

Example: Find all pairs of natural numbers that add to 5. If PLUS is a three-column relation containing triples \((x, y, z)\) with \(x + y = z\), then the query is:

\[\{(x, y) \mid \text{PLUS}(x, y, 5)\}\]

Answer: \(\{(0,5),(1,4),(2,3),(3,2),(4,1),(5,0)\}\).

The answers to queries depend on the current instance of the relation, not just its structure. This is what makes databases dynamic.

Formal Definition of the Relational Model

The components of a relational database are:

Universe: a set of values \(D\) with equality (\(=\)).
Relation: a predicate name \(R\) with an arity \(k\) (the number of columns); an instance of \(R\) is a subset \(R \subseteq D^k\).
Database: a signature \(\rho = (R_1, \ldots, R_n)\) (a finite set of predicates) and an instance \(\text{DB} = (D, {=}, R_1, \ldots, R_n)\) (a relation for each predicate).

The running example throughout this course is a bibliography database:

AUTHOR(aid, name)
WROTE(author, publication)
PUBLICATION(pubid, title)
BOOK(pubid, publisher, year)
JOURNAL(pubid, volume, no, year)
PROCEEDINGS(pubid, year)
ARTICLE(pubid, crossref, startpage, endpage)

A sample instance might contain:

AUTHOR = { (1, John), (2, Sue) }
WROTE  = { (1, 1), (1, 4), (2, 3) }
PUBLICATION = { (1, Mathematical Logic), (2, Principles of DB Syst.),
                (3, Trans. Databases),   (4, Query Languages) }
BOOK = { (1, AMS, 1990) }
JOURNAL = { (3, 35, 1, 1990) }

The Closed-World Assumption

In a relational database, a tuple \((a_1, \ldots, a_k) \in R\) means the corresponding fact is true. Tuples not present in a relation are considered false. This is the closed-world assumption: if something is not recorded, it didn’t happen.

So from the sample instance: “John is an author with id 1” is true (\((1, \text{John}) \in \text{AUTHOR}\)). “John wrote Trans. Databases” is false (\((1, 3) \notin \text{WROTE}\)).

Relational Calculus

The query language of the relational model is based on first-order logic. Conditions (formulas) over a schema \(\rho = (R_1, \ldots, R_k)\) are defined inductively:

\[\phi ::= R_i(x_{i_1}, \ldots, x_{i_k}) \mid x_i = x_j \mid \phi \land \phi \mid \exists x_i.\phi \mid \phi \lor \phi \mid \neg\phi\]

A valuation \(\theta\) maps variable names to values in the universe. The semantics of formulas is defined with respect to a database instance DB and a valuation \(\theta\):

\(\text{DB}, \theta \models R(x_{i_1}, \ldots, x_{i_k})\) if \((\theta(x_{i_1}), \ldots, \theta(x_{i_k})) \in R\).
\(\text{DB}, \theta \models x_i = x_j\) if \(\theta(x_i) = \theta(x_j)\).
\(\text{DB}, \theta \models \phi \land \psi\) if both hold.
\(\text{DB}, \theta \models \neg\phi\) if \(\phi\) does not hold.
\(\text{DB}, \theta \models \exists x_i.\phi\) if there exists some \(v \in D\) with \(\text{DB}, \theta[x_i \mapsto v] \models \phi\).

Logical equivalences carry over from propositional and first-order logic:

\[\forall x.\phi \equiv \neg\exists x.\neg\phi\]

A query in the relational calculus is a set comprehension:

\[\{(x_1, \ldots, x_k) \mid \phi\}\]

Its answer over DB is the relation \(\{(\theta(x_1), \ldots, \theta(x_k)) \mid \text{DB}, \theta \models \phi\}\) where \(\{x_1, \ldots, x_k\} = \text{FV}(\phi)\) (the free variables of \(\phi\)).

Integrity Constraints

Real-world data is not arbitrary. Integrity constraints are closed formulas in first-order logic that every valid database instance must satisfy.

Types of integrity constraints in the bibliography database:

Domain/typing constraints: author IDs are integers, author names are strings.
Key constraints (uniqueness): author IDs uniquely determine author names (\(\forall e_1, e_2.\text{AUTHOR}(e_1, n) \land \text{AUTHOR}(e_2, n) \to e_1 = e_2\) is not the right formulation — keys say the ID determines the name, not the other way around).
Referential integrity (foreign keys): books, journals, proceedings, and articles are publications (every BOOK.pubid must appear in PUBLICATION).
Disjointness: books are different from journals; books are different from proceedings.
Coverage: every publication is a book, a journal, proceedings, or an article.

Views are a special form of integrity constraint that defines a derived relation:

\[\forall x_1, \ldots, x_k. R(x_1, \ldots, x_k) \leftrightarrow \phi\]

where \(R\) is a new relation name not in the base schema. Views provide data abstraction and logical independence.

Safety: Domain Independence and Range Restriction

Not all first-order queries are safe to evaluate over a finite database. The query \(\{(y) \mid \neg\exists x.\text{AUTHOR}(x, y)\}\) asks for “everything that is not an author name,” which is an infinite set over an infinite universe.

A query is domain-independent if its answer does not change when the universe is changed (while leaving the actual relations unchanged). Equivalently, all answers must come from values actually appearing in the relations — the active domain.

Computing the closure of domain-independent queries is undecidable in general. The practical solution is range-restricted queries — a syntactic fragment of the relational calculus that is sufficient to express all domain-independent queries and is guaranteed to be safe. Every domain-independent query can be written equivalently as a range-restricted one.

Computational Properties of the Relational Calculus

Evaluation always terminates: the relational calculus is not Turing-complete.
Data complexity (fixed query, varying database): in PTIME, in LOGSPACE, even in AC⁰ (constant parallel time with polynomially many processors).
Combined complexity (query and database both varying): PSPACE-complete. This includes the ability to encode NP-hard problems (e.g., SAT) as queries.
Query satisfiability (is there any database for which the answer is non-empty?) is undecidable in general, though decidable for important fragments.

Limits of Relational Calculus

Some things cannot be expressed in first-order logic / relational calculus:

Arithmetic and ordering: comparing numerical values beyond equality, doing arithmetic.
Counting and aggregation: counting how many tuples satisfy a condition, summing values.
Graph reachability/connectivity: asking whether there is a path in a graph (a binary relation). This is a transitive closure query, which requires recursion beyond first-order logic.

SQL addresses the first two with built-in operators and aggregation. Recursion is addressed in SQL:1999 and later with common table expressions.

SQL — The Structured Query Language

Overview

SQL (Structured Query Language) is the practical realization of the relational calculus. It is based on the relational calculus but incorporates several pragmatic extensions: conjunctive queries (SELECT blocks), set operations, an update language, and non-first-order features such as bag (multiset) semantics, NULL values, and aggregation.

SQL is a committee design, which means it is sometimes more pragmatic than logically elegant. The standard has evolved continuously: SQL-89, SQL-92, SQL:1999, SQL:2003, SQL:2008, SQL:2011, SQL:2016.

SQL has three major sublanguages:

DML (Data Manipulation Language): the query and update language, including embedded SQL and ODBC for application development.
DDL (Data Definition Language): defines the schema — creates, modifies, and destroys database objects.
DCL (Data Control Language): access control — who can read or write what.

SQL Data Types

SQL attributes take values from standard types:

Type	Description
`INTEGER`	32-bit signed integer
`SMALLINT`	16-bit signed integer
`DECIMAL(m,n)`	Fixed decimal with \(m\) total digits and \(n\) after the decimal point
`FLOAT`	32-bit IEEE floating point
`CHAR(n)`	Fixed-length character string of length \(n\)
`VARCHAR(n)`	Variable-length string of at most \(n\) characters
`DATE`	Year/month/day
`TIME`	Hours:minutes:seconds

SQL is not case-sensitive.

The Basic SELECT Block

The fundamental building block of SQL queries is the SELECT block:

SELECT DISTINCT <results>
FROM   <tables>
WHERE  <condition>

This directly corresponds to a conjunctive (existential) query in the relational calculus:

\[\{\langle\text{results}\rangle \mid \exists\langle\text{unused}\rangle. \langle\text{tables}\rangle \land \langle\text{condition}\rangle\}\]

The FROM clause lists the relations involved, creating a conjunction of those relations. You may assign alias names to relations with AS. The WHERE clause adds additional conditions. The SELECT clause specifies which attributes (or expressions) appear in the output.

Example — List all authors:

SELECT DISTINCT *
FROM author;

Example — List all publications with at least two authors (requires two copies of WROTE with different authors):

SELECT DISTINCT r1.publication
FROM wrote r1, wrote r2
WHERE r1.publication = r2.publication
  AND r1.author != r2.author;

Since two corelations (r1 and r2) both refer to wrote, the variables in each are distinct; the equality condition must be stated explicitly.

Example — List titles of all books:

SELECT DISTINCT title
FROM publication, book
WHERE publication.pubid = book.pubid;

The SELECT Clause in Detail

The SELECT clause may:

Eliminate unwanted attributes (projection via existential quantification).
Compute expressions using built-in functions: arithmetic (+, -, *, /), string concatenation (||), substr, and constants.
Name resulting attributes using AS.

SELECT DISTINCT pubid AS id,
                endpage - startpage + 1 AS numberofpages
FROM article;

Complex Queries: Set Operations and Nesting

The basic SELECT block handles only \(\exists\) and \(\land\) queries. For full relational calculus coverage, SQL provides set operations:

UNION: the set of tuples in Q1 or Q2 (eliminates duplicates, unlike UNION ALL). Expresses disjunction.
EXCEPT: the set of tuples in Q1 but not Q2. Expresses negation/set difference.
INTERSECT: the set of tuples in both Q1 and Q2 (rarely needed; expressible with joins).

For union compatibility, Q1 and Q2 must have the same number and types of attributes.

Example — List all publication IDs for books or journals:

(SELECT DISTINCT pubid FROM book)
UNION
(SELECT DISTINCT pubid FROM journal);

Example — List all publication IDs except those for articles:

(SELECT DISTINCT pubid FROM publication)
EXCEPT
(SELECT DISTINCT pubid FROM article);

Common Pitfall: OR vs. UNION

A common mistake is using OR in the WHERE clause instead of UNION. Consider:

-- INCORRECT
SELECT DISTINCT title
FROM publication, book, journal
WHERE publication.pubid = book.pubid
   OR publication.pubid = journal.pubid;

This fails if there are no books at all (the cross product with book would be empty, and OR can’t rescue it). UNION is the correct approach.

Naming Subqueries

The WITH clause (common table expressions) allows naming subquery results for later use:

WITH bookorjournal(pubid) AS (
    (SELECT DISTINCT pubid FROM book)
    UNION
    (SELECT DISTINCT pubid FROM journal)
)
SELECT DISTINCT title
FROM publication, bookorjournal
WHERE publication.pubid = bookorjournal.pubid;

Alternatively, subqueries may be inlined in the FROM clause (SQL-92 and later):

SELECT DISTINCT title
FROM publication,
     ( (SELECT DISTINCT pubid FROM journal)
       UNION
       (SELECT DISTINCT pubid FROM book) ) jb
WHERE publication.pubid = jb.pubid;

WHERE Subqueries

SQL allows complex search conditions in the WHERE clause using subqueries:

<attr> IN (<query>) / <attr> NOT IN (<query>): tests membership.
<attr> op SOME (<query>) / <attr> op ALL (<query>): compares to some or all values.
EXISTS (<query>) / NOT EXISTS (<query>): tests for non-emptiness.

All subquery forms are syntactic sugar reducible to joins and set operations, but they are especially useful for expressing negation clearly.

Example — List publication titles for articles:

SELECT DISTINCT title
FROM publication
WHERE pubid IN (SELECT pubid FROM article);

Example — Find the longest articles (expressing maximum without aggregation):

SELECT DISTINCT pubid
FROM article
WHERE endpage - startpage >= ALL (
    SELECT endpage - startpage
    FROM article
);

<attr> = SOME (<query>) is equivalent to <attr> IN (<query>). <attr> <> ALL (<query>) is equivalent to <attr> NOT IN (<query>).

Correlated (Parametric) Subqueries

Subqueries can reference attributes from the enclosing query, making them correlated. For each row in the outer query, the subquery is re-evaluated:

Example — Find all author-publication pairs for publications with multiple authors:

SELECT *
FROM wrote r
WHERE EXISTS (
    SELECT *
    FROM wrote s
    WHERE r.publication = s.publication
      AND r.author <> s.author
);

Example — Find all publications with exactly one author (using NOT EXISTS):

SELECT *
FROM wrote r
WHERE NOT EXISTS (
    SELECT *
    FROM wrote s
    WHERE r.publication = s.publication
      AND r.author <> s.author
);

Updating Data

SQL supports three forms of data modification:

INSERT

Insert a constant tuple:

INSERT INTO author
VALUES (4, 'Niwinski, Damian', 'zls.mimuw.edu.pl/~niwinski');

Insert tuples from a query (here, auto-incrementing the author ID):

INSERT INTO author (
    SELECT max(aid)+1, 'Snodgrass, Richard T.', 'www.cs.arizona.edu/people/rts'
    FROM author
);

DELETE

Delete all tuples matching a condition:

DELETE FROM publication
WHERE pubid NOT IN (SELECT pubid FROM article)
  AND pubid NOT IN (SELECT crossref FROM article);

UPDATE

Modify tuples matching a condition:

UPDATE author
SET url = 'brics.dk/~david'
WHERE aid IN (
    SELECT aid FROM author WHERE name LIKE 'Toman%'
);

Transaction Support

SQL includes COMMIT and ROLLBACK for transaction control:

COMMIT;    -- make all changes since the last commit permanent
ROLLBACK;  -- discard all changes since the last commit

A transaction begins implicitly with the first database access.

Aggregation

SQL’s aggregation goes beyond first-order logic — it cannot be expressed in pure relational calculus. Aggregate functions include COUNT, SUM, MIN, MAX, and AVG.

The full GROUP BY syntax:

SELECT x1, ..., xk, agg1, ..., aggl
FROM Q
GROUP BY x1, ..., xk

The GROUP BY clause partitions the input relation into groups with equal values for the grouping attributes, then applies aggregate functions to each group. All non-aggregate attributes in the SELECT clause must appear in the GROUP BY clause.

Example — Count authors per publication:

SELECT publication, count(author)
FROM wrote
GROUP BY publication;

Example — Sum article pages per author:

SELECT author, sum(endpage - startpage + 1) AS pgs
FROM wrote, article
WHERE publication = pubid
GROUP BY author;

The HAVING clause filters groups by aggregate values:

SELECT publication, count(author)
FROM wrote
GROUP BY publication
HAVING count(author) = 1;

HAVING is syntactic sugar — it can always be replaced by wrapping the aggregation in a subquery and adding a WHERE clause to the outer query.

SQL — Ordering, Duplicates, and NULL Values

Ordering Results

No particular order of rows in a table or intermediate result should ever be assumed. However, the ORDER BY clause can order the final result:

SELECT DISTINCT *
FROM author
ORDER BY name;

The direction is ASC (ascending, default) or DESC (descending). Multiple sort keys can be chained.

Multiset (Bag) Semantics

SQL tables are multisets (bags), not sets — duplicate rows are allowed. This is the default because eliminating duplicates requires an extra sort or hash step, which was historically expensive. SELECT DISTINCT adds duplicate elimination; plain SELECT does not.

Without DISTINCT, the same logical tuple can appear multiple times. For example, joining WROTE with itself to find co-authors returns each publication pair O(n²) times for a publication with n authors.

The multiset semantics for relational calculus defines a count for each answer tuple:

Conjunction \(\phi \land \psi\): the count is the product of the counts from \(\phi\) and \(\psi\).
Existential \(\exists x.\phi\): the count is the sum over all values \(v\) of the count that \(\phi[x \mapsto v]\) produces.
Disjunction \(\phi \lor \psi\): the count is the sum of both counts.
Difference \(\phi \land \neg\psi\): the count is \(\max(0, m-n)\) where \(m\) is \(\phi\)’s count and \(n\) is \(\psi\)’s.

Bag operations correspond to the set operations but preserve duplicates:

Set	Bag
`UNION`	`UNION ALL`
`EXCEPT`	`EXCEPT ALL`
`INTERSECT`	`INTERSECT ALL`

An important subtlety: duplicates in subqueries in the WHERE clause do not change the results of the outer query (since WHERE conditions are boolean — a tuple either matches or it doesn’t). But duplicates in FROM or WITH subqueries do matter.

NULL Values

NULL represents a missing or inapplicable value. There are two distinct interpretations:

Value inapplicable: Sue has no home phone number. Better schema design would simply omit that attribute (put home phones in a separate table). SQL’s NULLs paper over this design flaw.
Value unknown: Sue has a home phone, but we don’t know the number. This is semantically richer — there exist possible worlds corresponding to every possible value the unknown might take, and a correct query should return answers that hold in all possible worlds (the certain answer semantics).

Computing certain answers is NP-hard to undecidable in general. SQL’s solution is a crude but practical approximation using three-valued logic (TRUE, FALSE, UNKNOWN).

NULL in Expressions and Comparisons

Any arithmetic or string operation with NULL as an operand returns NULL: 1 + NULL = NULL, 'foo' || NULL = NULL.

Comparisons with NULL return UNKNOWN: 1 = NULL → UNKNOWN, NULL < 5 → UNKNOWN. This means x = 0 OR x <> 0 evaluates to UNKNOWN when x is NULL — which is incorrect logically (it should always be TRUE).

Three-Valued Logic Tables

The truth tables for boolean connectives extended with UNKNOWN:

AND	T	U	F
T	T	U	F
U	U	U	F
F	F	F	F

OR	T	U	F
T	T	T	T
U	T	U	U
F	T	U	F

NOT
T	F
U	U
F	T

A WHERE clause filters on TRUE — UNKNOWN does not satisfy a WHERE condition.

IS NULL and IS UNKNOWN

Special syntax handles NULL explicitly:

SELECT aid, name
FROM author
WHERE url IS NULL;

IS NULL tests whether a value is null. IS UNKNOWN tests whether a comparison yielded UNKNOWN.

NULL and Aggregation

Aggregate functions treat NULL as “inapplicable” — they skip NULL values:

SELECT count(*) AS RS, count(url) AS US
FROM author;
-- RS counts all rows; US counts only non-NULL urls

COUNT(*) counts rows; COUNT(attribute) counts non-NULL values of that attribute.

Outer Joins

An outer join allows NULL-padded tuples in the result for rows that fail to match the join condition. This is an extension of the FROM clause:

FROM R <j-type> JOIN S ON C

where <j-type> is FULL, LEFT, RIGHT, or INNER.

INNER JOIN (the default): only matching tuples appear.
LEFT JOIN: all tuples from R appear; non-matching ones have NULLs for S’s columns.
RIGHT JOIN: all tuples from S appear; non-matching ones have NULLs for R’s columns.
FULL JOIN: all tuples from both R and S appear.

Example — Count publications per author, including authors with zero publications:

SELECT aid, count(publication) AS pubs
FROM author LEFT JOIN wrote ON aid = author
GROUP BY aid;

The LEFT JOIN ensures authors with no publications appear in the result with a count of 0 (because count(publication) counts non-NULL values, and NULL-padded tuples have NULL for publication).

Summary on NULLs

NULLs are a necessary evil. They are used to handle small irregularities in data and can always be avoided with better schema design — but the solution may be inefficient. In practice, NULLs are ubiquitous as a quick fix for schema design oversights or schema evolution. Use them sparingly and always remember their unintuitive behavior in conditions and aggregates.

Application Programming — Embedded SQL

Why Application Programming?

SQL is powerful but not Turing-complete. Real applications need general-purpose programming logic: loops, conditionals, user interaction, file I/O, network communication. The solution is to embed SQL statements within a host programming language (C, C++, FORTRAN, Java) or to call the DBMS through a library.

Embedded SQL (ESQL) inlines SQL statements directly into host language source code, distinguished by the prefix EXEC SQL ... ;. A preprocessor translates the combined source into pure host language code that calls the DBMS library. The advantages are:

Static analysis: the preprocessor can type-check SQL statements and compile them into access plans at build time.
Ease of use: SQL reads naturally alongside the surrounding code.

The disadvantage is the need for a specialized preprocessor and binding step.

Structure of an Embedded SQL Application

Every ESQL program follows this general structure:

#include "util.h"
EXEC SQL INCLUDE SQLCA;   /* SQL Communication Area */

int main(int argc, char *argv[]) {
    EXEC SQL BEGIN DECLARE SECTION;
    /* host variables declared here */
    char db[6] = "DBCLASS";
    EXEC SQL END DECLARE SECTION;

    EXEC SQL WHENEVER SQLERROR GO TO error;
    EXEC SQL CONNECT TO :db;

    /* ... do work ... */

    EXEC SQL COMMIT;
    EXEC SQL CONNECT reset;
    exit(0);

error:
    check_error("My error", &sqlca);
    EXEC SQL WHENEVER SQLERROR CONTINUE;
    EXEC SQL ROLLBACK;
    EXEC SQL CONNECT reset;
    exit(1);
}

The SQL Communication Area (SQLCA)

EXEC SQL INCLUDE SQLCA includes the SQL Communication Area, a structure that contains:

sqlcode: the return code of the last SQL statement (0 = success, negative = error, positive = warning).
Error message text.

Every SQL statement should be followed by an error check.

Host Variables

Host variables shuttle values between SQL and the host language. They must be declared in a BEGIN DECLARE SECTION/END DECLARE SECTION block and are referenced in SQL statements prefixed with a colon:

EXEC SQL SELECT title INTO :title
         FROM publication
         WHERE pubid = :pubid;

Error Handling

EXEC SQL WHENEVER SQLERROR    GO TO lbl;
EXEC SQL WHENEVER SQLWARNING  GO TO lbl;
EXEC SQL WHENEVER NOT FOUND   GO TO lbl;

These directives set up automatic branch-on-error behavior. NOT FOUND triggers when a query returns no rows.

Simple Queries: INTO Clause

For queries guaranteed to return exactly one tuple, use SELECT ... INTO :var:

strncpy(pubid, argv[i], 8);
EXEC SQL SELECT title INTO :title
         FROM publication
         WHERE pubid = :pubid;
printf("%10s: %s\n", pubid, title);

If more than one tuple matches, the behavior is undefined (or an error). If no tuple matches, NOT FOUND is triggered.

NULLs and Indicator Variables

If a host variable might be assigned NULL, an indicator variable must accompany it:

smallint ind;
EXEC SQL SELECT firstname INTO :firstname INDICATOR :ind
         FROM ...;
if (ind < 0) { /* firstname is NULL */ }

Without an indicator variable, assigning NULL to a host variable causes a runtime error.

Cursors: Handling Multiple-Tuple Results

The fundamental mismatch between SQL (which returns sets of tuples) and procedural programming (which processes one record at a time) is called the impedance mismatch. The solution is the cursor:

EXEC SQL DECLARE author_cursor CURSOR
         FOR SELECT name, title
             FROM author, wrote, publication
             WHERE name LIKE :apat
               AND aid = author
               AND pubid = publication;

EXEC SQL OPEN author_cursor;
EXEC SQL WHENEVER NOT FOUND GO TO end;
for (;;) {
    EXEC SQL FETCH author_cursor INTO :name, :title;
    printf("%s -> %s\n", name, title);
}
end:
EXEC SQL CLOSE author_cursor;

The cursor protocol: DECLARE the query, OPEN it (executes the query), FETCH one tuple at a time into host variables, CLOSE when done.

Cursors for Updates

A cursor can be declared FOR UPDATE to allow modifying the row it currently points to:

EXEC SQL DECLARE author_cursor CURSOR
         FOR SELECT name FROM author WHERE url IS NULL
         FOR UPDATE OF url;
EXEC SQL OPEN author_cursor;
EXEC SQL WHENEVER NOT FOUND GO TO end;
for (;;) {
    EXEC SQL FETCH author_cursor INTO :name;
    /* read new URL from user */
    EXEC SQL UPDATE author
             SET url = :url
             WHERE CURRENT OF author_cursor;
}

WHERE CURRENT OF <cursor> updates (or deletes) the tuple that the cursor is currently positioned on. This is the only way to distinguish duplicate rows.

Stored Procedures

A stored procedure executes application logic directly inside the DBMS process, minimizing network round-trips and centralizing application code.

Scalar function (returns a single value):

CREATE FUNCTION sumSalaries(dept CHAR(3))
RETURNS DECIMAL(9,2)
LANGUAGE SQL
RETURN SELECT sum(salary)
       FROM employee
       WHERE workdept = dept;

Table-valued function (returns a relation):

CREATE FUNCTION deptSalariesF(dept CHAR(3))
RETURNS TABLE(salary DECIMAL(9,2))
LANGUAGE SQL
RETURN SELECT salary
       FROM employee
       WHERE workdept = dept;

Procedure with branching (SQL/PSM):

CREATE PROCEDURE UPDATE_SALARY_IF
    (IN employee_number CHAR(6), INOUT rating SMALLINT)
LANGUAGE SQL
BEGIN
    IF rating = 1 THEN
        UPDATE employee SET salary = salary * 1.10, bonus = 1000
        WHERE empno = employee_number;
    ELSEIF rating = 2 THEN
        UPDATE employee SET salary = salary * 1.05, bonus = 500
        WHERE empno = employee_number;
    ELSE
        UPDATE employee SET salary = salary * 1.03, bonus = 0
        WHERE empno = employee_number;
    END IF;
END

Dynamic Embedded SQL

The Problem of Dynamic SQL

In static embedded SQL, queries are known at compile time — the preprocessor can type-check them and compile access plans. But some applications need to build and execute queries at runtime, based on user input. This requires dynamic SQL.

The challenges of dynamic SQL:

How do we know if a string is a valid SQL statement? (Requires parsing at runtime.)
How do we execute queries whose result schema is unknown? (Requires runtime metadata.)
How do we pass parameters into a dynamically constructed query?

Dynamic SQL Roadmap

STRING
  → EXECUTE IMMEDIATE  (non-parametric, one-time)
  → PREPARE → EXECUTE  (parametric, non-query)
  → PREPARE → DECLARE CURSOR → OPEN/FETCH/CLOSE (parametric query)

EXECUTE IMMEDIATE

For simple, non-parametric statements executed once:

EXEC SQL EXECUTE IMMEDIATE :string;

The string is compiled and executed every time. Cannot return tuples or contain parameter markers.

PREPARE

Compiles a statement for repeated execution, avoiding recompilation:

EXEC SQL PREPARE stmt FROM :string;

stmt is an identifier used by the preprocessor to track the compiled statement. The string may be a query, may return answers, and may contain parameter markers (?).

EXECUTE with Parameter Markers

Execute a prepared non-query statement, substituting values for ? markers:

EXEC SQL EXECUTE stmt USING :var1, ..., :vark;

Values are substituted in order of appearance.

Cursors for Dynamic Queries

For prepared queries that return multiple tuples:

EXEC SQL DECLARE cname CURSOR FOR stmt;
EXEC SQL OPEN cname USING :var1, ..., :vark;
EXEC SQL FETCH cname INTO :out1, ..., :outn;
EXEC SQL CLOSE cname;

The SQLDA: Handling Unknown Output Schemas

When the number and types of result columns are not known at compile time, the application uses a SQL Descriptor Area (SQLDA). DESCRIBE stmt INTO sqlda populates the SQLDA with metadata about the result columns:

Number of result attributes.
For each attribute: its name, type, and length.

The application then allocates buffers for each attribute, uses USING DESCRIPTOR :sqlda to receive data, and prints or processes each field:

EXEC SQL DESCRIBE stmt INTO :*select;
/* allocate buffers for each attribute */
for (i = 0; i < select->sqld; i++) {
    select->sqlvar[i].sqldata = malloc(select->sqlvar[i].sqllen);
    select->sqlvar[i].sqlind  = malloc(sizeof(short));
}
EXEC SQL OPEN cstmt;
for (;;) {
    EXEC SQL FETCH cstmt USING DESCRIPTOR :*select;
    /* print each field based on its type */
}

This is the basis for generic SQL tools like interactive query interfaces (the adhoc application in the course examples).

ODBC — Call-Level Interface

CLI/ODBC Concepts

ODBC (Open Database Connectivity) is an alternative to embedded SQL. Instead of preprocessing SQL statements into the host language, ODBC uses a standard library of function calls. Applications link against the ODBC library and call it directly.

Advantages:

No precompiler needed — standard C compilation.
Vendor-independence: ODBC is a quasi-standard (merging ODBC from Microsoft and X/Open standards).
Database independence: switch databases by changing the connection string.

Disadvantage: all SQL is dynamic — no compile-time type checking or access plan optimization.

Three fundamental objects in every ODBC program:

Environment (SQLHENV): the ODBC context.
Connection (SQLHDBC): a connection to a specific database.
Statement (SQLHSTMT): a SQL statement handle.

Connecting and Disconnecting

SQLHENV henv;
SQLHDBC hdbc;
SQLAllocEnv(&henv);
SQLAllocConnect(henv, &hdbc);
rc = SQLConnect(hdbc, server, SQL_NTS, uid, SQL_NTS, pwd, SQL_NTS);
/* ... do work ... */
SQLDisconnect(hdbc);
SQLFreeConnect(hdbc);
SQLFreeEnv(henv);

Statement Functions

Key ODBC statement functions:

Function	Purpose
`SQLAllocStmt`	Allocate a statement handle
`SQLExecDirect`	Prepare and execute in one step
`SQLPrepare`	Compile a statement
`SQLExecute`	Execute a compiled statement
`SQLBindParameter`	Bind a host variable to a `?` marker
`SQLNumResultCols`	Number of result columns
`SQLDescribeCol`	Metadata about a result column
`SQLBindCol`	Bind a result column to a host buffer
`SQLGetData`	Retrieve a result column value after FETCH
`SQLFetch`	Advance the cursor to the next row
`SQLRowCount`	Number of rows affected by an update
`SQLFreeStmt`	Release the statement handle

Parameter Binding

Parameters are bound with SQLBindParameter:

SQLPrepare(hstmt, "UPDATE author SET url = ? WHERE aid = ?", SQL_NTS);
SQLBindParameter(hstmt, 1, SQL_PARAM_INPUT, SQL_C_CHAR, SQL_CHAR,
                 0, 0, url_buffer, 70, &url_ind);
SQLBindParameter(hstmt, 2, SQL_PARAM_INPUT, SQL_C_SLONG, SQL_INTEGER,
                 0, 0, &aid, 0, NULL);
SQLExecute(hstmt);

Retrieving Results

Two approaches for getting query results:

Bind columns before fetch (SQLBindCol):

SQLExecDirect(hstmt, "SELECT pubid, title FROM publication", SQL_NTS);
SQLBindCol(hstmt, 1, SQL_C_CHAR, pubid.s, 8, &pubid.ind);
SQLBindCol(hstmt, 2, SQL_C_CHAR, title.s, 70, &title.ind);
while (SQLFetch(hstmt) == SQL_SUCCESS)
    printf("%-8.8s %-70.70s\n", pubid.s, title.s);

Get data after fetch (SQLGetData):

while (SQLFetch(hstmt) == SQL_SUCCESS) {
    SQLGetData(hstmt, 1, SQL_C_CHAR, pubid.s, 8, &pubid.ind);
    SQLGetData(hstmt, 2, SQL_C_CHAR, title.s, 70, &title.ind);
    printf("%-8.8s %-70.70s\n", pubid.s, title.s);
}

Transactions in ODBC

Transactions begin implicitly on the first statement execution and end with SQLTransact:

SQLTransact(henv, hdbc, SQL_COMMIT);    /* or SQL_ROLLBACK */

ODBC can do everything embedded SQL can, but all statements are effectively dynamic, and the programmer bears full responsibility for type matching.

The Entity-Relationship Model

Purpose of the E-R Model

The Entity-Relationship (E-R) model, proposed by Peter Chen in 1976, is a tool for conceptual schema design — modeling a real-world domain before translating it into a relational schema. The world is described in terms of:

Entities: distinguishable real-world objects.
Attributes: properties of entities.
Relationships: associations between entities.

E-R diagrams provide a visual representation of the conceptual schema. Many variant notations exist; this course uses the Waterloo notation.

Entities and Entity Sets

An entity is a distinguishable object in the domain of discourse. An entity set is a collection of entities of the same type: all current students at the University of Waterloo, all flights offered by Air Canada, all burglaries in Ontario in 1994.

In the E-R diagram, entity sets are represented as rectangles: [ Student ], [ Flight ], [ Burglary ].

Attributes

Attributes describe the properties of entities. Each attribute has a domain — the set of permitted values. In the E-R diagram, attributes appear as ovals connected to their entity set: Student has attributes StudentNum, StudentName, and Major.

Relationships and Relationship Sets

A relationship records that certain entities are related to each other. A relationship set is a collection of relationships of a given type. Relationships in the E-R model require that the participating entities exist.

Example: the RegisteredIn relationship set records which students are registered in which courses. In the E-R diagram, relationship sets are represented as diamonds connecting entity sets.

Relationships can also have attributes — for example, a Match relationship between a home team, a visiting team, and a location might have a Score attribute.

Role Names

When an entity set appears in multiple roles in a single relationship, role names distinguish them. Example: a Match relationship involves Team as both HomeTeam and Visitor.

Constraints in E-R Models

Primary Keys

Each entity must be distinguishable from others in its set by its attributes. The primary key is a minimal set of attributes whose values uniquely identify each entity in the set.

Relationship Types

The cardinality of a relationship describes how many entities on each side can participate:

Many-to-many (N:N): each entity in one set can relate to many entities in the other, and vice versa.
Many-to-one (N:1): each entity on one side relates to at most one entity on the other side. Example: Employee WorksIn Department (each employee works in one department, each department has many employees).
One-to-one (1:1): each entity on each side relates to at most one entity on the other. Example: Employee Manages Department.

Existence Dependencies and Weak Entities

Sometimes an entity’s existence depends on another entity: transactions exist only within accounts.

The dominant entity (Account) must exist first.
The subordinate entity (Transaction) is existence-dependent.
The entity set containing subordinate entities is a weak entity set.
Weak entities are identified by their discriminator (attributes that distinguish them among subordinates of a given dominant entity) plus the primary key of their dominant entity.

In the E-R diagram, weak entity sets use double rectangles, and identifying relationships use double diamonds.

General Cardinality Constraints

General cardinality constraints specify lower and upper bounds on participation. Notation: (lower, upper) on each edge of a relationship. Example: Student (3,5) Takes (6,100) Course means each student takes between 3 and 5 courses, and each course has between 6 and 100 students.

Extensions to E-R Modeling

Structured Attributes

Composite attributes: composed of a fixed number of sub-attributes. Example: Address composed of Street, City, Province, PostalCode.
Multi-valued attributes: set-valued. Example: Hobbies (an employee may have multiple hobbies).

Aggregation

Sometimes a relationship set needs to be treated as an entity in its own right, so it can participate in another relationship. Example: “Accounts are assigned to a given student enrollment” — the EnrolledIn relationship between Student and Course is aggregated into an entity, which then participates in a CourseAccount relationship with Account.

Specialization

A specialized entity set is derived from a more general one. Example: Graduate is a specialization of Student — all graduates are students, but graduates additionally have a supervisor and degrees.

In the diagram, specialization uses ISA (is-a) triangles. The constraint ∀e(Graduate(e) → Student(e)) holds.

Generalization

Generalization is the inverse: multiple entity sets are abstracted into a more general entity set. Example: Car and Truck are generalized to Vehicle.

An additional constraint, COVERS (\(\forall e(Vehicle(e) \to \exists i.E_i(e)\)), asserts that every vehicle is either a car or a truck.

Disjointness

Specialized entity sets are usually disjoint (nothing is both a car and a truck), but they can be declared to overlap (to accommodate utility vehicles).

Design Methodology

A simple six-step process for E-R design:

Recognize entity sets.
Recognize relationship sets and the entity sets they connect.
Recognize attributes of entity and relationship sets.
Define relationship types (cardinality) and existence dependencies.
Define general cardinality constraints, keys, and discriminators.
Draw the diagram.

For each decision, maintain a log of assumptions and the restrictions imposed.

Attribute or Entity Set?

Rules of thumb — model something as a new entity set if:

It is a separate object of interest.
Information is maintained about it independently.
Multiple of its kind can belong to a single entity.
It makes sense to delete such an object independently.
It can be missing from some entities.
It can be shared by different entities.

Translating ER to Relational Tables

Main Translation Rules

The translation from an E-R diagram to a relational schema follows three main rules:

Each entity set maps to a new table.
Each attribute maps to a new column in that table.
Each relationship set maps to either new columns on an existing table (when the cardinality allows) or a new table.

Strong Entity Sets

A strong entity set \(E\) with attributes \(a_1, \ldots, a_n\) translates to a table \(E\) with those columns. The primary key of the entity set becomes the primary key of the table.

Student(StudentNum, StudentName, Major)
  PRIMARY KEY (StudentNum)

Weak Entity Sets

A weak entity set \(E\) translates to a table whose columns include:

The attributes of the weak entity set.
The attributes of the identifying relationship.
The primary key attributes of the dominant entity set.

The primary key of the resulting table is the discriminator plus the primary key of the dominant entity.

Example: Transaction (weak, dominated by Account):

Account(AccNum, Balance)
Transaction(TransNum, AccNum, Date, Amount)
  PRIMARY KEY (TransNum, AccNum)
  FOREIGN KEY (AccNum) REFERENCES Account

Relationship Sets

If a relationship set is the identifying relationship for a weak entity set, no additional table is needed (it’s already folded into the weak entity’s table).

If the cardinality constraint (1,1) holds for one component entity set \(E\), the relationship’s attributes and the primary key of the other entity sets can be added directly to \(E\)’s table (no new table needed).

Otherwise, a new table \(R\) is created with:

All attributes of the relationship set.
The primary key attributes of each component entity set (with role names prepended if role names are given).

The primary key of \(R\) is determined by cardinality: if (0,1) holds for some component entity, use that entity’s key; otherwise, combine all components’ keys.

Representing Aggregation

The tabular representation of an aggregation of relationship \(R\) is simply the table for \(R\). A relationship involving the aggregation treats it like an entity set whose primary key is the primary key of \(R\)’s table.

Representing Specialization

Create a table for the higher-level entity set. Treat each specialized subset like a weak entity set without discriminators — its table has the primary key of the higher-level entity and any additional attributes.

Representing Generalization

Approach 1: Create a table for each lower-level entity set only. Each table includes both the lower-level and the higher-level attributes. The higher-level entity set can then be reconstructed as a view (union of the lower-level tables).

Approach 2: Treat it like specialization — create a table for the higher-level entity (with shared attributes), plus one table per lower-level entity (with just the specialized attributes and the higher-level primary key as a foreign key).

SQL DDL for Integrity Constraints

Primary Keys

CREATE TABLE DEPT (
    ID       INTEGER NOT NULL,
    DeptName CHAR(20),
    MgrNo    CHAR(3),
    PRIMARY KEY (ID)
);

Inserting a duplicate primary key raises SQL error 23505.

Foreign Keys

CREATE TABLE EMP (
    SSN    INTEGER NOT NULL,
    Name   CHAR(20),
    Dept   INTEGER,
    Salary DECIMAL(8,2),
    PRIMARY KEY (SSN),
    FOREIGN KEY (Dept) REFERENCES DEPT(ID)
        ON DELETE CASCADE
        ON UPDATE RESTRICT
);

Actions on referential constraint violation:

RESTRICT: produce an error (block the operation).
CASCADE: propagate the delete/update to referencing tuples.
SET NULL: set the foreign key value to NULL.

CHECK Constraints

CREATE TABLE EMP (
    ...
    CHECK (salary > 0)
);

CHECK constraints enforce arbitrary conditions at the row level. Subqueries are generally not allowed in CHECK conditions.

Views

A view is a relation whose instance is determined by the instances of other base relations. It has the same properties as a table for querying purposes, but updates may or may not be propagated back.

CREATE VIEW ManufacturingProjects AS
    SELECT projno, projname, firstnme, lastname
    FROM project, employee
    WHERE respemp = empno AND deptno = 'D21';

Virtual views are not stored — they are rewritten into the base query at compile time. Materialized views are stored and must be refreshed when base tables change (useful for data warehouses).

A view is updatable in SQL-92 only if it references exactly one table, outputs simple attributes (no expressions), and has no grouping, aggregation, DISTINCT, nested queries, or set operations.

Data Control Language

GRANT SELECT, INSERT ON author TO PUBLIC;
REVOKE DELETE ON author FROM USER dave;

DCL assigns access rights (SELECT, INSERT, DELETE, UPDATE, ALTER, CONTROL, etc.) to database objects for individual users, groups, or the public.

Functional Dependencies and Normal Forms

Schema Design Quality

How do we evaluate whether a relational schema is good? Several warning signs indicate poor design:

Update anomalies: changing a supplier’s name requires updating it in every row where that supplier appears.
Insert anomalies: cannot record a new item without also recording a supplier.
Delete anomalies: deleting the last supply relationship for a supplier erases all information about that supplier.
Redundancy: the same fact (e.g., “Supplier S1 is located in Ajax”) is stored in multiple rows.

The cause of all these problems is that the schema captures multiple independent facts in a single table.

Functional Dependencies

A functional dependency (FD) \(X \to Y\) over relation schema \(R\) states that for any two tuples in any valid instance of \(R\), if they agree on all attributes in \(X\), they must also agree on all attributes in \(Y\). Formally:

\[\forall v_1, \ldots, v_k, w_1, \ldots, w_k. R(v_1, \ldots, v_k) \land R(w_1, \ldots, w_k) \land \bigwedge_{j \in X} v_j = w_j \to \bigwedge_{i \in Y} v_i = w_i\]

We say “\(X\) functionally determines \(Y\).”

Example — Schema EmpProj(SIN, PNum, Hours, EName, PName, PLoc, Allowance):

SIN → EName (SIN determines employee name)
PNum → PName, PLoc (project number determines project name and location)
PLoc, Hours → Allowance (allowances are the same for the same hours at the same location)

Armstrong’s Axioms

To derive all FDs implied by a set \(F\), we use Armstrong’s axioms:

(Reflexivity): if \(Y \subseteq X\), then \(X \to Y\).
(Augmentation): if \(X \to Y\), then \(XZ \to YZ\).
(Transitivity): if \(X \to Y\) and \(Y \to Z\), then \(X \to Z\).

These are sound and complete: anything derivable from \(F\) using these axioms is in \(F^+\), and every FD in \(F^+\) is derivable.

Additional derived rules:

(Union): \(X \to Y\) and \(X \to Z\) imply \(X \to YZ\).
(Decomposition): \(X \to YZ\) implies \(X \to Y\).

Example — Derive SIN, PNum → Allowance from the above FDs:

SIN, PNum → Hours (from F)
PNum → PName, PLoc (from F)
PLoc, Hours → Allowance (from F)
SIN, PNum → PNum (reflexivity)
SIN, PNum → PName, PLoc (transitivity: 4 + 2)
SIN, PNum → PLoc (decomposition: 5)
SIN, PNum → PLoc, Hours (union: 6 + 1)
SIN, PNum → Allowance (transitivity: 7 + 3)

Keys

A superkey is a set \(K \subseteq R\) such that \(K \to R\) (the key determines all attributes).
A candidate key is a minimal superkey — no proper subset of it is also a superkey.
A primary key is a candidate key chosen by the DBA.

Computing the Closure

The attribute closure \(X^+\) of a set of attributes \(X\) with respect to \(F\) is the set of all attributes determined by \(X\):

function ComputeX+(X, F):
    X+ := X
    while true do:
        if ∃(Y → Z) ∈ F such that Y ⊆ X+ and Z ⊄ X+:
            X+ := X+ ∪ Z
        else:
            return X+

\(X\) is a superkey of \(R\) iff \(\text{ComputeX}^+(X, F) = R\). \(X \to Y \in F^+\) iff \(Y \subseteq \text{ComputeX}^+(X, F)\).

Decompositions

A decomposition of \(R\) is a collection \(\{R_1, \ldots, R_n\}\) of schemas with \(R = R_1 \cup \cdots \cup R_n\). A good decomposition:

Does not lose information (lossless-join).
Does not complicate checking constraints (dependency-preserving).
Eliminates anomalies.

Lossless-Join Decompositions

A decomposition \(\{R_1, R_2\}\) of \(R\) is lossless iff the natural join of the two projected instances always reconstructs the original: \(R_1 \cap R_2 \to R_1\) or \(R_1 \cap R_2 \to R_2\) (the shared attributes form a superkey for at least one of the two parts).

A lossy decomposition produces spurious tuples upon joining — information is effectively lost.

Dependency Preservation

A decomposition is dependency-preserving if every FD in \(F\) can be checked by examining only a single table in the decomposed schema (no joins needed to verify a constraint).

Boyce-Codd Normal Form (BCNF)

Schema \(R\) is in BCNF with respect to \(F\) iff for every non-trivial FD \(X \to Y \in F^+\) with \(XY \subseteq R\), \(X\) is a superkey of \(R\).

In other words: the only functional dependencies allowed are those where the left-hand side is a superkey. BCNF eliminates all redundancy due to FDs.

BCNF Decomposition Algorithm:

function ComputeBCNF(R, F):
    Result := {R}
    while ∃ Rᵢ ∈ Result and (X → Y) ∈ F+ violates BCNF on Rᵢ:
        Replace Rᵢ by (Rᵢ - (Y - X))
        Add {X, Y} to Result
    return Result

The result depends on the order in which violating FDs are processed. It is not always possible to achieve both lossless-join and dependency preservation in BCNF: for \(R = \{A, B, C\}\) with \(F = \{AB \to C, C \to B\}\), no lossless dependency-preserving BCNF decomposition exists.

Third Normal Form (3NF)

Schema \(R\) is in 3NF with respect to \(F\) iff for every non-trivial FD \(X \to Y \in F^+\) with \(XY \subseteq R\), either \(X\) is a superkey of \(R\) or every attribute in \(Y\) is part of some candidate key of \(R\).

3NF is strictly weaker (more permissive) than BCNF — it allows more redundancy but guarantees the existence of a lossless-join, dependency-preserving decomposition.

Minimal Cover

Two sets of FDs \(F\) and \(G\) are equivalent if \(F^+ = G^+\). A set \(G\) is a minimal cover if:

Every RHS is a single attribute.
No FD can be removed without changing \(G^+\).
No attribute can be removed from any LHS without changing \(G^+\).

Computing a minimal cover (four steps, repeated until stable):

Replace \(X \to YZ\) with \(X \to Y\) and \(X \to Z\).
Remove \(X \to A\) from \(F\) if \(A \in \text{ComputeX}^+(X, F - \{X \to A\})\).
Remove attribute \(A\) from the LHS of \(X \to B\) if \(B \in \text{ComputeX}^+(X - \{A\}, F)\).
Recombine: replace \(X \to Y\) and \(X \to Z\) by \(X \to YZ\).

3NF Decomposition Algorithm

function Compute3NF(R, F):
    Result := ∅
    F' := minimal cover for F
    for each (X → Y) ∈ F':
        Result := Result ∪ {XY}
    if no Rᵢ ∈ Result contains a candidate key for R:
        compute a candidate key K for R
        Result := Result ∪ {K}
    return Result

This always produces a lossless-join, dependency-preserving 3NF decomposition.

Summary of design goals: a decomposition should be (1) lossless-join, (2) dependency-preserving, and (3) in BCNF — or at least 3NF if BCNF cannot be achieved with the first two properties.

Multivalued Dependencies and Higher Normal Forms

Beyond Functional Dependencies

Functional dependencies capture only one kind of redundancy. Consider the schema CTB(Course, Teacher, Book) with instance:

Course	Teacher	Book
Math	Smith	Algebra
Math	Smith	Calculus
Math	Jones	Algebra
Math	Jones	Calculus
Physics	Black	Mechanics
Physics	Black	Optics

There are no non-trivial FDs here — so this schema is in BCNF. Yet the data is clearly redundant: for each course, the set of teachers and the set of books are independent of each other. If we add a new teacher for Math, we must add one row for each book (and vice versa).

Multivalued Dependencies

A multivalued dependency (MVD) \(X \twoheadrightarrow Y\) on schema \(R\) states that whenever two tuples agree on \(X\), the set of \(Y\)-values associated with any particular \(X\)-value is independent of the remaining attributes \(R - X - Y\):

Formally: if \((c, t_1, b_1) \in \text{CTB}\) and \((c, t_2, b_2) \in \text{CTB}\), then \((c, t_1, b_2) \in \text{CTB}\) and \((c, t_2, b_1) \in \text{CTB}\).

For CTB: \(C \twoheadrightarrow T\) and \(C \twoheadrightarrow B\) both hold.

Axioms for MVDs

Reasoning about MVDs uses a combined set of axioms for FDs and MVDs:

(Reflexivity): \(Y \subset X \Rightarrow X \twoheadrightarrow Y\)
(Complementation): \(X \twoheadrightarrow Y \Rightarrow X \twoheadrightarrow (R - Y)\)
(Augmentation): \(X \twoheadrightarrow Y \Rightarrow XZ \twoheadrightarrow YZ\)
(Transitivity): \(X \twoheadrightarrow Y, Y \twoheadrightarrow Z \Rightarrow X \twoheadrightarrow (Z - Y)\)
(Conversion): \(X \to Y \Rightarrow X \twoheadrightarrow Y\) (every FD implies an MVD)
(Interaction): \(X \twoheadrightarrow Y, XY \to Z \Rightarrow X \to (Z - Y)\)

These axioms are sound and complete for the logical implication of both FDs and MVDs.

Dependency Basis

The dependency basis of \(X\) with respect to a set \(F\) of FDs and MVDs is a partition of \(R - X\) into sets \(Y_1, \ldots, Y_k\) such that \(F \models X \twoheadrightarrow Z\) if and only if \(Z - X\) is a union of some of the \(Y_i\)s.

For the CTB schema, the dependency basis of \(C\) is \([T, B]\) — Teacher and Book are independent given Course.

Note: unlike FDs, MVDs cannot generally have their right-hand sides split into single attributes.

Fourth Normal Form (4NF)

A lossless-join decomposition \((R_1, R_2)\) of \(R\) with respect to MVDs satisfies \(F \models (R_1 \cap R_2) \twoheadrightarrow (R_1 - R_2)\) (or the symmetric condition).

Schema \(R\) is in 4NF with respect to \(F\) iff for every non-trivial MVD \(X \twoheadrightarrow Y \in F^+\) with \(XY \subseteq R\), \(X\) is a superkey of \(R\). (An MVD is trivial if \(Y \subseteq X\) or \(XY = R\).)

Use a BCNF-like decomposition procedure (substituting MVD violations for FD violations) to obtain a lossless-join 4NF decomposition.

Example — CTB decomposed using \(C \twoheadrightarrow T\):

Course-Teacher	Course-Book
Math–Smith	Math–Algebra
Math–Jones	Math–Calculus
Physics–Black	Physics–Mechanics
	Physics–Optics

No redundancy remains.

Other Dependencies

Join dependencies generalize MVDs further:

\[\bowtie[R_1, \ldots, R_k] \text{ holds if } I = \pi_{R_1}(I) \bowtie \cdots \bowtie \pi_{R_k}(I)\]

An MVD \(X \twoheadrightarrow Y\) is equivalent to the join dependency \(\bowtie[XY, X(R-Y)]\). Join dependencies cannot always be simulated by MVDs. No finite axiomatization exists for join dependencies. The normal form enforcing join dependencies is 5NF (Project-Join Normal Form).

Inclusion dependencies \(R[X] \subseteq S[Y]\) hold when the projection of \(R\) onto \(X\) is a subset of the projection of \(S\) onto \(Y\). Foreign key constraints are exactly inclusion dependencies.

Basics of Query Execution

How Queries Are Executed

SQL queries go through several transformation stages:

Parsing and type-checking: the SQL is parsed into an abstract syntax tree and type-checked.
Translation to relational algebra: the logical query plan is expressed in a relational algebra.
Optimization: the optimizer generates an efficient query plan using collected statistics.
Plan execution: physical access methods retrieve data; physical operators combine it.

Relational Algebra

The relational algebra is a set of operators on finite relations. Codd proved it is equivalent in expressive power to the relational calculus:

Constants: base relation \(R_i\) for each schema relation.
Selection \(\sigma_\phi(R)\): keeps only tuples satisfying condition \(\phi\).
Projection \(\pi_V(R)\): keeps only the columns in \(V\).
Cartesian Product \(R \times S\): all pairs of tuples from \(R\) and \(S\).
Union \(R \cup S\): tuples in \(R\) or in \(S\).
Set Difference \(R - S\): tuples in \(R\) but not in \(S\).

The translation from relational calculus to algebra follows Codd’s theorem:

\[\text{RCtoRA}(\exists x_i.Q) = \pi_{FV(Q) \setminus \{i\}}(\text{RCtoRA}(Q))\]\[\text{RCtoRA}(Q_1 \land Q_2) = \text{RCtoRA}(Q_1) \times \text{RCtoRA}(Q_2)\]\[\text{RCtoRA}(Q_1 \lor Q_2) = \text{RCtoRA}(Q_1) \cup \text{RCtoRA}(Q_2)\]\[\text{RCtoRA}(Q_1 \land \neg Q_2) = \text{RCtoRA}(Q_1) - \text{RCtoRA}(Q_2)\]

The Iterator Model

To avoid storing all intermediate results, each physical operator implements a cursor interface: open(), fetch(), close(). Each operator calls its children through the same interface. This pipelined execution passes tuples one at a time up through the operator tree.

Selection operator (pseudocode):

void open()  { child.open(); }
tuple fetch() {
    tuple t = child.fetch();
    if (t == NULL || t.attr(i) == t.attr(j)) return t;
    return this.fetch();  // recursive: skip non-matching
}
void close() { child.close(); }

Other operators follow the same pattern. Duplicate elimination and aggregation require sorting or hashing and cannot be purely pipelined.

Making Queries Faster

Naive implementations of the operators are correct but slow. Three categories of improvements:

Indexing: disk-based data structures (B-trees, hash tables) enable efficient lookup without scanning the entire relation.
Better algorithms: sorting and hashing enable more efficient join and duplicate-elimination algorithms than naive nested loops.
Query rewriting: transform the relational algebra expression into an equivalent but cheaper one.

Join Algorithms

Nested-Loop Join (basic):

for t in R do
    for u in S do
        if C(t, u) then output(t, u)

Cost: \(\text{cost}(R) + |R| \cdot \text{cost}(S)\). Very slow for large relations.

Index Join (with a B-tree of depth \(d_S\) on S): Cost: \(\text{cost}(R) + d_S \cdot |R|\).

Sort-Merge Join: sort both relations on the join attribute, then merge. Cost: \(\text{cost}(\text{sort}(R)) + \text{cost}(\text{sort}(S))\), where sorting adds \((|E|/b)\log(|E|/b)\) to the scan cost.

Hash Join: hash both R and S to buckets by the join attribute, then match within each bucket.

Query Optimization

A query has many possible plans (different join orders, different operator implementations). The optimizer chooses the best one using:

“Always Good” Transformations (Heuristics)

Push selections down: \(\sigma_\phi(E_1 \bowtie E_2) = \sigma_\phi(E_1) \bowtie E_2\) (if \(\phi\) only involves \(E_1\)’s columns). Reduces intermediate result sizes dramatically.
Push projections through joins: eliminate unnecessary columns early.
Replace products by joins: \(\sigma_\phi(R \times S) = R \bowtie_\phi S\). Reduces the search space and enables index joins.

Cost-Based Optimization

For conjunctive subqueries, the optimizer must choose join order. Joins are associative and commutative, so \(n\) relations have exponentially many join orderings. The optimizer uses a cost model based on collected statistics.

For each stored relation \(R\) and attribute \(A\), the DBMS maintains:

\(|R|\): cardinality (number of tuples).
\(b(R)\): blocking factor (tuples per disk block).
\(\min(R,A)\), \(\max(R,A)\): value range.
\(\text{distinct}(R,A)\): number of distinct values.

\[\text{sel}(\sigma_{A=c}(R)) \approx \frac{1}{\text{distinct}(R,A)}\]\[\text{sel}(\sigma_{A \leq c}(R)) \approx \frac{c - \min(R,A)}{\max(R,A) - \min(R,A)}\]\[|R \bowtie S| \approx |R| \cdot \frac{|S|}{\text{distinct}(S,A)}\]

For foreign-key joins, the estimated size is simply \(|S|\) (the many side).

Pipelined vs. Materialized Plans

Pipelined plans avoid storing intermediate results. A temporary store operator materializes an intermediate result for reuse (e.g., when the same subquery appears multiple times). Left-deep join trees allow full pipelining without recomputation.

Database Tuning and Physical Design

Physical Design and Tuning

Physical design is the process of selecting physical data structures (storage formats, indexes) to implement the conceptual schema efficiently. Tuning is the ongoing process of adjusting those structures as the workload changes.

A workload description specifies:

The most important queries and their frequencies.
The most important updates and their frequencies.
Performance goals for each.
For each query: which relations are accessed, which attributes are retrieved, which attributes appear in conditions and how selective those conditions are.

Storage Strategies

For each relation, a storage strategy is chosen:

Heap file: unsorted, simplest.
Sorted file: sorted on one attribute, enabling binary search.
Hash file: hashed on one attribute, enabling O(1) lookup.

Indexes are then added to speed up specific access patterns.

Indexes

The main index types:

B⁺-trees: support both equality and range queries; support traversal in sorted order.
Hash tables: fast O(1) equality lookup; no support for range queries.
ISAM, VSAM: historical static index structures.

Creating an index:

CREATE INDEX LastnameIndex ON Employee(Lastname) [CLUSTER];
DROP INDEX LastnameIndex;

Clustering vs. Non-Clustering Indexes

A clustering index on attribute \(A\) means tuples with similar \(A\)-values are stored together in the same disk block. A relation can have at most one clustering index.

Cost of selection using a clustered index: \(2 + N/b(R)\) (2 for the index lookup, \(N/b(R)\) for retrieving \(N\) tuples from clustered blocks).

Cost of selection using a non-clustered index: \(2 + N\) (pessimistic: each matching tuple may be on a different block).

Cost of a full table scan: \(|R|/b(R)\).

Co-Clustering

Two relations are co-clustered when their tuples are interleaved in the same physical file. Useful for hierarchical (1:N) relationships — speeds up foreign-key joins, but slows sequential scans of either relation alone.

Range Queries and B⁺-Trees

B⁺-trees support range queries efficiently: find the leaf for the starting value, then follow forward pointers through the leaf level.

interiorleaves→ chain

Multi-Attribute Indexes

An index on (Lastname, Firstname) is useful for queries on Lastname alone or on both Lastname and Firstname. It is not useful for queries on Firstname alone (the second column is not prefix-searchable). It can enable index-only plans where the query’s answer can be read from the index without touching the base table.

Physical Design Guidelines

Don’t add an index unless the performance improvement outweighs the extra update overhead (every insert/update/delete must maintain the index).
Attributes in WHERE clauses are candidates for index search keys.
Consider multi-attribute search keys when a WHERE clause has several conditions, or when an index-only plan is possible.
Choose indexes that benefit many queries simultaneously.
Each relation can have only one clustering scheme — choose it based on the most important query (range queries benefit most from clustering; join queries benefit most from co-clustering).
A multi-attribute index that enables an index-only plan does not benefit from being clustered.

Query Plan Inspection (EXPLAIN)

Most DBMSs provide tools to inspect the optimizer’s chosen plan and its estimated cost. In DB2, db2expln and dynexpln show the plan tree with estimated cost and cardinality. This allows DBAs to:

Verify that indexes are being used.
Detect poor cost estimates.
Update statistics if the optimizer is making bad choices.

Schema Tuning

If physical design alone is insufficient, changes to the conceptual schema may be needed:

Denormalization: intentionally merge schemas (reverse of normalization) to eliminate joins. Increases update overhead but decreases query overhead. Appropriate when reads vastly outnumber writes.
Partitioning:
- Horizontal partitioning: split a table into subsets of rows (e.g., separate operational from archival data).
- Vertical partitioning: split a table into subsets of columns (e.g., separate frequently from infrequently accessed columns, or separate concurrency hot-spots).

Tuning Queries and Applications

Query tuning guidelines:

Sorting is expensive — avoid unnecessary ORDER BY, DISTINCT, or GROUP BY.
Replace subqueries with joins when possible.
Replace correlated subqueries with uncorrelated ones when possible.
Use vendor tools (EXPLAIN) to inspect plans; update statistics if the plan is poor.

Application tuning guidelines:

Minimize communication costs: retrieve the fewest rows and columns necessary; use set-at-a-time updates (WHERE clause) rather than cursor-based row-at-a-time updates.
Minimize lock contention: delay updates as long as possible; delay hot-spot operations; shorten or split transactions; batch inserts/updates/deletes; consider weaker isolation levels for read-heavy workloads.

Transaction Processing

Basics of Transaction Processing

Query and update processing converts high-level SQL requests into reads and writes of physical objects — tuples, pages, or files. The two primary goals are:

Correctness and concurrency: multiple transactions executing simultaneously must not corrupt the data.
Durability: once a transaction commits, its effects are permanent, even if the system fails immediately afterward.

ACID Properties

Transactions must satisfy the ACID properties:

Atomicity: all-or-nothing execution. Either all operations in the transaction succeed, or none of them take effect.
Consistency: each transaction preserves all integrity constraints. If the database is consistent before the transaction, it is consistent after.
Isolation: transactions execute as if they were alone, without seeing each other’s intermediate states.
Durability: committed changes are permanent. Hardware failures, power outages, and crashes cannot erase committed work.

Implementation comes in two parts:

Concurrency Control: ensures isolation.
Recovery Management: ensures durability and atomicity.

Schedules and Serializability

Formally, a schedule is a sequence of operations (reads and writes) from a set of transactions, respecting the internal order of each transaction. A schedule is serializable if it is equivalent to some serial schedule (one where transactions execute one at a time without interleaving).

Example:

\(S_a = w_1[x], r_2[x], w_1[y], r_2[y]\) — serializable (equivalent to T1 then T2).
\(S_c = w_1[x], r_2[x], r_2[y], w_1[y]\) — not serializable.

Conflict equivalence: Two schedules are conflict-equivalent if all pairs of conflicting operations (operations from different transactions on the same object, at least one of which is a write) appear in the same order. A schedule is conflict-serializable if it is conflict-equivalent to a serial schedule.

Non-serializable (cycle!)T1T2

w1[x]→r2[x]w2[y]→r1[y] ← cycle!Cycle ⟹ not conflict-serializable

Two-Phase Locking (2PL)

The most widely used concurrency control protocol:

Shared lock (S-lock): required to read an object.
Exclusive lock (X-lock): required to write an object.

2PL Protocol: a transaction must acquire all its locks before it releases any of them. The transaction has an expanding phase (acquiring locks) followed by a shrinking phase (releasing locks).

Theorem: 2PL guarantees conflict-serializable schedules.

In practice, Strict 2PL is used: all locks are held until the transaction commits or aborts. This additionally guarantees Avoiding Cascading Aborts (ACA) — no transaction reads data written by an uncommitted transaction.

Deadlocks

With 2PL, deadlocks are possible:

T1: r1[x]  ...   w1[y](blocked by T2)
T2:    r2[y] w2[x](blocked by T1)

Strategies:

Prevention: only grant locks that cannot lead to a deadlock (e.g., impose an ordering on data items and grant locks only in that order).
Detection: build a wait-for graph; detect cycles; abort one transaction. In practice, a timeout is often used as a simpler detection mechanism.

Variations on Locking

Multi-granularity locking: lock objects at different granularities (record, page, table) depending on workload. Reduces lock table overhead for bulk operations.
Predicate locking: lock based on a predicate (e.g., “all employees in department D”), preventing the phantom problem (where inserting a new row satisfying a predicate retroactively violates serializability).
Tree locking: relaxes 2PL for B-tree structures to reduce congestion at the root.

Isolation Levels

Full serializability has performance costs (deadlocks, blocked transactions). SQL supports four isolation levels:

Level	Name	Guarantee
3	Serializable	Full conflict-serializability (table-level strict 2PL)
2	Repeatable Read	Tuple-level strict 2PL; phantoms may occur
1	Cursor Stability	Exclusive locks only at tuple level; non-repeatable reads possible
0	(None)	No locks; may read uncommitted data (“dirty reads”)

MVCC (Multi-Version Concurrency Control) is used by PostgreSQL and other modern databases to implement snapshot isolation efficiently. Each write creates a new version of a row; readers see the snapshot as of their transaction start time without blocking writers.

Recovery: Goals and Setting

The recovery manager provides two guarantees:

Committed transactions: their effects are permanent.
Aborted transactions: their effects are completely undone.

Input to the recovery manager: a schedule produced by a 2PL, ACA-compliant scheduler. Output: a schedule of physical reads, writes, and forced writes to disk.

Approaches

Shadowing: copy-on-write and merge-on-commit. Creates shadow pages on commit; poor disk clustering; used in early IBM System R but not modern systems.

Logging (universally used today): a sequential log file records all changes. The log is append-only and stored on a separate disk from the data. Transactions write log records for every modification; on commit, the log entry is forced to disk. On crash, the log is replayed to recover.

Log Records

The log contains:

UNDO records: old values of modified objects, used to reverse a transaction that aborts.
REDO records: new values of modified objects, used to reapply the effects of a committed transaction.
BEGIN/COMMIT/ABORT records: mark the start and end of each transaction.

Write-Ahead Logging (WAL)

To keep the log consistent with the database:

UNDO rule: the log record for an update must reach the log disk before the modified data page reaches the main disk. (Guarantees atomicity: if the system crashes before commit, we can undo the partial changes.)
REDO rule: all log records for a transaction must reach the log disk before the commit record. (Guarantees durability: after commit, all changes can be replayed even if the data pages hadn’t been written.)

WAL allows the DBMS to buffer data writes lazily (keeping data in memory longer for better performance) while still guaranteeing ACID properties through synchronous log writes.

Summary

ACID properties guarantee correct concurrent access and persistent storage:

Consistency and isolation are achieved through serializability — implemented by the transaction manager using 2PL.
Durability and atomicity are achieved through recovery — implemented by the recovery manager using WAL-based logging rather than slow synchronous writes to the main database.