course outline · Designing and Using Databases

Designing and Using Databases

Bachelor in Computer Science & Artificial Intelligence (BCSAI) · IE University · a syllabus-driven outline of all 30 sessions.

Databases are central to almost all technology projects. This course provides enough knowledge to work alongside database designers and programmers in a corporate setting — making better decisions about which kind of system to use and designing a solution that fits a client's needs. It surveys the database landscape in use today, from traditional systems that drive business intelligence to advanced systems that power web-scale applications.

The course lays the foundation for working with structured databases and looks at how organizations actually use them — in Business Intelligence and enterprise systems. It has a strong practical focus: you will produce design documentation, draw diagrams, and implement databases using SQL, NoSQL and Big Data technologies (Spark). The aim is to give future leaders the skills to implement new, innovative systems.

The 30 live sessions move deliberately from theory (data models, storage internals, normalisation) to practice (a four-part SQL tutorial track, MongoDB and Spark labs, a Cassandra lab) and finally to scale (distributed and Big-Data systems). Two threads run throughout: the technical design and implementation of databases, and their organizational impact. Both compulsory texts — Petrov's Database Internals and Harrison's Next Generation Databases — are mapped onto the relevant blocks below.

Program

BCSAI

Course code

DUD-CSAI.2.M.A

Area

Computer Science

Sessions

Credits

6.0 ECTS

Workload

150 h

Academic year

25–26

Degree course

Second

Semester

2º

Learning objectives

The course has two central themes: the technical design & implementation of databases, and the organizational impact of these systems.

The design part focuses on understanding system requirements, building cost scenarios, and analyzing systems using Entity-Relationship diagrams and Data Flow Diagrams. The implementation part looks at how databases are used in corporate environments. By the end of the course you will be able to:

Understand the data structures behind database systems
Produce a database design document
Implement databases using SQL, NoSQL and Spark
Assess security and implementation issues
Understand how web developers use databases
Create simple database-backed web projects
Understand the importance of Business Intelligence platforms
Choose a database system appropriate for different types of projects
Work with teams of programmers and designers and participate in decision-making

Methodology & assessment

IE's teaching method is collaborative, active and applied — a mix of lectures, discussion, in-class exercises, group work and individual study, totalling 150 hours.

Learning activities 150 h total

Lectures28.0%

Group work25.3%

Discussions17.3%

Exercises / async / field17.3%

Individual studying12.0%

Estimated hours: lectures 42 · discussions 26 · exercises/async/field 26 · group work 38 · individual study 18.

Assessment weighting graded /10

Final Exam30%

Individual work20%

Intermediate tests20%

Group Presentation10%

Group Work10%

Class Participation10%

A minimum of 3.5/10 (35%) in both the Midterm and Final exam is required to pass; the overall passing grade is 5.0. Below the minimum sends you to the June re-sit regardless of overall grade.

Components — deliverable & how it is evaluated

Final Exam30%

Deliverable: comprehensive written exam in Session 30 covering the whole course. Evaluated on: correctness of SQL/query answers, data-model design reasoning and conceptual understanding. Pass rule: a minimum 3.5/10 is mandatory or you go to the June re-sit.

Intermediate tests20%

Deliverable: the Midterm (Session 15) plus shorter in-term checks. Evaluated on: mastery of design and SQL fundamentals from Blocks 1–4. Pass rule: the Midterm also carries the 3.5/10 minimum.

Individual work20%

Deliverable: individually-submitted exercises and assignments (e.g. SQL tutorials, design documents, lab write-ups). Evaluated on: technical accuracy, clarity of documentation and independent problem-solving.

Group Work10%

Deliverable: the collaborative database project built over the term across SQL, NoSQL and Spark. Evaluated on: design quality, implementation completeness and teamwork.

Group Presentation10%

Deliverable: the final project presentation in Session 29. Evaluated on: the end-to-end solution demoed, design decisions justified and quality of communication.

Class Participation10%

Deliverable: active, prepared engagement in lectures, discussions and in-class exercises. Evaluated on: consistency and quality of contribution across the 30 live sessions.

Pass, attendance & re-sit rules. The overall passing grade is 5.0/10, with a mandatory 3.5/10 minimum in both the Midterm and the Final exam. Students must meet the 80% attendance rule — falling below it fails both the ordinary and extraordinary calls for the year and forces a re-enrolment. Each student has four exam chances over two academic years. The June/July re-sit is a single comprehensive exam (physical presence required at the enrolled campus, no date/format changes), graded only on that exam and capped at 8.0/10. Grade appeals require attending the post-exam review session first.

AI policy. Generative AI is encouraged to build an informed, critical perspective — but minimum-effort prompts give low-quality results, and you must cross-check and acknowledge any AI use. Unacknowledged use violates academic-honesty policies; the syllabus provides a suggested acknowledgement format.

Program — all 30 sessions

Every session is live and in-person. Grouped into seven thematic blocks, each opening with an overview and its learning outcomes; SQL-heavy sessions cross-link the playground demos.

Block 1

Foundations & Data Models

Sessions 1–2

Sets the scene: why data systems matter, where they appear, and the fundamental trade-offs behind choosing one. Introduces the major data models — relational, document and graph — and the languages used to query and define them, giving you the vocabulary the rest of the course builds on.

Learning outcomes

Explain why databases underpin modern technology projects
Compare relational, document and graph data models
Distinguish declarative query languages from imperative code
Recognise the role of DDL in defining a schema

1
Course Introduction

Frame why databases sit at the centre of nearly every technology project and where you encounter them today.
- Learning & having fun — how the course runs: collaborative, applied, project-driven.
- Why databases are so important — they provide durable, queryable, concurrent, consistent storage that applications cannot get from flat files.
- Where we see databases today — from banking and e-commerce to social feeds, IoT and ML feature stores.
- Trade-offs — no system is best at everything; you balance consistency, availability, latency, cost and developer effort.
Key idea There is no universal database. Every choice trades something — consistency vs availability, read speed vs write speed, flexibility vs structure — so the skill is matching the system to the workload.
2
Data Models and Query Languages

Compare the major data models and the languages used to query them.
- Relational vs document model — tables with rows/columns and joins, versus self-contained nested documents (JSON-like) that favour locality.
- Query languages for data — declarative (SQL: say what you want) vs imperative (code the how); declarative lets the optimiser pick the plan.
- Graph-like data models — nodes and edges for highly-connected data (social networks, recommendations) where joins would be expensive.
- Data Definition Language (DDL) — the SQL subset that creates and alters schema objects: tables, columns, types, keys, constraints.
Core concept · DDL DDL defines structure (CREATE, ALTER, DROP); DML manipulates data (INSERT, UPDATE, DELETE, SELECT).
```
CREATE TABLE customer (
  id     INTEGER PRIMARY KEY,
  email  TEXT NOT NULL UNIQUE,
  joined DATE DEFAULT CURRENT_DATE
);
```
Database Internals — Ch. 1

Block 2

Database Design & Storage Internals

Sessions 3–5, 7

The design backbone of the course. You learn how data is physically stored and indexed (hash indexes, SSTables, LSM-trees, B-trees), then how to model it well through normalisation and relationships, and finally how to turn requirements into documented logical and physical designs with DFDs and ER diagrams. Engine-level concerns — partitioning, indexing and replication — show how designs scale.

Learning outcomes

Explain how B-trees and LSM-trees trade read vs write performance
Normalise a schema to remove redundancy and anomalies
Model many-to-many relationships with junction tables
Produce logical/physical designs, DFDs and ER diagrams
Reason about partitioning, indexing and replication for scale

3
Database Design I — Storage Structures

Key elements of the data structures designers must understand to reason about efficiency.
- Hash indexes — an in-memory map from key to file offset; O(1) point lookups but no range scans.
- SSTables — Sorted String Tables: immutable on-disk files kept sorted by key, enabling efficient merges and range reads.
- LSM-Trees — Log-Structured Merge trees buffer writes in memory then flush sorted runs; write-optimised (Cassandra, RocksDB).
- B-Trees — balanced, page-based trees that keep keys sorted for fast point and range queries; the default for most relational engines.
Core concept · indexing An index is an auxiliary structure that speeds reads at the cost of slower writes and extra space. B-trees favour balanced read/write; LSM-trees favour heavy write throughput.

Database Internals — Pt. Iplayground · index lookup
4
Database Design II — Next-level Efficiencies

Move from single-table structures to relationships and engine-level scaling concerns.
- Normalisation — organising columns/tables so each fact is stored once, removing update/insert/delete anomalies.
- Many-to-many relationships — resolved with a junction (associative) table holding two foreign keys.
- Database engines — the pluggable storage layer (e.g. InnoDB, MyISAM) that decides on-disk layout and transaction support.
- Partitioning & indexes — splitting a large table by key range/hash so each partition is smaller and faster to scan.
- Replication — copying data across nodes for availability and read scaling (leader–follower or multi-leader).
Core concept · normalisation 1NF: atomic values, no repeating groups. 2NF: no partial dependency on part of a composite key. 3NF: no transitive dependency — non-key columns depend only on the key. Higher forms reduce redundancy at the cost of more joins.

Database Internals — Pt. IIplayground · schema
5
Database Design III — The Final Steps

Turn requirements into logical and physical designs documented with standard diagrams.
- Logical vs physical design — logical captures entities/relationships independent of any engine; physical adds tables, types, indexes and storage choices.
- Data Flow Diagrams (DFDs) — show how data moves between processes, stores and external entities.
- Entity-Relationship (ER) diagrams — model entities, attributes and cardinalities (1:1, 1:N, M:N) before they become tables.
- SQL Tutorial I — the SELECT statement — the first hands-on SQL: projecting columns, ordering, limiting.
```
SELECT name, joined
FROM customer
ORDER BY joined DESC
LIMIT 10;
```
playground · query editor
7
Relational Data Models

Consolidate relational design and revisit how engine choices interact with the schema.
- Normalisation (revisited) — applying the normal forms to a fuller schema and weighing when to denormalise for speed.
- Many-to-many relationships — junction tables in practice, including extra attributes on the relationship itself.
- Database engines — how the chosen engine affects transactions, locking and indexing options.
- Partitioning & indexes — choosing partition keys and composite indexes to match query patterns.
Key idea Normalisation removes redundancy; denormalisation deliberately re-introduces it to avoid expensive joins. The right point on that spectrum depends on the read/write mix.

playground · schema

Block 3

SQL Deep Dive

Sessions 6, 8–12

A four-part SQL tutorial track that takes you from filtering single tables to combining them with set-theoretic joins, then to aggregation, window functions and CTEs, temporal queries, and finally getting data in and out at scale. This is the most hands-on, playground-linked block.

Learning outcomes

Filter and categorise rows with WHERE and CASE
Choose the correct JOIN and justify it with set theory
Aggregate, window and modularise queries with CTEs
Query temporal data and time-series
Bulk-load and export data with correct encodings

6
SQL Tutorial II — Filtering & Categorization

Filter and shape result sets with conditional logic.
- WHERE clauses — row-level predicates with AND/OR/NOT, IN, BETWEEN, LIKE and NULL handling.
- CASE clauses — inline conditional expressions to bucket or relabel values in the result set.
```
SELECT name,
  CASE WHEN spend > 100 THEN 'VIP' ELSE 'std' END AS tier
FROM customer
WHERE joined >= '2025-01-01';
```
playground · query editor
8
SQL Joins and Set Theory

Ground JOINs in the set theory that defines their results.
- SQL joins — INNER (matches only), LEFT/RIGHT (keep one side), FULL (keep both), CROSS (cartesian product).
- Set theory — joins and UNION/INTERSECT/EXCEPT map directly onto set union, intersection and difference.
- De Morgan's laws & logical equivalence — ¬(A∧B)=¬A∨¬B; used to rewrite and simplify WHERE predicates correctly.
Core concept · joins An INNER JOIN returns the intersection of matching rows; a LEFT JOIN returns all left rows plus matches (NULLs where none). Picture each join as a region of a Venn diagram.
```
SELECT c.name, o.total
FROM customer c
LEFT JOIN orders o ON o.customer_id = c.id;
```
playground · JOIN Venn
9
SQL Tutorial III — Aggregation & Windows

Summarise and analyse data beyond simple row selection.
- Aggregating results — GROUP BY with COUNT/SUM/AVG/MIN/MAX and HAVING to filter groups.
- Window functions — compute running totals, ranks and moving averages without collapsing rows, using OVER (PARTITION BY … ORDER BY …).
- Common Table Expressions (CTEs) — named WITH subqueries that make complex queries readable and support recursion.
```
SELECT name, spend,
  RANK() OVER (ORDER BY spend DESC) AS rnk
FROM customer;
```
playground · query editor
10
SQL Tutorial IV — Time & Time-series

Work with temporal data and the databases purpose-built for it.
- Date & time functions — truncating, extracting parts, computing intervals and handling time zones.
- Time-series analysis in SQL — bucketing by interval, gap-filling and rolling aggregates over ordered timestamps.
- Time-series databases — engines (e.g. InfluxDB, TimescaleDB) optimised for append-heavy, timestamp-keyed data.
Key idea Time-series workloads are append-mostly and queried by range, so they favour storage sorted by time and downsampling/retention policies over general-purpose indexing.

playground · query editor
11
Advanced SQL Topics

Bridge SQL to analytics and application workflows.
- Exploratory Data Analysis — profiling distributions, nulls and outliers with SQL before modelling.
- Machine Learning datasets — shaping query output into feature tables for training.
- Loading from a database to an application — cursors, connection pooling and parameterised queries from app code.
Key idea The database is increasingly the feature store for ML — clean, well-modelled SQL output is what feeds models, so query quality is data quality.
12
Data Outside the Database

Get data in and out at scale and handle file formats correctly.
- Storing & modifying data — transactional INSERT/UPDATE/DELETE vs bulk paths, and when each is appropriate.
- Bulk data load operations — tools like COPY/LOAD DATA that bypass per-row overhead for large imports.
- SQLDump, file types & encodings — logical dumps for backup/migration; CSV/JSON/Parquet trade-offs; UTF-8 and delimiter pitfalls.
Key idea Most real-world data pain is at the boundary — encodings, delimiters and types. Getting import/export right prevents silent corruption.

Block 4

Data Warehousing, Web Apps & Business Intelligence

Sessions 13–17

Shifts from transactional (OLTP) thinking to analytical (OLAP): dimensional modelling with facts and dimensions, star vs snowflake schemas, and the place of databases in data-engineering pipelines. You then connect databases to dynamic web sites and build a Business-Intelligence data application, with the Midterm sitting at the block's centre.

Learning outcomes

Design star and snowflake schemas from facts and dimensions
Place databases within an enterprise data pipeline
Connect a database to a dynamic, database-driven web site
Explain the role of BI in enterprise IT and build a BI app

13
Data Warehouse and Data Mart Design

Design analytical schemas and place databases within engineering pipelines.
- Facts & dimensions — facts hold measurable events (sales, clicks); dimensions hold the descriptive context (time, product, customer).
- Star vs snowflake schema — star keeps dimensions denormalised in one table each (fast, simple); snowflake normalises them (less redundancy, more joins).
- Enterprise data architectures — warehouses, data marts, lakes and the patterns that connect them.
- Databases in data-engineering pipelines — where ETL/ELT sources and lands data.
Core concept · OLTP vs OLAP OLTP is many small read/write transactions (normalised, app-facing); OLAP is large analytical scans (dimensional, report-facing). Warehouses are OLAP-shaped.
14
Database Applications I — Web Applications

Connect databases to dynamic, database-driven web sites.
- Dynamic, database-driven web sites — how a request becomes a query and a query becomes a rendered page.
- Common stacks — LAMP (Linux/Apache/MySQL/PHP) and Flask-based Python stacks; where the database sits in each.
Key idea The web app is just a thin layer over the database: most user actions map to a CRUD operation, so a clean schema makes the app simpler.
15
Midterm Exam

Comprehensive assessment of design and SQL fundamentals from Blocks 1–4. A minimum 3.5/10 is required to pass the subject.

assessment · exam · 3.5/10 min
16
Database Applications II — Business Intelligence

Understand where BI fits in enterprise IT and start building a data app.
- Role of BI in enterprise IT — turning warehoused data into dashboards and decisions for non-technical users.
- Demo of BI platforms — a tour of how tools sit on top of the warehouse.
- Building a data application with Dash — a Python framework for interactive, database-backed analytics apps.
demo · BI platform
17
Database Applications III — Business Intelligence (cont.)

Continue the BI application and reflect on top-down vs bottom-up data design.
- Building a data application with Dash — extending the app with interactivity and live queries.
- Designing data: top-down vs bottom-up — starting from business questions (top-down) vs from available source data (bottom-up).

Block 5

NoSQL & Document Databases

Sessions 18–21

Introduces the NoSQL family and focuses on document databases (MongoDB). You learn document operations, partitioning/sharding, the MongoDB Query Language, and how to prototype a full document-backed web app on the MEAN stack — contrasting flexible schemas against the relational rigour of Blocks 2–3.

Learning outcomes

Explain when NoSQL beats relational and at what cost
Perform CRUD and queries on document collections (MQL)
Reason about partitioning and sharding for scale
Prototype a Node.js + MongoDB web application

18
Introduction to NoSQL

Survey the NoSQL landscape and its core data structures.
- What is NoSQL & when to use it — non-relational stores (key-value, document, column-family, graph) that trade strict schema/joins for scale and flexibility.
- Relevant NoSQL data structures — a sample implementation of the most common structures and their access patterns.
Core concept · CAP theorem A distributed store can guarantee at most two of Consistency, Availability, Partition-tolerance. Since network partitions are unavoidable, real systems choose between consistency and availability under partition.

Next Generation Databases — Pt. I
19
Document Databases

Work with document collections and the basics of MongoDB.
- Operations on document collections — insert/find/update/delete over BSON documents; querying nested fields and arrays.
- Partitioning & sharding — splitting a collection across nodes by shard key to scale writes and storage horizontally.
- Query processing using JavaScript — aggregation pipelines and expressions evaluated in the engine.
- MongoDB & PyMongo — driving MongoDB from Python.
```
// orders over 100, newest first
db.orders.find(
  { total: { $gt: 100 } }
).sort({ created: -1 })
```
Next Generation Databases — Pt. II
20
MongoDB Lab 1

Hands-on lab answering real questions against real datasets.
- Answering questions with MQL — writing MongoDB Query Language queries (find, filter, project, aggregate) on real-world datasets.
Lab outcome Translate analytical questions into MQL aggregation pipelines and interpret the results — the document-store analogue of the SQL tutorials.

lab · MongoDB
21
Designing Web-Apps with MongoDB

Prototype a full document-backed web application.
- Prototyping with Node.js & MongoDB — wiring a JavaScript server to a document store end-to-end.
- The MEAN stack — MongoDB · Express · Angular · Node.js — an all-JavaScript web stack.
Key idea A document model lets the same JSON shape flow from database to API to front-end, which is why all-JS stacks like MEAN pair naturally with MongoDB.

Block 6

Big Data & Distributed Processing with Spark

Sessions 22–26

Moves to web-scale processing: the Hadoop ecosystem and MapReduce, then Apache Spark's architecture, cluster tuning, and batch-processing labs, finishing with SparkSQL and Spark ML. The theme is distributing computation across a cluster rather than scaling a single machine.

Learning outcomes

Explain MapReduce and massively-parallel processing
Describe Spark's architecture and tune a cluster
Run Spark batch jobs to transform large datasets
Combine SQL and ML on Spark (SparkSQL, Spark ML)

22
Introduction to Big Data & the Hadoop Ecosystem

Understand distributed processing paradigms and the foundations of Hadoop.
- Big Data technologies & frameworks — the "3 Vs" (volume, velocity, variety) and the tools built to handle them.
- Distributed & massively-parallel processing — splitting work across many commodity machines instead of one large server.
- Hadoop & MapReduce — HDFS for distributed storage; MapReduce for distributed computation.
Core concept · MapReduce A two-phase model: Map transforms each input record into key-value pairs; Reduce aggregates all values sharing a key. The framework handles distribution, shuffling and fault tolerance.

Next Generation Databases — Big Data
23
Spark Basics

Learn Spark's architecture and how to configure and tune a cluster.
- Spark architecture & concepts — driver, executors and RDDs/DataFrames; lazy evaluation with a DAG of transformations.
- Cluster configuration & tuning — executor memory/cores, partitions and shuffle settings.
- Basic Spark usage demo — loading and transforming data interactively.
Key idea Spark keeps data in memory across stages and only computes when an action is called (lazy evaluation), making it far faster than disk-bound MapReduce for iterative work.
24
Spark Batch Processing — Lab I

Apply Spark batch-processing functions in a hands-on lab.
- Batch processing functions — map, filter, groupBy, join and aggregate over distributed DataFrames.
lab · Spark
25
Spark Batch Processing — Lab II

Continue and extend the Spark batch-processing exercises.
- Batch processing functions (cont.) — more complex pipelines, joins across datasets and performance considerations.
lab · Spark
26
SparkSQL and Spark ML

Combine SQL and machine learning on Spark and review the latest trends.
- SparkSQL — running SQL queries over distributed DataFrames with the Catalyst optimiser.
- Spark ML — distributed machine-learning pipelines (features, models, evaluation) at scale.
- Latest trends & top tips — where the ecosystem is heading and practical advice.
Key idea SparkSQL lets you express distributed computation in familiar SQL while the optimiser plans execution across the cluster — bridging the SQL of Block 3 with Big-Data scale.

Block 7

Distributed Databases & Final Assessment

Sessions 27–30

Caps the course with a wide-column distributed database (Cassandra): its ring architecture, tunable replication, CQL, and probabilistic structures like Bloom filters. The block closes with the final project presentations and the comprehensive final exam.

Learning outcomes

Describe Cassandra's ring, tokens and replication model
Model and query data with CQL and collections
Explain Bloom filters and tunable consistency
Present and defend an end-to-end database solution

27
Distributed Databases: Cassandra

Master Cassandra's data model, distributed architecture and query language.
- Cassandra data model — wide-column, query-first design: you model tables around the queries you need, not normal forms.
- Distributed architecture — masterless peer-to-peer cluster with no single point of failure.
- Nodes, rings, tokens, replication — data is hashed onto a token ring; the replication factor decides how many nodes hold each row.
- CQL — a SQL-like query language for Cassandra (no joins; partition-key driven).
- Data types & collections — scalars plus set/list/map collection columns.
- Secondary indexes — querying on non-key columns, with their performance caveats.
- Materialised & probabilistic structures · Bloom filters — a Bloom filter is a space-efficient set test: "definitely not present" or "possibly present", used to skip disk reads.
Core concept · tunable consistency Cassandra trades CAP per query: you pick a consistency level (e.g. ONE, QUORUM, ALL) so each read/write balances latency against how many replicas must agree.

Next Generation Databases — Cassandra
28
Cassandra Lab

Hands-on lab applying the Cassandra concepts from Session 27.
- Cassandra Lab — creating keyspaces/tables, writing CQL, and observing replication and consistency in practice.
lab · Cassandra
29
Final Project Presentations

Present the group project, demonstrating an end-to-end database solution and justifying the design choices made across SQL, NoSQL and Spark.

assessment · presentation · 10%
30
Final Exam

Comprehensive final assessment (30% of the grade). A minimum 3.5/10 is required; below it sends you to the June re-sit regardless of overall grade.

assessment · exam · 30% · 3.5/10 min

Key concepts — glossary

A quick reference to the core ideas that recur across the course, from storage internals to distributed systems.

Data model S2

The abstract shape of data — relational (tables), document (nested records), graph (nodes/edges) — that determines how it is stored and queried.

DDL / DML S2

Data Definition Language defines schema objects (CREATE, ALTER); Data Manipulation Language reads and changes data (SELECT, INSERT, UPDATE).

Index S3

An auxiliary structure (often a B-tree) that speeds lookups at the cost of slower writes and extra storage.

B-tree S3

A balanced, page-oriented tree keeping keys sorted for fast point and range queries; the default index in relational engines.

LSM-tree & SSTable S3

A write-optimised design that buffers writes in memory and flushes immutable sorted files (SSTables), merging them in the background.

Normalisation S4

Organising a schema (1NF→3NF and beyond) so each fact is stored once, eliminating insert/update/delete anomalies.

Denormalisation S7

Deliberately re-introducing redundancy to avoid expensive joins and speed up reads.

Primary / foreign key S4

A primary key uniquely identifies a row; a foreign key references a primary key in another table to enforce relationships.

Many-to-many S4

A relationship resolved with a junction (associative) table holding two foreign keys.

ER diagram / DFD S5

ER diagrams model entities, attributes and cardinalities; Data Flow Diagrams show how data moves between processes and stores.

ACID S4

Transaction guarantees: Atomicity, Consistency, Isolation, Durability — the contract that keeps relational data correct under concurrency and failure.

JOIN S8

Combining rows from tables on a condition: INNER (matches only), LEFT/RIGHT (keep one side), FULL (keep both), CROSS (cartesian).

Set operations S8

UNION, INTERSECT and EXCEPT combine result sets as set union, intersection and difference.

Aggregation S9

GROUP BY with functions like SUM/COUNT/AVG, filtered by HAVING, to summarise many rows into groups.

Window function S9

A function computed OVER a partition of rows (ranks, running totals) without collapsing them.

CTE S9

A Common Table Expression — a named WITH subquery that improves readability and enables recursion.

OLTP vs OLAP S13

OLTP handles many small transactions (app-facing); OLAP runs large analytical scans (report-facing); warehouses are OLAP-shaped.

Star / snowflake schema S13

Dimensional models around a central fact table; star keeps dimensions denormalised, snowflake normalises them.

Facts & dimensions S13

Facts record measurable events; dimensions provide descriptive context (time, product, customer) for slicing them.

NoSQL S18

Non-relational stores (key-value, document, column-family, graph) trading strict schema and joins for scale and flexibility.

CAP theorem S18

A distributed store can guarantee at most two of Consistency, Availability and Partition-tolerance; partitions force a choice between C and A.

Sharding / partitioning S19

Splitting data across nodes by a key to scale storage and throughput horizontally.

Replication S4

Keeping copies of data on multiple nodes for availability and read scaling (leader–follower or multi-leader).

MapReduce S22

A distributed model: Map emits key-value pairs per record; Reduce aggregates values per key, with the framework handling distribution.

Spark RDD / DataFrame S23

Spark's distributed datasets evaluated lazily as a DAG, kept in memory across stages for fast iterative processing.

Tunable consistency S27

Choosing per-query consistency levels (e.g. QUORUM) to balance latency against how many replicas must agree (Cassandra).

Bloom filter S27

A space-efficient probabilistic set test answering "definitely not present" or "possibly present" — used to skip unnecessary disk reads.

Bibliography

Compulsory reading — both titles are available digitally and are mapped to the blocks where they are most relevant.

Alex Petrov. Database Internals. O'Reilly. ISBN 978149204034 (Digital) The reference for storage engines and the single-node internals at the heart of Blocks 1–2: B-trees, LSM-trees/SSTables, hash indexes, transactions and how engines store and retrieve data on disk. → Sessions 2–7
Guy Harrison. Next Generation Databases: NoSQL, NewSQL, and Big Data. Apress. ISBN 9781484213292 (Digital) The companion for the distributed and non-relational half of the course: NoSQL and document stores, Big-Data frameworks and the architecture of systems like Cassandra. → Sessions 18–19, 22, 27

Behaviour, attendance and ethics are governed by the University's Code of Conduct, Attendance Policy and Ethics Code; the Program Director may provide further indications.

Designing and Using Databases

Learning objectives

Methodology & assessment

Learning activities 150 h total

Assessment weighting graded /10

Components — deliverable & how it is evaluated

Program — all 30 sessions

Foundations & Data Models

Course Introduction

Data Models and Query Languages

Database Design & Storage Internals

Database Design I — Storage Structures

Database Design II — Next-level Efficiencies

Database Design III — The Final Steps

Relational Data Models

SQL Deep Dive

SQL Tutorial II — Filtering & Categorization

SQL Joins and Set Theory

SQL Tutorial III — Aggregation & Windows

SQL Tutorial IV — Time & Time-series

Advanced SQL Topics

Data Outside the Database

Data Warehousing, Web Apps & Business Intelligence

Data Warehouse and Data Mart Design

Database Applications I — Web Applications

Midterm Exam

Database Applications II — Business Intelligence

Database Applications III — Business Intelligence (cont.)

NoSQL & Document Databases

Introduction to NoSQL

Document Databases

MongoDB Lab 1

Designing Web-Apps with MongoDB

Big Data & Distributed Processing with Spark

Introduction to Big Data & the Hadoop Ecosystem

Spark Basics

Spark Batch Processing — Lab I

Spark Batch Processing — Lab II

SparkSQL and Spark ML

Distributed Databases & Final Assessment

Distributed Databases: Cassandra

Cassandra Lab

Final Project Presentations

Final Exam

Key concepts — glossary

Bibliography