Probability for Computing Science

Course overview

An in-depth course on probability theory tailored to modern computing — from discrete random variables to Bayesian networks, Markov chains and the Poisson process.

Probability for Computing Science equips students with a solid foundation in probability theory, tailored to the needs of modern computing. Emphasizing both theoretical rigor and practical relevance, the course explores discrete and continuous random variables, conditional probability, Markov chains, Poisson processes and Bayesian networks. These probabilistic tools underpin many core areas in computer science — algorithms, machine learning, data analysis and artificial intelligence. Through a combination of conceptual insights and hands-on examples, students learn to model uncertainty, reason under incomplete information, and apply probabilistic thinking to solve complex computational problems. Every concept below has a matching interactive demo you can play with.

The 15 sessions are grouped into five modules. Module I refreshes the discrete probability toolkit and Bayes' theorem; Module II builds Bayesian networks as a compact, graphical language for joint distributions and inference; Module III turns to stochastic dynamics with Markov chains; Module IV moves to continuous time with the exponential distribution, the Poisson process and queues; and Module V consolidates everything through group project presentations, review and the final exam. The running example — a spam filter — is introduced in session 1 and reappears as a Naive Bayes classifier in session 7, tying probabilistic theory to a concrete computing application.

Program

BCSAI — Computer Science & AI

Code

PCS-CSAI.2.M.A

Area

Mathematics

Sessions

15 (live, in-person)

Credits

3.0 ECTS

Academic year

2025–26

Degree course

Second

Semester

2º

Learning objectives

By the end of the course, students will be able to apply key probabilistic tools in both academic and professional contexts, reasoning under uncertainty and validating their conclusions.

Understand and apply fundamental concepts of probability, including random variables, distributions, independence, and conditional probability — and recognise when each model is appropriate.
Analyze and model stochastic processes such as Markov chains and Poisson processes, computing steady-state behavior, absorption times and event rates.
Construct and interpret Bayesian networks to model complex dependencies and support probabilistic inference, reading independence directly off the graph.
Use probabilistic models to solve problems in computing science domains such as algorithms, machine learning, and artificial intelligence — from spam classification to reliability and queueing.

Methodology & assessment

IE's collaborative, active and applied teaching method — 75 estimated hours of student work — paired with continuous evaluation across four components.

Learning activities · 75.0 h total

Lectures26.7%20.0 h

Group work26.7%20.0 h

Individual studying20.0%15.0 h

Discussions13.3%10.0 h

Exercises, async & field work13.3%10.0 h

Assessment weighting

Final exam · MCQ, 3.5 to pass35%

Group work · prototype25%

Group presentation · non-technical20%

Class participation20%

Components — deliverable & evaluation

Final exam — 35%. A comprehensive multiple-choice exam (session 15) covering all five modules. A minimum mark of 3.5 / 10 on the exam is required to pass the course; below it, the other components cannot compensate. Evaluates conceptual understanding and the ability to apply formulas correctly under time pressure.
Group work — 25%. Submission of a small working prototype (typically a Bayesian network model) together with documentation explaining its construction, modelling choices and validation. Generative AI is encouraged here, provided its use is acknowledged and outputs are validated. Evaluates modelling skill, correctness and critical reasoning.
Group presentation — 20%. Teams present their project to the class in a non-technical way (sessions 13–14), communicating the model and its conclusions to a general audience. Evaluates clarity, structure and the ability to translate technical work for stakeholders.
Class participation — 20%. Active, well-behaved participation across the in-person sessions. Evaluates engagement, contribution to discussions and exercises, and professional conduct.

AI policy: generative AI is encouraged for the group project to build an informed, critical perspective. Low-effort prompts give low-quality results; never take GenAI output at face value — assume it is wrong unless you know the answer or can cross-check it, and you remain responsible for any errors. Its use must be acknowledged (a suggested disclosure format is provided in the syllabus); acknowledging AI use does not affect your grade.

Attendance & re-sit rules: the 80% attendance rule is mandatory — students who fail to meet it fail both the ordinary and the extraordinary (June/July) calls and must re-enrol the following year. Students have four calls across two academic years. The June/July re-sit is a single comprehensive exam; the course grade then depends on that exam only (continuous evaluation is disregarded), with a minimum passing grade of 5 and a maximum grade of 8.0 ("notable"). A grade review session is held after grading; attending it is a prerequisite for any appeal.

Program — 15 sessions

Every session is live and in-person. Each topic carries a one-line explanation, the core formula or definition where technical, and a key idea or worked example. Reading tags link to the two recommended texts; demo tags jump to the matching interactive visualization.

Module I — Foundations discrete probability & Bayes' theorem

The course opens by refreshing the language of probability — sample spaces, events, random variables and distributions — and the single most important rule for reasoning from evidence: Bayes' theorem. The spam filter introduced here is the thread that runs through the whole course.

By the end of this module you can

State the axioms of probability and compute with unions, intersections and complements.
Describe a discrete random variable through its probability mass function and recognise standard families.
Apply conditional probability, the law of total probability and Bayes' theorem to update beliefs from data.
Explain how a probabilistic spam filter turns word counts into a posterior probability of "spam".

1
Introduction to the course

Set expectations and motivate probability through a concrete computing application: the spam filter.
- Course structure & expectations — the five modules, workload, assessment and the in-person, applied format.
- Review of basic probability concepts — sample space $\Omega$, events, and the axioms $P(\Omega)=1$, $0\le P(A)\le 1$, and additivity for disjoint events.
- Discussion of spam filters — how an email can be scored as "spam" or "ham" from the words it contains.
- Implementing a simple spam filter — estimating word probabilities from a labelled corpus and combining them.
Core formula · complement & union$P(A^c)=1-P(A)$ and $P(A\cup B)=P(A)+P(B)-P(A\cap B)$.

Key ideaA spam filter is just Bayes' theorem in disguise: each word shifts the probability that a message is spam up or down, and the words are combined to produce a final score.

Ross ch. 1 demo · probability (Venn) demo · Bayes / base-rate

Reading: Ross ch. 1 — sample spaces, axioms and basic counting; the foundation for everything that follows.
2
Review of discrete probability

Consolidate the discrete toolkit — distributions, conditioning and the rule that powers inference.
- Fundamental principles of discrete probability — outcomes, events and counting on finite or countable sample spaces.
- Probability mass functions & distributions — a pmf assigns $p(x)=P(X=x)$ with $\sum_x p(x)=1$; key families are Bernoulli, Binomial, Geometric and Poisson.
- Conditional probability & independence — $P(A\mid B)=P(A\cap B)/P(B)$; events are independent when $P(A\cap B)=P(A)P(B)$.
- Bayes' theorem — invert a conditional to update a prior into a posterior.
Core formula · Bayes' theorem$P(H\mid E)=\dfrac{P(E\mid H)\,P(H)}{P(E)}$, with $P(E)=\sum_i P(E\mid H_i)P(H_i)$ by the law of total probability.

Worked exampleA test is 99% accurate for a disease affecting 1 in 1000. A positive result gives $P(\text{sick}\mid +)=\dfrac{0.99\cdot 0.001}{0.99\cdot 0.001+0.01\cdot 0.999}\approx 0.09$ — only ~9%, because the base rate is tiny.

Ross ch. 1–2 demo · discrete distributions demo · conditional probability

Reading: Ross ch. 1–2 — conditional probability, independence and discrete random variables with their expectations.

Module II — Bayesian networks structure, independence & inference

A Bayesian network represents a high-dimensional joint distribution as a directed acyclic graph plus a conditional probability table at each node. The module develops how to read independence off the graph (conditional independence and d-separation), how to compute posteriors (exact and by sampling), and the Naive Bayes special case used for classification.

By the end of this module you can

Build a Bayesian network from a problem description and specify its CPTs.
Determine whether two variables are conditionally independent using d-separation.
Distinguish exact from approximate (sampling-based) inference and choose appropriately.
Train and apply a Naive Bayes classifier and explain its independence assumption.

3
Introduction to Bayesian networks

Represent joint distributions compactly as a directed graph plus conditional probability tables.
- Basic concepts — nodes are random variables, directed edges encode direct probabilistic dependence.
- Structure & graphical representation — a directed acyclic graph (DAG) whose edges run from parents to children.
- Conditional probability tables (CPTs) — each node stores $P(X\mid \text{parents}(X))$.
- Simple examples — small networks such as the classic "sprinkler / rain / wet grass" model.
Core formula · chain rule for a DAG$P(X_1,\dots,X_n)=\prod_{i=1}^{n} P\!\left(X_i \mid \text{parents}(X_i)\right)$ — the factorization that makes networks compact.

Key ideaA full joint over $n$ binary variables needs $2^n-1$ numbers; if each node has at most $k$ parents the network needs only $O(n\,2^k)$ — an exponential saving.

Koller & Friedman ch. 3 demo · Bayes

Reading: Koller & Friedman ch. 3 — the Bayesian network representation and its independence semantics.
4
Conditional independence

Read independence relations directly off the structure of a network.
- Definition & importance — $X \perp Y \mid Z$ means knowing $Z$ makes $X$ uninformative about $Y$.
- Identifying it in networks — local Markov property: a node is independent of its non-descendants given its parents.
- Practical examples & exercises — using independence to simplify and to justify a network's factorization.
Definition · conditional independence$X \perp Y \mid Z \iff P(X,Y\mid Z)=P(X\mid Z)\,P(Y\mid Z)$, equivalently $P(X\mid Y,Z)=P(X\mid Z)$.

Key ideaConditional independence is what gives the chain rule its savings: every missing edge in the DAG asserts an independence that removes parameters from the model.

Koller & Friedman ch. 3

Reading: Koller & Friedman ch. 3 — conditional independence and the local/global Markov properties.
5
D-separation

Apply the graphical d-separation criterion to decide which variables are independent given evidence.
- Concept of d-separation — a purely graphical test for conditional independence in a DAG.
- Rules for determining it — examine every path through the three connection types: chain, fork and collider (v-structure).
- Using it to infer independence — if every path between $X$ and $Y$ is blocked given $Z$, then $X \perp Y \mid Z$.
- Hands-on practice — tracing paths and deciding blocking on worked networks.
Rule · collider (v-structure)A chain $A\to B\to C$ or fork $A\leftarrow B\to C$ is blocked when $B$ is observed; a collider $A\to B\leftarrow C$ is blocked unless $B$ or a descendant is observed.

Key idea"Explaining away": at a collider, observing the common effect makes its two causes dependent — conditioning can create dependence, not only remove it.

Koller & Friedman ch. 3

Reading: Koller & Friedman ch. 3 — d-separation and its soundness/completeness for reading independence.
6
Inference and sampling

Compute posteriors exactly where feasible, and approximate them by sampling when it is not.
- Methods of probabilistic inference — computing $P(\text{query}\mid \text{evidence})$ from the network.
- Exact vs. approximate inference — variable elimination is exact but can be intractable; sampling trades exactness for scalability.
- Introduction to sampling — forward sampling, rejection sampling and Monte Carlo estimation of probabilities.
Core formula · Monte Carlo estimate$P(A)\approx \dfrac{1}{N}\sum_{i=1}^{N}\mathbf{1}[A \text{ in sample } i]$, with error shrinking like $1/\sqrt{N}$.

Key ideaExact inference in general networks is NP-hard, so for large models we estimate posteriors by drawing many samples and counting — the same Monte Carlo principle behind the CLT demo.

Koller & Friedman ch. 9–12 demo · Monte Carlo demo · CLT (sampling)

Reading: Koller & Friedman ch. 9–12 — exact inference (variable elimination) and approximate, sampling-based inference.
7
Naive Bayes model

Specialize Bayesian reasoning into a fast, surprisingly effective classifier.
- Structure & assumptions — one class node as the parent of all features, which are assumed conditionally independent given the class.
- Training & testing — estimate class priors and per-feature likelihoods by counting; predict the class with the highest posterior.
- Applications — image classification and spam filtering (closing the loop with session 1).
Core formula · Naive Bayes classifier$\hat{y}=\arg\max_{c}\; P(c)\prod_{j} P(x_j\mid c)$ — the "naive" product follows from conditional independence of features.

Key ideaThe independence assumption is usually false, yet Naive Bayes often classifies well because it only needs the ranking of class scores to be right, not their exact probabilities.

Koller & Friedman ch. 17 demo · Bayes

Reading: Koller & Friedman ch. 17 — learning parameters from data and the Naive Bayes classifier.

Module III — Stochastic processes Markov chains

A Markov chain models a system that moves between states over time, where the next state depends only on the current one. The module covers transition matrices, long-run (steady-state) behavior, and absorbing chains with their expected time to absorption — tools behind PageRank, randomized algorithms and reliability models.

By the end of this module you can

Encode a process as a transition matrix and draw its state diagram.
Solve for the stationary distribution of an irreducible, aperiodic chain.
Identify absorbing states and compute expected time to absorption.
Recognise Markov models in computing applications such as web ranking and queueing.

8
Markov chains — Part I

Model memoryless dynamics with transition matrices and find their long-run behavior.
- Definition & properties — the Markov property: the future depends on the present, not the past.
- Transition matrices & state diagrams — $P_{ij}=P(X_{t+1}=j\mid X_t=i)$, with each row summing to 1.
- Examples in computing — PageRank, randomized algorithms, text generation, weather/queue models.
- Steady-state probabilities — the distribution the chain settles into over time.
Core formula · Markov property & stationarity$P(X_{t+1}=j\mid X_t=i,\dots,X_0)=P_{ij}$; the stationary $\boldsymbol{\pi}$ solves $\boldsymbol{\pi}P=\boldsymbol{\pi}$ with $\sum_i \pi_i = 1$.

Key ideaMultiplying a state distribution by $P$ advances it one step; raising $P$ to higher powers ($P^n$) gives $n$-step transition probabilities, which converge to $\boldsymbol{\pi}$ for a "nice" (irreducible, aperiodic) chain.

Ross ch. 4 demo · random walks

Reading: Ross ch. 4 — Markov chains, Chapman–Kolmogorov equations and limiting probabilities.
9
Markov chains — Part II

Tackle absorbing chains and the expected time until the process is trapped.
- Advanced topics — classification of states (recurrent, transient), periodicity and reducibility.
- Absorbing states & time to absorption — once entered, an absorbing state is never left; compute the expected steps to reach one.
- Hands-on exercises — solving real chain models, e.g. gambler's ruin and random walks.
Core formula · expected absorption timeWith transient-block $Q$, the fundamental matrix is $N=(I-Q)^{-1}$; expected steps to absorption from each transient state is $\mathbf{t}=N\mathbf{1}$.

Worked exampleIn gambler's ruin on $\{0,\dots,N\}$ with states $0$ and $N$ absorbing, $N=(I-Q)^{-1}$ gives both the expected number of bets and the probability of ending broke versus rich.

Ross ch. 4 demo · random walks

Reading: Ross ch. 4 — absorbing chains, mean time in transient states and absorption probabilities.

Module IV — Continuous models exponential, Poisson & queues

The course moves from discrete masses to continuous densities, centred on the memoryless exponential distribution and the Poisson process that counts events in continuous time. These combine into queueing theory — the M/M/1 model — used to size servers, networks and service systems.

By the end of this module you can

Work with probability density functions and compute probabilities by integration.
Use the exponential distribution and its memoryless property in reliability and timing problems.
Model event counts with the Poisson process and link inter-arrival times to the exponential.
Analyse an M/M/1 queue and compute its key performance metrics.

10
Continuous random variables & the exponential distribution

Move from masses to densities and meet the memoryless exponential distribution.
- Continuous random variables — outcomes on a continuum, where single points have probability zero.
- Probability density functions (PDFs) — $P(a\le X\le b)=\int_a^b f(x)\,dx$ with $\int_{-\infty}^{\infty} f = 1$.
- Properties of the exponential distribution — models waiting times; its rate $\lambda$ sets the mean $1/\lambda$.
- Applications — component lifetimes in reliability and inter-arrival times in network modelling.
Core formula · exponential distribution$f(x)=\lambda e^{-\lambda x}$ for $x\ge 0$; $E[X]=1/\lambda$; memoryless: $P(X>s+t\mid X>s)=P(X>t)$.

Key ideaMemorylessness means a used component is "as good as new": the time already waited tells you nothing about the time remaining — the defining property of the exponential.

Ross ch. 5 demo · continuous distributions

Reading: Ross ch. 5 — continuous random variables, densities, and the exponential distribution.
11
The Poisson process

Count events in time and connect the Poisson and exponential distributions.
- Definition & characteristics — a counting process with independent, stationary increments at rate $\lambda$.
- Poisson ↔ exponential — the number of events in any interval is Poisson; the gaps between events are exponential.
- Hands-on problems — modelling arrivals, requests and failures over time.
Core formula · Poisson counts$P(N(t)=k)=\dfrac{(\lambda t)^k e^{-\lambda t}}{k!}$, so $E[N(t)]=\lambda t$; inter-arrival times are $\text{Exponential}(\lambda)$.

Key ideaOne rate $\lambda$ describes two views of the same process: how many events occur (Poisson) and how long between them (exponential).

Ross ch. 5 demo · Poisson (discrete) demo · exponential

Reading: Ross ch. 5 — the Poisson process, inter-arrival and waiting times.
12
Queuing theory

Combine arrivals and service into the canonical M/M/1 queue and read off its performance.
- Basic concepts — arrival rate $\lambda$, service rate $\mu$ and utilization $\rho=\lambda/\mu$.
- The M/M/1 model — Poisson arrivals, exponential service, a single server and an infinite queue.
- Performance metrics — mean queue length, waiting time and time in system; analysed via Little's law.
Core formula · M/M/1 (stable when $\rho<1$)Mean number in system $L=\dfrac{\rho}{1-\rho}$; mean time in system $W=\dfrac{1}{\mu-\lambda}$; Little's law $L=\lambda W$.

Key ideaAs utilization $\rho\to 1$ the queue blows up: waiting time grows without bound, which is why systems are sized to keep $\rho$ comfortably below 1.

Ross ch. 8 demo · exponential (service)

Reading: Ross ch. 8 — queueing theory, the M/M/1 model and Little's law.

Module V — Projects & assessment presentations, review & final exam

The final module brings the course together: teams present their Bayesian-network projects to a non-technical audience, the class reviews past exams and open questions, and the program closes with a comprehensive multiple-choice exam.

By the end of this module you can

Communicate a probabilistic model and its conclusions clearly to a general audience.
Integrate concepts across all five modules and self-assess against past exam questions.
Demonstrate mastery of the full syllabus under exam conditions.

13
Group presentations and review

Present the group Bayesian-network projects and consolidate for the exam.
- Project presentations — groups present their Bayesian network projects in a non-technical way.
- Exam review — discussion of past exams and final questions across all modules.
Key ideaCommunicating uncertainty clearly to a non-technical audience is a skill in itself — the presentation is graded on clarity, not on mathematical depth.

demo · Bayes
14
Group presentations and review

Continue project presentations and final exam preparation.
- Project presentations — remaining groups present their Bayesian network projects.
- Exam review — continued discussion of past exams and final questions.
demo · Bayes
15
Final exam

Comprehensive multiple-choice exam covering the full program (35% of the grade, 3.5 to pass).
- Multiple-choice questions across all five modules — discrete probability and Bayes, Bayesian networks, Markov chains, continuous models and queueing.
NoteA minimum mark of 3.5 / 10 on this exam is required to pass the course; the other components cannot compensate below that threshold.

Key concepts

A quick-reference glossary of the terms that recur across the program. Each is defined in one or two sentences.

Sample space ($\Omega$). The set of all possible outcomes of a random experiment; events are subsets of it.
Random variable. A function mapping outcomes to numbers; discrete ones take countable values, continuous ones take values on a continuum.
Probability mass function (pmf). For a discrete variable, $p(x)=P(X=x)$ with $\sum_x p(x)=1$.
Probability density function (pdf). For a continuous variable, $f(x)\ge 0$ with $\int f=1$; probabilities are areas under $f$.
Conditional probability. $P(A\mid B)=P(A\cap B)/P(B)$ — the probability of $A$ once $B$ is known.
Independence. Events with $P(A\cap B)=P(A)P(B)$; knowing one says nothing about the other.
Conditional independence. $X\perp Y\mid Z$: $X$ and $Y$ are independent once $Z$ is known.
Bayes' theorem. $P(H\mid E)=P(E\mid H)P(H)/P(E)$ — updates a prior into a posterior using evidence.
Prior / posterior. Belief about a hypothesis before / after seeing evidence.
Law of total probability. $P(E)=\sum_i P(E\mid H_i)P(H_i)$ over a partition $\{H_i\}$.
Bayesian network. A DAG plus CPTs that factorizes a joint distribution via the chain rule.
Conditional probability table (CPT). The table $P(X\mid \text{parents}(X))$ stored at each node.
D-separation. A graphical criterion that decides conditional independence by checking whether paths are blocked.
Collider / v-structure. A node with two incoming edges ($A\to B\leftarrow C$); conditioning on it can create dependence ("explaining away").
Exact vs. approximate inference. Computing posteriors precisely (e.g. variable elimination) versus estimating them by sampling.
Monte Carlo. Estimating a quantity by averaging over many random samples; error shrinks like $1/\sqrt{N}$.
Naive Bayes. A classifier assuming features are conditionally independent given the class: $\hat y=\arg\max_c P(c)\prod_j P(x_j\mid c)$.
Markov property. The next state depends only on the current state, not the full history.
Transition matrix. $P_{ij}=P(X_{t+1}=j\mid X_t=i)$; each row sums to 1.
Stationary distribution. A distribution $\boldsymbol{\pi}$ with $\boldsymbol{\pi}P=\boldsymbol{\pi}$ — the chain's long-run behavior.
Absorbing state. A state that, once entered, is never left; used to compute time-to-absorption.
Exponential distribution. $f(x)=\lambda e^{-\lambda x}$; the memoryless model of waiting times, mean $1/\lambda$.
Poisson process. Events occurring at constant rate $\lambda$; counts are Poisson, gaps are exponential.
M/M/1 queue. A single-server queue with Poisson arrivals and exponential service; stable when $\rho=\lambda/\mu<1$.
Little's law. $L=\lambda W$ — average number in a system equals arrival rate times average time spent.
Central limit theorem (CLT). Sums/averages of many independent variables are approximately normal, regardless of the original distribution.

Bibliography

Recommended texts, annotated with the sessions they support. Reading tags throughout the program point back to these.

Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press. ISBN 9780262013192 (Digital) The definitive reference on Bayesian networks: representation, independence, exact and approximate inference, and learning. The primary text for Module II. Used in: sessions 3–7 (introduction, conditional independence, d-separation, inference & sampling, Naive Bayes).
Sheldon Ross. Introduction to Probability Models. Academic Press. ISBN 9780128143476 (Digital) A classic, application-oriented introduction covering discrete and continuous probability, Markov chains, the Poisson process and queueing. The primary text for Modules I, III and IV. Used in: sessions 1–2 (discrete probability & Bayes), 8–9 (Markov chains, ch. 4), 10–11 (continuous & Poisson, ch. 5), 12 (queueing, ch. 8).

Reading-chapter mappings throughout the program are indicative, aligning each session's topics to the relevant chapters of the recommended texts. Both texts are available in digital form through the IE library.