Probability for Computing Science course outline · IE BCSAI

Course overview

An in-depth course on probability theory tailored to modern computing — from discrete random variables to Bayesian networks, Markov chains and the Poisson process.

Probability for Computing Science equips students with a solid foundation in probability theory, tailored to the needs of modern computing. Emphasizing both theoretical rigor and practical relevance, the course explores discrete and continuous random variables, conditional probability, Markov chains, Poisson processes and Bayesian networks. These probabilistic tools underpin many core areas in computer science — algorithms, machine learning, data analysis and artificial intelligence. Through a combination of conceptual insights and hands-on examples, students learn to model uncertainty, reason under incomplete information, and apply probabilistic thinking to solve complex computational problems. Every concept below has a matching interactive demo you can play with.

The 15 sessions are grouped into five modules. Module I refreshes the discrete probability toolkit and Bayes' theorem; Module II builds Bayesian networks as a compact, graphical language for joint distributions and inference; Module III turns to stochastic dynamics with Markov chains; Module IV moves to continuous time with the exponential distribution, the Poisson process and queues; and Module V consolidates everything through group project presentations, review and the final exam. The running example — a spam filter — is introduced in session 1 and reappears as a Naive Bayes classifier in session 7, tying probabilistic theory to a concrete computing application.

Program
BCSAI — Computer Science & AI
Code
PCS-CSAI.2.M.A
Area
Mathematics
Sessions
15 (live, in-person)
Credits
3.0 ECTS
Academic year
2025–26
Degree course
Second
Semester
Category
Compulsory
Language
English
Professor
Manuele Leonelli
Contact
mleonelli@faculty.ie.edu

About the instructor: Manuele Leonelli is an Assistant Professor in the School of Science and Technology at IE University (PhD in Statistics, University of Warwick, 2015). His research centers on probabilistic graphical models for decision-making under uncertainty and inference over extreme values, with published work in JMLR, Information Sciences and Statistics and Computing. Office hours are on request by email. Estimated student workload is 75 hours total. Class category: compulsory. Campus: Madrid / Segovia.

Learning objectives

By the end of the course, students will be able to apply key probabilistic tools in both academic and professional contexts, reasoning under uncertainty and validating their conclusions.

Methodology & assessment

IE's collaborative, active and applied teaching method — 75 estimated hours of student work — paired with continuous evaluation across four components.

Learning activities · 75.0 h total

Lectures26.7%20.0 h
Group work26.7%20.0 h
Individual studying20.0%15.0 h
Discussions13.3%10.0 h
Exercises, async & field work13.3%10.0 h

Assessment weighting

Final exam · MCQ, 3.5 to pass35%
Group work · prototype25%
Group presentation · non-technical20%
Class participation20%

Components — deliverable & evaluation

AI policy: generative AI is encouraged for the group project to build an informed, critical perspective. Low-effort prompts give low-quality results; never take GenAI output at face value — assume it is wrong unless you know the answer or can cross-check it, and you remain responsible for any errors. Its use must be acknowledged (a suggested disclosure format is provided in the syllabus); acknowledging AI use does not affect your grade.

Attendance & re-sit rules: the 80% attendance rule is mandatory — students who fail to meet it fail both the ordinary and the extraordinary (June/July) calls and must re-enrol the following year. Students have four calls across two academic years. The June/July re-sit is a single comprehensive exam; the course grade then depends on that exam only (continuous evaluation is disregarded), with a minimum passing grade of 5 and a maximum grade of 8.0 ("notable"). A grade review session is held after grading; attending it is a prerequisite for any appeal.

Program — 15 sessions

Every session is live and in-person. Each topic carries a one-line explanation, the core formula or definition where technical, and a key idea or worked example. Reading tags link to the two recommended texts; demo tags jump to the matching interactive visualization.

Module I — Foundations discrete probability & Bayes' theorem

The course opens by refreshing the language of probability — sample spaces, events, random variables and distributions — and the single most important rule for reasoning from evidence: Bayes' theorem. The spam filter introduced here is the thread that runs through the whole course.

By the end of this module you can
  • State the axioms of probability and compute with unions, intersections and complements.
  • Describe a discrete random variable through its probability mass function and recognise standard families.
  • Apply conditional probability, the law of total probability and Bayes' theorem to update beliefs from data.
  • Explain how a probabilistic spam filter turns word counts into a posterior probability of "spam".
  1. 1
    Introduction to the course

    Set expectations and motivate probability through a concrete computing application: the spam filter.

    • Course structure & expectations — the five modules, workload, assessment and the in-person, applied format.
    • Review of basic probability concepts — sample space $\Omega$, events, and the axioms $P(\Omega)=1$, $0\le P(A)\le 1$, and additivity for disjoint events.
    • Discussion of spam filters — how an email can be scored as "spam" or "ham" from the words it contains.
    • Implementing a simple spam filter — estimating word probabilities from a labelled corpus and combining them.
    Core formula · complement & union$P(A^c)=1-P(A)$ and $P(A\cup B)=P(A)+P(B)-P(A\cap B)$.
    Key ideaA spam filter is just Bayes' theorem in disguise: each word shifts the probability that a message is spam up or down, and the words are combined to produce a final score.

    Reading: Ross ch. 1 — sample spaces, axioms and basic counting; the foundation for everything that follows.

  2. 2
    Review of discrete probability

    Consolidate the discrete toolkit — distributions, conditioning and the rule that powers inference.

    • Fundamental principles of discrete probability — outcomes, events and counting on finite or countable sample spaces.
    • Probability mass functions & distributions — a pmf assigns $p(x)=P(X=x)$ with $\sum_x p(x)=1$; key families are Bernoulli, Binomial, Geometric and Poisson.
    • Conditional probability & independence — $P(A\mid B)=P(A\cap B)/P(B)$; events are independent when $P(A\cap B)=P(A)P(B)$.
    • Bayes' theorem — invert a conditional to update a prior into a posterior.
    Core formula · Bayes' theorem$P(H\mid E)=\dfrac{P(E\mid H)\,P(H)}{P(E)}$, with $P(E)=\sum_i P(E\mid H_i)P(H_i)$ by the law of total probability.
    Worked exampleA test is 99% accurate for a disease affecting 1 in 1000. A positive result gives $P(\text{sick}\mid +)=\dfrac{0.99\cdot 0.001}{0.99\cdot 0.001+0.01\cdot 0.999}\approx 0.09$ — only ~9%, because the base rate is tiny.

    Reading: Ross ch. 1–2 — conditional probability, independence and discrete random variables with their expectations.

Module II — Bayesian networks structure, independence & inference

A Bayesian network represents a high-dimensional joint distribution as a directed acyclic graph plus a conditional probability table at each node. The module develops how to read independence off the graph (conditional independence and d-separation), how to compute posteriors (exact and by sampling), and the Naive Bayes special case used for classification.

By the end of this module you can
  • Build a Bayesian network from a problem description and specify its CPTs.
  • Determine whether two variables are conditionally independent using d-separation.
  • Distinguish exact from approximate (sampling-based) inference and choose appropriately.
  • Train and apply a Naive Bayes classifier and explain its independence assumption.
  1. 3
    Introduction to Bayesian networks

    Represent joint distributions compactly as a directed graph plus conditional probability tables.

    • Basic concepts — nodes are random variables, directed edges encode direct probabilistic dependence.
    • Structure & graphical representation — a directed acyclic graph (DAG) whose edges run from parents to children.
    • Conditional probability tables (CPTs) — each node stores $P(X\mid \text{parents}(X))$.
    • Simple examples — small networks such as the classic "sprinkler / rain / wet grass" model.
    Core formula · chain rule for a DAG$P(X_1,\dots,X_n)=\prod_{i=1}^{n} P\!\left(X_i \mid \text{parents}(X_i)\right)$ — the factorization that makes networks compact.
    Key ideaA full joint over $n$ binary variables needs $2^n-1$ numbers; if each node has at most $k$ parents the network needs only $O(n\,2^k)$ — an exponential saving.
    Koller & Friedman ch. 3 demo · Bayes

    Reading: Koller & Friedman ch. 3 — the Bayesian network representation and its independence semantics.

  2. 4
    Conditional independence

    Read independence relations directly off the structure of a network.

    • Definition & importance — $X \perp Y \mid Z$ means knowing $Z$ makes $X$ uninformative about $Y$.
    • Identifying it in networks — local Markov property: a node is independent of its non-descendants given its parents.
    • Practical examples & exercises — using independence to simplify and to justify a network's factorization.
    Definition · conditional independence$X \perp Y \mid Z \iff P(X,Y\mid Z)=P(X\mid Z)\,P(Y\mid Z)$, equivalently $P(X\mid Y,Z)=P(X\mid Z)$.
    Key ideaConditional independence is what gives the chain rule its savings: every missing edge in the DAG asserts an independence that removes parameters from the model.
    Koller & Friedman ch. 3

    Reading: Koller & Friedman ch. 3 — conditional independence and the local/global Markov properties.

  3. 5
    D-separation

    Apply the graphical d-separation criterion to decide which variables are independent given evidence.

    • Concept of d-separation — a purely graphical test for conditional independence in a DAG.
    • Rules for determining it — examine every path through the three connection types: chain, fork and collider (v-structure).
    • Using it to infer independence — if every path between $X$ and $Y$ is blocked given $Z$, then $X \perp Y \mid Z$.
    • Hands-on practice — tracing paths and deciding blocking on worked networks.
    Rule · collider (v-structure)A chain $A\to B\to C$ or fork $A\leftarrow B\to C$ is blocked when $B$ is observed; a collider $A\to B\leftarrow C$ is blocked unless $B$ or a descendant is observed.
    Key idea"Explaining away": at a collider, observing the common effect makes its two causes dependent — conditioning can create dependence, not only remove it.
    Koller & Friedman ch. 3

    Reading: Koller & Friedman ch. 3 — d-separation and its soundness/completeness for reading independence.

  4. 6
    Inference and sampling

    Compute posteriors exactly where feasible, and approximate them by sampling when it is not.

    • Methods of probabilistic inference — computing $P(\text{query}\mid \text{evidence})$ from the network.
    • Exact vs. approximate inference — variable elimination is exact but can be intractable; sampling trades exactness for scalability.
    • Introduction to sampling — forward sampling, rejection sampling and Monte Carlo estimation of probabilities.
    Core formula · Monte Carlo estimate$P(A)\approx \dfrac{1}{N}\sum_{i=1}^{N}\mathbf{1}[A \text{ in sample } i]$, with error shrinking like $1/\sqrt{N}$.
    Key ideaExact inference in general networks is NP-hard, so for large models we estimate posteriors by drawing many samples and counting — the same Monte Carlo principle behind the CLT demo.
    Koller & Friedman ch. 9–12 demo · Monte Carlo demo · CLT (sampling)

    Reading: Koller & Friedman ch. 9–12 — exact inference (variable elimination) and approximate, sampling-based inference.

  5. 7
    Naive Bayes model

    Specialize Bayesian reasoning into a fast, surprisingly effective classifier.

    • Structure & assumptions — one class node as the parent of all features, which are assumed conditionally independent given the class.
    • Training & testing — estimate class priors and per-feature likelihoods by counting; predict the class with the highest posterior.
    • Applications — image classification and spam filtering (closing the loop with session 1).
    Core formula · Naive Bayes classifier$\hat{y}=\arg\max_{c}\; P(c)\prod_{j} P(x_j\mid c)$ — the "naive" product follows from conditional independence of features.
    Key ideaThe independence assumption is usually false, yet Naive Bayes often classifies well because it only needs the ranking of class scores to be right, not their exact probabilities.
    Koller & Friedman ch. 17 demo · Bayes

    Reading: Koller & Friedman ch. 17 — learning parameters from data and the Naive Bayes classifier.

Module III — Stochastic processes Markov chains

A Markov chain models a system that moves between states over time, where the next state depends only on the current one. The module covers transition matrices, long-run (steady-state) behavior, and absorbing chains with their expected time to absorption — tools behind PageRank, randomized algorithms and reliability models.

By the end of this module you can
  • Encode a process as a transition matrix and draw its state diagram.
  • Solve for the stationary distribution of an irreducible, aperiodic chain.
  • Identify absorbing states and compute expected time to absorption.
  • Recognise Markov models in computing applications such as web ranking and queueing.
  1. 8
    Markov chains — Part I

    Model memoryless dynamics with transition matrices and find their long-run behavior.

    • Definition & properties — the Markov property: the future depends on the present, not the past.
    • Transition matrices & state diagrams — $P_{ij}=P(X_{t+1}=j\mid X_t=i)$, with each row summing to 1.
    • Examples in computing — PageRank, randomized algorithms, text generation, weather/queue models.
    • Steady-state probabilities — the distribution the chain settles into over time.
    Core formula · Markov property & stationarity$P(X_{t+1}=j\mid X_t=i,\dots,X_0)=P_{ij}$; the stationary $\boldsymbol{\pi}$ solves $\boldsymbol{\pi}P=\boldsymbol{\pi}$ with $\sum_i \pi_i = 1$.
    Key ideaMultiplying a state distribution by $P$ advances it one step; raising $P$ to higher powers ($P^n$) gives $n$-step transition probabilities, which converge to $\boldsymbol{\pi}$ for a "nice" (irreducible, aperiodic) chain.

    Reading: Ross ch. 4 — Markov chains, Chapman–Kolmogorov equations and limiting probabilities.

  2. 9
    Markov chains — Part II

    Tackle absorbing chains and the expected time until the process is trapped.

    • Advanced topics — classification of states (recurrent, transient), periodicity and reducibility.
    • Absorbing states & time to absorption — once entered, an absorbing state is never left; compute the expected steps to reach one.
    • Hands-on exercises — solving real chain models, e.g. gambler's ruin and random walks.
    Core formula · expected absorption timeWith transient-block $Q$, the fundamental matrix is $N=(I-Q)^{-1}$; expected steps to absorption from each transient state is $\mathbf{t}=N\mathbf{1}$.
    Worked exampleIn gambler's ruin on $\{0,\dots,N\}$ with states $0$ and $N$ absorbing, $N=(I-Q)^{-1}$ gives both the expected number of bets and the probability of ending broke versus rich.

    Reading: Ross ch. 4 — absorbing chains, mean time in transient states and absorption probabilities.

Module IV — Continuous models exponential, Poisson & queues

The course moves from discrete masses to continuous densities, centred on the memoryless exponential distribution and the Poisson process that counts events in continuous time. These combine into queueing theory — the M/M/1 model — used to size servers, networks and service systems.

By the end of this module you can
  • Work with probability density functions and compute probabilities by integration.
  • Use the exponential distribution and its memoryless property in reliability and timing problems.
  • Model event counts with the Poisson process and link inter-arrival times to the exponential.
  • Analyse an M/M/1 queue and compute its key performance metrics.
  1. 10
    Continuous random variables & the exponential distribution

    Move from masses to densities and meet the memoryless exponential distribution.

    • Continuous random variables — outcomes on a continuum, where single points have probability zero.
    • Probability density functions (PDFs) — $P(a\le X\le b)=\int_a^b f(x)\,dx$ with $\int_{-\infty}^{\infty} f = 1$.
    • Properties of the exponential distribution — models waiting times; its rate $\lambda$ sets the mean $1/\lambda$.
    • Applications — component lifetimes in reliability and inter-arrival times in network modelling.
    Core formula · exponential distribution$f(x)=\lambda e^{-\lambda x}$ for $x\ge 0$; $E[X]=1/\lambda$; memoryless: $P(X>s+t\mid X>s)=P(X>t)$.
    Key ideaMemorylessness means a used component is "as good as new": the time already waited tells you nothing about the time remaining — the defining property of the exponential.

    Reading: Ross ch. 5 — continuous random variables, densities, and the exponential distribution.

  2. 11
    The Poisson process

    Count events in time and connect the Poisson and exponential distributions.

    • Definition & characteristics — a counting process with independent, stationary increments at rate $\lambda$.
    • Poisson ↔ exponential — the number of events in any interval is Poisson; the gaps between events are exponential.
    • Hands-on problems — modelling arrivals, requests and failures over time.
    Core formula · Poisson counts$P(N(t)=k)=\dfrac{(\lambda t)^k e^{-\lambda t}}{k!}$, so $E[N(t)]=\lambda t$; inter-arrival times are $\text{Exponential}(\lambda)$.
    Key ideaOne rate $\lambda$ describes two views of the same process: how many events occur (Poisson) and how long between them (exponential).

    Reading: Ross ch. 5 — the Poisson process, inter-arrival and waiting times.

  3. 12
    Queuing theory

    Combine arrivals and service into the canonical M/M/1 queue and read off its performance.

    • Basic concepts — arrival rate $\lambda$, service rate $\mu$ and utilization $\rho=\lambda/\mu$.
    • The M/M/1 model — Poisson arrivals, exponential service, a single server and an infinite queue.
    • Performance metrics — mean queue length, waiting time and time in system; analysed via Little's law.
    Core formula · M/M/1 (stable when $\rho<1$)Mean number in system $L=\dfrac{\rho}{1-\rho}$; mean time in system $W=\dfrac{1}{\mu-\lambda}$; Little's law $L=\lambda W$.
    Key ideaAs utilization $\rho\to 1$ the queue blows up: waiting time grows without bound, which is why systems are sized to keep $\rho$ comfortably below 1.

    Reading: Ross ch. 8 — queueing theory, the M/M/1 model and Little's law.

Module V — Projects & assessment presentations, review & final exam

The final module brings the course together: teams present their Bayesian-network projects to a non-technical audience, the class reviews past exams and open questions, and the program closes with a comprehensive multiple-choice exam.

By the end of this module you can
  • Communicate a probabilistic model and its conclusions clearly to a general audience.
  • Integrate concepts across all five modules and self-assess against past exam questions.
  • Demonstrate mastery of the full syllabus under exam conditions.
  1. 13
    Group presentations and review

    Present the group Bayesian-network projects and consolidate for the exam.

    • Project presentations — groups present their Bayesian network projects in a non-technical way.
    • Exam review — discussion of past exams and final questions across all modules.
    Key ideaCommunicating uncertainty clearly to a non-technical audience is a skill in itself — the presentation is graded on clarity, not on mathematical depth.
  2. 14
    Group presentations and review

    Continue project presentations and final exam preparation.

    • Project presentations — remaining groups present their Bayesian network projects.
    • Exam review — continued discussion of past exams and final questions.
  3. 15
    Final exam

    Comprehensive multiple-choice exam covering the full program (35% of the grade, 3.5 to pass).

    • Multiple-choice questions across all five modules — discrete probability and Bayes, Bayesian networks, Markov chains, continuous models and queueing.
    NoteA minimum mark of 3.5 / 10 on this exam is required to pass the course; the other components cannot compensate below that threshold.

Key concepts

A quick-reference glossary of the terms that recur across the program. Each is defined in one or two sentences.

Bibliography

Recommended texts, annotated with the sessions they support. Reading tags throughout the program point back to these.

Reading-chapter mappings throughout the program are indicative, aligning each session's topics to the relevant chapters of the recommended texts. Both texts are available in digital form through the IE library.