NLP Lab/Course structure
Syllabus-driven course map · 25–26

AI: Natural Language
Processing & Semantic Analysis

The complete structure of the course — five modules, thirty live sessions, and the statistical-to-neural arc of modern NLP — laid out session by session with the core concept behind each one, and cross-linked to the 23 interactive demos and flashcards in this lab.

5
Modules
30
Sessions
6.0
ECTS credits
150h
Student workload
Code
AINLP-CSAI.3.M.A
Programme
BCSAI — Computer Science & AI
Area
Computer Science
Category
Compulsory
Year / Semester
Third · 2ª
Language
English
Professor
Juan José Manjarín Colón
Contact
jjmanjarin@faculty.ie.edu
Subject description

What this course is about

Natural Language Processing sits at the centre of today's technology revolution — the breakthroughs behind ChatGPT, Bard and their successors are just the starting line. Over the next few years NLP and text analysis will permeate every sector, reshaping data-driven decisions and letting us decode human communication in ways we never thought possible. This course turns that promise into skill: it walks the full arc from rule-based and statistical methods through machine learning to the deep-learning and transformer architectures that power modern language models, always pairing theory with hands-on practice so you can apply these tools to real data-science problems.

Prerequisites & how this course connects. You should arrive comfortable with Python (functions, NumPy, pandas), linear algebra (vectors, dot products, matrices — the language of embeddings and attention), probability (conditional probability and Bayes' rule, used in n-grams and Naive Bayes) and calculus (the chain rule, which is backpropagation). The course builds directly on the BCSAI Machine Learning and Mathematics for AI strands and feeds forward into Deep Learning, Generative AI and any capstone that touches text. Conceptually it is one continuous story — counting words → weighting them → embedding them → attending over them — so each module is a prerequisite for the next.
Weekly study-load (6 ECTS ≈ 150 h). Across ~15 teaching weeks that is roughly 10 hours per week: about 2.7 h of live lecture, ~3.7 h of in-class/asynchronous exercises and field work, ~1.7 h of group-project work, ~1 h of discussion and ~1 h of individual study and review. The single largest block is applied exercise work (36.7 %) — this is a build-it course, not a lecture course. Budget extra time in the deep-learning weeks (Module 4), where the conceptual density per session peaks.
Learning objectives

What you will be able to do

The course is organised around ten thematic strands. By the end you should be fluent in moving a problem across all of them — from raw text to a deployed, evaluated model.

01Introduction & history of NLP — situate the field, its applications and its evolution from rules to foundation models.
02NLP tools & resources — work confidently with NLTK, spaCy, Gensim and Hugging Face Transformers.
03Text processing & statistics — tokenize, normalize and quantify text with TF-IDF and frequency measures.
04Information retrieval & sentiment — rank documents and detect opinion polarity.
05Machine learning for text — classify text with Naive Bayes and logistic regression.
06Word representations — understand vector space models and learned embeddings (Word2Vec).
07Advanced NLP with neural networks — build feed-forward, recurrent and LSTM language models.
08Transformers & pre-trained models — use self-attention, BERT and GPT via transfer learning.
09Linguistic features in NLP — apply POS tagging and Named Entity Recognition.
10Question answering systems — design extractive and generative QA pipelines.
Teaching methodology

How the course is taught

IE University's method is collaborative, active and applied: the professor leads and guides while students build knowledge through a mix of lectures, hands-on practice, projects and peer learning. The table below shows how the 150-hour workload is distributed across learning activities.

LecturesInteractive lectures with worked examples
26.7%
Exercises, async sessions & field workLargest single block — applied work
36.7%
Group workCollaborative project building
16.7%
DiscussionsIn-class debate & reflection
10.0%
Individual studyingPre/post-work and review
10.0%

Lectures & hands-on practice

Concepts are introduced in interactive lectures, then immediately practiced — students write, run and debug code in class, working in groups and sharing knowledge.

Project-based learning

A semester-long group project applies course techniques to a real NLP problem: identify, design, implement, then present to the class for feedback.

Critical GenAI use

Generative AI is encouraged — but you must verify its output, never take it at face value, and acknowledge its use. Acknowledging AI never lowers your grade; failing to is an integrity violation.

Pre and post-work. Necessary readings are announced before each session, and selected exercises are indicated at the end of each session. The program below is tentative — pace adapts to group performance, so some variation in topics may occur.
Evaluation criteria

How you are graded

Five components make up the final grade. The final exam and the group project together carry the most weight; continuous work (exercises, quizzes, participation) rewards steady engagement across the semester.

Final exam
35%
Group project
30%
Exercises
20%
Quizzes
10%
Participation
5%
35%

Final exam

A comprehensive Blackboard quiz on all course content, combining multiple-choice and open-ended questions.

30%

Group project

Implement an NLP application or research a studied tool in depth; submit via Turnitin and deliver a 15-minute class presentation.

20%

Exercises

In-class exercises submitted individually through Turnitin — one week from the start of each exercise to submit.

10%

Quizzes

A quiz after each module, completed before the next synchronous session.

5%

Participation

Active engagement in in-class activities, discussions and exercises — central to an applied course.

+

Note: 100% total

35 + 30 + 20 + 10 + 5 = 100%. The exam and project alone account for 65% of the grade.

Attendance

Students who do not meet the 80% attendance rule fail both the ordinary and extraordinary calls for the year and must re-enroll the following academic year.

Late assignments

Penalised 5% per 24-hour period from the due date. Changes to due dates must be agreed with the professor before the deadline.

Re-sit / re-take (June–July)

A single comprehensive exam; continuous evaluation is not counted. Pass mark is 5, capped at 8.0 ("notable"). Students who failed on attendance cannot re-sit.

Calls & retakers

Four allowed calls over two academic years. Retake (3rd call) is capped at 10.0. Failing >18 ECTS in a year after re-sits means leaving the programme.

Per-component rubric

What "good" looks like

Each component rewards a different skill. Use these criteria as a checklist before you submit.

Final exam · 35%
Format: Blackboard quiz, multiple-choice + open-ended, covering all five modules. Graded on: conceptual accuracy, ability to read a formula and say what it does, and reasoning across the pipeline (not rote recall). Tips: for each method know generative vs discriminative, the core formula, one failure mode, and which demo in the lab illustrates it. Practise mapping a raw problem end-to-end.
Group project · 30%
Format: implement an NLP application or research a studied tool in depth; submit via Turnitin + a 15-minute class presentation. Graded on: problem framing, soundness of method, quality of evaluation (use a real metric — accuracy/F1, BLEU/ROUGE, perplexity — not just "it looks good"), reproducibility, and clarity of the talk. Tips: scope small but evaluate properly; show a baseline before your model; acknowledge any GenAI use explicitly.
Exercises · 20%
Format: in-class exercises submitted individually through Turnitin; one week from the start of each to submit. Graded on: correctness, clean reproducible code, and a short written interpretation of the result. Tips: commit working code early, then refine; a labelled plot or confusion matrix beats a wall of numbers.
Quizzes · 10%
Format: a short quiz after each module, completed before the next synchronous session. Graded on: recall of the module's key definitions and formulas. Tips: take the quiz while the material is fresh; treat it as low-stakes spaced repetition — the flashcards in the lab map onto it directly.
Participation · 5%
Format: active engagement in activities, discussions and exercises. Graded on: consistent, substantive contribution — questions, peer help, discussion. Tips: small steady contributions across the semester beat one big push; helping a peer debug counts.
Acknowledging GenAI. Use of generative AI is encouraged but must be disclosed. The syllabus's suggested format: "I acknowledge the use of [tool] to [how]. The prompts used include [list]. The output was used to [how it entered your work]." If you used none: "No content generated by AI technologies has been used in this assignment." Disclosure never lowers your grade; omitting it is an integrity violation.
Full program · 30 sessions

The course, session by session

All sessions are live in-person. Each carries a core NLP concept — many with the underlying formula — a key idea to take away, and links to the matching interactive demo and flashcards. Jump to a module:

Module 1 · Sessions 1–5

Foundations of NLP and Historical Overview

The on-ramp to the field: what NLP is, where it came from, the mathematical building blocks (perceptrons, n-grams, backpropagation), the vector-space view of meaning, and the Python toolchain you'll use all semester.

History & motivation n-grams Perceptron / MLP Vector space models NLTK · spaCy · Gensim · HF
By the end of this module you can
  • Explain what NLP is and recognise its presence in everyday technology.
  • Trace the field's history from ELIZA to modern statistical and neural methods.
  • Build a basic n-gram next-word predictor and reason about perceptrons and backpropagation.
  • Represent text in a vector space (LSA/HAL) and reduce dimensions with PCA.
  • Preprocess raw text using mainstream Python NLP libraries.
1
Live in-person · Module 1

Introduction to the Course

Objective: open the door to NLP — what it encompasses, its applications, and its role in bridging human–machine communication (chatbots, voice assistants, automated translation).

What is NLP
The interdisciplinary field at the intersection of linguistics, computer science and AI that lets machines read, understand and generate language. It is hard because language is ambiguous ("I saw the man with the telescope"), compositional (meaning builds from parts) and context-dependent (the same word shifts meaning across sentences). NLP gives machines representations and algorithms to cope with all three.
Applications
Machine translation, search, recommendation, conversational agents, summarization, spam filtering and autocomplete — the systems you already use daily. A useful framing: most tasks reduce to classification (sentiment), sequence labelling (POS/NER), generation (translation, chat) or retrieval (search, QA). Recognising which family a problem belongs to tells you which tool to reach for.
Course logistics
Slots: 10 min introductions · 40 min course introduction · 30 min class discussion. Practice: list your own daily encounters with NLP (voice commands, recommendations, support chatbots) to feel its omnipresence.
Key ideaNLP is everywhere; this opening discussion is revisited in the final session so you can measure how far you've come.
Connects to…Every later session is one of these four task families dressed in different mathematics. Hold the map in your head — it is the scaffold the whole course hangs on.
Key takeawayNLP turns messy human language into structured, computable objects — and the choice of representation is what makes everything downstream possible.
2
Live in-person

The Dawn of Computational Linguistics

Objective: trace NLP's formative phases — from its embryonic stages to landmarks like ELIZA, one of the first programs to mimic human conversation — and chart the key shifts that shaped the field.

ELIZA & early systems
Weizenbaum's ELIZA (1966) matched simple patterns and reflected the user's words back ("I am sad" → "Why are you sad?"). It held no model of meaning, yet users felt understood — the ELIZA effect, our tendency to over-attribute intelligence to fluent output. A cautionary tale that is strikingly relevant in the LLM era.
Rule-based era
Hand-written grammars and symbolic methods (SHRDLU, expert systems, the 1954 Georgetown–IBM translation demo) dominated into the 1980s. They were precise and explainable but brittle: every exception needs a new rule, so coverage never scales. The 1990s "statistical turn" replaced rules with probabilities learned from corpora.
Practice
Each student picks a pivotal historical NLP model (ELIZA, SHRDLU, IBM Models, Word2Vec, the Transformer…) and prepares a concise 5-minute overview video on its contribution and lasting impact.
Key ideaThe history of NLP is a swing from hand-crafted rules → statistics → learned representations → foundation models.
PitfallThe ELIZA effect never went away — fluent text is not evidence of understanding. Keep this in mind when you reach LLMs in Module 4.
Key takeawayEach era didn't erase the last: regex (Session 6) and HMMs (Session 15) are living descendants of the rule-based and statistical eras you still use today.
Reading
  • Jurafsky & Martin, Speech and Language Processing — Ch. 1 (introduction & brief history). Why: sets the timeline and vocabulary for the whole course.
3
Live in-person

Fundamental Concepts

Objective: unpack the foundational pillars — perceptrons and Multi-Layer Perceptrons, n-grams, and the backpropagation algorithm — the building blocks under much of today's NLP.

Perceptron / MLP
A perceptron computes $\hat{y}=f(\mathbf{w}^\top\mathbf{x}+b)$ — a weighted sum through a step/non-linear function. A single one can only draw a straight boundary (it famously cannot learn XOR), but stacking layers with non-linearities yields a Multi-Layer Perceptron, a universal function approximator that can represent any continuous mapping given enough hidden units.
n-grams
Estimate the next word from the previous (n−1) words by counting how often sequences co-occur. A bigram (n=2) conditions on one prior word; a trigram on two. Practice: build an n-gram model from provided sentences and predict the subsequent word.
Backpropagation
The chain rule applied layer by layer: compute the loss, then propagate its gradient backwards so each weight learns how it contributed to the error. With gradient descent, $w \leftarrow w-\eta\,\partial L/\partial w$, this is how every neural net in the course is trained.
Core concept · n-gram probability
$P(w_i \mid w_{i-(n-1)},\dots,w_{i-1}) = \dfrac{\text{count}(w_{i-(n-1)},\dots,w_i)}{\text{count}(w_{i-(n-1)},\dots,w_{i-1})}$

An n-gram model makes the Markov assumption: the next word depends only on the previous n−1 words, not the whole history.

Worked mini-example · a bigram

Corpus: "I like NLP. I like cats." Count: "I like" appears 2×, "like NLP" 1×, "like cats" 1×. So $P(\text{NLP}\mid\text{like})=\tfrac{1}{2}=0.5$ and $P(\text{cats}\mid\text{like})=\tfrac{1}{2}=0.5$ — the model is equally torn. A word never seen after "like" gets probability 0, which is exactly the sparsity problem Laplace (add-one) smoothing fixes by pretending every word was seen once more.

Key ideaCounting works until contexts get sparse — which is exactly why we later need smoothing, then learned representations.
PitfallRaising n captures more context but explodes the count table and starves it of data (the curse of dimensionality). Most words never co-occur, so most counts are zero — the motivation for the dense embeddings of Session 13.
Key takeawayTwo ideas seeded here run the whole course: predict the next token (n-grams → RNNs → GPT) and learn weights by backprop (MLP → transformers).
4
Live in-person

Advanced Concepts in NLP

Objective: go deeper into Vector Space Models (VSMs) through Latent Semantic Analysis (LSA) and the Hyperspace Analogue to Language (HAL), exploring their principles and applications.

Vector space models
Represent documents/words as vectors in a high-dimensional space so that geometry encodes meaning — texts about similar things point in similar directions. Each dimension is typically a vocabulary term; the whole corpus becomes a term–document matrix.
LSA & HAL
LSA applies SVD ($A \approx U\Sigma V^\top$) to a term–document matrix, keeping the top-k singular values to surface latent topics and merge synonyms (car/automobile collapse to one direction). HAL instead slides a window over text and counts word co-occurrences to build word vectors directly — two routes to the same distributional insight.
Practice · PCA
Visualise high-dimensional text data by projecting onto the directions of greatest variance with Principal Component Analysis — a cousin of LSA's SVD that turns thousands of dimensions into a readable 2-D scatter.
Core concept · cosine similarity
$\cos(\theta) = \dfrac{\mathbf{a}\cdot\mathbf{b}}{\lVert\mathbf{a}\rVert\,\lVert\mathbf{b}\rVert}$

Similarity between two text vectors is the cosine of the angle between them — independent of document length.

Worked mini-example · cosine

Let $\mathbf{a}=(1,1,0)$ and $\mathbf{b}=(1,0,1)$ over vocabulary {dog, cat, fish}. Dot product $=1$; each norm $=\sqrt2$. So $\cos\theta = 1/(\sqrt2\cdot\sqrt2)=0.5$ — the documents share one of two terms each, a 60° angle. Same direction ⇒ 1, no shared terms ⇒ 0.

Key idea"You shall know a word by the company it keeps" (Firth, 1957) — meaning emerges from co-occurrence statistics.
Connects to…LSA/HAL are the count-based ancestors of Word2Vec (Session 13), which learns the same geometry by prediction instead of factorisation — and to the IR ranking of Session 8.
Key takeawayOnce meaning is geometry, "similar" becomes "close" and you can compute it — the single most reused idea in the course.
5
Live in-person

NLP Libraries and Resources in Python

Objective: meet the workhorse libraries — NLTK, spaCy, Gensim and Hugging Face Transformers — the backbone of most text-processing and modelling tasks.

NLTK
The classic teaching toolkit: tokenizers, stemmers (Porter), lemmatizers (WordNet), built-in corpora and textbook algorithms. Transparent and great for learning, but slower and less suited to production.
spaCy
Production-grade, opinionated pipelines: fast tokenization, POS, dependency parsing and NER out of the box, with a clean Doc → Token object model. The default when you need to ship a robust pipeline.
Gensim · Hugging Face
Gensim specialises in topic models (LDA) and Word2Vec/Doc2Vec embeddings at scale. Hugging Face transformers gives one-line access to thousands of pre-trained models (BERT, GPT, T5) via a pipeline() API. Practice: clean and preprocess a raw dataset using one or more of these.
Key ideaChoosing the right library is half the battle — NLTK to learn, spaCy to ship, Hugging Face to reach the state of the art.
PitfallMixing libraries' tokenizers silently breaks pipelines — a spaCy token boundary ≠ an NLTK one ≠ a transformer's subword. Standardise the tokenizer end-to-end, and match it to the model you'll feed.
Key takeawayPreprocessing choices (lowercasing, stopword removal, stemming) are not neutral — they trade recall for precision and should be decided with the downstream task in mind.
Reading
  • Bird, Klein & Loper, NLP with Python (NLTK book) — Ch. 1–3. Why: the hands-on companion for this session's preprocessing work.
Module 2 · Sessions 6–10

Machine Learning and Text Analysis Techniques

From precise pattern matching to learned classifiers: regular expressions, the statistics that drive search (TF-IDF), information retrieval, and the two canonical text classifiers — Naive Bayes and logistic regression.

Regex TF-IDF Information retrieval Naive Bayes Logistic regression
By the end of this module you can
  • Write regex patterns to extract structured data from raw text.
  • Compute TF-IDF weights and use them to rank documents.
  • Build a small search engine grounded in vector space models.
  • Train and interpret a Naive Bayes text classifier.
  • Use logistic regression for binary sentiment classification.
6
Live in-person · Module 2

Regular Expressions in NLP

Objective: master regular expressions ("regex") — precise, efficient patterns for searching, extracting and manipulating text.

Pattern syntax
The building blocks: character classes (\d \w \s), quantifiers (* + ? {n,m}), anchors (^ $ \b), alternation (|) and capture groups ((...)). Compose them into a small grammar — e.g. \b\w+@\w+\.\w+\b matches a simple email.
Extraction
Practice: craft patterns for phone numbers, emails, URLs and monetary values. Worked example: [$]\d{1,3}(,\d{3})*(\.\d{2})? matches a currency amount like 1,250.00 (with a leading dollar sign) — thousands-separated, optional cents.
Role in pipelines
Regex underlies tokenization, cleaning, normalisation and scraping in almost every NLP system — it is the unglamorous layer that turns raw text into something a model can ingest.
Key ideaRegex is deterministic and explainable — when a rule is enough, you don't need a model.
PitfallGreedy quantifiers (.*) over-match across boundaries; use the lazy form (.*?). And never try to parse nested structure (HTML, balanced brackets) with regex — it is mathematically a regular language and cannot count nesting.
Key takeawayReach for the simplest tool that solves the problem; a precise rule beats a fragile model when the pattern is genuinely regular.
7
Live in-person

Text-Based Statistics

Objective: see how statistics underpin language processing — TF-IDF, lexical density and word-frequency distributions — and why they matter for analysing text.

Term frequency
How often a term appears in a document, often dampened by a log ($1+\log\text{tf}$) so a word seen 100× isn't 100× as important as one seen once.
Inverse document frequency
$\log(N/\text{df})$ down-weights terms common across the whole corpus (the, data in a CS corpus) and up-weights distinctive ones — the part that makes a word a good index term.
Lexical density & Zipf
Word frequencies follow Zipf's law — a few words dominate, a long tail is rare; IDF is essentially a correction for that skew. Lexical density (content words ÷ total) gauges how "information-rich" a text is.
Practice
Compute TF-IDF scores across a set of documents and find the most relevant document for a query.
Core concept · TF-IDF
$\text{tfidf}(t,d) = \text{tf}(t,d)\times \log\dfrac{N}{\text{df}(t)}$

A term scores high when it is frequent in a document but rare across the corpus — the signal that makes a word a good index term.

Worked mini-example · IDF

Corpus of $N=1000$ documents. The word "the" appears in all 1000, so $\text{idf}=\log(1000/1000)=0$ — it contributes nothing. "transformer" appears in 10, so $\text{idf}=\log(1000/10)=\log 100 \approx 2$. If "transformer" occurs 3× in a document, its weight is $3\times2=6$, while "the" scores 0 no matter how often it appears. Distinctiveness wins.

Key ideaDistinctiveness, not raw frequency, is what makes a word informative.
PitfallTF-IDF is bag-of-words: it ignores order and synonymy, so "good" and "great" share nothing. That gap is exactly what embeddings (Session 13) close.
Key takeawayTF-IDF is the workhorse baseline for search and classification — cheap, interpretable, and surprisingly hard to beat on short factual queries (it underlies BM25 and the retriever in RAG).
8
Live in-person

Information Retrieval

Objective: introduce Information Retrieval (IR) — fetching relevant data from large repositories — and the central role vector space models play in accuracy and efficiency.

IR fundamentals
The pipeline: index the corpus (often an inverted index mapping each term → the documents containing it), match the query, rank by relevance and return the top-k. The inverted index is what makes search fast at web scale.
VSM ranking
Score each document by the cosine between its TF-IDF vector and the query vector. The modern refinement, BM25, adds term-frequency saturation and document-length normalisation and is still the default lexical ranker in production search.
Evaluation
Quality is measured with precision@k, recall and MAP/nDCG against human relevance judgements — you cannot improve search you do not measure.
Practice
Build a rudimentary search engine that sifts a collection of articles and returns those closest to a user query.
Core concept · cosine ranking
$\text{score}(q,d) = \cos(\mathbf{q},\mathbf{d}) = \dfrac{\sum_t q_t d_t}{\lVert\mathbf{q}\rVert\,\lVert\mathbf{d}\rVert}$

Rank documents by the cosine between their TF-IDF vectors and the query vector.

Key ideaSearch = represent query and documents in the same vector space, then sort by similarity.
Connects to…Replace TF-IDF vectors with dense embeddings and cosine ranking becomes semantic / vector search — the retriever half of retrieval-augmented QA in Session 24.
Key takeawayLexical (BM25) and semantic (embedding) retrieval are complementary; hybrid search combines both for the strongest results.
Reading
  • Manning, Raghavan & Schütze, Introduction to Information Retrieval — Ch. 6 (scoring & the vector space model). Why: the canonical treatment of TF-IDF ranking.
9
Live in-person

Introduction to Naive Bayes

Objective: learn the mathematics behind Naive Bayes — a fundamental, efficient text classifier — and use it to categorise content.

Bayes' theorem
$P(c\mid d)=\dfrac{P(d\mid c)\,P(c)}{P(d)}$ — posterior ∝ prior × likelihood. Since $P(d)$ is the same for every class, we drop it and just compare numerators.
Conditional independence
The "naive" assumption: features (words) are independent given the class, so $P(d\mid c)=\prod_i P(w_i\mid c)$. False in reality ("New" and "York" are not independent) — yet the classifier still works well.
Laplace smoothing
A word unseen in a class would give probability 0 and zero out the whole product; add-one smoothing $P(w\mid c)=\tfrac{\text{count}(w,c)+1}{\text{count}(c)+|V|}$ prevents that.
Practice
Classify news articles into categories with multinomial Naive Bayes.
Core concept · Naive Bayes
$\hat{c} = \arg\max_c\; P(c)\prod_{i} P(w_i \mid c)$

Pick the class that maximises the prior times the product of per-word likelihoods (Laplace-smoothed to handle unseen words).

Worked mini-example · spam

Priors $P(\text{spam})=P(\text{ham})=0.5$. Suppose $P(\text{"free"}\mid\text{spam})=0.4$, $P(\text{"free"}\mid\text{ham})=0.05$. For the message "free": spam score $=0.5\times0.4=0.20$ vs ham $=0.5\times0.05=0.025$. Spam wins 8-to-1. In practice we sum logs ($\log P(c)+\sum\log P(w_i\mid c)$) to avoid floating-point underflow when multiplying many small probabilities.

Key ideaA "wrong" independence assumption can still classify remarkably well — and stays fully interpretable.
PitfallBecause words are treated as independent, correlated features (synonyms, phrases) get double-counted, making Naive Bayes over-confident — its probabilities are good for ranking but poorly calibrated.
Key takeawayNaive Bayes is generative (it models how documents are produced); next session's logistic regression is its discriminative counterpart — the same features, a different philosophy.
10
Live in-person

Logistic Regression in Text Analysis

Objective: demystify logistic regression and apply it to text — especially binary classification problems.

The sigmoid
$\sigma(z)=1/(1+e^{-z})$ squashes any real score into a probability in (0,1); at $z=0$ it gives exactly 0.5, the decision threshold.
Decision boundary
A linear hyperplane $\mathbf{w}^\top\mathbf{x}+b=0$ in feature space; the weights set its orientation, the bias its offset. Each weight is the (log-odds) contribution of one word — directly readable.
Training
Fit weights by minimising cross-entropy loss $-\sum[y\log\hat{y}+(1-y)\log(1-\hat{y})]$ via gradient descent, usually with L2 regularisation to curb over-fitting on rare words.
Practice
Build a sentiment tool that labels movie reviews positive or negative.
Core concept · logistic model
$P(y{=}1\mid \mathbf{x}) = \sigma(\mathbf{w}^\top\mathbf{x}+b),\quad \sigma(z)=\dfrac{1}{1+e^{-z}}$

Unlike Naive Bayes (generative), logistic regression is discriminative — it models the boundary directly and relaxes feature independence.

Worked mini-example · scoring a review

Suppose learned weights give "great" $w=+1.5$, "boring" $w=-2.0$, bias $b=0$. A review with both: $z=1.5-2.0=-0.5$, so $\hat{y}=\sigma(-0.5)\approx0.38$ → predicted negative (below 0.5). Flip "boring" to "fun" ($w=+1.2$): $z=2.7$, $\hat{y}\approx0.94$ → confidently positive.

Key ideaGenerative vs discriminative: Naive Bayes models how data is generated; logistic regression models the decision directly.
Connects to…Logistic regression is a single-neuron network: $\sigma(\mathbf{w}^\top\mathbf{x}+b)$. Stack neurons and add hidden layers and you get the feed-forward net of Session 14 — a softmax over many classes generalises the sigmoid.
Key takeawayWith enough data, the discriminative model usually edges out the generative one; with little data, Naive Bayes's stronger assumptions can win.
Module 3 · Sessions 11–16

Advanced Representations in NLP

How meaning becomes geometry. This module covers sentiment analysis, the deep theory of vector space models and cosine similarity, learned word embeddings (Word2Vec), the first neural language models, and the two key linguistic-feature tasks: POS tagging and NER.

Sentiment Cosine similarity Word2Vec embeddings Feed-forward LMs POS tagging NER
By the end of this module you can
  • Run an end-to-end sentiment analysis on real social data.
  • Measure document similarity with cosine similarity over VSMs.
  • Train and interpret Word2Vec embeddings and solve word analogies.
  • Build a feed-forward neural text classifier.
  • Apply POS tagging and Named Entity Recognition to extract structure from text.
11
Live in-person · Module 3

Sentiment Analysis

Objective: read the subtle cues and emotions in text to gauge whether opinion is positive, negative or neutral.

Polarity
Scoring text on a positive–negative axis (and sometimes intensity/arousal). Granularity ranges from document-level (whole review) to aspect-based (the food was great but the service was slow → +food, −service).
Lexicon vs ML
Valence dictionaries (VADER, SentiWordNet) sum per-word scores with negation/booster rules — fast and transparent but brittle on slang and sarcasm. Learned classifiers (the logistic regression of Session 10, or a fine-tuned transformer) adapt to the domain but need labelled data.
Practice
Collect real-time posts on a chosen topic from a social network and analyse the overall sentiment of the collected posts.
Core concept · lexicon scoring
$\text{sentiment} = \sum_i \nu(w_i)\cdot \text{neg}_i \cdot \text{boost}_i$

Each word's valence ν is flipped by nearby negators and amplified by intensifiers, then summed.

Worked mini-example · negation

Valence of "good" $=+2$. Phrase "not good": the negator within a 1–3 word window flips the sign → $-2$. "Really not good": the booster "really" scales by ~1.5 → $-3$. This is why a naive bag-of-words that sees only the token "good" mislabels the phrase as positive.

Key ideaContext matters: "not good" must score negative even though "good" is positive.
PitfallSarcasm ("oh great, another bug"), comparatives and domain shift (a "small" phone = good, a "small" portion = bad) defeat lexicons. These are precisely the long-range, contextual cases where transformers later shine.
Key takeawaySentiment is the canonical text-classification task — every classifier in the course (NB, LR, FFN, BERT) can be pointed at it, so it is a perfect benchmark for comparing them.
12
Live in-person

Deep Dive into Vector Space Models

Objective: understand why VSMs are a cornerstone of NLP — how they represent text and capture relations between pieces of text.

Document vectors
Bag-of-words / TF-IDF representations as points in a space with one axis per vocabulary term. These vectors are sparse (mostly zeros) and very high-dimensional (|V| can be 50k+), which has consequences for both storage and similarity.
Cosine similarity
Angle-based closeness, robust to document length — a long article and a short tweet on the same topic still point the same way. Compare with edit distance (string-level) and Jaccard (set overlap): each measures a different notion of "similar".
Practice
Build a tool that measures how similar two documents are using cosine similarity over their vectors.
Core concept · cosine similarity
$\cos(\theta) = \dfrac{\mathbf{a}\cdot\mathbf{b}}{\lVert\mathbf{a}\rVert\,\lVert\mathbf{b}\rVert} \in [-1,1]$

1 means identical direction (same content mix), 0 means orthogonal (no shared terms).

Key ideaDirection encodes meaning; magnitude (length) is mostly noise — so we compare angles.
PitfallThe curse of dimensionality: in very high-dimensional sparse spaces, distances concentrate and almost everything looks equally far. Dense low-dimensional embeddings (next session) restore meaningful neighbourhoods.
Key takeawayPick the similarity measure that matches your unit of meaning — cosine for topical content, edit distance for typos/spelling, Jaccard for set membership.
13
Live in-person

Introduction to Word Embeddings

Objective: meet Word Embeddings starting with the groundbreaking Word2Vec — and see how it changed the way machines interpret words.

Distributional semantics
Dense, low-dimensional vectors (typically 100–300 dims) learned so that words appearing in similar contexts get similar vectors — the distributional hypothesis made computable, and far more compact than the sparse |V|-dim count vectors of earlier sessions.
Skip-gram / CBOW
Word2Vec's two training objectives: skip-gram predicts the surrounding context words from a centre word (better on rare words); CBOW predicts the centre word from its context (faster). Both learn the embedding as a by-product of getting the prediction right.
Negative sampling
The full softmax over |V| is too costly, so Word2Vec trains a few "is this a real context pair?" binary tasks against random negatives — the trick that made it fast enough to scale.
Practice
Map and visualise Word2Vec embeddings of emotion words in 2D (via PCA/t-SNE) and inspect which words cluster.
Core concept · skip-gram objective
$P(w_O\mid w_I) = \dfrac{\exp(\mathbf{v}'_{w_O}\!\cdot\mathbf{v}_{w_I})}{\sum_{w\in V}\exp(\mathbf{v}'_{w}\!\cdot\mathbf{v}_{w_I})}$

A softmax over the vocabulary: maximise the dot product between a centre word's input vector and its true context words' output vectors. Words sharing contexts are pushed together.

Core concept · analogy arithmetic
$\mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}} \approx \mathbf{v}_{\text{queen}}$

Embeddings place words in a space where linear directions encode relations like gender, tense or plurality — solved by nearest-neighbour search after the arithmetic.

Key ideaDense learned vectors beat sparse counts: similar words sit close together and relationships become vector arithmetic.
PitfallA Word2Vec vector is static — "bank" gets one vector for both river and money senses. Embeddings also absorb social bias from the corpus (man:doctor :: woman:nurse). Contextual models (BERT, Session 20) fix the first; bias (Session 25) remains a live concern.
Key takeawayWord2Vec replaced counting (LSA/HAL) with prediction and won — the conceptual leap that set up every neural model that follows.
14
Live in-person

Feedforward Neural Language Models

Objective: enter neural networks for language — understand how these networks "think" and process language data.

Architecture
Concatenate the embeddings of the previous (n−1) words → one or more hidden layers → softmax over the whole vocabulary. Bengio et al. (2003) showed this beats n-grams precisely because embeddings let unseen contexts borrow strength from similar seen ones.
Activations
Non-linearities are what make depth worthwhile: ReLU $\max(0,z)$ (cheap, default), tanh (zero-centred), sigmoid (gates). Without them, stacked linear layers collapse to a single linear map.
Evaluation · perplexity
Language models are scored by perplexity — the exponential of the average per-word cross-entropy; lower is better. It is the standard yardstick from n-grams through to GPT.
Practice
Build a text classifier using a simple feed-forward network.
Core concept · softmax output
$P(w_i) = \dfrac{e^{z_i}}{\sum_j e^{z_j}}$

Bengio's neural LM replaced sparse n-gram counts with dense embeddings fed through an MLP — the conceptual ancestor of every modern LM.

Core concept · perplexity
$\text{PPL} = \exp\!\Big(-\tfrac{1}{N}\textstyle\sum_{i=1}^{N}\log P(w_i\mid w_{

Intuition: perplexity is the model's average "branching factor" — how many equally-likely words it is choosing among at each step. A PPL of 1 is a perfect predictor; uniform over a 10k vocabulary gives PPL 10,000.

Key ideaLearn the representation and the predictor jointly, instead of hand-engineering features.
PitfallA feed-forward LM still uses a fixed-size context window, just like an n-gram — it cannot model arbitrarily long dependencies. That limitation is exactly what RNNs (Session 17) set out to remove.
Key takeawayThe softmax-over-vocabulary output head here is the same head GPT uses — only the body (MLP → RNN → transformer) changes across the course.
15
Live in-person

Part-of-Speech (POS) Tagging

Objective: identify the grammatical role of each word — noun, verb, adjective and more — from its context and meaning.

Tag sets
DET, NOUN, VERB, ADJ… — the 17-tag Universal set or the finer ~36-tag Penn Treebank. Tagging resolves ambiguity from context: "book" is a NOUN in "read a book" but a VERB in "book a flight".
HMM & Viterbi
A Hidden Markov Model has transition probabilities $a_{ij}$ (tag→tag) and emission probabilities $b_j(o)$ (tag→word). Viterbi finds the single most probable tag sequence in $O(T\cdot|S|^2)$ time by dynamic programming instead of enumerating $|S|^T$ paths.
Practice
Tag an article and list all its nouns, verbs and adjectives.
Core concept · Viterbi recursion
$\delta_t(j) = \max_i\big[\delta_{t-1}(i)\,a_{ij}\big]\,b_j(o_t)$

Each cell holds the best score of reaching tag j at position t; back-tracking recovers the winning path.

Worked mini-example · "they fish"

Tags {PRON, NOUN, VERB}. "they" is almost surely PRON. For "fish", compare two paths: PRON→VERB (high transition, fish-as-verb emission) vs PRON→NOUN. Because a pronoun is far more often followed by a verb, $\delta(\text{VERB})$ wins and "fish" is tagged VERB — a decision driven by the neighbour, not the word alone.

Key ideaPOS tagging is sequence labelling — the tag of a word depends on its neighbours, not just itself.
Connects to…Viterbi is the same dynamic-programming idea as beam search in decoding (Session 21); both find a best path through a lattice. NER (next session) is sequence labelling too, usually solved today with a CRF or transformer instead of an HMM.
Key takeawayMany NLP tasks are "label every token in order" — once you see a problem as sequence labelling, Viterbi/CRF/transformer-tagging all become available.
16
Live in-person

Named Entity Recognition (NER)

Objective: extract specific information — people, places, organizations — by spotlighting the words that stand out because of their importance.

Entity types
PERSON, LOCATION, ORGANIZATION, DATE, MONEY and more. Domain-specific NER adds custom types — genes in biomedicine, tickers in finance, products in e-commerce.
BIO tagging
Encode multi-word spans by marking each token Begin / Inside / Outside an entity. This recasts a span-finding problem as per-token classification — the same sequence-labelling frame as POS.
Evaluation
Scored by span-level precision/recall/F1: a prediction counts only if the whole span and its type are exactly right — partial overlaps don't earn credit.
Practice
Analyse a news article and label every named entity it contains.
Core concept · span labelling (BIO)
Tim/B-PER Cook/I-PER visited/O New/B-LOC York/I-LOC

NER turns free text into structured records — the bridge from language to a knowledge base.

Key ideaEntities are the nouns that matter; finding them is the first step in turning text into data.
PitfallAmbiguity bites: "Washington" can be PERSON, LOCATION or ORG; "Apple" the company vs the fruit. Resolution needs context — and after detection, entity linking (mapping "Tim Cook" → a knowledge-base ID) is a separate, harder step.
Key takeawayNER + relation extraction is how unstructured text becomes a queryable database — the foundation of knowledge graphs and information extraction pipelines.
Module 4 · Sessions 17–24

Deep Learning for NLP

The modern stack. Recurrent networks and LSTMs for sequences, the transformer architecture and self-attention that replaced them, pre-trained models (BERT, GPT) and transfer learning, advanced sequence applications, and question-answering systems.

RNNs LSTMs Transformers · attention BERT · GPT Transfer learning Seq2Seq Question answering
By the end of this module you can
  • Explain and step through an RNN and an LSTM for sequential data.
  • Describe self-attention and fine-tune a transformer for classification.
  • Use pre-trained models (BERT, GPT) and the idea of transfer learning.
  • Apply Seq2Seq models to translation and build a GPT-based chatbot.
  • Design extractive and generative QA systems and fine-tune them.
17
Live in-person · Module 4

Introduction to RNNs

Objective: explore Recurrent Neural Networks — built for sequential data, with memory components that carry information from previous inputs.

Hidden state
A fixed-size vector $h_t$ that is a running summary of everything seen so far, re-computed at each step from the previous state and the new input. Unlike an n-gram or feed-forward LM, the context window is in principle unbounded.
Weight sharing
The same parameter matrices are reused at every time step, so the model handles sequences of any length with a fixed parameter count — and is trained by backpropagation through time (unrolling the recurrence, then back-propagating).
Practice
Use an RNN to predict the upcoming word in a given sentence.
Core concept · recurrence
$h_t = \tanh(W_{hh}\,h_{t-1} + W_{xh}\,x_t + b)$

The new hidden state mixes the previous state with the current input through a non-linearity.

Key ideaRecurrence gives a network memory — but vanilla RNNs forget quickly over long sequences.
PitfallThe vanishing/exploding gradient problem: repeatedly multiplying by $W_{hh}$ through time shrinks gradients toward 0 (or blows them up), so the RNN can't learn dependencies more than a handful of steps back. This single weakness motivates Sessions 18 (LSTM) and 19 (attention).
Key takeawayRNNs process sequentially — powerful but inherently un-parallelisable across time, which later becomes the practical reason transformers win at scale.
18
Live in-person

Deep Dive into LSTMs

Objective: study Long Short-Term Memory networks — RNNs that remember long-range dependencies and resist the vanishing-gradient problem.

Cell state
A protected memory channel $C_t$ that runs straight through the chain with only minor linear interactions. Because information is added rather than repeatedly multiplied, gradients survive — the constant-error-carousel intuition behind solving vanishing gradients.
Gates
Three learned sigmoid gates regulate the cell: the forget gate drops stale memory, the input gate writes new information, the output gate decides what to expose as $h_t$. GRUs are a lighter two-gate variant.
Practice
Predict the next word with an LSTM and compare directly against the plain RNN to feel the long-dependency gain.
Core concept · gating
$f_t=\sigma(W_f[h_{t-1},x_t]),\;\; C_t=f_t\odot C_{t-1}+i_t\odot \tilde{C}_t$

Gates (sigmoids in 0–1) control the additive flow of the cell state, keeping gradients alive over long ranges.

Key ideaGates let the network choose what to remember and what to forget — solving the RNN's short memory.
PitfallLSTMs extend memory but are still sequential (no time-parallelism) and still degrade over very long contexts. For machine translation they also struggle to compress a whole sentence into one fixed context vector — the bottleneck that attention removes.
Key takeawayThe additive cell-state path is the same trick as a residual connection in transformers — keep a clean gradient highway through the network.
19
Live in-person

Transformers in NLP

Objective: meet Transformers — the architecture that revolutionised NLP — and their self-attention mechanism for capturing intricate patterns.

Self-attention
Each token emits a query, a key and a value. A token's new representation is a weighted average of all tokens' values, where the weight is how well its query matches each key. Every position can reach every other in one step — and all positions compute in parallel.
Multi-head & positional encoding
Multi-head runs several attention maps in parallel so different heads specialise (syntax, coreference, long-range links). Because attention is order-agnostic, positional encodings are added to inject word order. Residual connections + layer-norm + a feed-forward sublayer complete the block.
Practice
Fine-tune a pre-trained transformer for a specific text-classification task.
Core concept · scaled dot-product attention
$\text{Attention}(Q,K,V) = \text{softmax}\!\left(\dfrac{QK^\top}{\sqrt{d_k}}\right)V$

Query–key dot products (scaled by √d) become attention weights that mix the value vectors. From "Attention Is All You Need" (Vaswani et al., 2017).

Worked mini-example · why √d

For the pronoun "it" in "the animal didn't cross the street because it was tired", the query of "it" dots highest with the key of "animal", so attention concentrates there and copies its value — resolving the reference. The $\sqrt{d_k}$ divisor matters: with $d_k=64$, raw dot products grow ~√64 = 8× larger, pushing the softmax into saturated regions with near-zero gradients; scaling keeps it trainable.

Key ideaAttention replaces recurrence — any token can directly influence any other, in parallel.
PitfallSelf-attention is $O(n^2)$ in sequence length — doubling the context quadruples the compute and memory. This is the central efficiency problem behind long-context research (sparse, linear and flash attention).
Key takeawayOne mechanism — soft, content-based lookup — underlies BERT, GPT and essentially every modern LLM. Understand this block and you understand the architecture of the field.
20
Live in-person

Pre-Trained Language Models (PLMs)

Objective: understand PLMs like BERT and GPT — their ability to understand and generate human-like text — and the concept of transfer learning.

Pre-training objectives
BERT uses masked language modelling — hide ~15% of tokens and predict them from both sides, giving deeply bidirectional representations ideal for understanding tasks. GPT uses autoregressive left-to-right next-token prediction, ideal for generation. Same transformer body, opposite reading direction.
Transfer learning
Adapt a giant pre-trained model to your task without training from scratch — either full fine-tuning, lightweight adapters/LoRA, or just prompting a frozen model. The cost asymmetry is huge: pre-training is millions of dollars, fine-tuning can be minutes on one GPU.
Practice
Extract sentence embeddings with BERT and visualise them; observe how semantically related sentences cluster.
Core concept · pre-train then fine-tune
$\theta^\ast = \text{fine-tune}\big(\text{pretrain}(\theta_0,\,\mathcal{D}_{\text{large}}),\,\mathcal{D}_{\text{task}}\big)$

Learn general language knowledge once on huge corpora, then specialise cheaply on a small labelled set.

Key ideaTransfer learning is why a few hundred labelled examples can now beat models trained on millions.
Connects to…Choose the model by task: BERT-family (encoders) for classification/NER/extractive QA; GPT-family (decoders) for generation/chat; encoder–decoders (T5, BART) for translation/summarisation — the Seq2Seq pattern of Session 21.
Key takeawayThe pre-train-then-adapt recipe is the dominant paradigm in modern NLP; almost no one trains a language model from scratch anymore.
21
Live in-person

Advanced Applications of RNNs and LSTMs

Objective: see how RNNs and LSTMs solve real-world problems, from time-series prediction to language modelling.

Seq2Seq
An encoder reads the input sequence into a context representation; a decoder generates the output sequence one token at a time. Adding attention lets the decoder look back at every encoder state instead of squeezing everything through one fixed vector — the fix for the bottleneck.
Decoding
At generation time, greedy decoding takes the top token each step (myopic); beam search keeps the k best partial sequences and usually finds a higher-probability overall output. Temperature and top-k/top-p sampling trade quality for diversity.
Practice
Design a mini translator using Seq2Seq LSTM models to translate specific phrases.
Core concept · encoder–decoder
$\mathbf{c}=\text{Enc}(x_{1:n}),\quad y_t=\text{Dec}(y_{

The encoder produces a context vector; the decoder generates one token at a time conditioned on it.

Core concept · BLEU
$\text{BLEU} = \text{BP}\cdot\exp\!\Big(\textstyle\sum_{n=1}^{4} w_n \log p_n\Big)$

Geometric mean of n-gram precisions $p_n$ (1- to 4-grams) against reference translations, times a brevity penalty BP that punishes too-short output. Example: a hypothesis sharing 3 of its 4 unigrams with the reference has $p_1=0.75$. ROUGE is the recall-oriented sibling used for summarisation.

Key ideaSeq2Seq reframed translation as "read the whole input, then write the whole output" — and motivated attention.
PitfallBLEU rewards surface n-gram overlap, so a fluent paraphrase with different wording can score low while a clunky literal match scores high. Always pair automatic metrics with a few human reads.
Key takeawayEncoder–decoder + attention is the bridge from RNNs to transformers — "Attention Is All You Need" simply replaced the recurrent encoder and decoder with stacked self-attention.
22
Live in-person

Advanced Applications of Transformers and PLMs

Objective: unravel the transformative power of transformers in real-world applications, backed by self-attention.

Generative models
GPT-style decoder-only transformers that produce coherent, context-aware text by sampling next tokens. Scaling parameters and data yields emergent abilities — few-shot learning, reasoning, instruction following — not present in smaller models.
Prompting
Steer a frozen model through its input: zero-shot (just ask), few-shot (show examples in-context), and chain-of-thought ("think step by step") to elicit reasoning. Instruction-tuning and RLHF align the base model with what users actually want.
Practice
Use GPT models to design a chatbot tailored for course-content inquiries.
Core concept · autoregressive generation
$P(x_{1:n}) = \prod_{t=1}^{n} P(x_t \mid x_{

A language model factorises the probability of a sequence into a product of next-token predictions.

Key ideaThe same next-token objective, scaled up, gives the emergent abilities behind modern chatbots.
PitfallHallucination: a model optimised to produce plausible text will confidently state falsehoods, because fluency is not factuality. Grounding (retrieval, citations) and verification are essential — never ship raw generation as truth.
Key takeawayNotice the full circle: the next-token objective of Session 3's n-gram is exactly what GPT optimises — only the model capacity and training scale changed.
23
Live in-person

Introduction to Question Answering (QA)

Objective: understand the principles behind QA systems — architected to extract precise answers from large volumes of data.

Extractive QA
Given a question and a passage, return the answer as a contiguous span copied from the passage (SQuAD-style) — contrast with generative/abstractive QA, which writes a fresh answer that may not appear verbatim.
Start/end pointers
A fine-tuned encoder (e.g. BERT) outputs two distributions over passage positions — one for where the answer starts, one for where it ends — and the best valid span is selected.
Evaluation
Scored by Exact Match (string equal to a reference, after normalisation) and token-level F1 (partial overlap credit) — the SQuAD standard.
Practice
Build a rudimentary QA system over a fixed FAQ-style dataset.
Core concept · span selection
$(\hat{s},\hat{e}) = \arg\max_{s\le e}\; P_{\text{start}}(s)\,P_{\text{end}}(e)$

Score each possible (start, end) pair (with $s\le e$) and pick the most probable valid span.

Key ideaExtractive QA doesn't invent answers — it locates them, which makes it verifiable.
PitfallPure extractive QA can only answer what is literally present in the given passage; it cannot synthesise across documents or handle "no answer" gracefully unless trained to (SQuAD 2.0 adds unanswerable questions for exactly this reason).
Key takeawayBecause the answer is a verifiable pointer into a source, extractive QA is the trustworthy backbone you extend with retrieval (next session) to scale beyond a single passage.
24
Live in-person

Advanced QA Techniques and Strategies

Objective: navigate state-of-the-art QA strategies as the demands of information retrieval evolve.

Retrieval-augmented QA
For open-domain questions there is no given passage, so first retrieve the top-k relevant documents (dense embeddings or BM25 over a vector store), then read them to extract or generate the answer. This is RAG — the dominant pattern for grounding LLMs in private or up-to-date knowledge.
Fine-tuning
Adapt a transformer specifically for the QA objective — or fine-tune the retriever and reader jointly so they learn to cooperate.
Practice
Fine-tune a transformer to enhance the chatbot from the previous session so it answers course-content questions precisely.
Core concept · retrieve-then-read
$P(a\mid q) = \sum_{d} P(d\mid q)\,P(a\mid q,d)$

Combine a retriever (find the right documents) with a reader (extract/generate the answer) — the backbone of RAG.

Key ideaGrounding answers in retrieved evidence reduces hallucination and keeps knowledge current.
PitfallRAG is only as good as its retriever — if the relevant passage isn't in the top-k, the reader cannot recover it and may hallucinate to fill the gap. Retrieval quality, chunking strategy and re-ranking matter as much as the LLM.
Key takeawayRAG closes the loop of the whole course: TF-IDF/embedding retrieval (Modules 2–3) feeds a transformer reader (Module 4) — every technique you learned, working together.
Module 5 · Sessions 25–30

Conclusion and Integrative Practices

Stepping back from the math: the ethical stakes of NLP, the open research frontier, and the integrative work that closes the course — group presentations, a full review, and the final exam.

Ethics & bias Privacy Open problems Group project Review & exam
By the end of this module you can
  • Critically audit an NLP system for bias and ethical risk.
  • Discuss open problems and emerging trends in the field.
  • Present an NLP application or case study you built and researched.
  • Consolidate the whole course and sit the comprehensive final exam.
25
Live in-person · Module 5

Ethics, Bias, and Real-world Challenges in NLP

Objective: confront the critical issues — biases in models and their implications, the environmental footprint of large-scale training, and privacy concerns around language data.

Bias
Models inherit and amplify the biases in their training data — gendered occupation associations, dialect and accent disparities, toxic stereotypes. Bias enters at data, annotation and modelling stages; auditing means measuring outcomes across demographic slices, not just overall accuracy.
Environmental & privacy cost
Training a large model can emit hundreds of tonnes of CO₂; the field is pushing efficiency (distillation, quantisation, smaller fine-tuned models). Privacy risks are real too: LLMs can memorise and regurgitate training data, so personal information leaks unless filtered or trained with differential privacy.
Practice
Critically examine a provided NLP model/application for potential biases and ethical issues.
Key ideaA model is only as fair as its data — measuring and mitigating bias is part of building it.
Connects to…The static embeddings of Session 13 are where bias was first quantified (man:programmer :: woman:homemaker). Mitigation — debiasing, balanced data, evaluation suites — applies to every model in the course.
Key takeawayEthics is not an add-on at the end; fairness, transparency and accountability are engineering requirements that belong in every stage of the pipeline.
26
Live in-person

Cutting-edge Trends and Open Problems in NLP

Objective: survey the latest breakthroughs, the open problems the research community is tackling, and the future trajectory of NLP.

Frontier topics
Scaling laws (predictable gains from more data/compute/parameters), multimodality (text+image+audio in one model), retrieval-augmentation, agents & tool use, alignment (RLHF, constitutional methods), and efficiency (mixture-of-experts, distillation, long-context attention).
Open problems
Hallucination & factuality, robust multi-step reasoning, low-resource languages (most of the world's ~7000 languages have little data), evaluation beyond leaderboards, interpretability, and the safety/governance of increasingly capable systems.
Practice
Group discussion: propose how an emerging trend could solve a real-world challenge; groups present a brief solution proposal.
Key ideaThe field moves fast — what matters most is the ability to read, evaluate and adapt new ideas.
Key takeawayThe fundamentals you built this semester — representation, attention, evaluation, grounding — are exactly the lenses you'll use to read tomorrow's papers. Tools change; the principles transfer.
27
Live in-person

Group Presentations — Session 1

Objective: present the group project — NLP applications you've built or case studies you've researched (15 minutes per group).

Format
15 minutes per group, followed by Q&A and peer feedback. Submit through Turnitin per the professor's guidelines.
Scope
An implemented NLP application or an in-depth study of a studied tool.
Talk structure
Problem → data → method → baseline → results with a real metric → limitations → demo. Lead with the result, then explain how you got there.
Key ideaCommunicating a result clearly is as much a skill as building it.
Key takeawayA clear baseline and an honest error analysis impress more than a flashy model with no evaluation.
28
Live in-person

Group Presentations — Session 2

Objective: the second round of group project presentations (15 minutes per group).

Format
Remaining groups present; continued peer and instructor feedback.
Key ideaSeeing many projects side by side reveals the breadth of what NLP can do.
29
Live in-person

Review Session

Objective: recap the entire course and clarify any remaining doubts before the exam.

Synthesis
Tie together the statistical → neural → transformer arc: counting (n-grams) → weighting (TF-IDF) → classifying (NB/LR) → embedding (Word2Vec) → sequence models (RNN/LSTM) → attention (transformers) → grounding (RAG).
Q&A
Targeted review of the hardest concepts — bring the formulas you find slipperiest.
Exam strategy
For each method be ready to state: what problem it solves, its core formula, one pitfall, and how it connects to its neighbours.
Key ideaThe whole course is one story — from counting words to attention.
Key takeawayIf you can redraw the arc above from memory and place every demo on it, you are ready for the exam.
30
Live in-person

Final Exam

Objective: the comprehensive final exam — a Blackboard quiz covering all course content, with multiple-choice and open-ended questions (35% of the grade).

Coverage
All five modules, from preprocessing and TF-IDF to transformers and QA.
Format
Multiple-choice + open-ended questions on the Blackboard platform.
Key ideaMastery = being able to move a problem across the whole pipeline, not just recall definitions.
Key concepts

Glossary of core terms

A quick reference for the recurring vocabulary of the course — roughly in the order it appears.

Tokenization
Splitting text into units (words, subwords or characters). The first step of almost every pipeline.
Stemming vs Lemmatization
Stemming chops suffixes (jumping→jump); lemmatization maps to dictionary forms using POS (was→be).
Stopwords
Very common words (the, of) often removed because they carry little distinguishing signal.
n-gram
A contiguous sequence of n tokens; the basis of count-based language models under the Markov assumption.
Regular expression
A formal pattern language for matching and extracting text — emails, dates, URLs, etc.
Bag of words
A representation that counts word occurrences while ignoring order.
TF-IDF
Term frequency × inverse document frequency — weights words by how distinctive they are to a document.
Vector space model
Representing text as vectors so that geometric closeness reflects semantic similarity.
Cosine similarity
The cosine of the angle between two vectors; a length-independent similarity measure.
Information retrieval
Finding documents relevant to a query from a large collection (the science behind search).
Naive Bayes
A generative classifier applying Bayes' theorem with a conditional-independence assumption between features.
Logistic regression
A discriminative linear classifier whose sigmoid output is a class probability.
Sentiment analysis
Classifying the opinion/emotion of text as positive, negative or neutral.
Word embedding
A dense, learned vector for a word where geometry encodes meaning (e.g. Word2Vec).
Word2Vec
Skip-gram/CBOW model that learns embeddings from context; famous for analogy arithmetic.
Softmax
Turns a vector of scores into a probability distribution; the standard output for classification.
Backpropagation
Computing gradients of the loss via the chain rule to train neural networks.
POS tagging
Labelling each word with its part of speech (noun, verb, adjective…).
Viterbi algorithm
Dynamic programming to find the most probable hidden-state (tag) sequence in an HMM.
Named Entity Recognition
Detecting and classifying entity spans — PERSON, LOCATION, ORGANIZATION, DATE, MONEY.
RNN
A network with a recurrent hidden state for processing sequences one step at a time.
LSTM
A gated RNN with a protected cell state that captures long-range dependencies.
Self-attention
A mechanism where each token weighs all others via softmax(QKᵀ/√d)V to build its representation.
Transformer
An attention-based architecture (no recurrence) that parallelises training; basis of modern LLMs.
Positional encoding
Information added to embeddings so a transformer knows word order.
Pre-trained language model
A model (BERT, GPT) trained on large corpora, then adapted to tasks.
Transfer learning
Reusing knowledge from a pre-trained model to solve a new task with little data.
BERT vs GPT
BERT is a masked-LM encoder (understanding); GPT is an autoregressive decoder (generation).
Seq2Seq
An encoder–decoder model mapping an input sequence to an output sequence (e.g. translation).
Beam search
Decoding that keeps the k best partial sequences at each step instead of just the top one.
Question answering
Returning a precise answer to a question, by extracting a span or generating text.
BLEU / ROUGE
n-gram precision (BLEU) and recall (ROUGE) metrics for generation tasks.
Corpus
A structured collection of text used to train or evaluate models; its size and quality cap what a model can learn.
Markov assumption
The simplifying claim that the next token depends only on a fixed window of recent tokens — the basis of n-gram models.
Laplace (add-one) smoothing
Adding a pseudo-count to every event so unseen n-grams or words get non-zero probability instead of zeroing a product.
Perplexity
$\exp$ of average per-word cross-entropy; the standard language-model score — the model's effective "branching factor" (lower is better).
Cross-entropy loss
The training objective for classifiers and LMs: penalises low probability assigned to the correct label/token.
Sigmoid
$\sigma(z)=1/(1+e^{-z})$ — squashes a real score to a probability; the activation behind logistic regression and LSTM gates.
Activation function
A non-linearity (ReLU, tanh, sigmoid) that gives a network expressive power; without it, depth collapses to a linear map.
Generative vs discriminative
Generative models $P(x,y)$ how data is produced (Naive Bayes); discriminative models the boundary $P(y\mid x)$ directly (logistic regression).
SVD / LSA
Singular Value Decomposition factorises a matrix; applied to a term–document matrix it yields Latent Semantic Analysis topics.
PCA
Principal Component Analysis — projects high-dimensional data onto its directions of greatest variance for visualisation/compression.
Curse of dimensionality
In very high-dimensional sparse spaces, distances concentrate and data becomes sparse — a motivation for dense embeddings.
Distributional hypothesis
"You shall know a word by the company it keeps" — meaning is inferred from co-occurrence context.
Skip-gram / CBOW
Word2Vec's objectives: predict context from a word (skip-gram) or a word from its context (CBOW).
Inverted index
A map from each term to the documents containing it — the data structure that makes large-scale search fast.
BM25
A refined TF-IDF ranking with term-frequency saturation and length normalisation; the standard lexical search baseline.
HMM
Hidden Markov Model — transition + emission probabilities over hidden states; classic model for POS tagging, decoded by Viterbi.
BIO tagging
Encoding spans as Begin/Inside/Outside per token, turning entity detection into per-token classification.
Vanishing/exploding gradient
Gradients shrinking to zero or blowing up through many layers/time-steps; the core RNN weakness LSTMs address.
BPTT
Backpropagation Through Time — unrolling a recurrent net across steps and back-propagating to train shared weights.
Query / Key / Value
The three projections each token produces in attention; query·key sets the weight, value is what gets mixed.
Multi-head attention
Running several attention maps in parallel so different heads capture different relations (syntax, coreference, …).
Masked LM (MLM)
BERT's pre-training: predict randomly masked tokens from both sides, yielding bidirectional representations.
Autoregressive LM
GPT's pre-training: predict the next token left-to-right, $P(x_{1:n})=\prod_t P(x_t\mid x_{
Fine-tuning
Continuing training of a pre-trained model on a small task dataset; lightweight variants include adapters and LoRA.
Prompting (zero/few-shot)
Steering a frozen model via its input — instructions alone (zero-shot) or with in-context examples (few-shot).
RAG
Retrieval-Augmented Generation — retrieve relevant documents, then read/generate the answer, grounding it in evidence.
Hallucination
A model producing fluent but false content, because it optimises for plausibility, not truth.
Precision / Recall / F1
Core classification metrics: correctness of positives (precision), coverage of positives (recall), and their harmonic mean (F1).
Beam search vs greedy
Decoding strategies: greedy takes the top token each step; beam search keeps the k best partial sequences for a better overall path.
Bibliography

Readings

Specific session readings are announced beforehand; these are the course's foundational texts.

Compulsory
Jurafsky, D., & Martin, J. H. (2008). Speech and Language Processing: An introduction to speech recognition, computational linguistics. Prentice Hall. ISBN 9780131873216.Why it matters: the definitive NLP textbook — the spine of this course, covering everything from n-grams and HMMs to neural models.
Compulsory
de la Cruz Echeandía, M., Elhaddad, Y. R. SH., Awinat, S., & Ortega, A. (2018). Handbook of Grammatical Evolution — chapter "GE and Semantics". Springer. ISBN 9783030087722.Why it matters: explains the semantic context in formal language theory and how Grammatical Evolution is extended to handle semantics — direct support for the "semantic analysis" half of the course.
Recommended
Manning, C., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press. ISBN 9780262133609.Why it matters: the classic reference for the statistical methods (Modules 1–2) — n-grams, smoothing, classification.
Recommended
Manning, C. D., Raghavan, P., & Schütze, H. (2012). Introduction to Information Retrieval. Cambridge University Press. ISBN 9780511809071.Why it matters: the go-to text for the IR and TF-IDF material in Module 2 (Sessions 7–8).
Recommended
Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O'Reilly Media. ISBN 9780596516499.Why it matters: the hands-on NLTK companion for the Python practice work, especially Session 5.