AI: Natural Language
Processing & Semantic Analysis
The complete structure of the course — five modules, thirty live sessions, and the statistical-to-neural arc of modern NLP — laid out session by session with the core concept behind each one, and cross-linked to the 23 interactive demos and flashcards in this lab.
What this course is about
Natural Language Processing sits at the centre of today's technology revolution — the breakthroughs behind ChatGPT, Bard and their successors are just the starting line. Over the next few years NLP and text analysis will permeate every sector, reshaping data-driven decisions and letting us decode human communication in ways we never thought possible. This course turns that promise into skill: it walks the full arc from rule-based and statistical methods through machine learning to the deep-learning and transformer architectures that power modern language models, always pairing theory with hands-on practice so you can apply these tools to real data-science problems.
Foundations & History
From ELIZA and n-grams to vector space models and the Python NLP toolchain.
ML for Text
Regex, TF-IDF, information retrieval, Naive Bayes and logistic regression.
Representations
Sentiment, cosine similarity, word embeddings, feed-forward nets, POS & NER.
Deep Learning
RNNs, LSTMs, transformers, pre-trained models and question answering.
Integration
Ethics & bias, cutting-edge trends, group projects, review and final exam.
Interactive demos →
Every concept below has a live, client-side visualization in the lab.
What you will be able to do
The course is organised around ten thematic strands. By the end you should be fluent in moving a problem across all of them — from raw text to a deployed, evaluated model.
How the course is taught
IE University's method is collaborative, active and applied: the professor leads and guides while students build knowledge through a mix of lectures, hands-on practice, projects and peer learning. The table below shows how the 150-hour workload is distributed across learning activities.
Lectures & hands-on practice
Concepts are introduced in interactive lectures, then immediately practiced — students write, run and debug code in class, working in groups and sharing knowledge.
Project-based learning
A semester-long group project applies course techniques to a real NLP problem: identify, design, implement, then present to the class for feedback.
Critical GenAI use
Generative AI is encouraged — but you must verify its output, never take it at face value, and acknowledge its use. Acknowledging AI never lowers your grade; failing to is an integrity violation.
How you are graded
Five components make up the final grade. The final exam and the group project together carry the most weight; continuous work (exercises, quizzes, participation) rewards steady engagement across the semester.
Final exam
A comprehensive Blackboard quiz on all course content, combining multiple-choice and open-ended questions.
Group project
Implement an NLP application or research a studied tool in depth; submit via Turnitin and deliver a 15-minute class presentation.
Exercises
In-class exercises submitted individually through Turnitin — one week from the start of each exercise to submit.
Quizzes
A quiz after each module, completed before the next synchronous session.
Participation
Active engagement in in-class activities, discussions and exercises — central to an applied course.
Note: 100% total
35 + 30 + 20 + 10 + 5 = 100%. The exam and project alone account for 65% of the grade.
Attendance
Students who do not meet the 80% attendance rule fail both the ordinary and extraordinary calls for the year and must re-enroll the following academic year.
Late assignments
Penalised 5% per 24-hour period from the due date. Changes to due dates must be agreed with the professor before the deadline.
Re-sit / re-take (June–July)
A single comprehensive exam; continuous evaluation is not counted. Pass mark is 5, capped at 8.0 ("notable"). Students who failed on attendance cannot re-sit.
Calls & retakers
Four allowed calls over two academic years. Retake (3rd call) is capped at 10.0. Failing >18 ECTS in a year after re-sits means leaving the programme.
What "good" looks like
Each component rewards a different skill. Use these criteria as a checklist before you submit.
The course, session by session
All sessions are live in-person. Each carries a core NLP concept — many with the underlying formula — a key idea to take away, and links to the matching interactive demo and flashcards. Jump to a module:
Foundations of NLP and Historical Overview
The on-ramp to the field: what NLP is, where it came from, the mathematical building blocks (perceptrons, n-grams, backpropagation), the vector-space view of meaning, and the Python toolchain you'll use all semester.
- Explain what NLP is and recognise its presence in everyday technology.
- Trace the field's history from ELIZA to modern statistical and neural methods.
- Build a basic n-gram next-word predictor and reason about perceptrons and backpropagation.
- Represent text in a vector space (LSA/HAL) and reduce dimensions with PCA.
- Preprocess raw text using mainstream Python NLP libraries.
Introduction to the Course
Objective: open the door to NLP — what it encompasses, its applications, and its role in bridging human–machine communication (chatbots, voice assistants, automated translation).
The Dawn of Computational Linguistics
Objective: trace NLP's formative phases — from its embryonic stages to landmarks like ELIZA, one of the first programs to mimic human conversation — and chart the key shifts that shaped the field.
- Jurafsky & Martin, Speech and Language Processing — Ch. 1 (introduction & brief history). Why: sets the timeline and vocabulary for the whole course.
Fundamental Concepts
Objective: unpack the foundational pillars — perceptrons and Multi-Layer Perceptrons, n-grams, and the backpropagation algorithm — the building blocks under much of today's NLP.
An n-gram model makes the Markov assumption: the next word depends only on the previous n−1 words, not the whole history.
Corpus: "I like NLP. I like cats." Count: "I like" appears 2×, "like NLP" 1×, "like cats" 1×. So $P(\text{NLP}\mid\text{like})=\tfrac{1}{2}=0.5$ and $P(\text{cats}\mid\text{like})=\tfrac{1}{2}=0.5$ — the model is equally torn. A word never seen after "like" gets probability 0, which is exactly the sparsity problem Laplace (add-one) smoothing fixes by pretending every word was seen once more.
Advanced Concepts in NLP
Objective: go deeper into Vector Space Models (VSMs) through Latent Semantic Analysis (LSA) and the Hyperspace Analogue to Language (HAL), exploring their principles and applications.
Similarity between two text vectors is the cosine of the angle between them — independent of document length.
Let $\mathbf{a}=(1,1,0)$ and $\mathbf{b}=(1,0,1)$ over vocabulary {dog, cat, fish}. Dot product $=1$; each norm $=\sqrt2$. So $\cos\theta = 1/(\sqrt2\cdot\sqrt2)=0.5$ — the documents share one of two terms each, a 60° angle. Same direction ⇒ 1, no shared terms ⇒ 0.
NLP Libraries and Resources in Python
Objective: meet the workhorse libraries — NLTK, spaCy, Gensim and Hugging Face Transformers — the backbone of most text-processing and modelling tasks.
Doc → Token object model. The default when you need to ship a robust pipeline.transformers gives one-line access to thousands of pre-trained models (BERT, GPT, T5) via a pipeline() API. Practice: clean and preprocess a raw dataset using one or more of these.- Bird, Klein & Loper, NLP with Python (NLTK book) — Ch. 1–3. Why: the hands-on companion for this session's preprocessing work.
Machine Learning and Text Analysis Techniques
From precise pattern matching to learned classifiers: regular expressions, the statistics that drive search (TF-IDF), information retrieval, and the two canonical text classifiers — Naive Bayes and logistic regression.
- Write regex patterns to extract structured data from raw text.
- Compute TF-IDF weights and use them to rank documents.
- Build a small search engine grounded in vector space models.
- Train and interpret a Naive Bayes text classifier.
- Use logistic regression for binary sentiment classification.
Regular Expressions in NLP
Objective: master regular expressions ("regex") — precise, efficient patterns for searching, extracting and manipulating text.
\d \w \s), quantifiers (* + ? {n,m}), anchors (^ $ \b), alternation (|) and capture groups ((...)). Compose them into a small grammar — e.g. \b\w+@\w+\.\w+\b matches a simple email.[$]\d{1,3}(,\d{3})*(\.\d{2})? matches a currency amount like 1,250.00 (with a leading dollar sign) — thousands-separated, optional cents..*) over-match across boundaries; use the lazy form (.*?). And never try to parse nested structure (HTML, balanced brackets) with regex — it is mathematically a regular language and cannot count nesting.Text-Based Statistics
Objective: see how statistics underpin language processing — TF-IDF, lexical density and word-frequency distributions — and why they matter for analysing text.
A term scores high when it is frequent in a document but rare across the corpus — the signal that makes a word a good index term.
Corpus of $N=1000$ documents. The word "the" appears in all 1000, so $\text{idf}=\log(1000/1000)=0$ — it contributes nothing. "transformer" appears in 10, so $\text{idf}=\log(1000/10)=\log 100 \approx 2$. If "transformer" occurs 3× in a document, its weight is $3\times2=6$, while "the" scores 0 no matter how often it appears. Distinctiveness wins.
Information Retrieval
Objective: introduce Information Retrieval (IR) — fetching relevant data from large repositories — and the central role vector space models play in accuracy and efficiency.
Rank documents by the cosine between their TF-IDF vectors and the query vector.
- Manning, Raghavan & Schütze, Introduction to Information Retrieval — Ch. 6 (scoring & the vector space model). Why: the canonical treatment of TF-IDF ranking.
Introduction to Naive Bayes
Objective: learn the mathematics behind Naive Bayes — a fundamental, efficient text classifier — and use it to categorise content.
Pick the class that maximises the prior times the product of per-word likelihoods (Laplace-smoothed to handle unseen words).
Priors $P(\text{spam})=P(\text{ham})=0.5$. Suppose $P(\text{"free"}\mid\text{spam})=0.4$, $P(\text{"free"}\mid\text{ham})=0.05$. For the message "free": spam score $=0.5\times0.4=0.20$ vs ham $=0.5\times0.05=0.025$. Spam wins 8-to-1. In practice we sum logs ($\log P(c)+\sum\log P(w_i\mid c)$) to avoid floating-point underflow when multiplying many small probabilities.
Logistic Regression in Text Analysis
Objective: demystify logistic regression and apply it to text — especially binary classification problems.
Unlike Naive Bayes (generative), logistic regression is discriminative — it models the boundary directly and relaxes feature independence.
Suppose learned weights give "great" $w=+1.5$, "boring" $w=-2.0$, bias $b=0$. A review with both: $z=1.5-2.0=-0.5$, so $\hat{y}=\sigma(-0.5)\approx0.38$ → predicted negative (below 0.5). Flip "boring" to "fun" ($w=+1.2$): $z=2.7$, $\hat{y}\approx0.94$ → confidently positive.
Advanced Representations in NLP
How meaning becomes geometry. This module covers sentiment analysis, the deep theory of vector space models and cosine similarity, learned word embeddings (Word2Vec), the first neural language models, and the two key linguistic-feature tasks: POS tagging and NER.
- Run an end-to-end sentiment analysis on real social data.
- Measure document similarity with cosine similarity over VSMs.
- Train and interpret Word2Vec embeddings and solve word analogies.
- Build a feed-forward neural text classifier.
- Apply POS tagging and Named Entity Recognition to extract structure from text.
Sentiment Analysis
Objective: read the subtle cues and emotions in text to gauge whether opinion is positive, negative or neutral.
Each word's valence ν is flipped by nearby negators and amplified by intensifiers, then summed.
Valence of "good" $=+2$. Phrase "not good": the negator within a 1–3 word window flips the sign → $-2$. "Really not good": the booster "really" scales by ~1.5 → $-3$. This is why a naive bag-of-words that sees only the token "good" mislabels the phrase as positive.
Deep Dive into Vector Space Models
Objective: understand why VSMs are a cornerstone of NLP — how they represent text and capture relations between pieces of text.
1 means identical direction (same content mix), 0 means orthogonal (no shared terms).
Introduction to Word Embeddings
Objective: meet Word Embeddings starting with the groundbreaking Word2Vec — and see how it changed the way machines interpret words.
A softmax over the vocabulary: maximise the dot product between a centre word's input vector and its true context words' output vectors. Words sharing contexts are pushed together.
Embeddings place words in a space where linear directions encode relations like gender, tense or plurality — solved by nearest-neighbour search after the arithmetic.
Feedforward Neural Language Models
Objective: enter neural networks for language — understand how these networks "think" and process language data.
Bengio's neural LM replaced sparse n-gram counts with dense embeddings fed through an MLP — the conceptual ancestor of every modern LM.
Intuition: perplexity is the model's average "branching factor" — how many equally-likely words it is choosing among at each step. A PPL of 1 is a perfect predictor; uniform over a 10k vocabulary gives PPL 10,000.
Part-of-Speech (POS) Tagging
Objective: identify the grammatical role of each word — noun, verb, adjective and more — from its context and meaning.
Each cell holds the best score of reaching tag j at position t; back-tracking recovers the winning path.
Tags {PRON, NOUN, VERB}. "they" is almost surely PRON. For "fish", compare two paths: PRON→VERB (high transition, fish-as-verb emission) vs PRON→NOUN. Because a pronoun is far more often followed by a verb, $\delta(\text{VERB})$ wins and "fish" is tagged VERB — a decision driven by the neighbour, not the word alone.
Named Entity Recognition (NER)
Objective: extract specific information — people, places, organizations — by spotlighting the words that stand out because of their importance.
NER turns free text into structured records — the bridge from language to a knowledge base.
Deep Learning for NLP
The modern stack. Recurrent networks and LSTMs for sequences, the transformer architecture and self-attention that replaced them, pre-trained models (BERT, GPT) and transfer learning, advanced sequence applications, and question-answering systems.
- Explain and step through an RNN and an LSTM for sequential data.
- Describe self-attention and fine-tune a transformer for classification.
- Use pre-trained models (BERT, GPT) and the idea of transfer learning.
- Apply Seq2Seq models to translation and build a GPT-based chatbot.
- Design extractive and generative QA systems and fine-tune them.
Introduction to RNNs
Objective: explore Recurrent Neural Networks — built for sequential data, with memory components that carry information from previous inputs.
The new hidden state mixes the previous state with the current input through a non-linearity.
Deep Dive into LSTMs
Objective: study Long Short-Term Memory networks — RNNs that remember long-range dependencies and resist the vanishing-gradient problem.
Gates (sigmoids in 0–1) control the additive flow of the cell state, keeping gradients alive over long ranges.
Transformers in NLP
Objective: meet Transformers — the architecture that revolutionised NLP — and their self-attention mechanism for capturing intricate patterns.
Query–key dot products (scaled by √d) become attention weights that mix the value vectors. From "Attention Is All You Need" (Vaswani et al., 2017).
For the pronoun "it" in "the animal didn't cross the street because it was tired", the query of "it" dots highest with the key of "animal", so attention concentrates there and copies its value — resolving the reference. The $\sqrt{d_k}$ divisor matters: with $d_k=64$, raw dot products grow ~√64 = 8× larger, pushing the softmax into saturated regions with near-zero gradients; scaling keeps it trainable.
Pre-Trained Language Models (PLMs)
Objective: understand PLMs like BERT and GPT — their ability to understand and generate human-like text — and the concept of transfer learning.
Learn general language knowledge once on huge corpora, then specialise cheaply on a small labelled set.
Advanced Applications of RNNs and LSTMs
Objective: see how RNNs and LSTMs solve real-world problems, from time-series prediction to language modelling.
The encoder produces a context vector; the decoder generates one token at a time conditioned on it.
Geometric mean of n-gram precisions $p_n$ (1- to 4-grams) against reference translations, times a brevity penalty BP that punishes too-short output. Example: a hypothesis sharing 3 of its 4 unigrams with the reference has $p_1=0.75$. ROUGE is the recall-oriented sibling used for summarisation.
Advanced Applications of Transformers and PLMs
Objective: unravel the transformative power of transformers in real-world applications, backed by self-attention.
A language model factorises the probability of a sequence into a product of next-token predictions.
Introduction to Question Answering (QA)
Objective: understand the principles behind QA systems — architected to extract precise answers from large volumes of data.
Score each possible (start, end) pair (with $s\le e$) and pick the most probable valid span.
Advanced QA Techniques and Strategies
Objective: navigate state-of-the-art QA strategies as the demands of information retrieval evolve.
Combine a retriever (find the right documents) with a reader (extract/generate the answer) — the backbone of RAG.
Conclusion and Integrative Practices
Stepping back from the math: the ethical stakes of NLP, the open research frontier, and the integrative work that closes the course — group presentations, a full review, and the final exam.
- Critically audit an NLP system for bias and ethical risk.
- Discuss open problems and emerging trends in the field.
- Present an NLP application or case study you built and researched.
- Consolidate the whole course and sit the comprehensive final exam.
Ethics, Bias, and Real-world Challenges in NLP
Objective: confront the critical issues — biases in models and their implications, the environmental footprint of large-scale training, and privacy concerns around language data.
Cutting-edge Trends and Open Problems in NLP
Objective: survey the latest breakthroughs, the open problems the research community is tackling, and the future trajectory of NLP.
Group Presentations — Session 1
Objective: present the group project — NLP applications you've built or case studies you've researched (15 minutes per group).
Group Presentations — Session 2
Objective: the second round of group project presentations (15 minutes per group).
Review Session
Objective: recap the entire course and clarify any remaining doubts before the exam.
Final Exam
Objective: the comprehensive final exam — a Blackboard quiz covering all course content, with multiple-choice and open-ended questions (35% of the grade).
Glossary of core terms
A quick reference for the recurring vocabulary of the course — roughly in the order it appears.
- Tokenization
- Splitting text into units (words, subwords or characters). The first step of almost every pipeline.
- Stemming vs Lemmatization
- Stemming chops suffixes (
jumping→jump); lemmatization maps to dictionary forms using POS (was→be). - Stopwords
- Very common words (
the,of) often removed because they carry little distinguishing signal. - n-gram
- A contiguous sequence of n tokens; the basis of count-based language models under the Markov assumption.
- Regular expression
- A formal pattern language for matching and extracting text — emails, dates, URLs, etc.
- Bag of words
- A representation that counts word occurrences while ignoring order.
- TF-IDF
- Term frequency × inverse document frequency — weights words by how distinctive they are to a document.
- Vector space model
- Representing text as vectors so that geometric closeness reflects semantic similarity.
- Cosine similarity
- The cosine of the angle between two vectors; a length-independent similarity measure.
- Information retrieval
- Finding documents relevant to a query from a large collection (the science behind search).
- Naive Bayes
- A generative classifier applying Bayes' theorem with a conditional-independence assumption between features.
- Logistic regression
- A discriminative linear classifier whose sigmoid output is a class probability.
- Sentiment analysis
- Classifying the opinion/emotion of text as positive, negative or neutral.
- Word embedding
- A dense, learned vector for a word where geometry encodes meaning (e.g. Word2Vec).
- Word2Vec
- Skip-gram/CBOW model that learns embeddings from context; famous for analogy arithmetic.
- Softmax
- Turns a vector of scores into a probability distribution; the standard output for classification.
- Backpropagation
- Computing gradients of the loss via the chain rule to train neural networks.
- POS tagging
- Labelling each word with its part of speech (noun, verb, adjective…).
- Viterbi algorithm
- Dynamic programming to find the most probable hidden-state (tag) sequence in an HMM.
- Named Entity Recognition
- Detecting and classifying entity spans — PERSON, LOCATION, ORGANIZATION, DATE, MONEY.
- RNN
- A network with a recurrent hidden state for processing sequences one step at a time.
- LSTM
- A gated RNN with a protected cell state that captures long-range dependencies.
- Self-attention
- A mechanism where each token weighs all others via
softmax(QKᵀ/√d)Vto build its representation. - Transformer
- An attention-based architecture (no recurrence) that parallelises training; basis of modern LLMs.
- Positional encoding
- Information added to embeddings so a transformer knows word order.
- Pre-trained language model
- A model (BERT, GPT) trained on large corpora, then adapted to tasks.
- Transfer learning
- Reusing knowledge from a pre-trained model to solve a new task with little data.
- BERT vs GPT
- BERT is a masked-LM encoder (understanding); GPT is an autoregressive decoder (generation).
- Seq2Seq
- An encoder–decoder model mapping an input sequence to an output sequence (e.g. translation).
- Beam search
- Decoding that keeps the k best partial sequences at each step instead of just the top one.
- Question answering
- Returning a precise answer to a question, by extracting a span or generating text.
- BLEU / ROUGE
- n-gram precision (BLEU) and recall (ROUGE) metrics for generation tasks.
- Corpus
- A structured collection of text used to train or evaluate models; its size and quality cap what a model can learn.
- Markov assumption
- The simplifying claim that the next token depends only on a fixed window of recent tokens — the basis of n-gram models.
- Laplace (add-one) smoothing
- Adding a pseudo-count to every event so unseen n-grams or words get non-zero probability instead of zeroing a product.
- Perplexity
- $\exp$ of average per-word cross-entropy; the standard language-model score — the model's effective "branching factor" (lower is better).
- Cross-entropy loss
- The training objective for classifiers and LMs: penalises low probability assigned to the correct label/token.
- Sigmoid
- $\sigma(z)=1/(1+e^{-z})$ — squashes a real score to a probability; the activation behind logistic regression and LSTM gates.
- Activation function
- A non-linearity (ReLU, tanh, sigmoid) that gives a network expressive power; without it, depth collapses to a linear map.
- Generative vs discriminative
- Generative models $P(x,y)$ how data is produced (Naive Bayes); discriminative models the boundary $P(y\mid x)$ directly (logistic regression).
- SVD / LSA
- Singular Value Decomposition factorises a matrix; applied to a term–document matrix it yields Latent Semantic Analysis topics.
- PCA
- Principal Component Analysis — projects high-dimensional data onto its directions of greatest variance for visualisation/compression.
- Curse of dimensionality
- In very high-dimensional sparse spaces, distances concentrate and data becomes sparse — a motivation for dense embeddings.
- Distributional hypothesis
- "You shall know a word by the company it keeps" — meaning is inferred from co-occurrence context.
- Skip-gram / CBOW
- Word2Vec's objectives: predict context from a word (skip-gram) or a word from its context (CBOW).
- Inverted index
- A map from each term to the documents containing it — the data structure that makes large-scale search fast.
- BM25
- A refined TF-IDF ranking with term-frequency saturation and length normalisation; the standard lexical search baseline.
- HMM
- Hidden Markov Model — transition + emission probabilities over hidden states; classic model for POS tagging, decoded by Viterbi.
- BIO tagging
- Encoding spans as Begin/Inside/Outside per token, turning entity detection into per-token classification.
- Vanishing/exploding gradient
- Gradients shrinking to zero or blowing up through many layers/time-steps; the core RNN weakness LSTMs address.
- BPTT
- Backpropagation Through Time — unrolling a recurrent net across steps and back-propagating to train shared weights.
- Query / Key / Value
- The three projections each token produces in attention; query·key sets the weight, value is what gets mixed.
- Multi-head attention
- Running several attention maps in parallel so different heads capture different relations (syntax, coreference, …).
- Masked LM (MLM)
- BERT's pre-training: predict randomly masked tokens from both sides, yielding bidirectional representations.
- Autoregressive LM
- GPT's pre-training: predict the next token left-to-right, $P(x_{1:n})=\prod_t P(x_t\mid x_{
- Fine-tuning
- Continuing training of a pre-trained model on a small task dataset; lightweight variants include adapters and LoRA.
- Prompting (zero/few-shot)
- Steering a frozen model via its input — instructions alone (zero-shot) or with in-context examples (few-shot).
- RAG
- Retrieval-Augmented Generation — retrieve relevant documents, then read/generate the answer, grounding it in evidence.
- Hallucination
- A model producing fluent but false content, because it optimises for plausibility, not truth.
- Precision / Recall / F1
- Core classification metrics: correctness of positives (precision), coverage of positives (recall), and their harmonic mean (F1).
- Beam search vs greedy
- Decoding strategies: greedy takes the top token each step; beam search keeps the k best partial sequences for a better overall path.
Readings
Specific session readings are announced beforehand; these are the course's foundational texts.