Introduction to Natural Language Processing (DS 207)

Instructors: Danish Pruthi and Aditya Gopalan

Term: January 2026 – May 2026

Time: Tuesdays & Thursdays (11:30-13:00)

Venue: Biological Sciences Auditorium

Credits: 3:1

Outline: This course is a graduate-level introduction to the field of Natural Language Processing (NLP), which involves building computational systems to handle human languages. Why care about NLP systems? We interact with them on a daily basis—such systems answer the questions we ask (using Google, or other search engines), curate the content we read, autocomplete words we are likely to type, translate text from languages we don’t know, flag content on social media that we might find harmful, etc. Such systems are prominently used in industry as well as academia, especially for analyzing textual data.

Learning Outcomes: The course is structured to emphasize practical learning. With three assignments, students will get a good sense on challenges involved in building models that deal with human languages. Post completion, students will feel comfortable developing models for problems involving textual data. The course also spends a considerable amount of time on language models, so students would be able to participate (in an informed way) in the current wave of large language models (LLMs). After the course, students would also be able to pick up most recent published research, and understand majority of the ideas in them.

Prerequisites: The class is intended for graduate students and senior undergraduates. We do not plan to impose any strict requisites on IISc courses that one should have completed to register for this course. However, students are expected to know the basics of linear algebra, probability, and calculus. Programming assignments would require proficiency in Python, familiarity with PyTorch would also be useful. There will be no pre-requisite quiz, we expect students to self-determine whether they are prepared to take this course or not.

Announcements

Jan 30: Assignment 1 is out now, due Feb 20, 16:59 IST.
First class on Jan 8, in the Biological Sciences Auditorium.

Course Schedule

The course schedule is as follows. This is subject to changes based on student feedback and pace of the instruction.

Date	Topic (Instructor)	Reading Material
Jan 8	Course introduction (DP)
Jan 13	Text classification I + annotations (DP)	Eisenstein Sec 2-2.1
Jan 15	Text classification II (DP) + annotations	Eisenstein Sec 2-2.1
Jan 20	Word Representations + annotations (DP)	Word2Vec: 1, 2; J&M Ch 5
Jan 22	N-gram Language Models + annotations (DP)	J&M Ch 3
Jan 27	N-gram models II (DP)	J&M Ch 3
Jan 29	ML Basics (AG)
Feb 3	Neural nets (AG)	Patterns, Predictions, and Actions Ch 5
Feb 5	Backpropagation (AG)	Deep Learning Book Ch 6
Feb 10	Neural n-grams (AG)	Bengio et al. (2003)
Feb 12	RNNs (AG)	J&M Chapter 13
Feb 17	LSTMs, Seq2Seq Models, Attention + annotations (DP)	J&M Chapter 13
Feb 19	No class
Feb 24	Transformers + annotations (DP)	The Illustrated Transformer, Attention is all you need
Feb 26	Pre-training + annotations (DP)	BERT, GPT 1, 2, 3
Mar 3	Midterm
Mar 5	RL for Post Training (AG)
Mar 10	RL training of LLMs (AG)
Mar 12	RL with Human Feedback (RLHF) (AG)
Mar 17	Tokenization (DP)	Karpathy’s tutorial
Mar 19	Institute Holiday
Mar 24	RAGs
Mar 26	Reasoning
Mar 31	Institute Holiday
Apr 2	Diffusion Models I
Apr 7	Diffusion Models II
Apr 9	Evaluation (DP)
Apr 14	Fairness, Biases and Ethics (DP)	FairML Book Chapters 1-4

Course Evaluation

The evaluation comprises:

Three programming assignments (3 x 15% = 45%),
Two exams (or quizzes):
- mid-semester (25%), and
- finals (30%).

Assignments

The programming assignments will tentatively involve building systems for learning word representations, text classification, language modeling, and post-training language models. The assignments will be implemented using interactive Python notebooks intended to run on Google’s Colab infrastructure. This allows students to use GPUs for free and with minimal setup. The notebooks will contain instructions interleaved with code blocks for students to fill in. These assignments are meant to be solved individually.

Important dates:

Assignment #1 (Out: Jan 30. Due: Feb 20, 16:59)
Assignment #2 (Out: Feb 23. Due: Mar 13, 16:59)
Assignment #3 (Out: Mar 16. Due: Apr 3, 16:59)

For a total of three assignments, you would get three late days, no extensions will be offered (please don’t even ask). There are no restrictions on how the late days can be used, for example, you can use all the three late days for one assignment. If you run out of late days, you can still submit your assignment, but your obtained score would be divided by 2 if submitting after 1 day, and by 4 if submitting after 2 days. No submissions would be entertained after that.

Discussions & (Anonymous) Feedback

We will use Teams for all discussions on course-related matters. Registered students should have received the joining link/passkey.

If you have any feedback, you can share it (anonymously or otherwise) through this link: http://tinyurl.com/feedback-for-danish

Teaching Assistants

Dayita Chaudhari (Thu 2-3 PM, CDS 308)
Savyasachi Deval (Mon 11:30-12:30, CDS 402)
Purva Parmar
Harshit Rawat
Sudharshan T R
Victor Azad

Reference Books

Speech and Language Processing (3rd ed. draft) by Dan Jurafsky and James H. Martin
Introduction to Natural Language Processing by Jacob Eisenstein
Neural Network Methods for Natural Language Processing by Yoav Goldberg