Introduction to Natural Language Processing (DS 207)

Term: January 2025 – May 2025

Time: Mondays & Wednesdays (3:30-5:00)

Venue: CDS 102

Credits: 3:1

Outline: This course is a graduate-level introduction to the field of Natural Language Processing (NLP), which involves building computational systems to handle human languages. Why care about NLP systems? We interact with them on a daily basis—such systems answer the questions we ask (using Google, or other search engines), curate the content we read, autocomplete words we are likely to type, translate text from languages we don’t know, flag content on social media that we might find harmful, etc. Such systems are prominently used in industry as well as academia, especially for analyzing textual data.

Learning Outcomes: The course is structured to emphasize practical learning. With four assignments, students will get a good sense on challenges involved in building models that deal with human languages. Post completion, students will feel comfortable developing models for problems involving textual data. The course also spends a considerable amount of time on language models, so students would be able to participate (in an informed way) in the current wave of large language models (LLMs). After the course, students would also be able to pick up most recent published research, and understand majority of the ideas in them.

Prerequisites: The class is intended for graduate students and senior undergraduates. We do not plan to impose any strict requisites on IISc courses that one should have completed to register for this course. However, students are expected to know the basics of linear algebra, probability, and calculus. Programming assignments would require proficiency in Python, familiarity with PyTorch would also be useful. There will be no pre-requisite quiz, we expect students to self-determine whether they are prepared to take this course or not.

Announcements

No class on March 17, 2025
Classes from Feb 25 onwards will be held in the CDS 102
No class on Feb 17
First class is on Jan 6 in CDS 102 at 3:30 PM

Course Schedule

The course schedule is as follows. This is subject to changes based on student feedback and pace of the instruction.

Date	Topic	Reading Material
Jan 6	Course introduction
Jan 8	Text classification I + annotations	Eisenstein Sec 2-2.1
Jan 13	Text classification II + annotations	Eisenstein Sec 2-2.1
Jan 15	Word Representations + annotations	Word2Vec: 1, 2; J & M Chapter 6
Jan 20	N-gram Language Models + annotations	J& M Ch 3
Jan 22	N-gram Language Models + Neural Nets + annotations	J & M Chapters 3 & 7
Jan 27	Feed-forward Nets + Recurrent Nets + annotations	J & M Chapters 7 and 8
Jan 29	RNNs + LSTM + annotations	J&M Chapter 8
Feb 3	PyTorch Tutorial by Kinshuk Vasisht (+ iPynb)	Tensor Puzzles
Feb 5	Seq-to-seq Models + Machine Translation + annotations	J & M Chapter 9
Feb 10	Transformers + annotations	Transformers: 1, 2, J & M Ch 10
Feb 12	Pre-training Transformers	BERT, GPT 1, 2, 3
Feb 17	No Class
Feb 19	Post-training + watermarking + annotations	InstructGPT, Watermarking
Feb 24	Decoding Algorithms (+ iPynb) by Dr. Apoorv Saxena	Speculative Decoding
Feb 26	RL for Post-training by Prof. Aditya Gopalan	RLHF
Mar 03	Review
Mar 05	Midterm
Mar 10	RL for Post-training II by Prof. Aditya Gopalan	Direct Preference Optimization
Mar 12	Tokenization: Code + Slides	Karpathy’s tutorial
Mar 17	No Class
Mar 19	Evaluation Benchmarks
Mar 24	Tagging: HMMs + Viterbi Decoding	Notes from Michael Collins
Mar 26	Tagging II (CRFs)	A tutorial on CRFs
Mar 31	Institute Holiday
Apr 2	Fairness, Biases and Ethics I	FairML Book Chapters 1-4
Apr 7	Ethics in AI	Case Studies
Apr 9	Course Summary

Course Evaluation

The evaluation comprises:

Four programming assignments (4 x 15% = 60%),
Two exams (or quizzes):
- mid-semester (15%), and
- finals (25%).

Assignments

The four programming assignments will tentatively involve building systems for learning word representations, text classification, language modeling, machine translation, and named entity recognition. The assignments will be implemented using interactive Python notebooks intended to run on Google’s Colab infrastructure. This allows students to use GPUs for free and with minimal setup. The notebooks will contain instructions interleaved with code blocks for students to fill in.

These assignments are meant to be solved individually. For a total of four assignments, you would get four late days, no extensions will be offered (please don’t even ask). There are no restrictions on how the late days can be used, for example, you can use all the four late days for one assignment. If you run out of late days, you can still submit your assignment, but your obtained score would be divided by 2 if submitting after 1 day, and by 4 if submitting after 2 days. No submissions would be entertained after that.

Important dates:

Assignment #1 (Out: Jan 20. Due: Feb 07, 16:59)
Assignment #2 (Out: Feb 10. Due: Feb 28, 16:59)
Assignment #3 (Out: Mar 03. Due: ~~Mar 21, 16:59~~ Mar 28, 16:59)
Assignment #4 (Out: ~~Mar 24~~ Mar 31. Due: ~~Apr 11, 16:59~~ April 18, 16:59)

Discussions & (Anonymous) Feedback

We will use Teams for all discussions on course-related matters. Registered students should have received the joining link/passkey.

If you have any feedback, you can share it (anonymously or otherwise) through this link: http://tinyurl.com/feedback-for-danish

Teaching Staff

Yash Patel (OH: Mondays 2 PM - 3 PM; Venue: CDS 208; notify them)
Tarun Gupta (OH: Tuesdays 2 PM - 3 PM; Venue: CDS 308; notify them)
Shivashish Naithani (OH: Friday 2 PM - 3 PM; Venue: CDS 308; notify them)
Karan Raj Bagri (OH: Wednesday 2 PM - 3 PM; Venue: CDS 208; notify them)
Danish Pruthi (OH: Tuesdays 4 PM - 5 PM; Venue: CDS 401; notify them)

Reference Books

Speech and Language Processing (3rd ed. draft) by Dan Jurafsky and James H. Martin
Introduction to Natural Language Processing by Jacob Eisenstein
Neural Network Methods for Natural Language Processing by Yoav Goldberg

Danish