Introduction to Natural Language Processing (DS 207)

Term: January 2024 – May 2024

Time: Tuesdays & Thursdays (10-11:30)

Venue: CDS 102

Credits: 3:1

Outline: This course is a graduate-level introduction to the field of Natural Language Processing (NLP), which involves building computational systems to handle human languages. We interact with NLP systems on a daily basis—such systems answer the questions we ask (using Google, or other search engines), curate the content we read, autocomplete words we are likely to type, translate text from languages we don’t know, flag content on social media that we might find harmful, etc. Such systems are prominently used in industry as well as academia, especially for analyzing textual data.

Prerequisites: The class is intended for graduate students and senior undergraduates. We do not plan to impose any strict requisites on IISc courses that one should have completed to register for this course. However, students are expected to know the basics of linear algebra, probability, calculus, and neural networks. Programming assignments would require proficiency in Python.

Announcements

Feb 27, 2024: Assignment #3 is out now, due on Mar 22, 16:59 IST.
Feb 11, 2024: Included the template for the project.
Feb 6, 2024: A few broad project directions are here.
Feb 6, 2024: Assignment #2 is out now, due on Feb 23, 16:59 IST.
Jan 22, 2024: Assignment #1 is out now, due on Feb 6, 16:59 IST.
Jan 20, 2024: The class on Thursday (Jan 25) will happen at EE B-308.
Jan 9, 2024: The quiz to assess the pre-requisites would be held in CDS 102 ~~CDS 202~~ (and if required CDS rooms 419 and 208) during the class hours on Thursday (Jan 11). The quiz will be conducted through google forms, so don’t forget to carry your laptop or phone.

Course Schedule

The course schedule is as follows. This is subject to changes based on student feedback and pace of the instruction.

Date	Topic	Reading Material
Jan 9	Course introduction
Jan 11	Text classification + In-class quiz	Eisenstein Sec 2-2.1
Jan 16	Generative Naive Bayes Classification w/ annotations	Eisenstein Sec 2.2, J & M Chapter 4
Jan 18	Discriminative Classifiers & Word Representations w/ annotations	Word2Vec: 1, 2; J & M Chapter 6
Jan 23	Word Representations + n-gram models	J& M Ch 3
Jan 25	LMs + Neural Nets w/ annotations	J & M Chapters 3 & 7
Jan 30	Tutorial on PyTorch
Feb 1	No class
Feb 6	Neural Nets (RNNs) + Applications w/ annotations	J & M Chapter 7
Feb 8	RNNs, LSTMs & Attention	J & M Chapter 9
Feb 13	Discussion on Research Projects
Feb 15	Attention & Transformers	Transformers 1, 2, J & M Ch 10
Feb 20	Transformers & Pre-training	BERT paper
Feb 22	Transformers & Pre-training II	1, 2
Feb 27	No Class
Feb 29	Tokenization (by Rankit Kachroo)
Mar 5	Pre-trained Decoders + Post-Training	GPT 1, 2 & 3
Mar 7	Tagging + HMMs	Notes from Michael Collins
Mar 12	CRFs (class work)	A tutorial on CRFs
Mar 14	Beyond Scaling Pretraining Data (by Saurabh Garg; CMU)
Mar 19	No class
Mar 21	Fairness, biases & ethics
Mar 26	AI X Community (by Shachi Dave; Google Research)
Mar 28	Multilinguality (by Prof. Monojit Choudhury; MBZUAI))
Apr 2	No class
Apr 4	Ethics in AI
Apr 9	Poster Session #1
Apr 10	Poster Session #2

Course Evaluation

The evaluation comprises 3 programming assignments (3 x 15% = 45% of the overall score), 2 exams (2 x 10% of the overall grade), and final group course project (which is worth 35% of the overall grade).

The two exams aim to evaluate the student’s learning acquired through lectures and assignments. One of these two exams would be administered towards the middle of the semester and the second one towards the end. Each exam is worth 10% of the grade.

Projects

The course project constitutes 35% of the overall score, where students—in groups of three—get a chance to apply the acquired knowledge for an application of their choice. Projects would typically involve human languages and deep learning. The project includes three milestones: (1) initial proposal (which will require a rough action plan and associated timelines); (2) a mid-term report and (3) a final report. Towards the end of the course, students would get a chance to showcase their research through poster presentations.

Each team would get three late days for projects, no extensions will be offered (please don’t even ask). After your late days expire, you can still submit your project but your obtained score would be divided by 2 if submitting after 1 day, and will be divided by 4 if submitting after 2 days. No submissions would be entertained after that.

Some project directions are availabe here. Please note that these directions are only suggestive, students should not limit their explorations to just these directions.

New: Included the template for the project that also includes a few guidelines.

Important dates:

~~Feb 13, 16:59~~ Feb 15, 16:59 Project proposals due
Mar 14, 16:59 Mid-term report due
~~Apr 16, 16:59~~ April 22, 16:59 Final project reports to be submitted

Assignments

The three programming assignments will involve building systems for (1) text classification and learning word representations; (2) language modeling; (3) TBD (possibly machine translation and/or named entity recognition). The assignments will be implemented using interactive Python notebooks intended to run on Google’s Colab infrastructure. This allows students to use GPUs for free and with minimal setup. The notebooks will contain instructions interleaved with code blocks for students to fill in.

These assignments are meant to be solved individually. For a total of three assignments, you would get three late days, no extensions will be offered (please don’t even ask). There are no restrictions on how the late days can be used, for example, you can use all the three late days for one assignment. If you run out of late days, you can still submit your assignment, but your obtained score would be divided by 2 if submitting after 1 day, and by 4 if submitting after 2 days. No submissions would be entertained after that.

Important dates:

Assignment #1 (Out: Jan 22. Due: Feb 6, 16:59)
Assignment #2 (Out: Feb 6. Due: Feb 23, 16:59)
Assignment #3 (Out: Feb 27. Due: Mar 22, 16:59)

Discussions & (Anonymous) Feedback

We will use Teams for all discussions on course-related matters. Registered students should have received the joining link/passkey.

If you have any feedback, you can share it (anonymously or otherwise) through this link: http://tinyurl.com/feedback-for-danish

Teaching Staff

Debarpan Bhattacharya (OH: Thursdays 3 PM - 4 PM; Venue: CDS 208; notify them)
Kinshuk Vasisht (OH: Wednesdays 10 AM - 11 AM; Venue: CDS 208; notify them)
Navreet Kaur (OH: Tuesdays 4 PM - 5 PM; Venue: CDS 208; notify them)
Nicy Scaria (OH: Tuesdays 11:30 AM - 12:30 PM; Venue: CDS 208; notify them)
Rankit Kachroo (OH Wednesdays 4 PM - 5 PM; Venue: CDS 208; notify them)
Danish Pruthi (OH: Fridays 10 AM - 11 AM; Venue: CDS 401; notify them)

Reference Books

Speech and Language Processing (3rd ed. draft) by Dan Jurafsky and James H. Martin
Introduction to Natural Language Processing by Jacob Eisenstein
Neural Network Methods for Natural Language Processing by Yoav Goldberg

Danish