COMP6211B (L1) - Statistical Learning for Text Data Analytics
Statistical Learning for Text Data Analytics
Course Syllabus
Instructor: |
Yangqiu Song |
Email: |
|
Telephone: |
6987 |
Office: |
3518 (lift 25/26) |
Office hours: |
TuTh 3pm-5pm |
Course code: 6211B
Class Meeting Time and Classroom: Monday and Wednesday, 6:00pm - 7:20pm, Rm 2404, Lift 17-18
Course Description: Statistical machine learning has been widely used in natural language processing. Instead of considering general-purpose data processing, natural language data is discrete, symbolic, noisy, ambiguous, and large scale. These characters raise challenges for machine learning algorithms to handle natural language processing. This course will provide an overview of key challenges and typical modeling principles that have been developed in the past decades to handle the challenges. This course is a postgraduate-level introductory course, which will include fundamental algorithms and solutions. It will also provide some discussion of open problems that are still not solved in the whole community to inspire new research topics.
Course Outcomes: On successful completion of this course, the students should:
- Demonstrate machine learning algorithm design skills for NLP tasks;
- Analyze the quality of NLP results to domain problems;
- Develop a program that can handle existing real problems.
Course Prerequisites: Computer science: object-oriented programming and data structures, design and analysis of algorithms; Mathematics: multivariable calculus, linear algebra and matrix analysis, probability and statistics.
Course Topics: Generally, this course will include the topics listed below. The actual topics covered may evolve somewhat over the semester based on the need to elaborate or focus on specific issues and subtopics.
- Representation: language models, word embeddings, topic models;
- Learning: supervised learning, semi-supervised learning, sequence models, deep learning, optimization techniques;
- Inference: constrained modeling, joint inference, search algorithms.
Performance Evaluation: In general, the earned grade in the course will be based on the calculated total points according to the following schedule:
Activity or Task |
Max Point Value |
Weekly reading notes |
40% |
Mid-term project proposal |
10% |
Project report |
30% |
Final project presentation |
20% |
Total |
100 |
Tentative Schedule:
Week |
Date |
Topic |
1 |
05/02 |
Introduction |
07/02 |
Language Modeling (1) |
|
2 |
12/02 |
Language Modeling (2) |
14/02 |
No class meeting (will be made up) |
|
3 |
19/02 |
Holiday |
21/02 |
Featurized Language Modeling |
|
4 |
26/02 |
Neural Language Modeling |
28/02 |
SGD Optimization (1) |
|
5 |
05/03 |
SGD Optimization (2) |
07/03 |
SGD Optimization (3) |
|
6 |
12/03 |
Word Embedding |
14/03 |
Topic Model (1): Introduction |
|
7 |
19/03 |
Topic Model (2): Dirichlet Processes |
21/03 |
Text Classification (1): Introduction |
|
8 |
26/03 |
Text Classification (2): Generative vs. Discriminative |
28/03 |
Semi-supervised Learning (1) |
|
9 |
02/04 |
Midterm Break |
04/04 |
Midterm Break |
|
10 |
09/04 |
Semi-supervised Learning (2) |
11/04 |
Sequence Tagging (1): HMM |
|
11 |
16/04 |
Sequence Tagging (2): CRF |
18/04 |
Sequence Tagging (2): Structural SVM |
|
12 |
23/04 |
Constraint Models (1): Posterior Regularization |
25/04 |
Constraint Models (2): ILP for NLP |
|
13 |
30/04 |
Student Project Presentation |
02/05 |
Student Project Presentation |
|
14 |
07/05 |
Student Project Presentation |
09/05 |
Student Project Presentation |
Course Summary:
Date | Details | Due |
---|---|---|