COMP6211B (L1) - Statistical Learning for Text Data Analytics

Statistical Learning for Text Data Analytics

 

Course Syllabus

 

Instructor:

Yangqiu Song

Email:

yqsong@cse.ust.hk

Telephone:

6987

Office:

3518 (lift 25/26)

Office hours:

TuTh 3pm-5pm

 

Course code: 6211B

 

Class Meeting Time and Classroom: Monday and Wednesday, 6:00pm - 7:20pm, Rm 2404, Lift 17-18

 

Course Description: Statistical machine learning has been widely used in natural language processing. Instead of considering general-purpose data processing, natural language data is discrete, symbolic, noisy, ambiguous, and large scale. These characters raise challenges for machine learning algorithms to handle natural language processing. This course will provide an overview of key challenges and typical modeling principles that have been developed in the past decades to handle the challenges. This course is a postgraduate-level introductory course, which will include fundamental algorithms and solutions.  It will also provide some discussion of open problems that are still not solved in the whole community to inspire new research topics.

 

Course Outcomes: On successful completion of this course, the students should:

  • Demonstrate machine learning algorithm design skills for NLP tasks;
  • Analyze the quality of NLP results to domain problems;
  • Develop a program that can handle existing real problems.

 

Course Prerequisites:  Computer science: object-oriented programming and data structures, design and analysis of algorithms; Mathematics: multivariable calculus, linear algebra and matrix analysis, probability and statistics.

 

Course Topics: Generally, this course will include the topics listed below.  The actual topics covered may evolve somewhat over the semester based on the need to elaborate or focus on specific issues and subtopics. 

  • Representation: language models, word embeddings, topic models;
  • Learning: supervised learning, semi-supervised learning, sequence models, deep learning, optimization techniques;
  • Inference: constrained modeling, joint inference, search algorithms.

 

Performance Evaluation: In general, the earned grade in the course will be based on the calculated total points according to the following schedule:

 

Activity or Task

Max Point Value

Weekly reading notes

40%

Mid-term project proposal

10%

Project report

30%

Final project presentation

20%

Total

100

 

Tentative Schedule:

Week

Date

Topic

1

05/02

Introduction

07/02

Language Modeling (1)

2

12/02

Language Modeling (2)

14/02

No class meeting (will be made up)

3

19/02

Holiday

21/02

Featurized Language Modeling

4

26/02

Neural Language Modeling

28/02

SGD Optimization (1)

5

05/03

SGD Optimization (2)

07/03

SGD Optimization (3)

6

12/03

Word Embedding

14/03

Topic Model (1): Introduction

7

19/03

Topic Model (2): Dirichlet Processes

21/03

Text Classification (1): Introduction

8

26/03

Text Classification (2): Generative vs. Discriminative

28/03

Semi-supervised Learning (1)

9

02/04

Midterm Break

04/04

Midterm Break

10

09/04

Semi-supervised Learning (2)

11/04

Sequence Tagging (1): HMM

11

16/04

Sequence Tagging (2): CRF

18/04

Sequence Tagging (2): Structural SVM

12

23/04

Constraint Models (1): Posterior Regularization

25/04

Constraint Models (2): ILP for NLP

13

30/04

Student Project Presentation

02/05

Student Project Presentation

14

07/05

Student Project Presentation

09/05

Student Project Presentation

 

 

Course Summary:

Date Details Due