CSCI 8980 (Spring 2023): Advanced Topics in Databases
Data Systems for Dirty, Private, and Federated Data

Course Overview

This course studies the algorithmic and system design of modern data systems for handling the emerging needs in data science. We will cover three specific needs in this semester-long seminar course. The first topic we will discuss focuses on detecting and repairing data errors, and data cleaning systems. Second, we will discuss security and privacy requirements for data analytics, and cover the tools and systems for achieving these requirements. Third, we will discuss the challenges of consolidating information from siloed data, and investigate the design principles and optimization opportunities in data federation.

Format

The course is organized as a series of seminars presented by the instructor and students. The instructor will present the overview and fundamental techniques for all three topics in the beginning weeks of the course. Following that, in each seminar session, we will discuss three papers presented by three individual students, one on each of the three topics, respectively. Each presenter is expected to do a conference-style paper presentation first, and then lead the discussion with all participants. Other students are strongly encouraged to read the papers before the seminar, and expected to submit a one-page summary for every paper that highlights the merits and challenges of the presented papers after attending the seminar.

There will be no exams. Instead, each student will be asked to identify a concrete problem related to the topics of this course and complete a semester-long project either independently or in a group of no more than two students. Each project will undergo three milestones at the beginning, middle and final stage of the project. The project will involve implementing some of the techniques covered in class with tailored modifications for the specific problems, and performing comparative studies between alternative techniques. At the end of the semester, each project is expected to be fully summarized in a technical report. A good project would possibly result in writing a publishable paper.

Optional Textbooks

The course will focus on reading and understanding recent papers from top venues in databases, security and privacy, and machine learning communities. The following optional textbooks serve as useful references and PDFs should be accessible on campus network:

Grading

Late submission without prior consent is not considered by default. All deadlines refer to the end of the day (11:59PM Central Time).

Schedule

Week Date Topic Presenter Slides
1 Wednesday, January 18 Lecture Chang Ge 1.1
2 Monday, January 23 Lecture Chang Ge 2.1
Wednesday, January 25
Paper selection is due by January 25
Lecture Chang Ge 2.2
2.3
3 Monday, January 30 Li et al.: Deep Entity Matching with Pre-Trained Language Models. Proc. VLDB Endow. 14(1): 50-60 (2020)
Abadi et al.: Deep Learning with Differential Privacy. CCS 2016: 308-318
McMahan et al.: Communication-Efficient Learning of Deep Networks from Decentralized Data. AISTATS 2017: 1273-1282
Jonathan Leibovich
Chris Liu
May Lin
3.1.A
3.1.B
3.1.C
Wednesday, February 1 Wu et al.: ZeroER: Entity Resolution using Zero Labeled Examples. SIGMOD Conference 2020: 1149-1164
Liu et al.: Dealer: An End-to-End Model Marketplace with Differential Privacy. Proc. VLDB Endow. 14(6): 957-969 (2021)
Wang et al.: Federated Learning with Matched Averaging. ICLR 2020
Krithika Sundaram
Mohammed Guiga
Shunichi Sawamura
3.2.A
3.2.B
3.2.C
4 Monday, February 6 Ahmadi et. al: Unsupervised Matching of Data and Text. ICDE 2022: 1058-1070
Yu et al.: Differentially Private Fine-tuning of Language Models. ICLR 2022
Rothchild et al.: FetchSGD: Communication-Efficient Federated Learning with Sketching. ICML 2020: 8253-8265
Pratik Nehete
Faizel Khan
Zahara Spilka
4.1.A
4.1.B
4.1.C
Wednesday, February 8 Jin et al.: Deep Transfer Learning for Multi-source Entity Linkage via Domain Adaptation. Proc. VLDB Endow. 15(3): 465-477 (2021)
Chowdhury et al.: Strengthening Order Preserving Encryption with Differential Privacy. CCS 2022: 2519-2533
Smith et al.: Federated Multi-Task Learning. NIPS 2017: 4424-4434
Shunichi Sawamura
Joe Conroy
Mohammed Guiga
4.2.A
4.2.B
4.2.C
5 Monday, February 13 Tu et al.: Domain Adaptation for Deep Entity Resolution. SIGMOD Conference 2022: 443-457
Böhler et al.: Secure Multi-party Computation of Differentially Private Heavy Hitters. CCS 2021: 2361-2377
Jiang et al.: Improving Federated Learning Personalization via Model Agnostic Meta Learning. CoRR abs/1909.12488 (2019)
Faizel Khan
Zahara Spilka
Yutong Lei
5.1.A
5.1.B
5.1.C
Wednesday, February 15
Project propsal is due by February 19
Galhotra et al.:Hierarchical Entity Resolution using an Oracle. SIGMOD Conference 2022: 414-428
Lu et al.: A General Framework for Auditing Differentially Private Machine Learning. NeurIPS 2022
Bui et al.: Federated User Representation Learning. CoRR abs/1909.12535 (2019)

Mike Cao
Pratik Nehete

5.2.B
5.2.C
6 Monday, February 20 Simonini et al.: Entity Resolution On-Demand. Proc. VLDB Endow. 15(7): 1506-1518 (2022)
Zhang et al.: LearnedSQLGen: Constraint-aware SQL Generation using Reinforcement Learning. SIGMOD Conference 2022: 945-958
Bagdasaryan et al.: How To Backdoor Federated Learning. AISTATS 2020: 2938-2948
May Lin
Shunichi Sawamura
Tisbia Mpoyo
6.1.A
6.1.B
6.1.C
Wednesday, February 22 Zhou et al.: Serving Deep Learning Models with Deduplication from Relational Databases. Proc. VLDB Endow. 15(10): 2230-2243 (2022)
Yang et al.: SAM: Database Generation from Query Workloads with Supervised Autoregressive Models. SIGMOD Conference 2022: 1542-1555
Bao et al.: Skellam Mixture Mechanism: a Novel Approach to Federated Learning with Differential Privacy. Proc. VLDB Endow. 15(11): 2348-2360 (2022)
Maximilian
Nicole Sullivan
6.2.A
6.2.B
7 Monday, February 27 Pena et al.: Fast Detection of Denial Constraint Violations. Proc. VLDB Endow. 15(4): 859-871 (2021)
Yang et al.: Auto-Pipeline: Synthesize Data Pipelines By-Target Using Reinforcement Learning and Search. Proc. VLDB Endow. 14(11): 2563-2575 (2021)
Roth et al.: Honeycrisp: large-scale differentially private aggregation without a trusted core. SOSP 2019: 196-210
Mike Cao
Pratik Nehete
Jonathan Leibovich
7.1.A
7.1.B
7.1.C
Wednesday, March 1 Hilprecht et al.: ReStore - Neural Data Completion for Relational Databases. SIGMOD Conference 2021: 710-722
Takenouchi et al.: PATSQL: Efficient Synthesis of SQL Queries from Example Tables with Quick Inference of Projected Columns. Proc. VLDB Endow. 14(11): 1937-1949 (2021)
Liu et al.: Enabling SQL-based Training Data Debugging for Federated Learning. Proc. VLDB Endow. 15(3): 388-400 (2021)
Zahara Spilka
Derrick Gnana
Dat Vy Luong
7.2.A
7.2.B
7.2.C
8 Monday, March 6 Spring break, no classes
Wednesday, March 8
9 Monday, March 13 Project midterm presentations
  1. Maximilian Scheder-Bieschin, Faizel Khan
  2. Pratik Nehete
  3. Joe Conroy
  4. Mohammed Guiga, Nicole Sullivan
  5. Yutong Lei
Wednesday, March 15 Project midterm presentations
  1. Chris Liu, Jonathan Leibovich, Mike Cao
  2. Tisbia Mpoyo
  3. Derrick Gnana, Zahara Spilka
  4. Dat Luong, Shunichi Sawamura
  5. May Lin, Krithika Sundaram
10 Monday, March 20 Li et al.: Unsupervised Contextual Anomaly Detection for Database Systems. SIGMOD Conference 2022: 788-802
Sanghi et al.: Projection-Compliant Database Generation. Proc. VLDB Endow. 15(5): 998-1010 (2022)
Li et al.: Federated Matrix Factorization with Privacy Guarantee. Proc. VLDB Endow. 15(4): 900-913 (2021)
Tisbia Mpoyo

Joe Conroy
10.1.A

10.1.C
Wednesday, March 22 Cao et al.: Efficient Discovery of Sequence Outlier Patterns. Proc. VLDB Endow. 12(8): 920-932 (2019)
Mughees et al.: OnionPIR: Response Efficient Single-Server PIR. CCS 2021: 2292-2306
Tong et al.: Hu-Fu: Efficient and Secure Spatial Queries over Data Federation. Proc. VLDB Endow. 15(6): 1159-1172 (2022)
Mohammed Guiga
May Lin
Derrick Gnana
10.2.A
10.2.B

11 Monday, March 27 Wang et al.: Uni-Detect: A Unified Approach to Automated Error Detection in Tables. SIGMOD Conference 2019: 811-828
Dauterman et al.: Waldo: A Private Time-Series Database from Function Secret Sharing. IEEE Symposium on Security and Privacy 2022: 2450-2468
Xie et al.: FederatedScope: A Comprehensive and Flexible Federated Learning Platform via Message Passing. CoRR abs/2204.05011 (2022)
Nicole Sullivan
Maximilian
Mike Cao
11.1.A
11.1.B
11.1.C
Wednesday, March 29 Yan et al.: SCODED: Statistical Constraint Oriented Data Error Detection. SIGMOD Conference 2020: 845-860
Unnibhavi et al.: Secure and Policy-Compliant Query Processing on Heterogeneous Computational Storage Architectures. SIGMOD Conference 2022: 1462-1477
Zhang et al.: Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches. Proc. VLDB Endow. 14(10): 1769-1782 (2021)
Yutong Lei
Dat Vy Luong
Krithika Sundaram
11.2.A
11.2.B
11.2.C
12 Monday, April 3 Galhotra et al.: DataPrism: Exposing Disconnect between Data and Systems. SIGMOD Conference 2022: 217-231
Tan et al.: CryptGPU: Fast Privacy-Preserving Machine Learning on the GPU. IEEE Symposium on Security and Privacy 2021: 1021-1038
Stoddard et al.: Tanium Reveal: A Federated Search Engine for Querying Unstructured File Data on Large Enterprise Networks. Proc. VLDB Endow. 14(12): 3096-3109 (2021)
Dat Vy Luong
Tisbia Mpoyo
Nicole Sullivan
12.1.A
12.1.B
12.1.C
Wednesday, April 5 Mahdavi et al.: Raha: A Configuration-Free Error Detection System. SIGMOD Conference 2019: 865-882
Shastri et al.: Understanding and Benchmarking the Impact of GDPR on Database Systems. Proc. VLDB Endow. 13(7): 1064-1077 (2020)
He et al.: TransNet: Training Privacy-Preserving Neural Network over Transformed Layer. Proc. VLDB Endow. 13(11): 1849-1862 (2020)
Joe Conroy
Yutong Lei
Chris Liu
12.2.A
12.2.B
12.2.C
13 Monday, April 10 Cheng et al.: PGE: Robust Product Graph Embedding Learning for Error Detection. Proc. VLDB Endow. 15(6): 1288-1296 (2022)
Pan et al.: Privacy Risks of General-Purpose Language Models. IEEE Symposium on Security and Privacy 2020: 1314-1331
Cheng et al.: HAFLO: GPU-Based Acceleration for Federated Logistic Regression. CoRR abs/2107.13797 (2021)
Derrick Gnana
Jonathan Leibovich
Maximilian
13.1.A
13.1.B
13.1.C
Wednesday, April 12 No class, work on project
14 Monday, April 17 Rekatsinas et al.: HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. 10(11): 1190-1201 (2017)
Eskandarian et al.: ObliDB: Oblivious Query Processing for Secure Databases. Proc. VLDB Endow. 13(2): 169-183 (2019)
Fu et al.: BlindFL: Vertical Federated Machine Learning without Peeking into Your Data. SIGMOD Conference 2022: 1316-1330
Chris Liu
Krithika Sundaram
Faizel Khan
14.1.A
14.1.B
14.1.C
Wednesday, April 19 No class, open office hour for AMA
15 Monday, April 24 Project final presentations
  1. May Lin, Krithika Sundaram
  2. Mohammed Guiga, Nicole Sullivan
  3. Chris Liu, Jonathan Leibovich, Mike Cao
Wednesday, April 26 Project final presentations
  1. Pratik Nehete
  2. Dat Luong, Shunichi Sawamura
  3. Joe Conroy
16 Monday, May 1
Project report is due by May 10
Project final presentations
  1. Derrick Gnana, Zahara Spilka
  2. Yutong Lei
  3. Tisbia Mpoyo
  4. Maximilian Scheder-Bieschin, Faizel Khan
Class photo

University Policies