This course studies the algorithmic and system design of modern data systems for handling the emerging needs in data science. We will cover three specific needs in this semester-long seminar course. The first topic we will discuss focuses on detecting and repairing data errors, and data cleaning systems. Second, we will discuss security and privacy requirements for data analytics, and cover the tools and systems for achieving these requirements. Third, we will discuss the challenges of consolidating information from siloed data, and investigate the design principles and optimization opportunities in data federation.
The course is organized as a series of seminars presented by the instructor and students. The instructor will present the overview and fundamental techniques for all three topics in the beginning weeks of the course. Following that, in each seminar session, we will discuss three papers presented by three individual students, one on each of the three topics, respectively. Each presenter is expected to do a conference-style paper presentation first, and then lead the discussion with all participants. Other students are strongly encouraged to read the papers before the seminar, and expected to submit a one-page summary for every paper that highlights the merits and challenges of the presented papers after attending the seminar.
There will be no exams. Instead, each student will be asked to identify a concrete problem related to the topics of this course and complete a semester-long project either independently or in a group of no more than two students. Each project will undergo three milestones at the beginning, middle and final stage of the project. The project will involve implementing some of the techniques covered in class with tailored modifications for the specific problems, and performing comparative studies between alternative techniques. At the end of the semester, each project is expected to be fully summarized in a technical report. A good project would possibly result in writing a publishable paper.
Late submission without prior consent is not considered by default. All deadlines refer to the end of the day (11:59PM Central Time).
Week | Date | Topic | Presenter | Slides |
---|---|---|---|---|
1 | Wednesday, January 18 | Lecture | Chang Ge | 1.1 |
2 | Monday, January 23 | Lecture | Chang Ge | 2.1 |
Wednesday, January 25 Paper selection is due by January 25 |
Lecture | Chang Ge |
2.2 2.3 |
|
3 | Monday, January 30 |
Li et al.: Deep Entity Matching with Pre-Trained Language Models. Proc. VLDB Endow. 14(1): 50-60 (2020) Abadi et al.: Deep Learning with Differential Privacy. CCS 2016: 308-318 McMahan et al.: Communication-Efficient Learning of Deep Networks from Decentralized Data. AISTATS 2017: 1273-1282 |
Jonathan Leibovich Chris Liu May Lin |
3.1.A 3.1.B 3.1.C |
Wednesday, February 1 |
Wu et al.: ZeroER: Entity Resolution using Zero Labeled Examples. SIGMOD Conference 2020: 1149-1164 Liu et al.: Dealer: An End-to-End Model Marketplace with Differential Privacy. Proc. VLDB Endow. 14(6): 957-969 (2021) Wang et al.: Federated Learning with Matched Averaging. ICLR 2020 |
Krithika Sundaram Mohammed Guiga Shunichi Sawamura |
3.2.A 3.2.B 3.2.C |
|
4 | Monday, February 6 |
Ahmadi et. al: Unsupervised Matching of Data and Text. ICDE 2022: 1058-1070 Yu et al.: Differentially Private Fine-tuning of Language Models. ICLR 2022 Rothchild et al.: FetchSGD: Communication-Efficient Federated Learning with Sketching. ICML 2020: 8253-8265 |
Pratik Nehete Faizel Khan Zahara Spilka |
4.1.A 4.1.B 4.1.C |
Wednesday, February 8 |
Jin et al.: Deep Transfer Learning for Multi-source Entity Linkage via Domain Adaptation. Proc. VLDB Endow. 15(3): 465-477 (2021) Chowdhury et al.: Strengthening Order Preserving Encryption with Differential Privacy. CCS 2022: 2519-2533 Smith et al.: Federated Multi-Task Learning. NIPS 2017: 4424-4434 |
Shunichi Sawamura Joe Conroy Mohammed Guiga |
4.2.A 4.2.B 4.2.C |
|
5 | Monday, February 13 |
Tu et al.: Domain Adaptation for Deep Entity Resolution. SIGMOD Conference 2022: 443-457 Böhler et al.: Secure Multi-party Computation of Differentially Private Heavy Hitters. CCS 2021: 2361-2377 Jiang et al.: Improving Federated Learning Personalization via Model Agnostic Meta Learning. CoRR abs/1909.12488 (2019) |
Faizel Khan Zahara Spilka Yutong Lei |
5.1.A 5.1.B 5.1.C |
Wednesday, February 15 Project propsal is due by February 19 |
Galhotra et al.:Hierarchical Entity Resolution using an Oracle. SIGMOD Conference 2022: 414-428 Lu et al.: A General Framework for Auditing Differentially Private Machine Learning. NeurIPS 2022 Bui et al.: Federated User Representation Learning. CoRR abs/1909.12535 (2019) |
Mike Cao Pratik Nehete |
5.2.B 5.2.C |
|
6 | Monday, February 20 |
Simonini et al.: Entity Resolution On-Demand. Proc. VLDB Endow. 15(7): 1506-1518 (2022) Zhang et al.: LearnedSQLGen: Constraint-aware SQL Generation using Reinforcement Learning. SIGMOD Conference 2022: 945-958 Bagdasaryan et al.: How To Backdoor Federated Learning. AISTATS 2020: 2938-2948 |
May Lin Shunichi Sawamura Tisbia Mpoyo |
6.1.A 6.1.B 6.1.C |
Wednesday, February 22 |
Zhou et al.: Serving Deep Learning Models with Deduplication from Relational Databases. Proc. VLDB Endow. 15(10): 2230-2243 (2022) Yang et al.: SAM: Database Generation from Query Workloads with Supervised Autoregressive Models. SIGMOD Conference 2022: 1542-1555 Bao et al.: Skellam Mixture Mechanism: a Novel Approach to Federated Learning with Differential Privacy. Proc. VLDB Endow. 15(11): 2348-2360 (2022) |
Maximilian Nicole Sullivan |
6.2.A 6.2.B |
|
7 | Monday, February 27 |
Pena et al.: Fast Detection of Denial Constraint Violations. Proc. VLDB Endow. 15(4): 859-871 (2021) Yang et al.: Auto-Pipeline: Synthesize Data Pipelines By-Target Using Reinforcement Learning and Search. Proc. VLDB Endow. 14(11): 2563-2575 (2021) Roth et al.: Honeycrisp: large-scale differentially private aggregation without a trusted core. SOSP 2019: 196-210 |
Mike Cao Pratik Nehete Jonathan Leibovich |
7.1.A 7.1.B 7.1.C |
Wednesday, March 1 |
Hilprecht et al.: ReStore - Neural Data Completion for Relational Databases. SIGMOD Conference 2021: 710-722 Takenouchi et al.: PATSQL: Efficient Synthesis of SQL Queries from Example Tables with Quick Inference of Projected Columns. Proc. VLDB Endow. 14(11): 1937-1949 (2021) Liu et al.: Enabling SQL-based Training Data Debugging for Federated Learning. Proc. VLDB Endow. 15(3): 388-400 (2021) |
Zahara Spilka Derrick Gnana Dat Vy Luong |
7.2.A 7.2.B 7.2.C |
|
8 | Monday, March 6 | Spring break, no classes | ||
Wednesday, March 8 | ||||
9 | Monday, March 13 | Project midterm presentations
|
||
Wednesday, March 15 | Project midterm presentations
|
|||
10 | Monday, March 20 |
Li et al.: Unsupervised Contextual Anomaly Detection for Database Systems. SIGMOD Conference 2022: 788-802 Sanghi et al.: Projection-Compliant Database Generation. Proc. VLDB Endow. 15(5): 998-1010 (2022) Li et al.: Federated Matrix Factorization with Privacy Guarantee. Proc. VLDB Endow. 15(4): 900-913 (2021) |
Tisbia Mpoyo Joe Conroy |
10.1.A 10.1.C |
Wednesday, March 22 |
Cao et al.: Efficient Discovery of Sequence Outlier Patterns. Proc. VLDB Endow. 12(8): 920-932 (2019) Mughees et al.: OnionPIR: Response Efficient Single-Server PIR. CCS 2021: 2292-2306 Tong et al.: Hu-Fu: Efficient and Secure Spatial Queries over Data Federation. Proc. VLDB Endow. 15(6): 1159-1172 (2022) |
Mohammed Guiga May Lin Derrick Gnana |
10.2.A 10.2.B |
|
11 | Monday, March 27 |
Wang et al.: Uni-Detect: A Unified Approach to Automated Error Detection in Tables. SIGMOD Conference 2019: 811-828 Dauterman et al.: Waldo: A Private Time-Series Database from Function Secret Sharing. IEEE Symposium on Security and Privacy 2022: 2450-2468 Xie et al.: FederatedScope: A Comprehensive and Flexible Federated Learning Platform via Message Passing. CoRR abs/2204.05011 (2022) |
Nicole Sullivan Maximilian Mike Cao |
11.1.A 11.1.B 11.1.C |
Wednesday, March 29 |
Yan et al.: SCODED: Statistical Constraint Oriented Data Error Detection. SIGMOD Conference 2020: 845-860 Unnibhavi et al.: Secure and Policy-Compliant Query Processing on Heterogeneous Computational Storage Architectures. SIGMOD Conference 2022: 1462-1477 Zhang et al.: Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches. Proc. VLDB Endow. 14(10): 1769-1782 (2021) |
Yutong Lei Dat Vy Luong Krithika Sundaram |
11.2.A 11.2.B 11.2.C |
|
12 | Monday, April 3 |
Galhotra et al.: DataPrism: Exposing Disconnect between Data and Systems. SIGMOD Conference 2022: 217-231 Tan et al.: CryptGPU: Fast Privacy-Preserving Machine Learning on the GPU. IEEE Symposium on Security and Privacy 2021: 1021-1038 Stoddard et al.: Tanium Reveal: A Federated Search Engine for Querying Unstructured File Data on Large Enterprise Networks. Proc. VLDB Endow. 14(12): 3096-3109 (2021) |
Dat Vy Luong Tisbia Mpoyo Nicole Sullivan |
12.1.A 12.1.B 12.1.C |
Wednesday, April 5 |
Mahdavi et al.: Raha: A Configuration-Free Error Detection System. SIGMOD Conference 2019: 865-882 Shastri et al.: Understanding and Benchmarking the Impact of GDPR on Database Systems. Proc. VLDB Endow. 13(7): 1064-1077 (2020) He et al.: TransNet: Training Privacy-Preserving Neural Network over Transformed Layer. Proc. VLDB Endow. 13(11): 1849-1862 (2020) |
Joe Conroy Yutong Lei Chris Liu |
12.2.A 12.2.B 12.2.C |
|
13 | Monday, April 10 |
Cheng et al.: PGE: Robust Product Graph Embedding Learning for Error Detection. Proc. VLDB Endow. 15(6): 1288-1296 (2022) Pan et al.: Privacy Risks of General-Purpose Language Models. IEEE Symposium on Security and Privacy 2020: 1314-1331 Cheng et al.: HAFLO: GPU-Based Acceleration for Federated Logistic Regression. CoRR abs/2107.13797 (2021) |
Derrick Gnana Jonathan Leibovich Maximilian |
13.1.A 13.1.B 13.1.C |
Wednesday, April 12 | No class, work on project | |||
14 | Monday, April 17 |
Rekatsinas et al.: HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. 10(11): 1190-1201 (2017) Eskandarian et al.: ObliDB: Oblivious Query Processing for Secure Databases. Proc. VLDB Endow. 13(2): 169-183 (2019) Fu et al.: BlindFL: Vertical Federated Machine Learning without Peeking into Your Data. SIGMOD Conference 2022: 1316-1330 |
Chris Liu Krithika Sundaram Faizel Khan |
14.1.A 14.1.B 14.1.C |
Wednesday, April 19 | No class, open office hour for AMA | |||
15 | Monday, April 24 | Project final presentations
|
||
Wednesday, April 26 | Project final presentations
|
|||
16 |
Monday, May 1 Project report is due by May 10 |
Project final presentations
|