Hoàng Anh Just | PhD Student

Hi, I'm Hoàng Anh Just.

A

A student at Virginia Tech with a passion in data valuation for machine learning.

About Me Contact Me

About

I am a PhD student in computer engineering at Responsible Data Science Lab (ReDS Lab) located at Virginia Polytechnic Institute and State University (Virginia Tech).
I am fortunate to be advised by Prof. Ruoxi Jia.
I finished my bachelor degrees in mathematics and computer science at Gettysburg College, where I had a pleasure to work with Prof. Béla Bajnok and Prof. Todd Neller, respectively.
I enjoy working on data-centric AI, especially to measure the importance of each data point used to train a model.

Focus Questions:

[Data Valuation]
How much should the data cost?
[Data Selection]
How to choose the best data to meet the model owner's expectations?
[Model Prediction]
How to predict the model performance given training data?
[AI Privacy]
How to protect your data used to train a model?
[Data Leakage]
How to extract data used to train a model?

Research

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

ICLR 2024

Feiyang Kang, Hoang Anh Just, Yifan Sun, Himanshu Jahagirdar, Yuanzhi Zhang, Rongxing Du, Anit Sahu, Ruoxi Jia

Developed a scalable data selection method to pre-fine-tune a pretrained large language model (LLM) by selecting (unlabeled) data that can shift the source distribution to better align with the target distribution.

Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources

NeurIPS 2023

Feiyang Kang, Hoang Anh Just, Anit Sahu, Ruoxi Jia

Proposed a performance estimator for a model trained on any data composition given only sample information and a scaling law to predict performance on larger scales, which effectively finds the optimal composition of data sources for any target data size.

NARCISSUS: A Practical Clean-Label Backdoor Attack with Limited Information

ACM CCS 2023

Yi Zeng, Minzhou Pan, Hoang Anh Just, Lingjuan Lyu, Meikang Qiu and Ruoxi Jia

Launched an efficient (poisoning 0.5% of the target class and 0.05% of the entire training dataset) and stealthy (hard to detect) backdoor attack, which requires only knowledge of the target class to successfully deploy the attack.

PrivMon: A Stream-Based System for Real-Time Privacy Attack Detection for Machine Learning Models

RAID 2023

Myoengseob Ko, Xinyu Yang, Zhengjie Ji, Hoang Anh Just, Peng Gao, Ruoxi Jia

Established an efficient real-time detection system to membership inference attacks which prevents attackers from inferring sensitive data used for model training.

2D-Shapley: A Framework for Fragmented Data Valuation

ICML 2023

Liu Zhihong, Hoang Anh Just, Xiangyu Chang, Xi Chen, Ruoxi Jia

Proposed a novel, efficient approach to fine-grained data analysis, which valuates the quality of each feature of each data point with theoretical grounding.

LAVA: Data Valuation without Pre-Specified Learning Algorithms

ICLR 2023

Hoang Anh Just, Feiyang Kang, Tianhao Wang, Yi Zeng, Myeongseob Ko, Ming Jin and Ruoxi Jia

Introduced an efficient data quality valuation method through adopting a modified class-wise Wasserstein distance, which is robust to noisy, mislabeled, and poisoned data without requiring any model training.

SPOTLIGHT

ModelPred: A Framework for Predicting Trained Model from Training Data

IEEE SatML 2023

Yingyan Zeng, Tianhao Wang, Si Chen, Hoang Anh Just, Ran Jin, Ruoxi Jia

Developed a set-function based neural network which can predict model weights from the training dataset of any size. This method enables efficient applications for data valuation, data selection, or data memorization, which requires multiple model re-trainings.

Label-Only Model Inversion Attacks via Boundary Repulsion

CVPR 2022

Mostafa Kahla, Si Chen, Hoang Anh Just, Ruoxi Jia

Designed a novel practical model inversion attack which recovers sensitive data by accessing only labels of the model output without additional information.

On perfect bases in finite Abelian groups

Involve 2022

Bela Bajnok, Connor Berson and Hoang Anh Just

Proved that for sets of size greater than 3, there are no perfect restricted 2-basis in Z_n. Showed that for only sets of size smaller equal to 3 there exists a perfect restricted 2-basis in Z_n, proving by contradiction knowing that Z_n is closed under both addition and subtraction.

Involve, a Journal of Mathematics, 12/2022

Opponent Hand Estimation in the Game of Gin Rummy

AAAI 2021

Peter Francis, Hoang Anh Just, Todd Neller

We describe various approaches to opponent hand estimation in the card game Gin Rummy. We use an application of Bayes' rule, as well as both simple and convolutional neural networks, to recognize patterns in simulated game play and predict the opponent's hand. We also present a new minimal-sized construction for using arrays to pre-populate hand representation images.