Deep learning models have made remarkable progress in recent years, achieving human-parity or superhuman performance in various visual tasks such as recognition and detection. However, one major limitation of these deep-learning-based AI models is that they can be brittle and susceptible to adversarial attacks or distribution shifts. By contrast, humans often find no difficulty at all when dealing with visual input with small noises, unseen objects, occlusions, viewpoint or illumination variations. To mitigate this issue, this research proposal studies how to equip deep networks with robust representations from different perspectives, including architecture design and effective model\&data scaling up. Our findings result in: (1) pure CNN architectures without any attention-like operations that are more robust than Transformers on out-of-distribution benchmarks, questioning the previous belief of the superiority of the self-attention-like architectures; (2) better-than-ever adversarial robustness with effective and comprehensive scaling up in the dimension of model, data, and schedule. Finally, we plan to also study the application of our robust vision models in recent Multi-modal Large Language Models.

Event Host: Zeyu Wang, Ph.D. Student, Computer Science & Engineering

Advisor: Cihang Xie

Event Details

See Who Is Interested

0 people are interested in this event


User Activity

No recent activity