UPCOMING EVENTS 校园活动

当前位置:首页  校园活动  全部

Prof. Trac-Duy Tran | 面向图像处理的Transformer驱动型U-Net模型

时间2026-04-17 14:00:002026-04-17 15:00:00

地点教学实验楼108

线上链接911-701-350

主讲人Prof. Trac-Duy Tran

主持人梁杰 讲席教授

讲座语言英语

主办单位信息学部

品牌栏目

主讲人
Trac D. Tran received the B.S. and M.S. degrees from the Massachusetts Institute of Technology, and the Ph.D. degree from the University of Wisconsin, Madison. Since 1998, he has been a Professor at the Department of Electrical and Computer Engineering, Johns Hopkins University. His research interests are in the field of digital signal processing and their applications in image/video analysis, compression, processing, and communications. His research results have been adopted by Microsoft Windows Media Video 9 and JPEG XR. Dr. Tran has served as Associate Editor / Senior Associate Editor of various IEEE Transactions, including PAMI. Dr. Tran received the NSF CAREER award in 2001, was the co-recipient of the IEEE Mikio Takagi Best Paper Award in 2012, and the IEEE GRSS Highest Impact Paper Award in 2018. He is an IEEE Fellow.
摘要

In this talk, we introduce a hybrid U-Net architecture that pairs a multi-resolution Vision Transformer encoder with a CNN decoder. The ViT encoder captures global sparse support whereas the CNN decoder concentrates reconstruction capacity on support-consistent regions, enabling the model to combine global high-level context with fine low-level local detail. We demonstrate that this framework consistently outperforms existing networks, achieving consistent improvements in representation accuracy and reducing hallucination artifacts, while requiring substantially less training data. These gains are observed across multiple image processing tasks and benchmarks, including optical imaging, MRI, and ImageNet. Overall, our results show that attention-guided transformer-based signal representation pairing with local CNN kernels provides a principled and effective solution for low-level image processing

讲座海报

In this talk, we introduce a hybrid U-Net architecture that pairs a multi-resolution Vision Transformer encoder with a CNN decoder. The ViT encoder captures global sparse support whereas the CNN decoder concentrates reconstruction capacity on support-consistent regions, enabling the model to combine global high-level context with fine low-level local detail. We demonstrate that this framework consistently outperforms existing networks, achieving consistent improvements in representation accuracy and reducing hallucination artifacts, while requiring substantially less training data. These gains are observed across multiple image processing tasks and benchmarks, including optical imaging, MRI, and ImageNet. Overall, our results show that attention-guided transformer-based signal representation pairing with local CNN kernels provides a principled and effective solution for low-level image processing