ArXiv Computer Vision research for Tuesday, June 11, 2024.
00:21: DERM12345: A Large, Multisource Dermatoscopic Skin Lesion Dataset with 38 Subclasses
01:44: Beware of Aliases -- Signal Preservation is Crucial for Robust Image Restoration
02:49: Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning
04:04: OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding
06:01: 4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models
07:24: VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
08:58: Image Neural Field Diffusion Models
10:11: Comparing Deep Learning Models for Rice Mapping in Bhutan Using High Resolution Satellite Imagery
12:29: GLAD: Towards Better Reconstruction with Global and Local Adaptive Diffusion Models for Unsupervised Anomaly Detection
14:26: ReduceFormer: Attention with Tensor Reduction by Summation
15:23: Trim 3D Gaussian Splatting for Accurate Geometry Representation
16:44: SPIN: Spacecraft Imagery for Navigation
18:24: Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions
20:00: Understanding Visual Concepts Across Models
21:12: Instant 3D Human Avatar Generation using Image Diffusion Models
22:47: Neural Gaffer: Relighting Any Object via Diffusion
24:19: Autoregressive Pretraining with Mamba in Vision
25:51: Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance
27:19: Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
28:50: Situational Awareness Matters in 3D Vision Language Reasoning
30:10: Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?
31:46: Zero-shot Image Editing with Reference Imitation
33:08: Image and Video Tokenization with Binary Spherical Quantization
34:18: An Image is Worth 32 Tokens for Reconstruction and Generation
36:28: Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring