Episodes

  • Ep. 247 - Part 3 - June 13, 2024
    Jun 15 2024

    ArXiv Computer Vision research for Thursday, June 13, 2024.


    00:21: LRM-Zero: Training Large Reconstruction Models with Synthesized Data

    01:56: Scale-Invariant Monocular Depth Estimation via SSI Depth

    03:08: GGHead: Fast and Generalizable 3D Gaussian Heads

    04:55: Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset

    06:34: Towards Vision-Language Geo-Foundation Model: A Survey

    08:11: SimGen: Simulator-conditioned Driving Scene Generation

    09:44: Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition

    11:03: Sagiri: Low Dynamic Range Image Enhancement with Generative Diffusion Prior

    12:32: LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living

    13:56: WonderWorld: Interactive 3D Scene Generation from a Single Image

    15:21: Modeling Ambient Scene Dynamics for Free-view Synthesis

    16:29: Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA

    17:50: Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

    19:39: Real-Time Deepfake Detection in the Real-World

    21:17: OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

    23:02: Yo'LLaVA: Your Personalized Language and Vision Assistant

    24:30: MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

    26:26: Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion

    28:03: Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

    29:59: ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

    31:24: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

    33:16: Towards Evaluating the Robustness of Visual State Space Models

    34:57: Data Attribution for Text-to-Image Models by Unlearning Synthesized Images

    36:09: CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras

    37:37: Scene Graph Generation in Large-Size VHR Satellite Imagery: A Large-Scale Dataset and A Context-Aware Approach

    40:02: MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

    41:40: Explore the Limits of Omni-modal Pretraining at Scale

    42:46: Interpreting the Weight Space of Customized Diffusion Models

    43:58: Depth Anything V2

    45:12: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

    46:23: Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

    48:11: Rethinking Score Distillation as a Bridge Between Image Distributions

    49:44: VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

    Show More Show Less
    52 mins
  • Ep. 247 - Part 2 - June 13, 2024
    Jun 15 2024

    ArXiv Computer Vision research for Thursday, June 13, 2024.


    00:21: INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance

    02:11: Large-Scale Evaluation of Open-Set Image Classification Techniques

    03:43: PC-LoRA: Low-Rank Adaptation for Progressive Model Compression with Knowledge Distillation

    05:00: MMRel: A Relation Understanding Dataset and Benchmark in the MLLM Era

    06:41: Auto-Vocabulary Segmentation for LiDAR Points

    07:30: AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring

    08:43: EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

    10:23: Fine-Grained Domain Generalization with Feature Structuralization

    12:03: SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-Resolution

    14:13: ReMI: A Dataset for Reasoning with Multiple Images

    15:41: A Large-scale Universal Evaluation Benchmark For Face Forgery Detection

    17:26: Thoracic Surgery Video Analysis for Surgical Phase Recognition

    18:58: Reducing Task Discrepancy of Text Encoders for Zero-Shot Composed Image Retrieval

    20:40: Adaptive Slot Attention: Object Discovery with Dynamic Slot Number

    22:26: CLIP-Driven Cloth-Agnostic Feature Learning for Cloth-Changing Person Re-Identification

    24:22: Enhanced Object Detection: A Study on Vast Vocabulary Object Detection Track for V3Det Challenge 2024

    25:21: Optimizing Visual Question Answering Models for Driving: Bridging the Gap Between Human and Machine Attention Patterns

    26:30: WildlifeReID-10k: Wildlife re-identification dataset with 10k individual animals

    27:44: MGRQ: Post-Training Quantization For Vision Transformer With Mixed Granularity Reconstruction

    29:28: Comparison Visual Instruction Tuning

    30:51: MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

    32:14: Deep Transformer Network for Monocular Pose Estimation of Ship-Based UAV

    33:10: Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

    34:33: Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

    36:04: StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning

    37:30: Parameter-Efficient Active Learning for Foundational models

    38:31: Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation

    40:22: Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

    42:38: Towards AI Lesion Tracking in PET/CT Imaging: A Siamese-based CNN Pipeline applied on PSMA PET/CT Scans

    44:36: Memory-Efficient Sparse Pyramid Attention Networks for Whole Slide Image Analysis

    46:19: Instance-level quantitative saliency in multiple sclerosis lesion segmentation

    48:37: CMC-Bench: Towards a New Paradigm of Visual Signal Compression

    50:05: Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs

    52:05: CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

    Show More Show Less
    53 mins
  • Ep. 247 - Part 1 - June 13, 2024
    Jun 15 2024

    ArXiv Computer Vision research for Thursday, June 13, 2024.


    00:21: FouRA: Fourier Low Rank Adaptation

    01:41: Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

    03:18: Few-Shot Anomaly Detection via Category-Agnostic Registration Learning

    04:57: Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting

    06:46: ToSA: Token Selective Attention for Efficient Vision Transformers

    08:00: Computer vision-based model for detecting turning lane features on Florida's public roadways

    09:08: Improving Adversarial Robustness via Feature Pattern Consistency Constraint

    10:52: Research on Deep Learning Model of Feature Extraction Based on Convolutional Neural Network

    12:10: NeRF Director: Revisiting View Selection in Neural Volume Rendering

    13:36: Conceptual Learning via Embedding Approximations for Reinforcing Interpretability and Transparency

    15:03: Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

    16:40: COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing

    18:16: Fusion of regional and sparse attention in Vision Transformers

    19:26: Zoom and Shift are All You Need

    20:17: EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding

    21:49: The Penalized Inverse Probability Measure for Conformal Classification

    23:24: OpenMaterial: A Comprehensive Dataset of Complex Materials for 3D Reconstruction

    24:47: Blind Super-Resolution via Meta-learning and Markov Chain Monte Carlo Simulation

    26:30: Computer Vision Approaches for Automated Bee Counting Application

    27:17: Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding

    28:16: A Label-Free and Non-Monotonic Metric for Evaluating Denoising in Event Cameras

    29:43: Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer

    31:25: Neural NeRF Compression

    32:29: Preserving Identity with Variational Score for General-purpose 3D Editing

    33:50: AirPlanes: Accurate Plane Estimation via 3D-Consistent Embeddings

    34:51: Adaptive Temporal Motion Guided Graph Convolution Network for Micro-expression Recognition

    36:10: Enhancing Cross-Modal Fine-Tuning with Gradually Intermediate Modality Generation

    37:34: AMSA-UNet: An Asymmetric Multiple Scales U-net Based on Self-attention for Deblurring

    38:49: Cross-Modal Learning for Anomaly Detection in Fused Magnesium Smelting Process: Methodology and Benchmark

    40:45: A PCA based Keypoint Tracking Approach to Automated Facial Expressions Encoding

    42:02: Steganalysis on Digital Watermarking: Is Your Defense Truly Impervious?

    43:28: FacEnhance: Facial Expression Enhancing with Recurrent DDPMs

    45:11: How structured are the representations in transformer-based vision encoders? An analysis of multi-object representations in vision-language models

    47:08: Suitability of KANs for Computer Vision: A preliminary investigation

    Show More Show Less
    48 mins
  • Ep. 246 - Part 3 - June 12, 2024
    Jun 13 2024

    ArXiv Computer Vision research for Wednesday, June 12, 2024.


    00:20: From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition

    02:09: APSeg: Auto-Prompt Network for Cross-Domain Few-Shot Semantic Segmentatio

    03:57: 2.5D Multi-view Averaging Diffusion Model for 3D Medical Image Translation: Application to Low-count PET Reconstruction with CT-less Attenuation Correction

    05:47: DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor

    06:58: Eyes Wide Unshut: Unsupervised Mistake Detection in Egocentric Video by Detecting Unpredictable Gaze

    08:02: LaneCPP: Continuous 3D Lane Detection using Physical Priors

    09:23: FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation

    11:10: VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

    12:46: MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

    14:39: OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

    16:49: AWGUNET: Attention-Aided Wavelet Guided U-Net for Nuclei Segmentation in Histopathology Images

    18:15: Diffusion Soup: Model Merging for Text-to-Image Diffusion Models

    19:58: Coherent Optical Modems for Full-Wavefield Lidar

    21:32: Transformation-Dependent Adversarial Attacks

    22:45: PixMamba: Leveraging State Space Models in a Dual-Level Architecture for Underwater Image Enhancement

    24:10: GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

    25:57: ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery

    27:26: Self-supervised Learning of Neural Implicit Feature Fields for Camera Pose Refinement

    28:51: Real2Code: Reconstruct Articulated Objects via Code Generation

    30:02: Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models

    31:42: RMem: Restricted Memory Banks Improve Video Object Segmentation

    33:12: What If We Recaption Billions of Web Images with LLaMA-3?

    34:42: Real3D: Scaling Up Large Reconstruction Models with Real-World Images

    36:07: Enhancing End-to-End Autonomous Driving with Latent World Model

    37:12: Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation

    38:43: On Evaluating Adversarial Robustness of Volumetric Medical Segmentation Models

    40:16: Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

    42:15: ICE-G: Image Conditional Editing of 3D Gaussian Splats

    Show More Show Less
    44 mins
  • Ep. 246 - Part 2 - June 12, 2024
    Jun 13 2024

    ArXiv Computer Vision research for Wednesday, June 12, 2024.


    00:21: From Sim-to-Real: Toward General Event-based Low-light Frame Interpolation with Per-scene Optimization

    01:44: Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

    03:20: Adversarial Patch for 3D Local Feature Extractor

    04:00: Valeo4Cast: A Modular Approach to End-to-End Forecasting

    05:38: The impact of deep learning aid on the workload and interpretation accuracy of radiologists on chest computed tomography: a cross-over reader study

    08:50: Universal Scale Laws for Colors and Patterns in Imagery

    10:11: CT3D++: Improving 3D Object Detection with Keypoint-induced Channel-wise Transformer

    11:44: ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

    13:25: Continuous fake media detection: adapting deepfake detectors to new generative techniques

    15:18: Category-level Neural Field for Reconstruction of Partially Observed Objects in Indoor Environment

    16:23: One-Step Effective Diffusion Network for Real-World Image Super-Resolution

    18:12: 2nd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation

    19:22: Diffusion-Promoted HDR Video Reconstruction

    21:09: Runtime Freezing: Dynamic Class Loss for Multi-Organ 3D Segmentation

    21:52: A Sociotechnical Lens for Evaluating Computer Vision Models: A Case Study on Detecting and Reasoning about Gender and Emotion

    23:54: DistilDoc: Knowledge Distillation for Visually-Rich Document Applications

    25:28: Using Deep Convolutional Neural Networks to Detect Rendered Glitches in Video Games

    26:39: OpenCOLE: Towards Reproducible Automatic Graphic Design Generation

    27:23: Dataset Enhancement with Instance-Level Augmentations

    28:33: Interpretable Representation Learning of Cardiac MRI via Attribute Regularization

    29:33: A New Class Biorthogonal Spline Wavelet for Image Edge Detection

    30:48: Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata

    32:10: Vessel Re-identification and Activity Detection in Thermal Domain for Maritime Surveillance

    33:32: AdaNCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer

    35:09: From Chaos to Clarity: 3DGS in the Dark

    36:32: LaMOT: Language-Guided Multi-Object Tracking

    38:07: UDON: Universal Dynamic Online distillatioN for generic image representations

    39:49: WMAdapter: Adding WaterMark Control to Latent Diffusion Models

    40:48: Blind Image Deblurring using FFT-ReLU with Deep Learning Pipeline Integration

    42:06: DocSynthv2: A Practical Autoregressive Modeling for Document Generation

    Show More Show Less
    43 mins
  • Ep. 246 - Part 1 - June 12, 2024
    Jun 13 2024

    ArXiv Computer Vision research for Wednesday, June 12, 2024.


    00:20: FaithFill: Faithful Inpainting for Object Completion Using a Single Reference Image

    01:21: Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

    02:49: Unveiling the Power of Wavelets: A Wavelet-based Kolmogorov-Arnold Network for Hyperspectral Image Classification

    04:26: Flexible Music-Conditioned Dance Generation with Style Description Prompts

    05:52: Robust 3D Face Alignment with Multi-Path Neural Architecture Search

    07:00: Small Scale Data-Free Knowledge Distillation

    08:48: KernelWarehouse: Rethinking the Design of Dynamic Convolution

    10:31: A Comprehensive Survey on Machine Learning Driven Material Defect Detection: Challenges, Solutions, and Future Prospects

    12:34: Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

    14:02: IFTD: Image Feature Triangle Descriptor for Loop Detection in Driving Scenes

    14:54: Multi-Teacher Multi-Objective Meta-Learning for Zero-Shot Hyperspectral Band Selection

    16:30: DemosaicFormer: Coarse-to-Fine Demosaicing Network for HybridEVS Camera

    18:10: Spatial-Frequency Dual Progressive Attention Network For Medical Image Segmentation

    20:07: Accurate Explanation Model for Image Classifiers using Class Association Embedding

    21:55: Real-world Image Dehazing with Coherence-based Label Generator and Cooperative Unfolding Network

    23:11: SimSAM: Simple Siamese Representations Based Semantic Affinity Matrix for Unsupervised Image Segmentation

    24:06: Asymptotic Unbiased Sample Sampling to Speed Up Sharpness-Aware Minimization

    25:34: OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding

    26:58: Generalizable Disaster Damage Assessment via Change Detection with Vision Foundation Model

    28:26: Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

    29:52: Deep Learning for Slum Mapping in Remote Sensing Images: A Meta-analysis and Review

    31:49: LVBench: An Extreme Long Video Understanding Benchmark

    33:14: Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking

    34:48: A Robust Pipeline for Classification and Detection of Bleeding Frames in Wireless Capsule Endoscopy using Swin Transformer and RT-DETR

    36:23: 3D CBCT Challenge 2024: Improved Cone Beam CT Reconstruction using SwinIR-Based Sinogram and Image Enhancement

    37:29: MWIRSTD: A MWIR Small Target Detection Dataset

    38:34: CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models

    40:27: A$^{2}$-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder

    42:35: Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

    44:26: Identification of Conversation Partners from Egocentric Video

    Show More Show Less
    46 mins
  • Ep. 245 - Part 3 - June 11, 2024
    Jun 13 2024

    ArXiv Computer Vision research for Tuesday, June 11, 2024.


    00:21: DERM12345: A Large, Multisource Dermatoscopic Skin Lesion Dataset with 38 Subclasses

    01:44: Beware of Aliases -- Signal Preservation is Crucial for Robust Image Restoration

    02:49: Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning

    04:04: OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding

    06:01: 4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

    07:24: VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    08:58: Image Neural Field Diffusion Models

    10:11: Comparing Deep Learning Models for Rice Mapping in Bhutan Using High Resolution Satellite Imagery

    12:29: GLAD: Towards Better Reconstruction with Global and Local Adaptive Diffusion Models for Unsupervised Anomaly Detection

    14:26: ReduceFormer: Attention with Tensor Reduction by Summation

    15:23: Trim 3D Gaussian Splatting for Accurate Geometry Representation

    16:44: SPIN: Spacecraft Imagery for Navigation

    18:24: Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions

    20:00: Understanding Visual Concepts Across Models

    21:12: Instant 3D Human Avatar Generation using Image Diffusion Models

    22:47: Neural Gaffer: Relighting Any Object via Diffusion

    24:19: Autoregressive Pretraining with Mamba in Vision

    25:51: Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

    27:19: Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

    28:50: Situational Awareness Matters in 3D Vision Language Reasoning

    30:10: Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

    31:46: Zero-shot Image Editing with Reference Imitation

    33:08: Image and Video Tokenization with Binary Spherical Quantization

    34:18: An Image is Worth 32 Tokens for Reconstruction and Generation

    36:28: Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring

    Show More Show Less
    38 mins
  • Ep. 245 - Part 2 - June 11, 2024
    Jun 13 2024

    ArXiv Computer Vision research for Tuesday, June 11, 2024.


    00:21: NeRSP: Neural 3D Reconstruction for Reflective Objects with Sparse Polarized Images

    01:27: Beyond Bare Queries: Open-Vocabulary Object Retrieval with 3D Scene Graph

    03:14: T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text

    04:45: Benchmarking and Boosting Radiology Report Generation for 3D High-Resolution Medical Images

    06:23: FaceGPT: Self-supervised Learning to Chat about 3D Human Faces

    07:52: RecMoDiffuse: Recurrent Flow Diffusion for Human Motion Generation

    09:15: VoxNeuS: Enhancing Voxel-Based Neural Surface Reconstruction via Gradient Interpolation

    10:51: RAD: A Comprehensive Dataset for Benchmarking the Robustness of Image Anomaly Detection

    12:05: RGB-Sonar Tracking Benchmark and Spatial Cross-Attention Transformer Tracker

    13:52: MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD

    15:15: Can Foundation Models Reliably Identify Spatial Hazards? A Case Study on Curb Segmentation

    16:56: MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance

    18:20: Open-World Human-Object Interaction Detection via Multi-modal Prompts

    20:03: Which Country Is This? Automatic Country Ranking of Street View Photos

    20:44: Needle In A Multimodal Haystack

    22:10: Is One GPU Enough? Pushing Image Generation at Higher-Resolutions with Foundation Models

    23:24: Towards Realistic Data Generation for Real-World Super-Resolution

    24:37: Unsupervised Object Detection with Theoretical Guarantees

    25:43: Embedded Graph Convolutional Networks for Real-Time Event Data Processing on SoC FPGAs

    27:45: A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation

    29:01: Cinematic Gaussians: Real-Time HDR Radiance Fields with Depth of Field

    30:24: Minimizing Energy Costs in Deep Learning Model Training: The Gaussian Sampling Approach

    32:09: Global-Regularized Neighborhood Regression for Efficient Zero-Shot Texture Anomaly Detection

    33:52: Deep Implicit Optimization for Robust and Flexible Image Registration

    35:28: Visual Representation Learning with Stochastic Frame Prediction

    Show More Show Less
    37 mins