ResQ Residual Quantization for Video Perception

CDUL CLIP-Driven Unsupervised Learning for Multi-Label Image Classification

SATR Zero-Shot Semantic Segmentation of 3D Shapes

GePSAn Generative Procedure Step Anticipation in Cooking Videos

CLIPTER Looking at the Bigger Picture in Scene Text Recognition

A-STAR Test-time Attention Segregation and Retention for Text-to-image Synthesis

Ordered Atomic Activity for Fine-grained Interactive Traffic Scenario Understanding

Person Re-Identification without Identification via Event anonymization

SSDA Secure Source-Free Domain Adaptation

Sample-wise Label Confidence Incorporation for Learning with Noisy Labels

Story Visualization by Online Text Augmentation with Context Memory

Continual Learning for Personalized Co-speech Gesture Generation

Data-Free Class-Incremental Hand Gesture Recognition

Efficient Controllable Multi-Task Architectures

Yes we CANN Constrained Approximate Nearest Neighbors for Local Feature-Bas

Self-Supervised Object Detection from Egocentric Videos

Iterative Superquadric Recomposition of 3D Objects from Multiple Views

StyleDomain Efficient and Lightweight Parameterizations of StyleGAN for One-shot an

HMD-NeMo Online 3D Avatar Motion Generation From Sparse Observations

Zenseact Open Dataset A Large-Scale and Diverse Multimodal Dataset fo

Task Agnostic Restoration of Natural Video Dynamics

VidStyleODE Disentangled Video Editing via StyleGAN and NeuralODEs

Learning Human-Human Interactions in Images from Weak Textual Supervision

Kader Hammoud Rapid Adaptation in Online Continual Learning Are We Evaluating It

Khatib 3D Instance Segmentation via Enhanced Spatial and Semantic Supervision

XiNet Efficient Neural Networks for tinyML

BEVBert Multimodal Map Pre-training for Language-guided Navigation

MiniROAD Minimal RNN Framework for Online Action Detection

Towards Content-based Pixel Retrieval in Revisited Oxford and Paris

Long-range Multimodal Pretraining for Movie Understanding

Viewing Graph Solvability in Practic

LIST Learning Implicitly from Spatial Transformers for Single-View 3D Reconstruction

MixBag Bag-Level Data Augmentation for Learning from Label Proportions

uSplit Image Decomposition for Fluorescence Microscopy

SINC Spatial Composition of 3D Human Motions for Simultaneous Action

DarSwin Distortion Aware Radial Swin Transform

Unified Out-Of-Distribution Detection A Model-Specific Perspectiv

ADAPT Efficient Multi-Agent Trajectory Prediction with Adaptation

Make-An-Animation Large-Scale Text-conditional 3D Human Motion Generation

Markov Game Video Augmentation for Action Segmentation

Adaptive Spiral Layers for Efficient 3D Representation Learning on Meshes

Luminance-aware Color Transform for Multiple Exposure Correction

EigenTrajectory Low-Rank Descriptors for Multi-Modal Trajectory Forecasting

PNI Industrial Anomaly Detection using Position and Neighborhood Information

CC3D Layout-Conditioned Generation of Compositional 3D Scenes

How Much Temporal Long-Term Context is Needed for Action Segmentation

Cross-Domain Product Representation Learning for Rich-Content E-Commerc

Dynamic PlenOctree for Adaptive Sampling Refinement in Explicit NeRF

Unified Data-Free Compression Pruning and Quantization without Fine-Tuning

HRS-Bench Holistic Reliable and Scalable Benchmark for Text-to-Image Models

Towards Improved Input Masking for Convolutional Neural Networks

Multimodal Garment Designer Human-Centric Latent Diffusion Models for Fashion Imag

Zero-Shot Composed Image Retrieval with Textual Inversion

CleanCLIP Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning

Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting

DEDRIFT Robust Similarity Search under Content Drift

Efficient Discovery and Effective Evaluation of Visual Perceptual Similarity A

Visual Explanations via Iterated Integrated Attributions

BeLFusion Latent Diffusion for Behavior-Driven Human Motion Prediction

With a Little Help from Your Own Past Prototypical Memory

Localizing Moments in Long Video Via Multimodal Guidanc

Zip-NeRF Anti-Aliased Grid-Based Neural Radiance Fields

Active Stereo Without Pattern Projecto

SatlasPretrain A Large-Scale Dataset for Remote Sensing Image Understanding

Inspecting the Geographical Representativeness of Images from Text-to-Image Models

XMem Production-level Video Segmentation From Few Annotated Frames

A Game of Bundle Adjustment - Learning Efficient Convergenc

MapFormer Boosting Change Detection by Using Pre-change Information

EigenPlaces Training Viewpoint Robust Models for Visual Place Recognition

Vision Transformer Adapters for Generalizable Multitask Learning

Self-Supervised Burst Super-Resolution

Detecting Objects with Context-Likelihood Graphs and Graph Refinement

Breaking Common Sense WHOOPS A Vision-and-Language Benchmark of Synthetic an

VL-Match Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching

VADER Video Alignment Differencing and Retrieval

Mesh2Tex Generating Mesh Textures from Image Queries

Beyond the Pixel a Photometrically Calibrated HDR Dataset for Luminanc

Distilling from Similar Tasks for Transfer Learning on a Budget

HyperReenact One-Shot Reenactment via Jointly Learning to Refine and Retarget

IDiff-Face Synthetic-based Face Recognition through Fizzy Identity-Conditioned Diffusion Model

Plausible Uncertainties for Human Pose Regression

Compatibility of Fundamental Matrices for Complete Viewing Graphs

A Multidimensional Analysis of Social Biases in Vision Transformers

Contrastive Model Adaptation for Cross-Condition Robustness in Semantic Segmentation

Preface A Data-driven Volumetric Prior for Few-shot Ultra High-resolution Fac

FS-DETR Few-Shot DEtection TRansformer with Prompting and without Re-Training

ReGen A good Generative Zero-Shot Video Classifier Should be Rew

V-FUSE Volumetric Depth Map Fusion with Long-Range Constraints

UniverSeg Universal Medical Image Segmentation

Towards Building More Robust Models with Frequency Bias

Building a Winning Team Selecting Source Model Ensembles using

Active Self-Supervised Learning A Few Low-Cost Relationships Are All You

CLNeRF Continual Learning Meets NeRF

Consistent Depth Prediction for Transparent Object Reconstruction from RGB-D Cam

DiffDreamer Towards Consistent Unsupervised Single-view Scene Extrapolation with Conditional Diffusion

Doppelgangers Learning to Disambiguate Images of Similar Structures

EfficientViT Lightweight Multi-Scale Attention for High-Resolution Dense Prediction

IIEU Rethinking Neural Feature Activation from Decision-Making

MixReorg Cross-Modal Mixed Patch Reorganization is a Good Mask Learn

ObjectFusion Multi-modal 3D Object Detection with Object-Centric Fusion

Rehearsal-Free Domain Continual Face Anti-Spoofing Generalize More and Forget Less

Retinexformer One-stage Retinex-based Transformer for Low-light Image Enhancement

Robust Object Modeling for Visual Tracking

Exploiting Proximity-Aware Tasks for Embodied Social Navigation

Improving Online Lane Graph Extraction by Object-Lane Clustering

Anomaly Detection Under Distribution Shift

Attention Where It Matters Rethinking Visual Document Understanding with Selectiv

E2E-LOAD End-to-End Long-form Online Action Detection

Efficient-VQGAN Towards High-Resolution Image Generation with Efficient Vision Transformers

Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints

Knowledge-Aware Federated Active Learning with Non-IID Dat

MasaCtrl Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis an

Multi-Modal Continual Test-Time Adaptation for 3D Semantic Segmentation

Multi-Modal Gated Mixture of Local-to-Global Experts for Dynamic Image Fusion

OmniZoomer Learning to Move and Zoom in on Sphere at

Re-mine Learn and Reason Exploring the Cross-modal Semantic Correlations fo

SceneRF Self-Supervised Monocular 3D Scene Reconstruction with Radiance Fields

Strip-MLP Efficient Token Interaction for Vision MLP

TexFusion Synthesizing 3D Textures with Text-Guided Image Diffusion Models

Going Beyond Nouns With Vision Language Models Using Synthetic

A Simple Recipe to Meta-Learn Forward and Backward Trans

Pix2Video Video Editing using Image Diffusion

Global Adaptation Meets Local Generalization Unsupervised Domain Adaptation for 3D

HiFace High-Fidelity 3D Face Reconstruction by Learning Static and Dynamic

StableVideo Text-driven Consistency-aware Diffusion Video Editing

DETRDistill A Universal Knowledge Distillation Framework for DETR-families

HairNeRF Geometry-Aware Image Synthesis for Hairstyle Trans

Neural Radiance Field with LiDAR maps

Revisiting Vision Transformer from the View of Path Ensembl

Generative Novel View Synthesis with 3D-Aware Diffusion Models

Hashing Neural Video Decomposition with Multiplicative Residuals in Space-Tim

ReLeaPS Reinforcement Learning-based Illumination Planning for Generalized Photometric Stereo

SpinCam High-Speed Imaging via a Rotating Point-Spread Function

Shape Analysis of Euclidean Curves under Frenet-Serret Framework

PASTA Proportional Amplitude Spectrum Training Augmentation for Syn-to-Real Domain Generalization

Towards Realistic Evaluation of Industrial Continual Learning Scenarios with an

Quality Diversity for Visual Pre-Training

3DMiner Discovering Shapes from Large-Scale Unannotated Image Datasets

Adversarial Bayesian Augmentation for Single-Source Domain Generalization

ChartReader A Unified Framework for Chart Derendering and Comprehension without

Contrastive Continuity on Augmentation Stability Rehearsal for Continual Self-Supervised Learning

DNA-Rendering A Diverse Neural Actor Repository for High-Fidelity Human-Centric Rendering

DVGaze Dual-View Gaze Estimation

Forecast-MAE Self-supervised Pre-training for Motion Forecasting with Masked Autoencoders

Frequency Guidance Matters in Few-Shot Learning

General Image-to-Image Translation with One-Shot Image Guidanc

HandR2N2 Iterative 3D Hand Pose Estimation Using a Residual Recurrent

LISTER Neighbor Decoding for Length-Insensitive Scene Text Recognition

LU-NeRF Scene and Pose Estimation by Synchronizing Local Unposed NeRFs

MixSpeech Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech

Multi-Scale Bidirectional Recurrent Network with Hybrid Correlation for Point Clou

PRIOR Prototype Representation Joint Learning from Medical Images and Reports

ReST A Reconfigurable Spatial-Temporal Graph Model for Multi-Camera Multi-Object Tracking

Score Priors Guided Deep Variational Inference for Unsupervised Real-World Singl

Tracking Anything with Decoupled Video Segmentation

Activate and Reject Towards Safe Domain Generalization under Category Shift

AdaMV-MoE Adaptive Multi-Task Vision Mixture-of-Experts

AdvDiffuser Natural Adversarial Example Synthesis with Diffusion Models

AGG-Net Attention Guided Gated-Convolutional Network for Depth Image Completion

An Adaptive Model Ensemble Adversarial Attack for Boosting Adversarial Transferability

AREA Adaptive Reweighting via Effective Area for Long-Tailed Classification

Atmospheric Transmission and Thermal Inertia Induced Blind Road Segmentation with

A Generalist Framework for Panoptic Segmentation of Images and Videos

A Retrospect to Multi-prompt Learning across Vision and Languag

Be Everywhere - Hear Everything BEE Audio Scene Reconstruction by

BoMD Bag of Multi-label Descriptors for Noisy Chest X-ray Classification

Building Vision Transformers with Hierarchy Aware Feature Aggregation

CancerUniT Towards a Single Unified Model for Effective Detection Segmentation

Category-aware Allocation Transformer for Weakly Supervised Object Localization

CuNeRF Cube-Based Neural Radiance Field for Zero-Shot Medical Image Arbitrary-Scal

Deep Multiview Clustering by Contrasting Cluster Assignments

DiffRate Differentiable Compression Rate for Efficient Vision Transformers

DiffusionDet Diffusion Model for Object Detection

Domain Generalization via Rationale Invarianc

DReg-NeRF Deep Registration for Neural Radiance Fields

Dual Aggregation Transformer for Image Super-Resolution

Dynamic Residual Classifier for Class Incremental Learning

Editable Image Geometric Abstraction via Neural Primitive Assembly

Efficient Deep Space Filling Curv

Efficient Video Action Detection with Token Dropout and Context Refinement

Exploring Open-Vocabulary Semantic Segmentation from CLIP Vision Encoder Distillation Only

Fan-Beam Binarization Difference Projection FB-BDP A Novel Local Object Descripto

Fantasia3D Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation

FBLNet FeedBack Loop Network for Driver Attention Prediction

FocalFormer3D Focusing on Hard Instance for 3D Object Detection

FPR False Positive Rectification for Weakly Supervised Semantic Segmentation

FRAug Tackling Federated Learning with Non-IID Features via Representation Augmentation

Generating Dynamic Kernels via Transformers for Lane Detection

GridPull Towards Scalability in Learning Implicit Representations from 3D Point

Group DETR Fast DETR Training with Group-Wise One-to-Many Assignment

HumanMAC Masked Motion Completion for Human Motion Prediction

Joint Implicit Neural Representation for High-fidelity and Compact Vector Fonts

Learning Continuous Exposure Value Representations for Single-Image HDR Reconstruction

Learning from Noisy Data for Semi-Supervised 3D Object Detection

MHEntropy Entropy Meets Multiple Hypotheses for Pose and Shape Recovery

Mimic3D Thriving 3D-Aware GANs via 3D-to-2D Imitation

MoTIF Learning Motion Trajectories with Local Implicit Neural Functions fo

Multi-view Self-supervised Disentanglement for General Image Denoising

NeuRBF A Neural Fields Representation with Adaptive Radial Basis Functions

Omnidirectional Information Gathering for Knowledge Transfer-Based Audio-Visual Navigation

Open-vocabulary Panoptic Segmentation with Embedding Modulation

Overcoming Forgetting Catastrophe in Quantization-Aware Training

PointDC Unsupervised Semantic Segmentation of 3D Point Clouds via Cross-Modal

Ray Conditioning Trading Photo-consistency for Photo-realism in Multi-view Image Generation

Rethinking Point Cloud Registration as Masking and Reconstruction

Revisiting Domain-Adaptive 3D Object Detection by Reliable Diverse and Class-balanc

SHIFT3D Synthesizing Hard Inputs For Tricking 3D Detectors

SINC Self-Supervised In-Context Learning for Vision-Language Tasks

Single-Stage Diffusion NeRF A Unified Approach to 3D Generation an

SIRA-PCR Sim-to-Real Adaptation for 3D Point Cloud Registration

Size Does Matter Size-aware Virtual Try-on via Clothing-oriented Transformation Try-on

SMMix Self-Motivated Image Mixing for Vision Transformers

Snow Removal in Video A New Dataset and A Novel

Sound Localization from Motion Jointly Learning Sound Direction and Cam

Sparse Sampling Transformer with Uncertainty-Driven Ranking for Unified Removal o

SVQNet Sparse Voxel-Adjacent Query Network for 4D Spatio-Temporal LiDAR Semantic

Tem-Adapter Adapting Image-Text Pretraining for Video Question Answ

Text2Tex Text-driven Texture Synthesis via Diffusion Models

The Devil is in the Crack Orientation A New Perspectiv

Towards Unifying Medical Vision-and-Language Pre-Training via Soft Prompts

Traj-MAE Masked Autoencoders for Trajectory Prediction

TrajectoryFormer 3D Object Tracking Transformer with Predictive Trajectory Hypotheses

TransIFF An Instance-Level Feature Fusion Framework for Vehicle-Infrastructure Cooperative 3D

TransTIC Transferring Transformer-based Image Compression from Human Perception to Machin

UniT3D A Unified Transformer for 3D Dense Captioning and Visual

VeRi3D Generative Vertex-based Radiance Fields for 3D Controllable Human Imag

Video Action Recognition with Attentive Semantic Units

VQA Therapy Exploring Answer Differences by Visually Grounding Answers

WDiscOOD Out-of-Distribution Detection via Whitened Linear Discriminant Analysis

Weakly-supervised 3D Pose Transfer with Keypoints

Workie-Talkie Accelerating Federated Learning by Overlapping Computing and Communications vi

Parametric Information Maximization for Generalized Category Discovery

Muscles in Action

Better May Not Be Fairer A Study on Subgroup Discrepancy

Spacetime Surface Regularization for Neural Dynamic Scene Reconstruction

DiffV2S Diffusion-Based Video-to-Speech Synthesis with Vision-Guided Speaker Embedding

Environment Agnostic Representation for Visual Reinforcement Learning

Exploring Positional Characteristics of Dual-Pixel Data for Camera Autofocus

ORC Network Group-based Knowledge Distillation using Online Role Chang

R-Pred Two-Stage Motion Prediction Via Tube-Query Attention-Based Trajectory Refinement

TEMPO Efficient Multi-View Pose Estimation Tracking and Forecasting

Diffusion-SDF Conditional Generative Modeling of Signed Distance Functions

AdVerb Visually Guided Audio Dereverberation

Democratising 2D Sketch to 3D Shape Retrieval Through Pivoting

Complementary Domain Adaptation and Generalization for Unsupervised Continual Domain Shift

DALL-Eval Probing the Reasoning Skills and Social Biases of Text-to-Imag

Distribution-Aware Prompt Tuning for Vision-Language Models

Label-Free Event-based Object Recognition via Joint Learning with Image Reconstruction

Local or Global Selective Knowledge Assimilation for Federated Learning with

Non-Coaxial Event-Guided Motion Deblurring with Spatial Alignment

PromptStyler Prompt-driven Style Generation for Source-free Domain Generalization

Image-Free Classifier Injection for Zero-Shot Classification

LAN-HDR Luminance-based Alignment Network for High Dynamic Range Video Reconstruction

Shortcut-V2V Compression Framework for Video-to-Video Translation Based on Temporal Redundancy

MixPath A Unified Approach for One-shot Neural Architecture Search

Rethinking Fast Fourier Convolution in Image Inpainting

A2Q Accumulator-Aware Quantization with Guaranteed Overflow Avoidanc

To Adapt or Not to Adapt Real-Time Adaptation for Semantic

Enhancing NeRF akin to Enhancing LLMs Generalizable NeRF Transformer with

Unaligned 2D to 3D Translation with Conditional Vector-Quantized Code Diffusion

Learning Depth Estimation for Transparent and Mirror Surfaces

Zero-Shot Spatial Layout Conditioning for Text-to-Image Diffusion Models

Moment Detection in Long Tutorial Videos

Focal Network for Image Restoration

Grounded Entity-Landmark Adaptive Pre-Training for Vision-and-Language Navigation

Learning Hierarchical Features with Joint Latent Space Energy-Based Prio

P2C Self-Supervised Point Cloud Completion from Single Partial Clouds

SportsMOT A Large Multi-Object Tracking Dataset in Multiple Sports Scenes

Test-time Personalizable Forecasting of 3D Human Poses

Cloth2Body Generating 3D Human Body Mesh from 2D Clothing

Indoor Depth Recovery Based on Deep Unfolding with Non-Local Prio

X-VoE Measuring eXplanatory Violation of Expectation in Physical Events

Cin Multi-body Depth and Camera Pose Estimation from Multiple Views

AutoSynth Learning to Generate 3D Training Data for Object Point

Search for or Navigate to Dual Adaptive Thinking for Object

TransFace Calibrating Transformer Training for Face Recognition from a Data-Centric

EverLight Indoor-Outdoor Editable HDR Lighting Estimation

Efficient Video Prediction via Sparsely Conditioned Flow Matching

LIMITR Leveraging Local Information for Medical Image-Text Representation

Vision Grid Transformer for Document Layout Analysis

Robust Frame-to-Frame Camera Rotation Estimation in Crowded Scenes

PoseFix Correcting 3D Human Poses with Natural Languag

A Large-scale Study of Spatiotemporal Representation Learning with a New

Explicit Motion Disentangling for Efficient Optical Flow Estimation

GrowCLIP Data-Aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-Training

Identity-Consistent Aggregation for Video Object Detection

Learning Neural Eigenfunctions for Unsupervised Semantic Segmentation

NeRF-LOAM Neural Implicit Representation for Large-Scale Incremental LiDAR Odometry an

PIRNet Privacy-Preserving Image Restoration Network via Wavelet Lifting

Prompt Switch Efficient CLIP Adaptation for Text-Video Retrieval

Towards Inadequately Pre-trained Models in Transfer Learning

Bayesian Prompt Learning for Image-Language Model Generalization

Sample4Geo Hard Negative Sampling For Cross-View Geo-Localisation

Guevara Cross-modal Latent Space Alignment for Image to Avatar Translation

Strata-NeRF Neural Radiance Fields for Stratified Scenes

General Planar Motion from a Pair of 3D Correspondences

3DMOTFormer Graph Transformer for Online 3D Multi-Object Tracking

MeViS A Large-scale Benchmark for Video Segmentation with Motion Expressions

Minimal Solutions to Generalized Three-View Relative Pose Problem

MOSE A New Dataset for Video Object Segmentation in Complex

PivotNet Vectorized Pivot Learning for End-to-end HD Map Construction

Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation

Unsupervised Manifold Linearizing and Clustering

VertexSerum Poisoning Graph Neural Networks for Link Inferenc

SFHarmony Source Free Domain Adaptation for Distributed Neuroimaging Analysis

U-RED Unsupervised 3D Shape Retrieval and Deformation for Partial Point

Lip2Vec Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual

TAPIR Tracking Any Point with Per-Frame Initialization and Temporal Refinement

Foreground-Background Separation through Concept Distillation from Generative Image Foundation Models

AG3D Learning to Generate 3D Avatars from 2D Image Collections

Boosting Long-tailed Object Detection via Step-wise Learning on Smooth-tail Dat

Collaborative Propagation on Multiple Instance Graphs for 3D Instance Segmentation

Cross-view Topology Based Consistent and Complementary Information for Deep Multi-view

CVSformer Cross-View Synthesis Transformer for Semantic Scene Completion

Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video

EMQ Evolving Training-free Proxies for Automated Mixed Precision Quantization

Heterogeneous Forgetting Compensation for Class-Incremental Learning

iVS-Net Learning Human View Synthesis from Internet Videos

Knowledge Restore and Transfer for Multi-Label Class-Incremental Learning

Large-Scale Land Cover Mapping with Fine-Grained Classes via Class-Aware Semi-Supervis

Multi-Scale Residual Low-Pass Filter Network for Image Deblurring

One-bit Flip is All You Need When Bit-flip Attack Meets

Preserving Tumor Volumes for Unsupervised Medical Image Registration

Prompt Tuning Inversion for Text-driven Image Editing Using Diffusion Models

Shape Anchor Guided Holistic Indoor Scene Understanding

Sparse Instance Conditioned Multimodal Trajectory Prediction

Identity-Seeking Self-Supervised Representation Learning for Generalizable Person Re-Identification

TORE Token Reduction for Efficient Human Mesh Recovery with Transform

Reducing Training Time in Cross-Silo Federated Learning Using Multigraph Topology

Rosetta Neurons Mining the Common Units in a Model Zoo

One-Shot Recognition of Any Material Anywhere Using Contrastive Learning with

SkeleTR Towards Skeleton-based Action Recognition in the Wil

Towards Saner Deep Image Registration

Towards Semi-supervised Learning with Non-random Missing Labels

A Low-Shot Object Counting Network With Iterative Prototype Adaptation

SAFE Machine Unlearning With Shard Graphs

Eventful Transformers Leveraging Temporal Redundancy in Vision Transformers

Multi-View Active Fine-Grained Visual Recognition

s-Adaptive Decoupled Prototype for Few-Shot Object Detection

Semi-Supervised Learning via Weight-Aware Distillation under Class Distribution Mismatch

HyperDiffusion Generating Implicit Neural Fields with Weight-Space Diffusion

Physically-Plausible Illumination Distribution Estimation

Structure and Content-Guided Video Synthesis with Diffusion Models

All4One Symbiotic Neighbour Contrastive Learning via Self-Attention and Redundancy Reduction

Diffusion in Styl

Reinforce Data Multiply Impact Improved Model Accuracy and Robustness with

PODA Prompt-driven Zero-shot Domain Adaptation

FastRecon Few-shot Industrial Anomaly Detection via Fast Feature Reconstruction

GIFD A Generative Gradient Inversion Method with Feature Domain Optimization

Locating Noise is Halfway Denoising for Semi-Supervised Segmentation

Robust Heterogeneous Federated Learning under Data Corruption

SQAD Automatic Smartphone Camera Quality Assessment and Benchmarking

Tracing the Origin of Adversarial Attack for Forensic Investigation an

UATVR Uncertainty-Adaptive Text-Video Retrieval

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object

Visible-Infrared Person Re-Identification via Semantic Alignment and Affinity Inferenc

Flexible Visual Recognition by Evidential Modeling of Confusion and Ignoranc

Motion-Guided Masking for Spatiotemporal Representation Learning

Occ2Net Robust Image Matching Based on 3D Occupancy Estimation fo

Once Detected Never Lost Surpassing Human Performance in Offline LiDAR

RCA-NOC Relative Contrastive Alignment for Novel Object Captioning

Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric

Simulating Fluids in Real-World Still Images

SSB Simple but Strong Baseline for Boosting Performance of Open-Set

Taxonomy Adaptive Cross-Domain Adaptation in Medical Imaging via Optimization Trajectory

Unpaired Multi-domain Attribute Translation of 3D Facial Shapes with

Unsupervised Open-Vocabulary Object Localization in Videos

Transferable Decoding with Visual Entities for Zero-Shot Image Captioning

3D Motion Magnification Visualizing Subtle Motions from Time-Varying Radiance Fields

Clustering based Point Cloud Representation Learning for 3D Analysis

CVRecon Rethinking 3D Geometric Feature Learning For Neural Reconstruction

DiffPose SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation

Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning

Generalizing Neural Human Fitting to Unseen Poses With Articulated SE

Hierarchical Contrastive Learning for Pattern-Generalizable Image Corruption Detection

Score-Based Diffusion Models as Principled Priors for Inverse Imaging

Semantically Structured Image Compression via Irregular Group-Based Decoupling

SimFIR A Simple Framework for Fisheye Image Rectification with Self-supervis

Towards Instance-adaptive Inference for Federated Learning

ViM Vision Middleware for Unified Downstream Transferring

The Stable Signature Rooting Watermarks in Latent Diffusion Models

TeD-SPAD Temporal Distinctiveness for Self-Supervised Privacy-Preservation for Video Anomaly Detection

Multimodal Motion Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection

Distribution-Aligned Diffusion for Human Mesh Recovery

Jumping through Local Minima Quantization in the Loss Landscape o

NLOS-NeuS Non-line-of-sight Neural Implicit Surfac

ASAG Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Ancho

Dancing in the Dark A Benchmark towards General Low-light Video

Deformer Dynamic Fusion Transformer for Robust Hand Pose Estimation

GPGait Generalized Pose-based Gait Recognition

Towards High-Quality Specular Highlight Removal by Leveraging Large-Scale Synthetic Dat

TripLe Revisiting Pretrained Model Reuse and Progressive Learning for Efficient

UnitedHuman Harnessing Multi-Source Data for High-Resolution Human Generation

VAPCNet Viewpoint-Aware 3D Point Cloud Completion

Erasing Concepts from Diffusion Models

Improving Unsupervised Visual Program Inference with Code Rewriting Families

Towards Models that Can See and R

Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation

Towards Robust Model Watermark via Reducing Parametric Vulnerability

Adaptive Positional Encoding for Bundle-Adjusting Neural Radiance Fields

Adaptive Testing of Computer Vision Models

A 5-Point Minimal Solver for Event Camera Relative Motion Estimation

A Unified Continual Learning Framework with General Parameter-Efficient Tuning

Coarse-to-Fine Amodal Segmentation with Shape Prio

Controllable Visual-Tactile Synthesis

CSDA Learning Category-Scale Joint Feature for Domain Adaptive Object Detection

DIFFGUARD Semantic Mismatch-Guided Out-of-Distribution Detection Using Pre-Trained Diffusion Models

DQS3D Densely-matched Quantization-aware Semi-supervised 3D Detection

Human-Inspired Facial Sketch Synthesis with Dynamic Adaptation

Masked Diffusion Transformer is a Strong Image Synthesiz

MeMOTR Long-Term Memory-Augmented Transformer for Multi-Object Tracking

SIGMA Scale-Invariant Global Sparse Shape Matching

Strivec Sparse Tri-Vector Radiance Fields

Structural Alignment for Network Pruning through Partial Regularization

Towards Better Robustness against Common Corruptions for Unsupervised Domain Adaptation

Tuning Pre-trained Model via Moment Probing

Robust Monocular Depth Estimation under Challenging Conditions

Segmenting Known Objects and Unseen Unknowns without Prior Knowledg

Tree-Structured Shading Decomposition

Audiovisual Masked Autoencoders

Advancing Example Exploitation Can Alleviate Critical Challenges in Adversarial Training

CLR Channel-wise Lightweight Reprogramming for Continual Learning

Expressive Text-to-Image Generation with Rich Text

MetaBEV Solving Sensor Failures for 3D Detection and Map Segmentation

Preserve Your Own Correlation A Noise Prior for Video Diffusion

Ref-NeuS Ambiguity-Reduced Neural Implicit Surface Learning for Multi-View Reconstruction with

Weakly-Supervised Action Segmentation and Unseen Error Detection in Anomalous Instructional

zPROBE Zero Peek Robustness Checks for Federated Learning

Handwritten and Printed Text Segmentation A Signature Case Study

ETran Energy-Based Transferability Estimation

SHACIRA Scalable HAsh-grid Compression for Implicit Neural Representations

SiLK Simple Learned Keypoints

Humans in 4D Reconstructing and Tracking Humans with Transformers

Who Are You Referring To Coreference Resolution In Image Narrations

Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks

ARNOLD A Benchmark for Language-Grounded Task Learning with Continuous States

TM2D Bimodality Driven 3D Dance Generation via Music-Text Integration

ToonTalker Cross-Domain Face Reenactment

SYENet A Simple Yet Effective Network for Multiple Low-Level Vision

Semantify Simplifying the Control of 3D Morphable Models Using CLIP

CrossLoc3D Aerial-Ground Cross-Source 3D Place Recognition

PIDRo Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval

Revisit PCA-based Technique for Out-of-Distribution Detection

Self-Supervised Character-to-Character Distillation for Text Recognition

DeLiRa Self-Supervised Depth Light and Radiance Fields

Towards Zero-Shot Scale-Aware Monocular Depth Estimation

Enhancing Sample Utilization through Sample Adaptive Augmentation in Semi-Supervised Learning

Audio-Visual Deception Detection DOLOS Dataset and Parameter-Efficient Crossmodal Learning

Automatic Network Pruning via Hilbert-Schmidt Independence Criterion Lasso under Information

Boundary-Aware Divide and Conquer A Diffusion-Based Solution for Unsupervised Shadow

Controllable Guide-Space for Generalizable Face Forgery Detection

DomainDrop Suppressing Domain-Sensitive Channels for Domain Generalization

EGC Image Generation and Classification via a Diffusion Energy-Based Model

Forward Flow for Novel View Synthesis of Dynamic Scenes

From Sky to the Ground A Large-scale Benchmark and Simpl

FSAR Federated Skeleton-based Action Recognition with Adaptive Topology Structure an

Membrane Potential Batch Normalization for Spiking Neural Networks

Physics-Augmented Autoencoder for 3D Skeleton-Based Gait Recognition

PolicyCleanse Backdoor Detection and Mitigation for Competitive Reinforcement Learning

RMP-Loss Regularizing Membrane Potential Distribution for Spiking Neural Networks

Robustifying Token Attention for Vision Transformers

Task-aware Adaptive Learning for Cross-domain Few-shot Learning

Template-guided Hierarchical Feature Restoration for Anomaly Detection

ViewRefer Grasp the Multi-view Knowledge for 3D Visual Grounding

Visual Traffic Knowledge Graph Generation from Scene Images

ASIC Aligning Sparse in-the-wild Image Collections

CLIPTrans Transferring Visual Knowledge with Pre-trained Models for Multimodal Machin

Eulerian Single-Photon Vision

Generalized Sum Pooling for Metric Learning

SPACE Speech-driven Portrait Animation with Controllable Expression

FACET Fairness in Computer Vision Evaluation Benchmark

Learned Compressive Representations for Single-Photon 3D Imaging

Class-relation Knowledge Distillation for Novel Class Discovery

Few-shot Continual Infomax Learning

Generalizable Neural Fields as Partially Observed Neural Processes

I Cant Believe Theres No Images Learning Visual Tasks Using

Remembering Normality Memory-guided Knowledge Distillation for Unsupervised Anomaly Detection

Two Birds One Stone A Unified Framework for Joint Learning

Deep Geometry-Aware Camera Self-Calibration from Video

Pseudo Flow Consistency for Self-Supervised 6D Object Pose Estimation

Fast Globally Optimal Surface Normal Estimation from an Affine Correspondenc

ClusT3 Information Invariant Test-Time Training

Efficient Diffusion Training via Min-SNR Weighting Strategy

AutoAD II The Sequel - Who When and What in

CHAMPAGNE Learning Real-world Conversation from Large-Scale Web Videos

CHORUS Learning Canonicalized 3D Human-Object Spatial Relations from Unboun

Controllable Person Image Synthesis with Pose-Constrained Latent Diffusion

Dynamic Perceiver for Efficient Visual Recognition

E2VPT An Effective and Efficient Approach for Visual Prompt Tuning

FLatten Transformer Vision Transformer using Focused Linear Attention

Global Knowledge Calibration for Fast Open-Vocabulary Segmentation

HTML Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object

Neglected Free Lunch - Learning Image Classifiers Using Annotation Byproducts

Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network

Self-Supervised Monocular Depth Estimation by Direction-aware Cumulative Convolution Network

STEERER Resolving Scale Variations for Counting and Localization via Selectiv

SVDiff Compact Parameter Space for Diffusion Fine-Tuning

Towards Attack-tolerant Federated Learning via Critical Parameter Analysis

Vision HGNN An Image is More than a Graph o

Class-Aware Patch Embedding Adaptation for Few-Shot Image Classification

Instruct-NeRF2NeRF Editing 3D Scenes with Instructions

BaRe-ESA A Riemannian Framework for Unregistered Human Body Shapes

FeatEnHancer Enhancing Hierarchical Features for Object Detection and Beyond Un

Will Large-scale Generative Models Corrupt Future Datasets

Point-TTA Test-Time Adaptation for Point Cloud Registration Using Multitask Meta-Auxiliary

EgoTV Egocentric Task Verification from Natural Language Task Descriptions

Video OWL-ViT Temporally-consistent Open-world Localization in Video

Chasing Clouds Differentiable Volumetric Rasterisation of Point Clouds as

A Fast Unified System for 3D Object Detection and Tracking

Understanding Hessian Alignment for Domain Generalization

Energy-based Self-Training and Normalization for Unsupervised Domain Adaptation

Delta Denoising Sco

FunnyBirds A Synthetic Vision Dataset for a Part-Based Analysis o

Bidirectional Alignment for Domain Adaptive Detection with Transformers

BiViT Extremely Compressed Binary Vision Transformers

Candidate-aware Selective Disambiguation Based On Normalized Entropy for Instance-dependent Partial-label

Degradation-Resistant Unfolding Network for Heterogeneous Image Fusion

GlobalMapper Arbitrary-Shaped Urban Layout Generation

ICL-D3IE In-Context Learning with Diverse Demonstrations Updating for Document Information

OrthoPlanes A Novel Representation for Better 3D-Awareness of GANs

Pyramid Dual Domain Injection Network for Pan-sharpening

Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning

Shift from Texture-bias to Shape-bias Edge Deformation-based Augmentation for Robust

Speech4Mesh Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial

Thinking Image Color Aesthetics Assessment Models Datasets and Benchmarks

TopoSeg Topology-Aware Nuclear Instance Segmentation

Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning

Unsupervised Prompt Tuning for Text-Driven Object Detection

REAP A Large-Scale Realistic Adversarial Patch Benchmark

Normalizing Flows for Human Pose Anomaly Detection

Text2Room Extracting Textured 3D Meshes from 2D Text-to-Image Models

DiffPose Multi-hypothesis Human Pose Estimation using Diffusion Models

AesPA-Net Aesthetic Pattern-Aware Style Transfer Networks

Attention Discriminant Sampling for Point Clouds

Hyperbolic Audio-visual Zero-shot Learning

Implicit Identity Representation Conditioned Memory Compensation Network for Talking H

Improving Sample Quality of Diffusion Models Using Self-Attention Guidanc

Learning Navigational Visual Representations with Semantic Map Supervision

LVOS A Benchmark for Long-term Video Object Segmentation

On the Robustness of Normalizing Flows for Inverse Problems in

Out-of-Distribution Detection for Monocular Depth Estimation

Subclass-balancing Contrastive Learning for Long-tailed Recognition

When to Learn What Model-Adaptive Data Augmentation Curriculum

Class-incremental Continual Learning for Instance Segmentation with Image-level Weak Supervision

360VOT A New Benchmark Dataset for Omnidirectional Visual Object Tracking

Adaptive Frequency Filters As Efficient Global Token Mixers

Adaptive Nonlinear Latent Transformation for Conditional Face Editing

Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection

A Sentence Speaks a Thousand Images Domain Generalization through Distilling

CLIP2Point Transfer CLIP to Point Cloud Classification with Image-Depth Pre-Training

ConSlide Asynchronous Hierarchical Interaction Transformer with Breakup-Reorganize Rehearsal for Continual

Counting Crowds in Bad Weath

Delving into Motion-Aware Matching for Monocular 3D Object Tracking

DiffDis Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability

ESTextSpotter Towards Better Scene Text Spotting with Explicit Synergy in

Evaluation and Improvement of Interpretability for Self-Explainable Part-Prototype Networks

FULLER Unified Multi-modality Multi-task 3D Perception via Multi-level Gradient Calibration

GameFormer Game-theoretic Modeling and Learning of Transformer-based Interactive Prediction an

iDAG Invariant DAG Searching for Domain Generalization

Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting

Interactive Class-Agnostic Object Counting

InterFormer Real-time Interactive Image Segmentation

Learning Shape Primitives via Implicit Convexity Regularization

MGMAE Motion Guided Masking for Video Masked Autoencoding

Multi-Metrics Adaptively Identifies Backdoors in Federated Learning

Neural LiDAR Fields for Novel View Synthesis

One-shot Implicit Animatable Avatars with Model-based Priors

PADDLES Phase-Amplitude Spectrum Disentangled Early Stopping for Learning with Noisy

PHRIT Parametric Hand Representation with Implicit Templat

Pixel-Wise Contrastive Distillation

Ponder Point Cloud Pre-training via Neural Rendering

Prototypical Kernel Learning and Open-set Foreground Perception for Generalized Few-shot

Reconstructing Groups of People with Hypergraph Relational Reasoning

SAFARI Versatile and Efficient Evaluations for Robustness of Interpretability

Simoun Synergizing Interactive Motion-appearance Understanding for Vision-based Reinforcement Learning

Skill Transformer A Monolithic Policy for Mobile Manipulation

Understanding Self-attention Mechanism via Dynamical System Perspectiv

Video Task Decathlon Unifying Image and Video Tasks in Autonomous

Weakly Supervised Learning of Semantic Correspondence through Cascaded Online Correspondenc

What can Discriminator do Towards Box-free Ownership Verification of Generativ

Efficient LiDAR Point Cloud Oversegmentation Network

Focus on Your Target A Dual Teacher-Student Framework for Domain-Adaptiv

Beyond One-to-One Rethinking the Referring Image Segmentation

DandelionNet Domain Composition with Instance Adaptive Classification for Domain Generalization

DRAW Defending Camera-shooted RAW Against Image Manipulation

Explore and Tell Embodied Visual Captioning in 3D Environments

Federated Learning Over Images Vertical Decompositions and Pre-Trained Backbones A

Multiscale Representation for Real-Time Anti-Aliasing Neural Rendering

Open-domain Visual Entity Recognition Towards Recognizing Millions of Wikipedia Entities

Phasic Content Fusing Diffusion Model with Directional Distribution Consistency fo

PlankAssembly Robust 3D Reconstruction from Three Orthographic Views with Learnt

PromptCap Prompt-Guided Image Captioning for VQA with GPT-

Pseudo-label Alignment for Semi-supervised Instance Segmentation

SHERF Generalizable Human NeRF from a Single Imag

Single Image Reflection Separation via Component Synergy

TIFA Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering

Tri-MipRF Tri-Mip Representation for Efficient Anti-Aliasing Neural Radiance Fields

Unsupervised Feature Representation Learning for Domain-generalized Cross-domain Image Retrieval

VL-PET Vision-and-Language Parameter-Efficient Tuning via Granularity Control

FaceCLIPNeRF Text-driven 3D Face Manipulation using Deformable Neural Radiance Fields

UpCycling Semi-supervised 3D Object Detection without Sharing Raw-level Unlabeled Scenes

Scratching Visual Transformers Back with Uniform Attention

Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment

RANA Relightable Articulated Neural Avatars

NeO 360 Neural Fields for Sparse View Synthesis of Outdoo

RED-PSM Regularization by Denoising of Partially Separable Models for Dynamic

Hidden Biases of End-to-End Driving Models

Efficiently Robustify Pre-Trained Models

UMFuse Unified Multi View Fusion for Human Editing Applications

Physics-Driven Turbulence Image Restoration with Stochastic Refinement

Dynamic Mesh Recovery from Partial Point Cloud Sequenc

Knowing Where to Focus Event-aware Transformer for Video Grounding

Self-supervised Image Denoising with Downsampled Invariance Loss and Conditional Blind-Spot

BlindHarmony Blind Harmonization for MR Images via Flow Model

The Power of Sound TPoS Audio Reactive Video Generation with

A Unified Framework for Robustness on Diverse Sampling Errors

Beyond Single Path Integrated Gradients for Reliable Input Attribution vi

Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations

Anatomical Invariance Modeling and Semantic Alignment for Self-supervised Learning in

AvatarCraft Transforming Text into Neural Human Avatars with Parameterized Sh

BUS Efficient and Effective Vision-Language Pre-Training with Bottom-Up Patch Summarization

Center-Based Decoupled Point-cloud Registration for 6D Object Pose Estimation

Coordinate Quantized Neural Implicit Representations for Multi-view Reconstruction

Diffuse3D Wide-Angle 3D Photography via Bilateral Diffusion

Domain Generalization via Balancing Training Difficulty and Model Capability

Efficient Decision-based Black-box Patch Attacks on Video Recognition

EMR-MSF Self-Supervised Recurrent Monocular Scene Flow Exploiting Ego-Motion Rigidity

Full-Body Articulated Human-Object Interaction

MEFLUT Unsupervised 1D Lookup Tables for Multi-exposure Image Fusion

Optimizing the Placement of Roadside LiDARs for Autonomous Driving

Personalized Image Generation for Color Vision Deficiency Population

Probabilistic Triangulation for Uncalibrated Multi-View 3D Human Pose Estimation

Revisiting Scene Text Recognition A Data Perspectiv

Scenimefy Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation

Structure-Aware Surface Reconstruction via Primitive Assembly

Supervised Homography Learning with Realistic Dataset Generation

Text2Performer Text-Driven Human Video Generation

VAD Vectorized Scene Representation for Efficient Autonomous Driving

Video Action Segmentation via Contextually Refined Temporal Keypoints

AffordPose A Large-Scale Dataset of Hand-Object Interactions with Affordance-Driven Han

Unsupervised Domain Adaptation for Training Event-Based Networks Using Contrastive Learning

CoSign Exploring Co-occurrence Signals in Skeleton-based Continuous Sign Language Recognition

Semi-supervised Semantics-guided Adversarial Training for Robust Trajectory Prediction

DriveAdapter Breaking the Coupling Barrier of Perception and Planning in

Revisiting the Parameter Efficiency of Adapters from the Perspective o

Order-preserving Consistency Regularization for Domain Adaptation and Generalization

Uncertainty Guided Adaptive Warping for Robust and Efficient Stereo Matching

DiffusionRet Generative Text-Video Retrieval with Diffusion Model

Explaining Adversarial Robustness of Neural Networks from Clustering Effect Perspectiv

Growing a Brain with Sparsity-Inducing Generation for Continual Learning

Lighting Every Darkness in Two Pairs A Calibration-Free Pipeline fo

Recursive Video Lane Detection

Anchor Structure Regularization Induced Multi-view Subspace Clustering via Enhanced Tenso

Benchmarking and Analyzing Robust Point Cloud Recognition Bag of Tricks

Continual Segment Towards a Single Unified and Non-forgetting Continual Segmentation

DDP Diffusion Model for Dense Visual Prediction

Rethinking Video Frame Interpolation from Shutter Mode Induced Degradation

Single Image Deblurring with Row-dependent Blur Magnitu

Uncertainty-guided Learning for Improving Image Manipulation Detection

3D-Aware Generative Model for Improved Side-View Image Synthesis

MARS Model-agnostic Biased Object Removal without Additional Supervision for Weakly-Supervis

Panoramas from Photons

CAFA Class-Aware Feature Alignment for Test-Time Adaptation

Generating Instance-level Prompts for Rehearsal-free Continual Learning

DG-Recon Depth-Guided Neural 3D Scene Reconstruction

HumanSD A Native Skeleton-Guided Diffusion Model for Human Image Generation

MIMO-NeRF Fast Neural Rendering with Multi-input Multi-output Neural Radiance Fields

Alleviating Catastrophic Forgetting of Incremental Object Detection via Within-Class an

A Soft Nearest-Neighbor Framework for Continual Semi-Supervised Learning

DDColor Towards Photo-Realistic Image Colorization via Dual Decoders

Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking

Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning

Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models

Essential Matrix Estimation using Convex Relaxations in Orthogonal Spac

HoloFusion Towards Photo-realistic 3D Generative Modeling

DreamPose Fashion Video Synthesis with Stable Diffusion

Guided Motion Diffusion for Controllable Human Motion Synthesis

EMDB The Electromagnetic Database of Global 3D Human Pose an

LERF Language Embedded Radiance Fields

Text2Video-Zero Text-to-Image Diffusion Models are Zero-Shot Video Generators

FishNet A Large-scale Dataset and Benchmark for Fish Recognition Detection

Introducing Language Guidance in Prompt-based Continual Learning

Tiled Multiplane Images for Practical 3D Photography

Self-regulating Prompts Foundational Model Adaptation without Forgetting

Ego-Humans An Ego-Centric 3D Multi-Human Benchmark

Sentence Attention Blocks for Answer Grounding

Unsupervised Facial Performance Editing via Vector-Quantized StyleGAN Representations

PreSTU Pre-Training for Scene-Text Understanding

3D-aware Blending with Generative NeRFs

Adaptive Superpixel for Active Learning in Semantic Segmentation

Breaking Temporal Consistency Generating Video Universal Adversarial Perturbations Using Imag

Calibrating Panoramic Depth Estimation for Practical Localization and Mapping

Chupa Carving 3D Clothed Humans from Skinned Shape Priors using

Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents

Contrastive Feature Masking Open-Vocabulary Vision Transform

CRN Camera Radar Net for Accurate Robust Efficient 3D Perception

Cross-Modal Learning with 3D Deformable Attention for Action Recognition

Dense Text-to-Image Generation with Attention Modulation

EP2P-Loc End-to-End 3D Point to 2D Pixel Localization for Large-Scal

Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning

Joint Demosaicing and Deghosting of Time-Varying Exposures for Single-Shot HDR

LDL Line Distance Functions for Panoramic Localization

Learning Point Cloud Completion without Complete Point Clouds A Pose-Aw

Lip Reading for Low-resource Languages by Learning and Combining General

Misalign Contrast then Distill Rethinking Misalignments in Language-Image Pre-training

NCHO Unsupervised Learning for Neural 3D Composition of Humans an

PODIA-3D Domain Adaptation of 3D Generative Model Across Large Domain

Predict to Detect Prediction-guided 3D Object Detection using Sequential Images

ProtoFL Unsupervised Federated Learning via Prototypical Distillation

Proxy Anchor-based Unsupervised Learning for Continuous Generalized Category Discovery

SCOB Universal Text Understanding via Character-wise Supervised Contrastive Learning with

Self-Feedback DETR for Temporal Action Detection

Semantic-Aware Implicit Template Learning via Part Deformation Consistency

Shatter and Gather Learning Referring Image Segmentation with Text Supervision

Texture Learning Domain Randomization for Domain Generalized Segmentation

Convolutional Networks with Oriented 1D Kernels

Segment Anything

StyleLipSync Style-based Personalized Lip-sync Video Generation

DISeR Designing Imaging Systems with Reinforcement Learning

Towards Viewpoint Robustness in Birds Eye View Segmentation

LoCUS Learning Multiscale 3D-consistent Features from Posed Images

Computational 3D Imaging with Position Sensors

Disposable Transfer Learning for Selective Source Task Unlearning

Priority-Centric Human Motion Generation in Discrete Latent Spac

Rethinking Range View Representation for LiDAR Segmentation

Robo3D Towards Robust and Reliable 3D Perception against Corruptions

Enhancing Modality-Agnostic Representations via Meta-Learning for Brain Tumor Segmentation

PG-RCNN Semantic Surface Point Generation for 3D Object Detection

SALAD Part-Level Latent Diffusion for 3D Shape Generation and Manipulation

Guiding Image Captioning Models Toward More Specific Captions

ENTL Embodied Navigation Trajectory Learn

Continuously Masked Transformer for Image Inpainting

Open-vocabulary Video Question Answering A New Benchmark for Evaluating th

Practical Membership Inference Attacks Against Large-Scale Multi-Modal Models A Pilot

Navigating to Objects Specified by Images

Tetra-NeRF Representing Neural Radiance Fields Using Tetrah

Ablating Concepts in Text-to-Image Diffusion Models

Generative Multiplane Neural Radiance for 3D-Aware Image Generation

RefEgo Referring Expression Comprehension Dataset from First-Person Perception of Ego4D

TiDAL Learning Training Dynamics for Active Learning

COOL-CHIC Coordinate-based Low Complexity Hierarchical Image Codec

Hybrid Spectral Denoising Transformer with Guided Attention

Mask-Attention-Free Transformer for 3D Instance Segmentation

PADCLIP Pseudo-labeling with Adaptive Debiasing in CLIP for Unsupervised Domain

XVO Generalized Visual Odometry via Cross-Modal Self-Training

The Making and Breaking of Camouflag

Efficient Converted Spiking Neural Network for 3D and 2D Classification

Masked Autoencoders Are Stronger Knowledge Distillers

UniKD Universal Knowledge Distillation for Mimicking Homogeneous or Heterogeneous Object

SeeABLE Soft Discrepancies and Bounded Contrastive Learning for Exposing Deepfakes

Adaptive Similarity Bootstrapping for Self-Distillation Based Representation Learning

Bayesian Optimization Meets Self-Distillation

Camera-Driven Representation Learning for Unsupervised Domain Adaptive Person Re-identification

DetermiNet A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using

Efficient Unified Demosaicing for Bayer and Non-Bayer Patterned Image Sensors

ExBluRF Efficient Radiance Fields for Extreme Motion Blurred Images

Few-Shot Common Action Localization via Cross-Attentional Fusion of Context an

Generating Realistic Images from In-the-wild Sounds

Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition

Human Part-wise 3D Motion Context Learning for Sign Language Recognition

ICE-NeRF Interactive Color Editing of NeRFs via Decomposition-Aware Weight Optimization

Improving 3D Imaging with Pre-Trained Perpendicular 2D Diffusion Models

INSTA-BNN Binary Neural Network with INSTAnce-aware Threshol

Latent-OFER Detect Mask and Reconstruct with Latent Vectors for Occlu

Lecture Presentations Multimodal Dataset Towards Understanding Multimodality in Educational Videos

Leveraging Spatio-Temporal Dependency for Skeleton-Based Action Recognition

Locomotion-Action-Manipulation Synthesizing Human-Scene Interactions in Complex 3D Environments

Mitigating Adversarial Vulnerability through Causal Parameter Estimation by Adversarial Doubl

Neural Collage Transfer Artistic Reconstruction via Material Manipulation

Online Continual Learning on Hierarchical Label Expansion

Read-only Prompt Optimization for Vision-Language Few-shot Learning

Robust Evaluation of Diffusion-Based Adversarial Purification

Semantic-Aware Dynamic Parameter for Video Inpainting Transform

SlaBins Fisheye Depth Estimation using Slanted Bins on Road Environments

Text-Conditioned Sampling Framework for Text-to-Image Generation with Masked Generative Models

Towards Open-Set Test-Time Adaptation Utilizing the Wisdom of Crowds in

Unsupervised Accuracy Estimation of Deep Visual Models using Domain-Adaptive Adversarial

Weakly Supervised Referring Image Segmentation with Intra-Chunk and Inter-Chunk Consistency

Decomposition-Based Variational Network for Multi-Contrast MRI Super-Resolution and Reconstruction

Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory

Dynamic Hyperbolic Attention Network for Fine Hand-object Reconstruction

DLT Conditioned layout generation with Joint Discrete-Continuous Diffusion Layout Transform

EPiC Ensemble of Partial Point Clouds for Robust Classification

Moing WALDO Future Video Synthesis Using Object Layer Decomposition and Parametric

Quality-Agnostic Deepfake Detection with Intra-model Collaborative Learning

Benchmarking Algorithmic Bias in Face Recognition An Experimental Approach Using

Coherent Event Guided Low-Light Video Enhancement

ENVIDR Implicit Differentiable Renderer with Neural Environment Lighting

Inducing Neural Collapse to a Fixed Hierarchy-Aware Frame for Reducing

Iterative Prompt Learning for Unsupervised Backlit Image Enhancement

Learning Non-Local Spatial-Angular Correlation for Light Field Image Super-Resolution

Logic-induced Diagnostic Reasoning for Semi-supervised Semantic Segmentation

MAAL Multimodality-Aware Autoencoder-Based Affordance Learning for 3D Articulated Objects

MPI-Flow Learning Realistic Optical Flow with Multiplane Images

Semantic Attention Flow Fields for Monocular Dynamic Scene Decomposition

Simple Baselines for Interactive Video Retrieval with Questions and Answers

CheckerPose Progressive Dense Keypoint Localization for Object Pose Estimation with

WaterMask Instance Segmentation for Underwater Imagery

DocTr Document Transformer for Structured Information Extraction in Documents

RecRecNet Rectangling Rectified Wide-Angle Images by Thin-Plate Spline Model an

Segmentation of Tubular Structures Using Iterative Training with Tailored Samples

LightGlue Local Feature Matching at Light S

Algebraically Rigorous Quaternion Framework for the Neural Network Pose Estimation

A Parse-Then-Place Approach for Generating Graphic Layouts from Textual Descriptions

DETR Does Not Need Multi-Scale or Locality Design

Exploring Group Video Captioning with Efficient Relational Approximation

Graph Matching with Bi-level Noisy Correspondenc

Hyperbolic Chamfer Distance for Point Cloud Completion

InfiniCity Infinite-Scale City Synthesis

Learning Vision-and-Language Navigation from YouTube Videos

Leveraging Intrinsic Properties for Non-Rigid Garment Alignment

MAtch eXpand and Improve Unsupervised Finetuning for Zero-Shot Action Recognition

MHCN A Hyperbolic Neural Network Model for Multi-view Hierarchical Clustering

MMST-ViT Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision

OmnimatteRF Robust Omnimatte with 3D Background Modeling

PourIt Weakly-Supervised Liquid Perception from a Single Image for Visual

Preparing the Future for Continual Semantic Segmentation

RealGraph A Multiview Dataset for 4D Real-world Context Graph Generation

Scale-Aware Modulation Meet Transform

Self-supervised Pre-training for Mirror Detection

SMAUG Sparse Masked Autoencoder for Efficient Video-Language Pre-Training

UniVTG Towards Unified Video-Language Temporal Grounding

Unsupervised Image Denoising in Real-World Scenarios via Self-Collaboration Parallel Generativ

VI-Net Boosting Category-level 6D Object Pose Estimation via Learning Decoupl

AerialVLN Vision-and-Language Navigation for UAVs

Augmented Box Replay Overcoming Foreground Shift for Incremental Object Detection

Beating Backdoor Attack at Its Own Gam

Beyond Image Borders Learning Feature Extrapolation for Unbounded Image Composition

Birds-Eye-View Scene Graph for Vision-Language Navigation

Boosting Semantic Segmentation from the Perspective of Explicit Class Embeddings

CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection

Collaborative Tracking Learning for Frame-Rate-Insensitive Multi-Object Tracking

Confidence-aware Pseudo-label Learning for Weakly Supervised Visual Grounding

ContactGen Generative Contact Modeling for Grasp Generation

CPCM Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic

DeFormer Integrating Transformers with Deformable Models for 3D Shape Abstraction

Density-invariant Features for Distant Point Cloud Registration

Detection Transformer with Stable Matching

Diffusion Action Segmentation

DOLCE A Model-Based Probabilistic Diffusion Framework for Limited-Angle CT Reconstruction

DREAM Efficient Dataset Distillation by Representative Matching

Enhancing Generalization of Universal Adversarial Perturbation through Gradient Aggregation

Few-Shot Dataset Distillation via Translative Pre-Training

Few-Shot Physically-Aware Articulated Mesh Generation via Hierarchical Deformation

FSI Frequency and Spatial Interactive Learning for Image Restoration in

Geometrized Transformer for Self-Supervised Homography Estimation

GeoMIM Towards Better 3D Knowledge Transfer via Masked Image Modeling

Group Pose A Simple Baseline for End-to-End Multi-Person Pose Estimation

HOSNeRF Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video

Improving Generalization in Visual Reinforcement Learning via Conflict-aware Gradient Agreement

Improving Pixel-based MIM by Reducing Wasted Modeling Capability

Instance Neural Radiance Fiel

Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection

IST-Net Prior-Free Category-Level Pose Estimation with Implicit Space Transformation

Landscape Learning for Neural Network Inversion

LeaF Learning Frames for 4D Point Cloud Sequence Understanding

Learning Clothing and Pose Invariant 3D Shape Representation for Long-Term

Learning Cross-Representation Affinity Consistency for Sparsely Supervised Biomedical Instance Segmentation

Learning Image-Adaptive Codebooks for Class-Agnostic Image Restoration

Learning to Identify Critical States for Reinforcement Learning from Videos

Learning to Upsample by Learning to Sampl

Linear-Covariance Loss for End-to-End Learning of 6D Pose Estimation

LoTE-Animal A Long Time-span Dataset for Endangered Animal Behavior Understanding

Low-Light Image Enhancement with Multi-Stage Residue Quantization and Brightness-Aware Attention

MODA Mapping-Once Audio-driven Portrait Animation with Dual Attentions

Model Calibration in Dense Classification with Adaptive Label Perturbation

Monocular 3D Object Detection with Bounding Box Denoising in 3D

Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation

Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Imag

Multi-Modal Neural Radiance Field for Monocular Dense SLAM with

MUter Machine Unlearning on Adversarially Trained Models

MV-DeepSDF Implicit Modeling with Multi-Sweep Point Clouds for 3D Vehicl

Objects Do Not Disappear Video Object Detection by Single-Frame Object

Parallel Attention Interaction Network for Few-Shot Skeleton-Based Action Recognition

PARIS Part-level Reconstruction and Motion Analysis for Articulated Objects

Partition Speeds Up Learning Implicit Neural Representations Based on Exponential-Increas

Periodically Exchange Teacher-Student for Source-Free Object Detection

PETRv2 A Unified Framework for 3D Perception from Multi-Camera Images

PlanarTrack A Large-scale Challenging Benchmark for Planar Object Tracking

Point-Query Quadtree for Crowd Counting Localization and Mo

Real-Time Neural Rasterization for Large Scenes

Reconstructed Convolution Module Based Look-Up Tables for Efficient Image Super-Resolution

Referring Image Segmentation Using Text Supervision

RegFormer An Efficient Projection-Aware Transformer Network for Large-Scale Point Clou

Residual Pattern Learning for Pixel-Wise Out-of-Distribution Detection in Semantic Segmentation

Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization

Seeing Beyond the Patch Scale-Adaptive Semantic Segmentation of High-resolution Remot

SimpleClick Interactive Image Segmentation with Simple Vision Transformers

SKiT a Fast Key Information Video Transformer for Online Surgical

SparseBEV High-Performance Sparse 3D Object Detection from Multi-Camera Videos

Tangent Model Composition for Ensembling and Continual Fine-tuning

Text-Driven Generative Domain Adaptation with Spectral Consistency Regularization

The Devil is in the Upsampling Architectural Decisions Made Simpl

TMA Temporal Motion Aggregation for Event-based Optical Flow

Towards Unsupervised Domain Generalization for Face Anti-Spoofing

TRM-UAP Enhancing the Transferability of Data-Free Universal Adversarial Perturbation vi

Uncertainty-aware Unsupervised Multi-Object Tracking

UniSeg A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg

Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models

When Epipolar Constraint Meets Non-Local Operators in Multi-View Stereo

Zero-1-to-3 Zero-shot One Image to 3D Object

2D3D-MATR 2D-3D Matching Transformer for Detection-Free Registration Between Images an

Adaptive and Background-Aware Vision Transformer for Real-Time UAV Tracking

AlignDet Aligning Pre-training and Fine-tuning in Object Detection

Among Us Adversarially Robust Collaborative Perception by Consensus

An Embarrassingly Simple Backdoor Attack on Self-supervised Learning

AutoDiffusion Training-Free Optimization of Time Steps and Architectures for Automat

Automated Knowledge Distillation via Monte Carlo Tree Search

BEV-DG Cross-Modal Learning under Birds-Eye View for Domain Generalization o

Beyond Object Recognition A New Benchmark towards Object Concept Learning

Boosting Multi-modal Model Performance with Adaptive Gradient Modulation

Calibrating Uncertainty for Semi-Supervised Crowd Counting

CFCG Semi-Supervised Semantic Segmentation via Cross-Fusion and Contour Guidance Supervision

CHORD Category-level Hand-held Object Reconstruction via Shape Deformation

CiteTracker Correlating Image and Text for Visual Tracking

ClimateNeRF Extreme Weather Synthesis in Neural Radiance Fiel

Collecting The Puzzle Pieces Disentangled Self-Driven Human Pose Transfer by

Compositional Feature Augmentation for Unbiased Scene Graph Generation

Contactless Pulse Estimation Leveraging Pseudo Labels and Self-Supervision

Coordinate Transformer Achieving Single-stage Multi-person Mesh Recovery from Videos

CORE Co-planarity Regularized Monocular Geometry Estimation with Weak Supervision

Cross Contrasting Feature Perturbation for Domain Generalization

D3G Exploring Gaussian Prior for Temporal Sentence Grounding with Glanc

DDIT Semantic Scene Completion via Deformable Deep Implicit Templates

DenseShift Towards Accurate and Efficient Low-Bit Power-of-Two Quantization

DFA3D 3D Deformable Attention For 2D-to-3D Feature Lifting

Differentiable Transportation Pruning

Discovering Spatio-Temporal Rationales for Video Question Answering

Distilled Reverse Attention Network for Open-world Compositional Zero-Shot Learning

Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection

Distilling Large Vision-Language Model with Out-of-Distribution Generalizability

Diverse Cotraining Makes Strong Semi-Supervised Segmento

DLGSANet Lightweight Dynamic Local and Global Self-Attention Networks for Imag

Do DALL-E and Flamingo Understand Each Oth

DPM-OT A New Diffusion Probabilistic Model Based on Optimal Transport

DreamTeacher Pretraining Image Backbones with Deep Generative Models

E3Sym Leveraging E3 Invariance for Unsupervised 3D Planar Reflective Symmetry

Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis

End-to-end 3D Tracking with Decoupled Queries

Exploring Model Transferability through the Lens of Potential Energy

Exploring the Benefits of Visual Prompting in Differential Privacy

Extensible and Efficient Proxy for Neural Architecture Search

Fast Neural Scene Flow

FB-BEV BEV Representation from Forward-Backward View Transformations

Feature Modulation Transformer Cross-Refinement of Global Representation via High-Frequency Prio

FineDance A Fine-grained Choreography Dataset for 3D Full Body Danc

Foreground and Text-lines Aware Document Image Rectification

G2L Semantically Aligned and Uniform Video Grounding via Geodesic an

GPA-3D Geometry-aware Prototype Alignment for Unsupervised Domain Adaptive 3D Object

Gradient-based Sampling for Class Imbalanced Semi-supervised Object Detection

Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models

Heterogeneous Diversity Driven Active Learning for Multi-Object Tracking

Hierarchical Visual Categories Modeling A Joint Representation Learning and Density

High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset an

I-ViT Integer-only Quantization for Efficient Vision Transformer Inferenc

IntentQA Context-aware Video Intent Reasoning

Inverse Compositional Learning for Weakly-supervised Relation Grounding

IOMatch Simplifying Open-Set Semi-Supervised Learning with Joint Inliers and Outliers

JOTR 3D Joint Contrastive Learning with Transformers for Occluded Human

Knowledge-Spreader Learning Semi-Supervised Facial Action Dynamics by Consistifying Knowledge Granularity

Knowledge Proxy Intervention for Deconfounded Video Question Answering

Large Selective Kernel Network for Remote Sensing Object Detection

Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limit

Learning Fine-Grained Features for Pixel-Wise Video Correspondences

Learning Robust Representations with Information Bottleneck and Memory Network fo

Learning to Distill Global Representation for Sparse-View CT

Leveraging Inpainting for Single-Image Shadow Removal

LogicSeg Parsing Visual Semantics with Neural Logic Learning and Reasoning

MatrixCity A Large-scale City Dataset for City-scale Neural Rendering an

MemorySeg Online LiDAR Semantic Segmentation with a Latent Memory

Mitigating and Evaluating Static Bias of Action Representations in th

Monte Carlo Linear Clustering with Single-Point Supervision is Enough fo

Multi-Frequency Representation Enhancement with Privilege Information for Video Super-Resolution

Multi-granularity Interaction Simulation for Unsupervised Interactive Segmentation

MUVA A New Large-Scale Benchmark for Multi-View Amodal Instance Segmentation

NeRF-MS Neural Radiance Fields with Multi-Sequenc

NerfAcc Efficient Sampling Accelerates NeRFs

NeTONeural Reconstruction of Transparent Objects with Self-Occlusion Aware Refraction-Tracing

Neural Characteristic Function Learning for Conditional Image Generation

Novel Scenes Classes Towards Adaptive Open-set Object Detection

No Fear of Classifier Biases Neural Collapse Inspired Federated Learning

On the Robustness of Open-World Test-Time Training Self-Training with Dynamic

Open-vocabulary Object Segmentation with Diffusion Models

OxfordTVG-HIC Can Machine Make Humorous Captions from Images

Partition-And-Debias Agnostic Biases Mitigation via a Mixture of Biases-Specific Experts

PatchCT Aligning Patch Set and Label Set with Conditional Transport

Pixel Adaptive Deep Unfolding Transformer for Hyperspectral Image Reconstruction

Pluralistic Aging Diffusion Autoenco

Point2Mask Point-supervised Panoptic Segmentation via Optimal Transport

Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval

PVT A Simple End-to-End Latency-Aware Visual Tracking Framework

Q-Diffusion Quantizing Diffusion Models

ReactioNet Learning High-Order Facial Behavior from Universal Stimulus-Reaction by Dyadic

RenderIH A Large-Scale Synthetic Dataset for 3D Interacting Hand Pos

RepQ-ViT Scale Reparameterization for Post-Training Quantization of Vision Transformers

Representation Disparity-aware Distillation for 3D Object Detection

Rethinking Multi-Contrast MRI Super-Resolution Rectangle-Window Cross-Attention Transformer and Arbitrary-Scale Upsampling

Rethinking Vision Transformers for MobileNet Size and S

RFD-ECNet Extreme Underwater Image Compression with Reference to Feature Dictionary

RICO Regularizing the Unobservable for Indoor Compositional Reconstruction

Robust Referring Video Object Segmentation with Cyclic Structural Consensus

Semi-Supervised Semantic Segmentation under Label Noise via Diverse Learning Groups

Sequential Texts Driven Cohesive Motions Synthesis with Natural Transitions

Skip-Plan Procedure Planning in Instructional Videos via Condensed Action Spac

StegaNeRF Embedding Invisible Information within Neural Radiance Fields

STPrivacy Spatio-Temporal Privacy-Preserving Action Recognition

TCOVIS Temporally Consistent Online Video Instance Segmentation

The Euclidean Space is Evil Hyperbolic Attribute Editing for Few-shot

Tube-Link A Flexible Cross Tube Framework for Universal Video Segmentation

UHDNeRF Ultra-High-Definition Neural Radiance Fields

UniFormerV2 Unlocking the Potential of Image ViTs for Video Understanding

Unify Align and Refine Multi-Level Semantic Alignment for Radiology Report

Unleashing the Potential of Spiking Neural Networks with Dynamic Confidenc

Unmasked Teacher Towards Training-Efficient Video Foundation Models

Variational Degeneration to Structural Refinement A Unified Framework for Superimpos

Virtual Try-On with Pose-Garment Keypoints Guided Inpainting

Your Diffusion Model is Secretly a Zero-Shot Classifi

Cross-modal Scalable Hierarchical Clustering in Hyperbolic spac

Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models

ATT3D Amortized Text-to-3D Object Synthesis

ELFNet Evidential Local-global Fusion for Stereo Matching

Robust e-NeRF NeRF from Sparse Noisy Events under Non-Uniform

3D VR Sketch Guided 3D Shape Prototyping and Exploration

BEVPlace Learning LiDAR-based Place Recognition using Birds Eye View Images

CopyRNeRF Protecting the CopyRight of Neural Radiance Fields

GAFlow Incorporating Gaussian Attention into Optical Flow

Harvard Glaucoma Detection and Progression A Multimodal Multitask Dataset an

KECOR Kernel Coding Rate Maximization for Active 3D Object Detection

LATR 3D Lane Detection from Monocular Images with Transform

Learning Optical Flow from Event Camera with Rendered Dataset

Learning Versatile 3D Shape Generation with Improved Auto-regressive Models

LexLIP Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval

On the Effectiveness of Spectral Discriminators for Perceptual Quality Improvement

Perpetual Humanoid Control for Real-time Simulated Avatars

PGFed Personalize Each Clients Global Objective for Federated Learning

Similarity Min-Max Zero-Shot Day-Night Domain Adaptation

A Large-Scale Outdoor Multi-Modal Dataset and Benchmark for Novel View

Hard No-Box Adversarial Attack on Skeleton-Based Human Action Recognition with

Holistic Geometric Feature Learning for Structured Reconstruction

Label-Noise Learning with Intrinsically Long-Tailed Dat

Query Refinement Transformer for 3D Instance Segmentation

Removing Anomalies as Noises for Industrial Defect Localization

Scene-Aware Feature Matching

See More and Know More Zero-shot Point Cloud Segmentation vi

Set-level Guidance Attack Boosting Adversarial Transferability of Vision-Language Pre-training Models

TF-ICON Diffusion-Based Training-Free Cross-Domain Image Composition

Translating Images to Road Network A Non-Autoregressive Sequence-to-Sequence Approach

Urban Radiance Field Representation with Deformable Neural Mesh Primitives

Anchor-Intermediate Detector Decoupling and Coupling Bounding Boxes for Accurate Object

Aperture Diffraction for Compact Snapshot Spectral Imaging

Learning a Room with the Occ-SDF Hybrid Signed Distance Function

Measuring Asymmetric Gradient Discrepancy in Parallel Continual Learning

Fast Inference and Update of Probabilistic Density Estimation on Trajectory

How to Choose your Best Allies for a Transferable Attack

EgoLoc Revisiting 3D Object Localization from Egocentric Videos with Visual

Neural Microfacet Fields for Inverse Rendering

NAPA-VQ Neighborhood-Aware Prototype Augmentation with Vector Quantization for Continual Learning

TrackFlow Multi-Object tracking with Normalizing Flows

CAD-Estate Large-scale CAD Model Annotation in RGB Videos

Towards Zero Domain Gap A Comprehensive Study of Realistic LiDAR

SurfsUP Learning Fluid Simulation for Novel Surfaces

Chordal Averaging on Flag Manifolds and Its Applications

COCO-O A Benchmark for Object Detectors under Natural Distribution Shifts

Masked Motion Predictors are Strong 3D Action Representation Learners

Multimodal Variational Auto-encoder based Audio-Visual Segmentation

VoroMesh Learning Watertight Surface Meshes with Voronoi Diagrams

Multi-Object Navigation with Dynamically Learned Neural Implicit Representations

Learning to Ground Instructional Articles in Videos through Narrations

A Benchmark for Chinese-English Scene Text Image Super-Resolution

Borrowing Knowledge From Pre-trained Language Model A New Data-efficient Visual

Deformable Neural Radiance Fields using RGB and Event Cameras

DetZero Rethinking Offboard 3D Object Detection with Long-term Sequential Point

Enhanced Soft Label for Semi-Supervised Semantic Segmentation

Fine-grained Unsupervised Domain Adaptation for Gait Recognition

GaFET Learning Geometry-aware Facial Expression Translation from In-The-Wild Images

Invariant Feature Regularization for Fair Face Recognition

Order-Prompted Tag Sequence Generation for Video Tagging

Rethinking Safe Semi-supervised Learning Transferring the Open-set Problem to A

Synchronize Feature Extracting and Matching A Single Branch Framework fo

Towards Fair and Comprehensive Comparisons for Image-Based 3D Object Detection

Tracking by Natural Language Specification with Long Short-term Context Decoupling

Transferable Adversarial Attack for Both Vision Transformers and Convolutional Networks

WaveIPT Joint Attention and Flow Alignment in the Wavelet domain

X-Mesh Towards Fast and Accurate Text-driven 3D Stylization via Dynamic

Inter-Realization Channels Unsupervised Anomaly Detection Beyond One-Class Classification

A Theory of Topological Derivatives for Inverse Rendering of Geometry

Gender Artifacts in Visual Datasets

Towards Geospatial Foundation Models via Continual Pretraining

Towards Memory- and Time-Efficient Backpropagation for Training Spiking Neural Networks

Tracking without Label Unsupervised Multiple Object Tracking via Contrastive Similarity

Encyclopedic VQA Visual Questions About Detailed Properties of Fine-Grained Categories

A Skeletonization Algorithm for Gradient-Based Optimization

M2T Masking Transformers Twice for Faster Decoding

Efficient Neural Supersampling on a Novel Gaming Dataset

Identification of Systematic Errors of Image Classifiers on Rare Subgroups

CauSSL Causality-inspired Semi-supervised Learning for Medical Image Segmentation

DDS2M Self-Supervised Denoising Diffusion Spatio-Spectral Model for Hyperspectral Image Restoration

Spectrum-guided Multi-granularity Referring Video Object Segmentation

Domain Generalization Guided by Gradient Signal to Noise Ratio o

SKED Sketch-guided Text-based 3D Editing

Environment-Invariant Curriculum Relation Learning for Fine-Grained Scene Graph Generation

Geometric Viewpoint Learning with Hyper-Rays and Harmonics Encoding

Reference-guided Controllable Inpainting of Neural Radiance Fields

MATE Masked Autoencoders are Online 3D Test-Time Learners

Privacy-Preserving Face Recognition Using Random Frequency Components

Dark Side Augmentation Generating Diverse Night Examples for Metric Learning

Verbs in Action Improving Verb Understanding in Video-Language Models

MSI Maximize Support-Set Information for Few-Shot Segmentation

Online Class Incremental Learning on Stochastic Blurry Task Boundary vi

CROSSFIRE Camera Relocalization On Self-Supervised Features from an Implicit Representation

MolGrapher Graph-based Visual Recognition of Chemical Structures

PATMAT Person Aware Tuning of Mask-Aware Transformer for Face Inpainting

Class-Incremental Grouping Network for Continual Audio-Visual Learning

SIDGAN High-Resolution Dubbed Video Generation via Shift-Invariant Learning

LiveHand Real-time and Photorealistic Neural Hand Rendering

Multi-label Affordance Mapping from Egocentric Vision

ActorsNeRF Animatable Few-shot Human Rendering with Generalizable NeRFs

DiffTAD Temporal Action Detection with Proposal Denoising Diffusion

Mining bias-target Alignment from Voronoi Cells

Steered Diffusion A Generalized Framework for Plug-and-Play Conditional Image Synthesis

Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving

DeePoint Visual Pointing Recognition and Direction Estimation

Pre-training Vision Transformers with Very Limited Synthesized Images

Representation Uncertainty in Self-Supervised Learning as Variational Inferenc

Minimal Solutions to Uncalibrated Two-view Geometry with Known Epipoles

Interaction-aware Joint Attention Estimation Using People Attributes

DiffFacto Controllable Part-Based 3D Point Cloud Generation with Cross Diffusion

CO-PILOT Dynamic Top-Down Point Cloud with Conditional Neighborhood Aggregation fo

Cyclic Test-Time Adaptation on Monocular Video for 3D Human Mesh

Unmasking Anomalies in Road-Scene Segmentation

Multi-Directional Subspace Editing in Style-Spac

RbA Segmenting Unknown Regions Rejected by All

Spurious Features Everywhere - Large-Scale Detection of Harmful Spurious Features

GaPro Box-Supervised 3D Point Cloud Instance Segmentation Using Gaussian Processes

Improved Knowledge Transfer for Semi-Supervised Domain Adaptation via Trico Training

Can Language Models Learn to Listen

Parallax-Tolerant Unsupervised Deep Image Stitching

PARTNER Level up the Polar Representation for LiDAR 3D Object

RLSAC Reinforcement Learning Enhanced Sample Consensus for End-to-End Robust Estimation

All in Tokens Unifying Output Space of Visual Tasks vi

Deep Image Harmonization with Globally Guided Feature Transformation and Relation

Deep Image Harmonization with Learnable Augmentation

Fine-grained Visible Watermark Removal

NIR-assisted Video Enhancement via Unpaired 24-hour Dat

On the Audio-visual Synchronization for Lip-to-Speech Synthesis

Deep Incubation Training Large Models by Divide-and-Conquering

Part-Aware Transformer for Generalizable Person Re-identification

RankMixup Ranking-Based Mixup Training for Network Calibration

Simple and Effective Out-of-Distribution Detection via Cosine-based Softmax Loss

PRANC Pseudo RAndom Networks for Compacting Deep Models

Neural Implicit Surface Evolution

Audio-Visual Glance Network for Efficient Video Recognition

Time-to-Contact Map by Joint Estimation of Up-to-Scale Inverse Depth an

Chaotic World A Large and Challenging Benchmark for Human Behavio

Robust One-Shot Face Video Re-enactment using Hybrid Latent Spaces o

Editing Implicit Assumptions in Text-to-Image Diffusion Models

Black Box Few-Shot Adaptation for Vision-Language Models

RSFNet A White-Box Image Retouching Approach using Region-Specific Color Filters

Troubleshooting Ethnic Quality Bias with Curriculum Domain Adaptation for Fac

Conceptual and Hierarchical Latent Space Decomposition for Face Editing

Teaching CLIP to Count to Ten

NeSS-ST Detecting Good and Stable Keypoints with a Neural Stability

Domain Adaptive Few-Shot Open-Set Learning

FashionNTM Multi-turn Fashion Image Retrieval via Cascaded Memory

A Complete Recipe for Diffusion Generative Models

Locally Stylized Neural Radiance Fields

First Session Adaptation A Strong Replay-Free Baseline for Class-Incremental Learning

Adaptive Template Transformer for Mitochondria Segmentation in Electron Microscopy Images

Aria Digital Twin A New Benchmark Dataset for Egocentric 3D

COPILOT Human-Environment Collision Prediction and Localization from Egocentric Videos

Effective Real Image Editing with Accelerated Iterative Diffusion Inversion

Few Shot Font Generation Via Transferring Similarity Guided Global Styl

Privacy Preserving Localization via Coordinate Permutations

Random Sub-Samples Generation for Self-Supervised Real Image Denoising

Scanning Only Once An End-to-end Framework for Fast Temporal Grounding

TransHuman A Transformer-based Human Representation for Generalizable Neural Human Rendering

Relightify Relightable 3D Faces from a Single Image via Diffusion

Taming Contrast Maximization for Learning Sequential Low-latency Event-based Optical Flow

MotionDeltaCNN Sparse CNN Inference of Frame Differences in Moving Cam

ACLS Adaptive and Conditional Label Smoothing for Network Calibration

COMPASS High-Efficiency Deep Image Compression with Arbitrary-scale Spatial Scalability

Content-Aware Local GAN for Photo-Realistic Super-Resolution

Label Shift Adapter for Test-Time Adaptation under Covariate and Label

Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in

Nearest Neighbor Guidance for Out-of-Distribution Detection

PC-Adapter Topology-Aware Adapter for Efficient Domain Adaption on Point Clouds

Probabilistic Precision and Recall Towards Reliable Evaluation of Generative Models

SeiT Storage-Efficient Vision Training with Tokens Using 1 of Pixel

Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocul

Understanding the Feature Norm for Out-of-Distribution Detection

Localizing Object-Level Shape Variations with Text-to-Image Diffusion Models

Pretrained Language Models as Visual Planners for Human Assistanc

Multi-weather Image Restoration via Domain Translation

GlueStick Robust Image Matching by Sticking Points and Lines Togeth

Vanishing Point Estimation in Uncalibrated Images with Prior Gravity Direction

Scalable Diffusion Models with Transformers

Clusterformer Cluster-based Transformer for 3D Object Detection in Point Clouds

Space-time Prompting for Video Class-incremental Learning

AutoReP Automatic ReLU Replacement for Fast Private Network Inferenc

CAME Contrastive Automated Model Evaluation

DELFlow Dense Efficient Learning of Scene Flow for Large-Scale Point

Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic

EmoTalk Speech-Driven Emotional Disentanglement for 3D Face Animation

GET Group Event Transformer for Event-Based Vision

Source-free Domain Adaptive Human Pose Estimation

USAGE A Unified Seed Area Generation Paradigm for Weakly Supervis

TMR Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis

Audio-Visual Class-Incremental Learning

Lens Parameter Estimation for Realistic Depth of Field Modeling

BANSAC A Dynamic BAyesian Network for Adaptive SAmple Consensus

A step towards understanding why classification helps regression

LDP-Feat Image Features with Local Differential Privacy

Hierarchical Generation of Human-Object Interactions with Diffusion Probabilistic Models

What Can a Cook in Italy Teach a Mechanic in

LD-ZNet A Latent Diffusion Approach for Text-Based Image Segmentation

Event-based Temporally Dense Optical Flow Estimation with Sequential Learning

DiFaReli Diffusion Face Relighting

Surface Normal Clustering for Implicit Representation of Manhattan Scenes

Learn TAROT with MENTOR A Meta-Learned Self-Supervised Approach for Trajectory

EgoVLPv2 Egocentric Video-Language Pre-training with Fusion in the Backbon

What Does a Platypus Look Like Generating Customized Prompts fo

Dynamic Point Fields

Inverse Problem Regularization with Hierarchical Variational Autoencoders

Keep It SimPool Who Said Supervised Transformers Suffer from Attention

Using a Waffle Iron for Automotive Point Cloud Semantic Segmentation

Adaptive Rotated Convolution for Rotated Object Detection

Breaking The Limits of Text-conditioned 3D Motion Synthesis with Elaborativ

Decouple Before Interact Multi-Modal Prompt Learning for Continual Visual Question

LEA2 A Lightweight Ensemble Adversarial Attack via Non-overlapping Vulnerable Frequency

Sat2Density Faithful Density Learning from Satellite-Ground Image Pairs

Semantics Meets Temporal Correspondence Self-supervised Object-centric Learning in Videos

Stable Cluster Discrimination for Deep Clustering

Understanding 3D Object Interaction from a Single Imag

Dynamic Mesh-Aware Radiance Fields

March in Chat Interactive Prompting for Remote Embodied Referring Expression

Multi-view Spectral Polarization Propagation for Video Glass Segmentation

VLN-PETL Parameter-Efficient Transfer Learning for Vision-and-Language Navigation

Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

GlueGen Plug and Play Multi-modal Encoders for X-to-image Generation

SupFusion Supervised LiDAR-Camera Fusion for 3D Object Detection

UniFusion Unified Multi-View Fusion Transformer for Spatial-Temporal Representation in Birds-Eye-View

Gram-based Attentive Neural Ordinary Differential Equations Network for Video Nystagmography

MB-TaylorFormer Multi-Branch Efficient Transformer Expanded by Taylor Formula for Imag

Scratch Each Others Back Incomplete Multi-Modal Brain Tumor Segmentation vi

Dynamic Snake Convolution Based on Topological Geometric Constraints for Tubul

E2NeRF Event Enhanced Neural Radiance Fields from Blurry Images

FateZero Fusing Attentions for Zero-shot Text-based Video Editing

High Quality Entity Segmentation

Deep Video Demoireing via Compact Invertible Dyadic Decomposition

Fingerprinting Deep Image Restoration Models

Semantic Information in Contrastive Learning

Single Image Defocus Deblurring via Implicit Neural Inverse Kernels

Boosting Whole Slide Image Classification from the Perspectives of Distribution

Novel-View Synthesis and Pose Estimation for Hand-Object Interaction from Spars

Towards Nonlinear-Motion-Aware and Occlusion-Robust Rolling Shutter Correction

Boosting Positive Segments for Weakly-Supervised Audio-Visual Video Parsing

Multimodal Distillation for Egocentric Action Recognition

DreamBooth3D Subject-Driven Text-to-3D Generation

ScatterNeRF Seeing Through Fog with Physically-Based Inverse Neural Rendering

MOST Multiple Object Localization with Self-Supervised Transformers for Object Discovery

Perceptual Grouping in Contrastive Vision-Language Models

DynaMITe Dynamic Query Bootstrapping for Multi-object Interactive Segmentation Transform

Studying How to Efficiently and Effectively Guide Models with Explanations

SEMPART Self-supervised Multi-resolution Partitioning of Image Semantics

Prior-guided Source-free Domain Adaptation for Human Pose Estimation

Scale-MAE A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning

L-DAWA Layer-wise Divergence Aware Weight Aggregation in Federated Self-Supervised Visual

Decoupled Iterative Refinement Framework for Interacting Hands Reconstruction from

GeoUDF Surface Reconstruction from 3D Point Clouds via Geometry-guided Distanc

Hierarchical Prior Mining for Non-local Multi-View Stereo

Multiscale Structure Guided Diffusion for Image Deblurring

Reinforced Disentanglement for Face Swapping without Skip Connection

SG-Former Self-guided Transformer with Evolving Token Reallocation

UGC Unified GAN Compression for Efficient Image-to-Image Translation

Zero-guidance Segmentation Using Zero Segment Labels

CGBA Curvature-aware Geometric Black-box Attack

Efficient 3D Semantic Segmentation with Superpoint Transform

LightDepth Single-View Depth Self-Supervision from Illumination Declin

End2End Multi-View Feature Matching with Differentiable Pose Optimization

Re-ReND Real-Time Rendering of NeRFs across Devices

Waffling Around for Performance Visual Classification with Random Words an

Exemplar-Free Continual Transformer with Convolutions

Test Time Adaptation for Blind Image Quality Assessment

Tracking by 3D Model Estimation of Unknown Objects in Videos

Towards Viewpoint-Invariant Visual Recognition via Adversarial Training

Theoretical and Numerical Analysis of 3D Reconstruction Using Point an

ICICLE Interpretable Class Incremental Continual Learning

Gramian Attention Heads are Strong yet Efficient Vision Learners

MEGA Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation

Multi-Object Discovery by Low-Dimensional Object Motion

EDAPS Enhanced Domain-Adaptive Panoptic Segmentation

Learning Adaptive Neighborhoods for Graph Neural Networks

Chop Learn Recognizing and Generating Object-State Compositions

DataDAM Efficient Dataset Distillation with Attention Matching

Time Does Tell Self-Supervised Time-Tuning of Dense Image Representations

Walking Your LiDOG A Journey Through Multiple Domains for LiDAR

CDFSL-V Cross-Domain Few-Shot Learning for Videos

Spatio-Temporal Crop Aggregation for Video Representation Learning

You Never Get a Second Chance To Make a Goo

Domain Generalization of 3D Semantic Segmentation in Autonomous Driving

Point-SLAM Dense Neural Point Cloud-based SLAM

S-TREK Sequential Translation and Rotation Equivariant Keypoints for Local Featu

Domain-Specificity Inducing Transformers for Source-Free Domain Adaptation

Curvature-Aware Training for Coordinate Networks

VQ3D Learning a 3D-Aware Generative Model on ImageNet

MI-GAN A Simple Baseline for Image Inpainting on Mobile Devices

SGAligner 3D Scene Alignment with Scene Graphs

Self-supervised Monocular Depth Estimation Lets Talk About The Weath

GACE Geometry Aware Confidence Enhancement for Black-Box 3D Object Detectors

Distracting Downpour Adversarial Weather Attacks for Motion Estimation

Probabilistic Modeling of Inter- and Intra-observer Variability in Medical Imag

R3D3 Dense 3D Reconstruction of Dynamic Scenes from Multiple Cameras

OmniLabel A Challenging Benchmark for Language-Based Object Detection

Discriminative Class Tokens for Text-to-Image Diffusion Models

MotionLM Multi-Agent Motion Forecasting as Language Modeling

DARTH Holistic Test-time Adaptation for Multiple Object Tracking

Vox-E Text-Guided Voxel Editing of 3D Objects

Sound Source Localization is All about Cross-Modal Alignment

FlipNeRF Flipped Reflection Rays for Few-shot Novel View Synthesis

Graphics2RAW Mapping Computer Graphics Images to Sensor RAW Images

LFS-GAN Lifelong Few-Shot Image Generation

How to Boost Face Recognition with StyleGAN

LiDAR-UDA Self-ensembling Through Time for Unsupervised LiDAR Domain Adaptation

Template Inversion Attack against Face Recognition Systems using 3D Fac

STEPs Self-Supervised Key Step Extraction and Localization from Unlabeled Procedural

TiDy-PSFs Computational Imaging with Time-Averaged Dynamic Point-Spread-Functions

SwiftFormer Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications

Neural Fields for Structured Lighting

Causal-DFQ Causality Guided Data-Free Network Quantization

Self-supervised Learning to Bring Dual Reversed Rolling Shutter Images Aliv

Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation

Action Sensitivity Learning for Temporal Action Localization

Building Bridge Across the Time Disruption and Restoration of Murals

Data-free Knowledge Distillation for Fine-grained Visual Categorization

Global Features are All You Need for Image Retrieval an

HiVLP Hierarchical Interactive Video-Language Pre-Training

LNPL-MIL Learning from Noisy Pseudo Labels for Promoting Multiple Instanc

NDDepth Normal-Distance Assisted Monocular Depth Estimation

Towards Multi-Layered 3D Garments Animation

Transparent Shape from a Single View Polarization Imag

Unified Pre-Training with Pseudo Texts for Text-To-Image Person Re-Identification

Replay Multi-modal Multi-view Acted Videos for Casual Holography

The Perils of Learning From Unlabeled Data Backdoor Attacks on

AdaptGuard Defending Against Universal Attacks for Model Adaptation

Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on

Accurate and Fast Compressed Video Captioning

CLIP-Cluster CLIP-Guided Attribute Hallucination for Face Clustering

Dec-Adapter Exploring Efficient Decoder-Side Adapter for Bridging Screen Content an

FerKD Surgical Label Adaptation for Efficient Distillation

Learning Global-aware Kernel for Image Harmonization

Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Clou

RPG-Palm Realistic Pseudo-data Generation for Palmprint Recognition

SegRCDB Semantic Segmentation via Formula-Driven Supervised Learning

Anomaly Detection using Score-based Perturbation Resilienc

BallGAN 3D-aware Image Synthesis with a Spherical Backgroun

BlendFace Re-designing Identity Encoders for Face-Swapping

3D Distillation Improving Self-Supervised Monocular Depth Estimation on Reflective Surfaces

Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via Geometry-Guided Cross-View Transform

Deep Multitask Learning with Progressive Parameter Sharing

Dual Pseudo-Labels Interactive Self-Training for Semi-Supervised Visible-Infrared Person Re-Identification

EdaDet Open-Vocabulary Object Detection Using Early Dense Alignment

FreeCOS Self-Supervised Learning from Fractals and Unlabeled Images for Curvilin

LoGoPrompt Synthetic Text Images Can Be Good Visual Prompts fo

Lossy and Lossless L2 Post-training Model Size Compression

PhaseMP Robust 3D Pose Estimation via Phase-conditioned Human Motion Prio

PlaneRecTR Unified Query Learning for 3D Plane Recovery from

Prototype Reminiscence and Augmented Asymmetric Knowledge Aggregation for Non-Exemplar Class-Incremental

Trajectory Unified Transformer for Pedestrian Trajectory Prediction

VideoFlow Exploiting Temporal Cues for Multi-frame Optical Flow Estimation

Video Anomaly Detection via Sequentially Learning Multiple Pretext Tasks

Efficient Computation Sharing for Multi-Task Visual Scene Understanding

What does CLIP know about a red circle Visual prompt

DPF-Net Combining Explicit Shape Priors in Deformable Primitive Field fo

eP-ALM Efficient Perceptual Augmentation of Language Models

Conditional 360-degree Image Synthesis for Immersive Indoor Scene Decoration

3DPPE 3D Point Positional Encoding for Transformer-based Multi-Camera 3D Object

Adaptive Image Anonymization in the Context of Image Classification with

In-Style Bridging Text and Uncurated Videos with Style Transfer fo

Learning by Sorting Self-supervised Learning with Group Ordering Constraints

MosaiQ Quantum Generative Adversarial Networks for Image Generation on NISQ

SUMMIT Source-Free Adaptation of Uni-Modal Models to Multi-Modal Targets

Learning to Transform for Generalizable Instance-wise Invarianc

Benchmarking Low-Shot Robustness to Natural Distribution Shifts

Learning to Learn How to Continuously Teach Humans and Machines

Scene Graph Contrastive Learning for Embodied Navigation

The Effectiveness of MAE Pre-Pretraining for Billion-Scale Pretraining

Deep Geometrized Cartoon Line Inbetweening

Neural Haircut Prior-Guided Strand-Based Hair Reconstruction

VLSlice Interactive Vision-and-Language Slice Discovery

Blending-NeRF Text-Driven Localized Editing in Neural Radiance Fields

Conditional Cross Attention Network for Multi-Space Embedding without Entanglement in

Emotional Listener Portrait Neural Listener Head Generation with Emotion

Feature Proliferation -- the Cancer in StyleGAN and its Treatments

GraphAlign Enhancing Accurate Feature Alignment by Graph matching for Multi-Modal

Householder Projector for Unsupervised Latent Semantics Discovery

LLM-Planner Few-Shot Grounded Planning for Embodied Agents with Large Languag

ModelGiF Gradient Fields for Model Functional Distanc

Semantics-Consistent Feature Search for Self-Supervised Visual Representation Learning

Total-Recon Deformable Scene Reconstruction for Embodied View Synthesis

Under-Display Camera Image Restoration with Scattering Effect

Unsupervised Object Localization with Representer Point Selection

Mastering Spatial Graph Prediction of Road Networks

Kick Back Relax Learning to Reconstruct the World by

Corrupting Neuron Explanations of Deep Visual Features

FLIP Cross-domain Face Anti-spoofing with Language Guidanc

Leaping Into Memories Space-Time Deep Feature Synthesis

FineRecon Depth-aware Feed-forward Network for Detailed 3D Reconstruction

LivePose Online 3D Reconstruction from Monocular Video with Dynamic Cam

SAGA Spectral Adversarial Geometric Attack on 3D Meshes

Agile Modeling From Concept to Classifier in Minutes

Rickrolling the Artist Injecting Backdoors into Text Encoders for Text-to-Imag

Vision Relation Transformer for Unbiased Scene Graph Generation

Exploring the Sim2Real Gap Using Digital Twins

SoDaCam Software-defined Cameras via Single-Photon Imaging

Adaptive Illumination Mapping for Shadow Detection in Raw Images

Alignment Before Aggregation Trajectory Memory Retrieval Network for Video Object

Communication-Efficient Vertical Federated Learning with Limited Overlapping Samples

Contrastive Pseudo Learning for Open-World DeepFake Attribution

DIME-FM DIstilling Multimodal and Efficient Foundation Models

Dual Meta-Learning with Longitudinally Consistent Regularization for One-Shot Brain Tissu

FedPerfix Towards Partial Model Personalization of Vision Transformers in Federat

Going Denser with Open-Vocabulary Part Segmentation

Local Context-Aware Active Domain Adaptation

MAPConNet Self-supervised 3D Pose Transfer with Mesh and Point Contrastiv

MixSynthFormer A Transformer Encoder-like Structure with Mixed Synthetic Self-attention fo

Neural-PBIR Reconstruction of Shape Material and Illumination

Neural Reconstruction of Relightable Human Model from Monocular Video

SAFL-Net Semantic-Agnostic Feature Learning Network with Auxiliary Plugins for Imag

Spatially-Adaptive Feature Modulation for Efficient Image Super-Resolution

Spatially and Spectrally Consistent Deep Functional Maps

Spatio-temporal Prompting Network for Robust Video Feature Extraction

Unleashing the Power of Gradient Signal-to-Noise Ratio for Zero-Shot NAS

ViperGPT Visual Inference via Python Execution for Reasoning

SparseDet Improving Sparsely Annotated Object Detection with Pseudo-positive Mining

ACTIVE Towards Highly Transferable 3D Physical Camouflage for Universal an

TIJO Trigger Inversion with Joint Optimization for Defending Multimodal Backdoo

Smoothness Similarity Regularization for Few-Shot GAN Adaptation

Adversarial Finetuning with Latent Representation Constraint to Mitigate Accuracy-Robustness Tradeo

CaPhy Capturing Physical Properties for Animatable Human Avatars

Deep Directly-Trained Spiking Neural Networks for Object Detection

Hiding Visual Information via Obfuscating Adversarial Perturbations

Name Your Colour For the Task Artificially Discover Colour Naming

NPC Neural Point Characters from Video

Unsupervised Video Object Segmentation with Online Adversarial Self-Tuning

DINAR Diffusion Inpainting of Neural Textures for One-Shot Human Avatars

Preserving Modality Structure Improves Multi-Modal Learning

Viewset Diffusion 0-Image-Conditioned 3D Generative Models from 2D Dat

ChildPlay A New Benchmark for Understanding Childrens Gaze Behaviou

Global Perception Based Autoregressive Neural Processes

3D Segmentation of Humans in Point Clouds with Synthetic Dat

Role-Aware Interaction Generation from Textual Description

CoTDet Affordance Knowledge Prompting for Task Driven Object Detection

DDG-Net Discriminability-Driven Graph Network for Weakly-supervised Temporal Action Localization

Delicate Textured Mesh Recovery from NeRF via Adaptive Surface Refinement

Distribution Shift Matters for Knowledge Distillation with Webly Collected Images

Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation

ElasticViT Conflict-aware Supernet Training for Deploying Fast Vision Transformer on

Make-It-3D High-fidelity 3D Creation from A Single Image with Diffusion

Multiple Instance Learning Framework with Masked Hard Instance Mining fo

ProtoTransfer Cross-Modal Prototype Transfer for Point Cloud Segmentation

Scene Matters Model-based Deep Video Compression

SwinLSTM Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM

Temporal Collection and Distribution for Referring Video Object Segmentation

When Prompt-based Incremental Learning Does Not Meet Strong Pretraining

Social Diffusion Long-term Multiple Human Motion Anticipation

DS-Fusion Artistic Typography via Discriminated and Stylized Diffusion

EMMN Emotional Motion Memory Network for Audio-driven Emotional Talking Fac

3DHacker Spectrum-based Decision Boundary Generation for Hard-label 3D Point Clou

AdaNIC Towards Practical Neural Image Compression via Dynamic Transform Routing

Local and Global Logit Adjustments for Long-Tailed Learning

Enhanced Meta Label Correction for Coping with Label Corruption

Examining Autoexposure for Challenging Scenes

Alignment-free HDR Deghosting with Semantics Consistent Transform

StageInteractor Query-based Object Detector with Cross-stage Interaction

Tangent Sampson Error Fast Approximate Two-view Reprojection Error for Central

Imitator Personalized Speech-driven 3D Facial Animation

Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization

Beyond Skin Tone A Multidimensional Measure of Apparent Skin Colo

DPS-Net Deep Polarimetric Stereo Depth Estimation

Instance and Category Supervision are Alternate Learners for Continual Learning

MonoNeRF Learning a Generalizable Dynamic Radiance Field from Monocular Videos

Non-Semantics Suppressed Mask Learning for Unsupervised Video Semantic Compression

Prototypes-oriented Transductive Few-shot Learning with Conditional Transport

ShapeScaffolder Structure-Aware 3D Shape Generation from Text

Scene as Occupancy

Object-aware Gaze Target Detection

Linear Spaces of Meanings Compositional Structures in Vision-Language Models

Persistent-Transient Duality A Multi-Mechanism Approach for Modeling Human-Object Interaction

DECO Dense Estimation of 3D Human-Scene Contact In The Wil

DivideClassify Fine-Grained Classification for City-Wide Visual Geo-Localization

Spectral Graphormer Spectral Graph-Based Transformer for Egocentric Two-Hand Reconstruction using

Learning Data-Driven Vector-Quantized Degradation Model for Animation Video Super-Resolution

Agglomerative Transformer for Human-Object Interaction Detection

FemtoDet An Object Detection Baseline for Energy Versus Performance Tradeoffs

ImGeoNet Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection

Implicit Temporal Modeling with Learnable Alignment for Video Recognition

MULLER Multilayer Laplacian Resizer for Vision

Self-supervised Cross-view Representation Reconstruction for Change Captioning

GECCO Geometrically-Conditioned Point Diffusion Models

SuS-X Training-Free Name-Only Transfer of Vision-Language Models

ProbVLM Probabilistic Adapter for Frozen Vison-Language Models

When Do Curricula Work in Federated Learning

der Klis PDiscoNet Semantically consistent part discovery for fine-grained recognition

Landeghem Document Understanding Dataset and Evaluation DUDE

Le Anti-DreamBooth Protecting Users from Personalized Text-to-image Synthesis

Noord Protoype-based Dataset Comparison

Spengler Poincare ResNet

Self-supervised Monocular Underwater Depth Recovery Image Restoration and a Real-s

ViLLA Fine-Grained Vision-Language Representation Learning from Real-World Dat

FastViT A Fast Hybrid Vision Transformer Using Structural Reparameterization

Convex Decomposition of Indoor Scenes

P1AC Revisiting Absolute Pose From a Single Affine Correspondenc

CLIPascene Scene Sketching with Different Types and Levels of Abstraction

MST-compression Compressing and Accelerating Binary Neural Networks with Minimum Spanning

End-to-End Diffusion Latent Optimization Improves Classifier Guidanc

3D Human Mesh Recovery with Sequentially Global Rotation Estimation

3D Semantic Subspace Traverser Empowering 3D Generative Model with Sh

ALWOD Active Learning for Weakly-Supervised Object Detection

Batch-based Model Registration for Fast 3D Sherd Reconstruction

Building3D A Urban-Scale Dataset and Benchmarks for Learning Roof Structures

CBA Improving Online Continual Learning via Continual Bias Adapto

CDAC Cross-domain Attention Consistency in Transformer for Domain Adaptive Semantic

CLIPN for Zero-Shot OOD Detection Teaching CLIP to Say No

CORE Cooperative Reconstruction for Multi-Agent Perception

Counterfactual-based Saliency Map Towards Visual Contrastive Explanations for Neural Networks

Creative Birds Self-Supervised Single-View 3D Style Trans

Deep Active Contours for Real-time 6-DoF Object Tracking

Deep Equilibrium Object Detection

Deep Optics for Video Snapshot Compressive Imaging

DiLiGenT-Pi Photometric Stereo for Planar Surfaces with Rich Details -

DIRE for Diffusion-Generated Image Detection

DistillBEV Boosting Multi-Camera 3D Object Detection with Cross-Modal Knowledge Distillation

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual

Distribution-Consistent Modal Recovering for Incomplete Multimodal Learning

Does Physical Adversarial Example Really Matter to Autonomous Driving Towards

Domain Specified Optimization for Deployment Authorization

DREAMWALKER Mental Planning for Continuous Vision-Language Navigation

DyGait Exploiting Dynamic Representations for High-performance Gait Recognition

EfficientTrain Exploring Generalized Curriculum Learning for Training Visual Backbones

Ego-Only Egocentric Action Detection without Exocentric Transferring

Equivariant Similarity for Vision-Language Foundation Models

Evaluating Data Attribution for Text-to-Image Models

Event-Guided Procedure Planning from Instructional Videos with Text Supervision

Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection

ExposureDiffusion Learning to Expose for Low-light Image Enhancement

Fg-T2M Fine-Grained Text-Driven Human Motion Generation via Diffusion Model

Generalizable Decision Boundaries Dualistic Meta-Learning for Open Set Domain Generalization

Get the Best of Both Worlds Improving Accuracy and Transferability

GlowGAN Unsupervised Learning of HDR Images from LDR Images in

GridMM Grid Memory Map for Vision-and-Language Navigation

Guiding Local Feature Matching with Surface Curvatu

Hierarchical Spatio-Temporal Representation Learning for Gait Recognition

HoloAssist an Egocentric Human Interaction Dataset for Interactive AI Assistants

Homography Guided Temporal Fusion for Road Line and Marking Segmentation

How Far Pre-trained Models Are from Neural Collapse on th

IHNet Iterative Hierarchical Network Guided by High-Resolution Estimated Information fo

Improved Visual Fine-tuning with Natural Language Supervision

Improving Zero-Shot Generalization for CLIP with Synthesized Prompts

Informative Data Mining for One-Shot Cross-Domain Semantic Segmentation

Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution

Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video

Learning Human Dynamics in Autonomous Driving Scenarios

Learning Long-Range Information with Dual-Scale Transformers for Indoor Scene Completion

Learning Support and Trivial Prototypes for Interpretable Image Classification

Learning Unified Decompositional and Compositional NeRF for Editable Novel View

Lighting up NeRF via Unsupervised Decomposition and Enhancement

LoLep Single-View View Synthesis with Locally-Learned Planes and Self-Attention Occlusion

Low-Light Image Enhancement with Illumination-Aware Gamma Correction and Complete Imag

LRRU Long-short Range Recurrent Updating Networks for Depth Completion

Manipulate by Seeing Creating Manipulation Controllers from Pre-Trained Representations

Masked Spiking Transform

Memory-and-Anticipation Transformer for Online Action Understanding

Mixed Neural Voxels for Fast Multi-view Video Synthesis

NEMTO Neural Environment Matting for Novel View and Relighting Synthesis

Neural Video Depth Stabiliz

NeuS2 Fast Learning of Neural Implicit Surfaces for Multi-view Reconstruction

Noise2Info Noisy Image to Information of Noise for Self-Supervised Imag

Not All Steps are Created Equal Selective Diffusion Distillation fo

Not Every Side Is Equal Localization Uncertainty Estimation for Semi-Supervis

Object as Query Lifting Any 2D Object Detector to 3D

Open-Vocabulary Object Detection With an Open Corpus

OpenOccupancy A Large Scale Benchmark for Surrounding Semantic Occupancy Perception

OPERA Omni-Supervised Representation Learning with Hierarchical Supervisions

Ord2Seq Regarding Ordinal Regression as Label Sequence Prediction

Overwriting Pretrained Bias with Finetuning Dat

PoseDiffusion Solving Pose Estimation via Diffusion-aided Bundle Adjustment

Query6DoF Learning Sparse Queries as Implicit Shape Prior for Category-Level

Random Boxes Are Open-world Object Detectors

ReFit Recurrent Fitting Network for 3D Human Recovery

Regularized Primitive Graph Learning for Unified Vector Mapping

RFLA A Stealthy Reflected Light Adversarial Attack in the Physical

ROME Robustifying Memory-Efficient NAS via Topology Disentanglement and Gradient Accumulation

Root Pose Decomposition Towards Generic Non-rigid 3D Reconstruction with Monocul

Saliency Regularization for Self-Training with Partial Annotations

Sample-adaptive Augmentation for Point Cloud Recognition Against Real-world Corruptions

Scaling Data Generation in Vision-and-Language Navigation

Seal-3D Interactive Pixel-Level Editing for Neural Radiance Fields

SegGPT Towards Segmenting Everything in Context

Self-similarity Driven Scale-invariant Learning for Weakly Supervised Person Search

SpaceEvo Hardware-Friendly Search Space Design for Efficient INT8 Inferenc

Space Engage Collaborative Space Supervision for Contrastive-Based Semi-Supervised Semantic Segmentation

SparseNeRF Distilling Depth Ranking for Few-shot Novel View Synthesis

SSF Accelerating Training of Spiking Neural Networks with Stabilized Spiking

Structure Invariant Transformation for better Adversarial Transferability

StyleDiffusion Controllable Disentangled Style Transfer via Diffusion Models

StyleInV A Temporal Style Modulated Inversion Network for Unconditional Video

Take-A-Photo 3D-to-2D Generative Pre-training of Point Cloud Models

Too Large Data Reduction for Vision-Language Pre-Training

Towards Open-Vocabulary Video Instance Segmentation

Tracking Everything Everywhere All at Onc

Treating Pseudo-labels Generation as Image Matting for Weakly Supervised Semantic

UMC A Unified Bandwidth-efficient and Multi-resolution based Collaborative Perception Framework

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection

UniTR A Unified and Efficient Multi-Modal Transformer for Birds-Eye-View Representation

Unsupervised Video Deraining with An Event Cam

V3Det Vast Vocabulary Visual Detection Dataset

View Consistent Purification for Accurate Cross-View Localization

ViLTA Enhancing Vision-Language Pre-training through Textual Augmentation

VQA-GNN Reasoning with Multimodal Knowledge via Graph Neural Networks fo

Weakly-Supervised Action Localization by Hierarchically-Structured Latent Attention Modeling

What do neural networks learn in image classification A frequency

Why do networks have inhibitorynegative connections

Zolly Zoom Focal Length Correctly for Perspective-Distorted Human Mesh Reconstruction

Enhancing Privacy Preservation in Federated Learning via Learning Rate Perturbation

RPEFlow Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow an

SOCS Semantically-Aware Object Coordinate Space for Category-Level 6D Object Pos

UniDexGrasp Improving Dexterous Grasping Policy Learning via Geometry-Aware Curriculum an

Nerfbusters Removing Ghostly Artifacts from Casually Captured NeRFs

Video-FocalNets Spatio-Temporal Focal Modulation for Video Action Recognition

CroCo v2 Improved Cross-view Completion Pre-training for Stereo Matching an

Adaptive Reordering Sampler with Neurally Guided MAGSA

Clutter Detection and Removal in 3D Scenes with View-Consistent Inpainting

DCPB Deformable Convolution Based on the Poincare Ball for Top-view

Diffusion Models as Masked Autoencoders

Disentangle then Parse Night-time Semantic Segmentation with Illumination Disentanglement

ELITE Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Imag

Generalized Differentiable RANSA

HairCLIPv2 Unifying Hair Editing via Proxy Feature Blending

Improving CLIP Fine-tuning Performanc

Improving Continuous Sign Language Recognition with Cross-Lingual Signs

Is Imitation All You Need Generalized Decision-Making with Dual-Phase Training

Multimodal High-order Relation Transformer for Scene Boundary Detection

Online Prototype Learning for Online Continual Learning

Passive Ultra-Wideband Single-Photon Imaging

SurroundOcc Multi-camera 3D Occupancy Prediction for Autonomous Driving

Temporal-Coded Spiking Neural Networks with Dynamic Firing Threshold Learning with

Towards Real-World Burst Image Super-Resolution Benchmark and Metho

Unified Adversarial Patch for Cross-Modal Attacks in the Physical Worl

Affective Image Filter Reflecting Emotions from Text to Images

Joint Metrics Matter A Better Standard for Trajectory Forecasting

Divide and Conquer a Two-Step Method for High Quality Fac

Ordinal Label Distribution Learning

Pairwise Similarity Learning is SimPLE

Parametric Classification for Generalized Category Discovery A Baseline Study

SimNP Learning Self-Similarity Priors Between Neural Points

SAFE Sensitivity-Aware Features for Out-of-Distribution Object Detection

Unsupervised Learning of Object-Centric Embeddings for Cell Instance Segmentation in

AccFlow Backward Accumulation for Long-Range Optical Flow

Advancing Referring Expression Segmentation Beyond Single Imag

A Latent Space of Stochastic Diffusion Models for Zero-Shot Imag

Betrayed by Captions Joint Caption Grounding and Generation for Open

Bold but Cautious Unlocking the Potential of Personalized Federated Learning

Computation and Data Efficient Backdoor Attacks

Deep Feature Deblurring Diffusion for Detecting Out-of-Distribution Objects

DiffuMask Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using

Efficient View Synthesis with Neural Radiance Distribution Fiel

Estimator Meets Equilibrium Perspective A Rectified Straight Through Estimator fo

Exploring Transformers for Open-world Instance Segmentation

Exploring Video Quality Assessment on User Generated Contents from Aesthetic

Face Clustering via Graph Convolutional Networks with Confidence Edges

Factorized Inverse Path Tracing for Efficient and Accurate Material-Lighting Estimation

Grounded Image Text Matching with Mismatched Relation Reasoning

Hallucination Improves the Performance of Unsupervised Visual Representation Learning

Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Imag

HSR-Diff Hyperspectral Image Super-Resolution via Conditional Diffusion Models

Human Preference Score Better Aligning Text-to-Image Models with Human Preferenc

Improving Representation Learning for Histopathologic Images with Cluster Constraints

LA-Net Landmark-Aware Learning for Reliable Facial Expression Recognition under Label

Label-Efficient Online Continual Object Detection in Streaming Video

Learning Concordant Attention via Target-aware Alignment for Visible-Infrared Person Re-identification

Learning Foresightful Dense Visual Affordance for Deformable Object Manipulation

Leveraging SE3 Equivariance for Learning 3D Geometric Shape Assembly

LPFF A Portrait Dataset for Face Generators Across Large Poses

MedKLIP Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis

MetaGCD Learning to Continually Learn in Generalized Category Discovery

Meta OOD Learning For Continuously Adaptive OOD Detection

MixCycle Mixup Assisted Semi-Supervised 3D Single Object Tracking with Cycl

ObjectSDF Improved Object-Compositional Neural Implicit Surfaces

OnlineRefer A Simple Online Baseline for Referring Video Object Segmentation

Randomized Quantization A Generic Augmentation for Data Agnostic Self-supervised Learning

S-VolSDF Sparse Multi-View Stereo Regularization of Neural Implicit Surfaces

Scalable Video Object Segmentation with Simplified Framework

Segment Every Reference Object in Spatial and Temporal Spaces

Sketch and Text Guided Diffusion Model for Colored Point Clou

Source-free Depth for Object Pop-out

Spatial-Aware Token for Weakly Supervised Object Localization

Spatial Self-Distillation for Object Detection with Inaccurate Bounding Boxes

Speech2Lip High-fidelity Speech to Lip Generation by Learning from

TinyCLIP CLIP Distillation via Affinity Mimicking and Weight Inheritanc

Towards Universal LiDAR-Based 3D Object Detection by Multi-Domain Knowledge Trans

Tune-A-Video One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

What Can Simple Arithmetic Operations Do for Temporal Modeling

Why Is Prompt Tuning for Vision-Language Models Robust to Noisy

AssetField Assets Mining and Reconfiguration in Ground Feature Plane Representation

3D-aware Image Generation using 2D Diffusion Models

Denoising Diffusion Autoencoders are Unified Self-supervised Learners

Generative Action Description Prompts for Skeleton-based Action Recognition

GRAM-HD 3D-Consistent Image Generation at High Resolution with Generative Radianc

HM-ViT Hetero-Modal Vehicle-to-Vehicle Cooperative Perception with Vision Transform

Rendering Humans from Object-Occluded Monocular Videos

Retro-FPN Retrospective Feature Pyramid Network for Point Cloud Semantic Segmentation

ADNet Lane Shape Prediction via Anchor Decomposition

Automatic Animation of Hair Blowing in Still Portrait Photos

Token-Label Alignment for Vision Transformers

CASSPR Cross Attention Single Scan Place Recognition

CMDA Cross-Modality Domain Adaptation for Nighttime Semantic Segmentation

CoIn Contrastive Instance Feature Mining for Outdoor 3D Object Detection

Combating Noisy Labels with Sample Selection by Mining High-Discrepancy Examples

DiffIR Efficient Diffusion Model for Image Restoration

Few-Shot Video Classification via Representation Fusion and Promotion Learning

Holistic Label Correction for Noisy Multi-Label Classification

Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Localization

Personalized Semantics Excitation for Federated Image Classification

Window-Based Early-Exit Cascades for Uncertainty Estimation When Deep Ensembles

BoxDiff Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

CO-Net Learning Multiple Point Cloud Tasks at Once with A

DiffFit Unlocking Transferability of Large Diffusion Models via Simple Parameter-efficient

GAIT Generating Aesthetic Indoor Tours with Deep Reinforcement Learning

HollowNeRF Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation

Most Important Person-Guided Dual-Branch Cross-Patch Attention for Group Affect Recognition

MV-Map Offboard HD-Map Generation with Multi-view Consistency

NaviNeRF NeRF-based 3D Representation Disentanglement by Latent Semantic Navigation

Nonrigid Object Contact Estimation With Regional Unwrapping Transform

OFVL-MS Once for Visual Localization across Multiple Indoor Scenes

Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection

S3IM Stochastic Structural SIMilarity and Its Unreasonable Effectiveness for Neural

SparseFusion Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection

Boosting Few-shot Action Recognition with Graph-guided Hybrid Matching

HDG-ODE A Hierarchical Continuous-Time Model for Human Pose Forecasting

CL-MVSNet Unsupervised Multi-View Stereo with Dual-Level Contrastive Learning

Confidence-based Visual Dispersal for Few-shot Unsupervised Domain Adaptation

Get3DHuman Lifting StyleGAN-Human into a 3D Generative Model Using Pixel-Align

Open Set Video HOI detection from Action-Centric Chain-of-Look Prompting

Narrator Towards Natural Control of Human-Scene Interaction Generation via Relationshi

NSF Neural Surface Fields for Human Modeling from Monocular Depth

Variational Causal Inference Network for Explanatory Visual Question Answering

ActFormer A GAN-based Transformer towards General Action-Conditioned 3D Human Motion

Animal3D A Comprehensive Dataset of 3D Animal Pose and Sh

Augmenting and Aligning Snippets for Few-Shot Video Domain Adaptation

Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction

Backpropagation Path Search On Adversarial Transferability

Bridging Vision and Language Encoders Parameter-Efficient Tuning for Referring Imag

C2F2NeUS Cascade Cost Frustum Fusion for High Fidelity and Generalizabl

CiT Curation in Training for Effective Vision-Language Dat

ClothPose A Real-world Benchmark for Visual Analysis of Garment Pos

DeepChange A Long-Term Person Re-Identification Benchmark with Clothes Chang

Deformable Model-Driven Neural Rendering for High-Fidelity 3D Reconstruction of Human

Downscaled Representation Matters Improving Image Rescaling with Collaborative Downscaled Images

Efficient Joint Optimization of Layer-Adaptive Weight Pruning in Deep Neural

EgoPCA A New Framework for Egocentric Hand-Object Interaction Understanding

EQ-Net Elastic Quantization Neural Networks

Fast and Accurate Transferability Measurement by Evaluating Intra-class Feature Varianc

FDViT Improve the Hierarchical Architecture of Vision Transform

FrozenRecon Pose-free 3D Scene Reconstruction with Frozen Depth Models

Generalized Few-Shot Point Cloud Segmentation via Geometric Words

Hierarchical Point-based Active Learning for Semi-supervised Point Cloud Semantic Segmentation

Human-centric Scene Understanding for 3D Large-scale Scenarios

Integrating Boxes and Masks A Multi-Object Framework for Unified Visual

InterDiff Generating 3D Human-Object Interactions with Physics-Informed Diffusion

Joint-Relation Transformer for Multi-Person Motion Prediction

Learning Image Harmonization in the Linear Color Spac

MasQCLIP for Open-Vocabulary Universal Image Segmentation

MBPTrack Improving 3D Point Cloud Tracking with Memory Networks an

MonoNeRD NeRF-like Representations for Monocular 3D Object Detection

Multi-Task Learning with Knowledge Distillation for Dense Prediction

Multimodal Optimal Transport-based Co-Attention Transformer with Global Structure Consistency fo

NeRF-Det Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection

ParCNetV2 Oversized Kernel with Enhanced Attention

ReNeRF Relightable Neural Radiance Fields with Nearfield Lighting

RIGID Recurrent GAN Inversion and Editing of Real Face Videos

Self-Calibrated Cross Attention Network for Few-Shot Segmentation

StylerDALLE Language-Guided Style Transfer Using a Vector-Quantized Tokenizer o

TALL Thumbnail Layout for Deepfake Video Detection

Versatile Diffusion Text Images and Variations All in One Diffusion

WaveNeRF Wavelet-based Generalizable Neural Radiance Fields

FCCNs Fully Complex-valued Convolutional Networks using Complex-valued Color Model an

2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision

3DHumanGAN 3D-Aware Human Image Generation with 3D Pose Mapping

AIDE A Vision-Driven Multi-View Multi-Modal Multi-Tasking Dataset for Assistive Driving

ALIP Adaptive Language-Image Pre-Training with Synthetic Caption

ASM Adaptive Skinning Model for High-Quality 3D Face Modeling

Attentive Mask CLIP

Beyond the Limitation of Monocular 3D Detector via Knowledge Distillation

BoxSnake Polygonal Instance Segmentation with Box Supervision

Bridging Cross-task Protocol Inconsistency for Distillation in Dense Object Detection

Computationally-Efficient Neural Image Compression with Shallow Decoders

Concept-wise Fine-tuning Matters in Preventing Negative Trans

Cross-Ray Neural Radiance Fields for Novel-View Synthesis from Unconstrained Imag

Cross-view Semantic Alignment for Livestreaming Product Recognition

D-IF Uncertainty-aware Human Digitization via Implicit Distribution Fiel

Data Augmented Flatness-aware Gradient Projection for Continual Learning

Designing Phase Masks for Under-Display Cameras

Diffusion Model as Representation Learn

Efficient Model Personalization in Federated Learning via Client-Specific Prompt Generation

EmoSet A Large-scale Visual Emotion Dataset with Rich Attributes

Enhancing Adversarial Robustness in Low-Label Regime via Adaptively Weighted Regularization

Event Camera Data Pre-training

FedPD Federated Open Set Recognition with Parameter Disentanglement

Foreground-Background Distribution Modeling Transformer for Visual Object Tracking

From Knowledge Distillation to Self-Knowledge Distillation A Unified Approach with

GEDepth Ground Embedding for Monocular Depth Estimation

Generating Visual Scenes from Touch

GraphEcho Graph-Driven Unsupervised Domain Adaptation for Echocardiogram Video Segmentation

Grounding 3D Object Affordance from 2D Interactions in Images

HSE Hybrid Species Embedding for Deep Metric Learning

Implicit Neural Representation for Cooperative Low-light Image Enhancement

Innovating Real Fisheye Image Correction with Dual Diffusion Architectu

Label-Guided Knowledge Distillation for Continual Semantic Segmentation on 2D Images

LAC - Latent Action Composition for Skeleton-based Action Segmentation

Large-Scale Person Detection and Localization Using Overhead Fisheye Cameras

LAW-Diffusion Complex Scene Generation by Diffusion with Layouts

Learning Trajectory-Word Alignments for Video-Language Tasks

Long-Range Grouping Transformer for Multi-View 3D Reconstruction

MRM Masked Relation Modeling for Medical Image Pre-Training with Genetics

Multi-Label Knowledge Distillation

Neural Interactive Keypoint Detection

One-Shot Generative Domain Adaptation

Out-of-Domain GAN Inversion via Invertibility Decomposition for Photo-Realistic Human Fac

PanFlowNet A Flow-Based Deep Network for Pan-Sharpening

Parametric Depth Based Feature Representation Learning for Object Detection an

PPR Physically Plausible Reconstruction from Monocular Videos

Prototypical Mixing and Retrieval-Based Refinement for Label Noise-Resistant Image Retrieval

SEFD Learning to Distill Complex Pose and Occlusion

Self-Ordering Point Clouds

Semi-supervised Speech-driven 3D Facial Animation via Cross-modal Encoding

Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning

SILT Shadow-Aware Iterative Label Tuning for Learning to Detect Shadows

Spatio-Temporal Domain Awareness for Multi-Agent Collaborative Perception

Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations

StyleGANEX StyleGAN-Based Manipulation Beyond Cropped Aligned Faces

SynBody Synthetic Dataset with Layered Human Models for 3D Human

Towards Grand Unified Representation Learning for Unsupervised Visible-Infrared Person Re-Identification

UrbanGIRAFFE Representing Urban Scenes as Compositional Generative Neural Feature Fields

Video Adverse-Weather-Component Suppression Network via Weather Messenger and Adversarial Backpropagation

Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Trans

Zero-Shot Point Cloud Segmentation by Semantic-Visual Aware Synthesis

Active Neural Mapping

Cross Modal Transformer Towards Fast and Robust 3D Object Detection

Deep Homography Mixture for Single Image Rolling Shutter Correction

Feature Prediction Diffusion Model for Video Anomaly Detection

Implicit Autoencoder for Point-Cloud Self-Supervised Representation Learning

INT2 Interactive Trajectory Prediction at Intersections

Learning Concise and Descriptive Attributes for Visual Recognition

Learning with Diversity Self-Expanded Equalization for Better Generalized Deep Metric

SkeletonMAE Graph-based Masked Autoencoder for Skeleton Sequence Pre-training

UCF Uncovering Common Features for Generalizable Deepfake Detection

UnLoc A Unified Framework for Video Localization Tasks

Focus the Discrepancy Intra- and Inter-Correlation Learning for Image Anomaly

Generalized Lightness Adaptation with Channel Selective Normalization

Inherent Redundancy in Spiking Neural Networks

NDC-Scene Boost Monocular 3D Semantic Scene Completion in Normalized Devic

Sign Language Translation with Iterative Prototy

Sparse Point Guided 3D Lane Detection

Towards Understanding the Generalization of Deepfake Detectors from a Game-Theoretical

MAMo Leveraging Memory and Attention for Monocular Video Depth Estimation

TextManiA Enriching Visual Feature by Text-driven Manifold Augmentation

FACTS First Amplify Correlations and Then Slice to Discover Bias

Rapid Network Adaptation Learning to Adapt Neural Networks Using Test-Tim

ScanNet A High-Fidelity Dataset of 3D Indoor Scenes

Adverse Weather Removal with Codebook Priors

Bootstrap Motion Forecasting With Self-Consistent Constraints

Cascade-DETR Delving into High-Quality Universal Object Detection

Constraining Depth Map Geometry for Multi-View Stereo A Dual-Depth Approach

Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips

Efficient Transformer-based 3D Object Detection with Dynamic Token Halting

FeatureNeRF Learning Generalizable NeRFs by Distilling Foundation Models

HiTeA Hierarchical Temporal-Aware Video-Language Pre-training

IntrinsicNeRF Learning Intrinsic Neural Radiance Fields for Editable Novel View

Neural Deformable Models for 3D Bi-Ventricular Heart Shape Reconstruction an

Recovering a Molecules 3D Dynamics from Liquid-phase Electron Microscopy Movies

Self-Evolved Dynamic Expansion Model for Task-Free Continual Learning

TaskExpert Dynamically Assembling Multi-Task Representations with Memorial Mixture-of-Experts

Wasserstein Expansible Variational Autoencoder for Discriminative and Generative Continual Learning

Diverse Inpainting and Editing with GAN Inversion

CTVIS Consistent Training for Online Video Instance Segmentation

PARF Primitive-Aware Radiance Fusion for Indoor Scene Novel View Synthesis

CrossMatch Source-Free Domain Adaptive Semantic Segmentation via Cross-Modal Consistency Training

Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection

Geometry-guided Feature Learning and Fusion for Indoor Scene Reconstruction

MetaF2N Blind Image Super-Resolution by Learning Efficient Model Adaptation from

Metric3D Towards Zero-shot Metric 3D Prediction from A Single Imag

Canonical Factors for Hybrid Neural Fields

Diff-Retinex Rethinking Low-light Image Enhancement with A Generative Diffusion Model

Invariant Training 2D-3D Joint Hard Samples for Few-Shot Point Clou

SCANet Scene Complexity Aware Network for Weakly-Supervised Video Moment Retrieval

Video Object Segmentation-aware Video Frame Interpolation

DynamicISP Dynamically Controlled Image Signal Processor for Image Recognition

Co-Evolution of Pose and Mesh for 3D Human Body Estimation

Towards Universal Image Embeddings A Large-Scale Dataset and Challenge fo

4D Myocardium Reconstruction with Decoupled Motion and Shape Model

Isomer Isomerous Transformer for Zero-shot Video Object Segmentation

Late Stopping Avoiding Confidently Learning from Mislabeled Examples

Make Encoder Great Again in 3D GAN Inversion through Geometry

PhysDiff Physics-Guided Human Motion Diffusion Model

PointMBF A Multi-scale Bidirectional Fusion Network for Unsupervised RGB-D Point

RLIPv2 Fast Scaling of Relational Language-Image Pre-Training

SemARFlow Injecting Semantics into Unsupervised Optical Flow Estimation for Autonomous

Small Object Detection via Coarse-to-fine Proposal Generation and Imitation Learning

HybridAugment Unified Frequency Spectra Perturbations for Model Robustness

Achievement-Based Training Progress Balancing for Multi-Task Learning

Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation

EGformer Equirectangular Geometry-biased Transformer for 360 Depth Estimation

SPANet Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation

Aggregating Feature Point Cloud for Depth Completion

Bidirectionally Deformable Motion Modulation For Video-based Human Pose Trans

Both Diverse and Realism Matter Physical Attribute and Style Alignment

Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through Image-IDS

Enhancing Non-line-of-sight Imaging via Learnable Inverse Kernel and Attention Mechanisms

FreeDoM Training-Free Energy-Guided Conditional Diffusion Model

GLA-GCN Global-local Adaptive Graph Convolutional Network for 3D Human Pos

HAL3D Hierarchical Active Learning for Fine-Grained 3D Part Labeling

ICD-Face Intra-class Compactness Distillation for Face Recognition

LaPE Layer-adaptive Position Embedding for Vision Transformers with Independent Lay

Long-Term Photometric Consistent Novel View Synthesis with Diffusion Models

Modality Unifying Network for Visible-Infrared Person Re-Identification

Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors

Texture Generation on 3D Meshes with Point-UV Diffusion

Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only

Video State-Changing Object Segmentation

Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an

Boosting Novel Category Discovery Over Domains with Soft Contrastive Learning

The Unreasonable Effectiveness of Large Language-Vision Models for Source-Free Video

Stochastic Segmentation with Conditional Categorical Diffusion Models

Global Balanced Experts for Federated Long-Tailed Learning

MPCViT Searching for Accurate and Efficient MPC-Friendly Vision Transformer with

Parameterized Cost Volume for Stereo Matching

HopFIR Hop-wise GraphFormer with Intragroup Joint Refinement for 3D Human

Masked Autoencoders are Efficient Class Incremental Learners

PEANUT Predicting and Navigating to Unseen Targets

Sigmoid Loss for Language Image Pre-Training

SLAN Self-Locator Aided Network for Vision-Language Understanding

SOAR Scene-debiasing Open-set Action Recognition

Stabilizing Visual Reinforcement Learning via Asymmetric Interactive Cooperation

Towards Generic Image Manipulation Detection with Weakly-Supervised Self-Consistency Learning

3D-Aware Neural Body Fitting for Occlusion Robust 3D Human Pos

Accurate 3D Face Reconstruction with Facial Component Tokens

Adding Conditional Control to Text-to-Image Diffusion Models

A Dynamic Dual-Processing Object Detection Framework Inspired by the Brains

A Simple Framework for Open-Vocabulary Segmentation and Detection

A Simple Vision Transformer for Weakly Semi-supervised 3D Object Detection

Black-Box Unsupervised Domain Adaptation with Bi-Directional Atkinson-Shiffrin Memory

Body Knowledge and Uncertainty Modeling for Monocular 3D Human Body

Boosting Single Image Super-Resolution via Partial Channel Shifting

C2ST Cross-Modal Contextualized Sequence Transduction for Continuous Sign Language Recognition

CoinSeg Contrast Inter- and Intra- Class Representations for Incremental Segmentation

Continual Zero-Shot Learning through Semantically Guided Generative Random Walks

Decoupled DETR Spatially Disentangling Localization and Classification for Improved End-to-En

DeformToon3D Deformable Neural Radiance Fields for 3D Toonification

DETA Denoised Task Adaptation for Few-Shot Learning

DiffCloth Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal

DMNet Delaunay Meshing Network for 3D Shape Representation

DomainAdaptor A Novel Approach to Test-time Adaptation

DVIS Decoupled Video Instance Segmentation Framework

ESSAformer Efficient Transformer for Hyperspectral Image Super-resolution

Exploring Predicate Visual Context in Detecting of Human-Object Interactions

Exploring Temporal Concurrency for Video-Language Representation Learning

Fcaformer Forward Cross Attention in Hybrid Vision Transform

Flatness-Aware Minimization for Domain Generalization

Foreground Object Search by Distilling Composite Image Featu

Generalizing Event-Based Motion Deblurring in Real-World Scenarios

Generative Gradient Inversion via Over-Parameterized Networks in Federated Learning

GETAvatar Generative Textured Meshes for Animatable Human Avatars

GeT Generative Target Structure Debiasing for Domain Adaptation

GO-SLAM Global Optimization for Consistent 3D Instant Reconstruction

GPFL Simultaneously Learning Global and Personalized Feature Information for Personaliz

Helping Hands An Object-Aware Ego-Centric Video Recognition Model

ITI-GEN Inclusive Text-to-Image Generation

LayoutDiffusion Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models

Learning in Imperfect Environment Multi-Label Classification with Long-Tailed Distribution an

Learning Neural Implicit Surfaces with Object-Aware Radiance Fields

Learning Rain Location Prior for Nighttime Deraining

Learning Spatial-context-aware Global Visual Feature Representation for Instance Image Retrieval

LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment

Lightweight Image Super-Resolution with Superpixel Token Interaction

LMR A Large-Scale Multi-Reference Dataset for Reference-Based Super-Resolution

MAGI Multi-Annotated Explanation-Guided Learning

MAP Towards Balanced Generalization of IID and OOD through Model-Agnostic

Meta-ZSDETR Zero-shot DETR with Meta-learning

Minimum Latency Deep Online Video Stabilization

MonoDETR Depth-guided Transformer for Monocular 3D Object Detection

MoreauGrad Sparse and Robust Interpretation of Neural Networks via Moreau

Multi-Event Video-Text Retrieval

Multi3DRefer Grounding Text Description to Multiple 3D Objects

Multiple Planar Object Tracking

NeILF Inter-Reflectable Light Fields for Geometry and Material Estimation

NeMF Inverse Volume Rendering with Neural Microflake Fiel

OccFormer Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

OCHID-Fi Occlusion-Robust Hand Pose Estimation in 3D via RF-Vision

Perceptual Artifacts Localization for Image Synthesis Tasks

Pose-Free Neural Radiance Fields via Implicit Pose Regularization

Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views

QD-BEV Quantization-aware View-guided Distillation for Multi-view 3D Object Detection

RankMatch Fostering Confidence and Consistency in Learning with Noisy Labels

Reconciling Object-Level and Global-Level Objectives for Long-Tail Detection

ReMoDiffuse Retrieval-Augmented Motion Diffusion Model

Rethinking Mobile Block for Efficient Attention-based Models

Rethinking the Role of Pre-Trained Networks in Source-Free Domain Adaptation

Robust Geometry-Preserving Depth Estimation Using Differentiable Rendering

Robust Mixture-of-Expert Training for Convolutional Neural Networks

SA-BEV Generating Semantic-Aware Birds-Eye-View Feature for Multi-view 3D Object Detection

SAL-ViT Towards Latency Efficient Private Inference on ViT using Selectiv

Self-supervised Learning of Implicit Shape Representation with Dense Correspondence fo

ShiftNAS Improving One-shot NAS via Probability Shift

Single Depth-image 3D Reflection Symmetry and Shape Prediction

SLCA Slow Learner with Classifier Alignment for Continual Learning on

Surface Extraction from Neural Unsigned Distance Fields

TARGET Federated Class-Continual Learning via Exemplar-Free Distillation

Tiny Updater Towards Efficient Neural Network-Driven Software Updating

Towards Effective Instance Discrimination Contrastive Loss for Unsupervised Domain Adaptation

Towards Fairness-aware Adversarial Network Pruning

Towards General Low-Light Raw Noise Synthesis and Modeling

Toward Multi-Granularity Decision-Making Explicit Visual Reasoning with Hierarchical Knowledg

Toward Unsupervised Realistic Visual Question Answering

TrajPAC Towards Robustness Verification of Pedestrian Trajectory Prediction Models

Uni-3D A Universal Model for Panoptic 3D Scene Reconstruction

Unsupervised Surface Anomaly Detection with Diffusion Probabilistic Model

Weakly-Supervised Text-Driven Contrastive Learning for Facial Behavior Understanding

When Noisy Labels Meet Long Tail Dilemmas A Representation Calibration

NeRFrac Neural Radiance Fields through Refractive Surfac

Ada3D Exploiting the Spatial Redundancy with Adaptive Inference fo

Bring Clipart to Li

Class Prior-Free Positive-Unlabeled Learning with Taylor Variational Loss for Hyperspectral

Cumulative Spatial Knowledge Distillation for Vision Transformers

DDFM Denoising Diffusion Model for Multi-Modality Image Fusion

Divide and Conquer 3D Point Cloud Instance Segmentation With Point-Wis

DOT A Distillation-Oriented Train

Fast Adversarial Training with Smooth Convergenc

Fast Full-frame Video Stabilization with Iterative Optimization

Fully Attentional Networks with Self-emerging Token Labeling

GasMono Geometry-Aided Self-Supervised Monocular Depth Estimation for Indoor Scenes

Generative Prompt Model for Weakly Supervised Object Localization

Human from Blur Human Pose Tracking from Blurry Images

Incremental Generalized Category Discovery

Learning Pseudo-Relations for Cross-domain Semantic Segmentation

Learning Semi-supervised Gaussian Mixture Models for Generalized Category Discovery

Learning Symmetry-Aware Geometry Correspondences for 6D Object Pose Estimation

MagicFusion Boosting Text-to-Image Generation Performance by Fusing Diffusion Models

Masked Retraining Teacher-Student Framework for Domain Adaptive Object Detection

MDCS More Diverse Experts with Consistency Self-distillation for Long-tailed Recognition

Movement Enhancement toward Multi-Scale Video Feature Representation for Temporal Action

MVPSNet Fast Generalizable Multi-view Photometric Stereo

Object-Centric Multiple Object Tracking

RecursiveDet End-to-End Region-Based Recursive Object Detection

Spherical Space Feature Decomposition for Guided Depth Map Super-Resolution

Synthesizing Diverse Human Motions in 3D Indoor Scenes

TextPSG Panoptic Scene Graph Generation from Textual Descriptions

Towards Authentic Face Restoration with Iterative Diffusion Models and Beyon

Unified Visual Relationship Detection with Vision and Language Models

Unleashing Text-to-Image Diffusion Models for Visual Perception

Instance-aware Dynamic Prompt Tuning for Pre-trained Point Cloud Models

CIRI Curricular Inactivation for Residue-aware One-shot Video Inpainting

COOP Decoupling and Coupling of Whole-Body Grasping Pose Generation

Distributed Bundle Adjustment with Block-Based Sparse Matrix Compression for Su

Empowering Low-Light Image Enhancer through Customized Learnable Priors

HaMuCo Hand Pose Estimation via Multiview Collaborative Self-Supervised Learning

Less is More Focus Attention for Efficient DETR

Look at the Neighbor Distortion-aware Unsupervised Domain Adaptation for Panoramic

MRN Multiplexed Routing Network for Incremental Multilingual Text Recognition

Multi-task View Synthesis with Neural Radiance Fields

Online Clustered Codebook

PointOdyssey A Large-Scale Synthetic Dataset for Long-Term Point Tracking

Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models

Realistic Full-Body Tracking from Sparse Observations via Joint-Level Modeling

Regularized Mask Tuning Uncovering Hidden Knowledge in Pre-Trained Vision-Language Models

Scalable Multi-Temporal Remote Sensing Change Data Generation via Simulating Stochastic

SimMatchV2 Semi-Supervised Learning with Graph Consistency

Unfolding Framework with Prior of Convolution-Transformer Mixture and Uncertainty Estimation

LivelySpeaker Towards Semantic-Aware Co-Speech Gesture Generation

3D Implicit Transporter for Temporally Consistent Keypoint Discovery

AttT2M Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

Contrastive Learning Relies More on Spatial Inductive Bias Than Supervis

Improving Equivariance in State-of-the-Art Supervised Depth and Normal Predictors

MMVP Motion-Matrix-Based Video Prediction

3D Neural Embedding Likelihood Probabilistic Inverse Graphics for Robust 6D

BT2 Backward-compatible Training with Basis Transformation

ClothesNet An Information-Rich 3D Garment Model Repository with Simulated Clothes

Communication-efficient Federated Learning with Single-Step Synthetic Features Compressor for Fast

Cross-Modal Translation and Alignment for Survival Analysis

Dataset Quantization

Deep Fusion Transformer Network with Weighted Vector-Wise Keypoints Voting fo

Downstream-agnostic Adversarial Examples

DR-Tune Improving Fine-tuning of Pretrained Visual Models by Distribution Regularization

FF Attack Adversarial Attack against Multiple Object Trackers by Inducing

Gloss-Free Sign Language Translation Improving from Visual-Language Pretraining

HiLo Exploiting High Low Frequency Relations for Unbiased Panoptic Scen

Homeomorphism Alignment for Unsupervised Domain Adaptation

ImbSAM A Closer Look at Sharpness-Aware Minimization in Class-Imbalanced Recognition

Improving Lens Flare Removal with General-Purpose Pipeline and Multiple Light

Learned Image Reasoning Prior Penetrates Deep Unfolding Network for Panchromatic

Learning a More Continuous Zero Level Set in Unsigned Distanc

Learning Correction Filter via Degradation-Adaptive Regression for Blind Single Imag

MatrixVT Efficient Multi-Camera to BEV Transformation for 3D Perception

MSRA-SR Image Super-resolution Transformer with Multi-scale Shared Representation Acquisition

Pre-Training-Free Image Manipulation Localization through Non-Mutually Exclusive Contrastive Learning

ProPainter Improving Propagation and Transformer for Video Inpainting

Rethinking Pose Estimation in Crowds Overcoming the Detection Information Bottleneck

SAMPLING Scene-adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis

SparseMAE Sparse Training Meets Masked Autoencoders

SRFormer Permuted Self-Attention for Single Image Super-Resolution

Two-in-One Depth Bridging the Gap Between Monocular and Binocular Self-Supervis

UniFace Unified Cross-Entropy Loss for Deep Face Recognition

Unsupervised Domain Adaptive Detection with Network Stability Analysis

XNet Wavelet-Based Low and High Frequency Fusion Networks for Fully-

MAS Towards Resource-Efficient Federated Multiple-Task Learning

Video Background Music Generation Dataset Method and Evaluation

3D-VisTA Pre-trained Transformer for 3D Vision and Text Alignment

4D Panoptic Segmentation as Invariant and Equivariant Field Prediction

All-to-Key Attention for Arbitrary Style Trans

A Good Student is Cooperative and Reliable CNN-Transformer Collaborative Learning

BiFF Bi-level Future Fusion with Polyline-based Coordinate for Interactive Trajectory

Boosting Adversarial Transferability via Gradient Relevance Attack

Coarse-to-Fine Learning Compact Discriminative Representation for Single-Stage Image Retrieval

Cross-Modal Orthogonal High-Rank Augmentation for RGB-Event Transformer-Trackers

CTPTowards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology

EgoObjects A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding

Enhancing Fine-Tuning Based Backdoor Defense with Sharpness-Aware Minimization

Exploring Temporal Frequency Spectrum in Deep Video Deblurring

Frequency-aware GAN for Adversarial Manipulation Generation

H3WB Human3.6M 3D WholeBody Dataset and Benchmark

Improving Generalization of Adversarial Training via Robust Critical Fine-Tuning

Learning Gabor Texture Features for Fine-Grained Recognition

LinkGAN Linking GAN Latents to Pixels for Controllable Image Synthesis

MapPrior Birds-Eye View Map Layout Estimation with Generative Models

Modeling the Relative Visual Tempo for Self-supervised Skeleton-based Action Recognition

MotionBERT A Unified Perspective on Learning Human Motion Representations

Multi-Label Self-Supervised Learning with Scene Images

Not All Features Matter Enhancing Few-shot CLIP with Adaptive Prio

PointCLIP V2 Prompting CLIP and GPT for Powerful 3D Open-worl

Prompt-aligned Gradient for Prompt Tuning

Rethinking Data Distillation Do Not Overlook Calibration

Scene-Aware Label Graph Learning for Multi-Label Image Classification

SegPrompt Boosting Open-World Segmentation via Category-Level Prompt Learning

Self-Organizing Pathway Expansion for Non-Exemplar Class-Incremental Learning

SVDFormer Complementing Point Cloud via Self-view Augmentation and Self-structure Dual-generato

The Victim and The Beneficiary Exploiting a Poisoned Model to

UMIFormer Mining the Correlations between Similar Tokens for Multi-View 3D

Universal Domain Adaptation via Compressive Attention Matching

Unsupervised Self-Driving Attention Prediction via Uncertainty Mining and Knowledge Embedding

SC3K Self-supervised and Coherent 3D Keypoints Estimation from Rotated Noisy

DETRs with Collaborative Hybrid Assignments Training

Temporal Enhanced Training of Multi-view 3D Object Detector via Historical

RePolyWorld - A Graph Neural Network for Polygonal Scene Parsing

Adaptive Calibrator Ensemble Navigating Test Set Difficulty in Out-of-Distribution Scenarios

Discrepant and Multi-Instance Proxies for Unsupervised Person Re-Identification

Iterative Denoiser and Noise Estimator for Self-Supervised Image Denoising

RawHDR High Dynamic Range Image Reconstruction from a Single Raw

From Chaos Comes Order Ordering Event Representations for Object Recognition

DG3D Generating High Quality 3D Textured Shapes by Learning to

Reconstructing Interacting Hands with Interaction Prior from Monocular Images

LaRS A Diverse Panoptic Maritime Obstacle Detection Dataset and Benchmark

分类: ICCV导读 标签: 暂无标签

评论

暂无评论数据

暂无评论数据

目录