1

Unnoticed Yet Effective: A Hybrid Physical Camouflage Framework Against DNNs and Human Perception

While adversarial attacks can effectively deceive deep neural networks, their real-world applicability is often limited by complex and conspicuous patterns that reveal their attack intent to human observers. To overcome this limitation, we propose …

MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models

VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models

Although large visual-language models (LVLMs) have demonstrated strong performance in multimodal tasks, errors may occasionally arise due to biases during the reasoning process. Recently, reward models (RMs) have become increasingly pivotal in the …

SmartRAG: Jointly Learn RAG-Related Tasks From the Environment Feedback

RAG systems consist of multiple modules to work together. However, these modules are usually separately trained. We argue that a system like RAG that incorporates multiple modules should be jointly optimized to achieve optimal performance. To …

MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object Scenarios

EIAD: Explainable Industrial Anomaly Detection Via Multi-Modal Large Language Models

Industrial Anomaly Detection (IAD) is critical to ensure product quality during manufacturing. Although existing zero-shot defect segmentation and detection methods have shown effectiveness, they cannot provide detailed descriptions of the defects. …

GIST: Improving Parameter Efficient Fine Tuning via Knowledge Interaction

The Parameter-Efficient Fine-Tuning (PEFT) method, which adjusts or introduces fewer trainable parameters to calibrate pre-trained models on downstream tasks, has become a recent research interest. However, existing PEFT methods within the …

TTE: Two Tokens are Enough to Improve Parameter-Effcient Tuning

Existing fne-tuning paradigms are predominantly characterized by Full Parameter Tuning (FPT) and Parameter-Effcient Tuning (PET). FPT fne-tunes all parameters of a pre-trained model on downstream tasks, whereas PET freezes the pretrained model and …

From Raw Video to Pedagogical Insights: A Unified Framework for Student Behavior Analysis

Understanding student behavior in educational settings is critical in improving both the quality of pedagogy and the level of student engagement. While various AI-based models exist for classroom analysis, they tend to specialize in limited tasks and …

LAMM: Label Alignment for Multi-Modal Prompt Learning

With the success of pre-trained visual-language (VL) models such as CLIP in visual representation tasks, transferring pre-trained models to downstream tasks has become a crucial paradigm. Recently, the prompt tuning paradigm, which draws inspiration …