publications | Liangli Zhen

2025

Cross-modal obfuscation for jailbreak attacks on large vision-language models

Lei Jiang, Zixun Zhang, Zizhou Wang, Xiaobing Sun, Zhen Li, Liangli Zhen*, and Xiaohua Xu*

arXiv preprint arXiv:2506.16760, 2025

Abs PDF

Large Vision-Language Models (LVLMs) demonstrate exceptional performance across multimodal tasks, yet remain vulnerable to jailbreak attacks that bypass built-in safety mechanisms to elicit restricted content generation. Existing black-box jailbreak methods primarily rely on adversarial textual prompts or image perturbations, yet these approaches are highly detectable by standard content filtering systems and exhibit low query and computational efficiency. In this work, we present Cross-modal Adversarial Multimodal Obfuscation (CAMO), a novel black-box jailbreak attack framework that decomposes malicious prompts into semantically benign visual and textual fragments. By leveraging LVLMs’ cross-modal reasoning abilities, CAMO covertly reconstructs harmful instructions through multi-step reasoning, evading conventional detection mechanisms. Our approach supports adjustable reasoning complexity and requires significantly fewer queries than prior attacks, enabling both stealth and efficiency. Comprehensive evaluations conducted on leading LVLMs validate CAMO’s effectiveness, showcasing robust performance and strong cross-model transferability. These results underscore significant vulnerabilities in current built-in safety mechanisms, emphasizing an urgent need for advanced, alignment-aware security and safety solutions in vision-language systems.
A survey and evaluation of adversarial attacks in object detection

Khoi Nguyen Tiet Nguyen*, Wenyu Zhang*, Kangkang Lu, Yuhuan Wu, Xingjian Zheng, Hui Li Tan, and Liangli Zhen*

IEEE Transactions on Neural Networks and Learning Systems, 2025

Abs PDF

Deep learning models achieve remarkable accuracy in computer vision tasks, yet remain vulnerable to adversarial examples–carefully crafted perturbations to input images that can deceive these models into making confident but incorrect predictions. This vulnerability pose significant risks in high-stakes applications such as autonomous vehicles, security surveillance, and safety-critical inspection systems. While the existing literature extensively covers adversarial attacks in image classification, comprehensive analyses of such attacks on object detection systems remain limited. This paper presents a novel taxonomic framework for categorizing adversarial attacks specific to object detection architectures, synthesizes existing robustness metrics, and provides a comprehensive empirical evaluation of state-of-the-art attack methodologies on popular object detection models, including both traditional detectors and modern detectors with vision-language pretraining. Through rigorous analysis of open-source attack implementations and their effectiveness across diverse detection architectures, we derive key insights into attack characteristics. Furthermore, we delineate critical research gaps and emerging challenges to guide future investigations in securing object detection systems against adversarial threats. Our findings establish a foundation for developing more robust detection models while highlighting the urgent need for standardized evaluation protocols in this rapidly evolving domain.
Continuous disentangled joint space learning for domain generalization

Zizhou Wang, Yan Wang, Yangqin Feng, Jiawei Du, Yong Liu, Rick Siow Mong Goh, and Liangli Zhen*

IEEE Transactions on Neural Networks and Learning Systems, 2025

Abs PDF

Domain generalization aims to learn a model on one or multiple observed source domains that can generalize to unseen target test domains. Previous approaches have focused on extracting domain-invariant information from multiple source domains, but domain-specific information is also closely tied to semantics in individual domains and is not well-suited for generalization to the target domain. In this paper, we propose a novel domain generalization method called Continuous Disentangled Joint Space Learning (CJSL), which leverages both domain-invariant and domain-specific information for more effective domain generalization. The key idea behind CJSL is to formulate and learn a continuous joint space for domain-specific representations from source domains through iterative feature disentanglement. This learned continuous joint space can then be used to simulate domain-specific representations for test samples from a mixture of multiple domains via Monte Carlo sampling during the inference stage. Unlike existing approaches, which exploit domain-invariant feature vectors only or aim to learn a universal domain-specific feature extractor, we simulate domain-specific representations via sampling the latent vectors in the learned continuous joint space for the test sample to fully utilize the power of multiple domain-specific classifiers for the robust prediction. Empirical results demonstrate that CJSL outperforms 19 state-of-the-art methods on seven benchmarks, indicating the effectiveness of our proposed method.
Generative image reconstruction from gradients

Ekanut Sotthiwat, Liangli Zhen*, Chi Zhang, Zengxiang Li, and Rick Goh

IEEE Transactions on Neural Networks and Learning Systems, 2025

Abs PDF

In this paper, we propose a method, Generative Image Reconstruction from Gradients (GIRG), for recovering training images from gradients in a federated learning setting, where privacy is preserved by sharing model weights and gradients rather than raw training data. Previous studies have shown the potential for revealing clients’ private information or even pixel-level recovery of training images from shared gradients. However, existing methods are limited to low-resolution images and small batch sizes or require prior knowledge about the client data. GIRG utilizes a conditional generative model to reconstruct training images and their corresponding labels from the shared gradients. Unlike previous generative model-based methods, GIRG does not require prior knowledge of the training data. Furthermore, GIRG optimizes the weights of the conditional generative model to generate highly accurate “dummy” images instead of optimizing the input vectors of the generative model. Comprehensive empirical results show that GIRG is able to recover high-resolution images with large batch sizes and can even recover images from the aggregation of gradients from multiple participants. These results reveal the vulnerability of current federated learning practices and call for immediate efforts to prevent inversion attacks in gradient-sharing-based collaborative training.
Low-resolution self-attention for semantic segmentation

Yu-Huan Wu, Shi-Chen Zhang, Yun Liu, Le Zhang, Xin Zhan, Daquan Zhou, Jiashi Feng, Ming-Ming Cheng, and Liangli Zhen

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Abs PDF Code

Semantic segmentation tasks naturally require high-resolution information for pixel-wise segmentation and global context information for class prediction. While existing vision transformers demonstrate promising performance, they often utilize high-resolution context modeling, resulting in a computational bottleneck. In this work, we challenge conventional wisdom and introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost, \ie, FLOPs. Our approach involves computing self-attention in a fixed low-resolution space, regardless of the input image’s resolution, with additional 3x3 depth-wise convolutions to capture fine details in the high-resolution space. We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure. Extensive experiments on the ADE20K, COCO-Stuff, and Cityscapes datasets demonstrate that LRFormer outperforms state-of-the-art models.
Deep evidential hashing for trustworthy cross-modal retrieval

Yuan Li, Liangli Zhen, Yuan Sun, Dezhong Peng, Xi Peng, and Peng Hu

AAAI, 2025

Abs PDF

Cross-modal hashing provides an efficient solution for retrieval tasks across various modalities, such as images and text. However, most existing methods are deterministic models, which overlook the reliability associated with the retrieved results. This omission renders them unreliable for determining matches between data pairs based solely on Hamming distance. To bridge the gap, in this paper, we propose a novel method called Deep Evidential Cross-modal Hashing (DECH). This method equips hashing models with the ability to quantify the reliability level of the association between a query sample and each corresponding retrieved sample, bringing a new dimension of reliability to the cross-modal retrieval process. To achieve this, our method addresses two key challenges: i) To leverage evidential theory in guiding the model to learn hash codes, we design a novel evidence acquisition module to collect evidence and place the evidence captured by hash codes on a Beta distribution to derive a binomial opinion. Unlike existing evidential learning approaches that rely on classifiers, our method collects evidence directly through hash codes. ii) To tackle the task-oriented challenge, we first introduce a method to update the derived binomial opinion, allowing it to present the uncertainty caused by conflicting evidence. Following this manner, we present a strategy to precisely evaluate the reliability level of retrieved results, culminating in performance improvement. We validate the efficacy of our DECH through extensive experimentation on four benchmark datasets. The experimental results demonstrate our superior performance compared to 12 state-of-the-art methods.

2024

Geometric correspondence-based multimodal learning for ophthalmic image analysis

Yan Wang, Liangli Zhen*, Tien-En Tan, Huazhu Fu, Yangqin Feng, Zizhou Wang, Xinxing Xu, Rick Siow Mong Goh, Yipin Ng, Claire Calhoun, Gavin SW Tan, Jennifer K Sun, Yong Liu, and Daniel SW Ting

IEEE Transactions on Medical Imaging, 2024

Abs PDF

Color fundus photography (CFP) and Optical coherence tomography (OCT) images are two of the most widely used modalities in the clinical diagnosis and management of retinal diseases. Despite the widespread use of multimodal imaging in clinical practice, few methods for automated diagnosis of eye diseases utilize correlated and complementary information from multiple modalities effectively. This paper explores how to leverage the information from CFP and OCT images to improve the automated diagnosis of retinal diseases. We propose a novel multimodal learning method, named geometric correspondence-based multimodal learning network (GeCoM-Net), to achieve the fusion of CFP and OCT images. Specifically, inspired by clinical observations, we consider the geometric correspondence between the OCT slice and the CFP region to learn the correlated features of the two modalities for robust fusion. Furthermore, we design a new feature selection strategy to extract discriminative OCT representations by automatically selecting the important feature maps from OCT slices. Unlike the existing multimodal learning methods, GeCoM-Net is the first method that formulates the geometric relationships between the OCT slice and the corresponding region of the CFP image explicitly for CFP and OCT fusion. Experiments have been conducted on a large-scale private dataset and a publicly available dataset to evaluate the effectiveness of GeCoM-Net for diagnosing diabetic macular edema (DME), impaired visual acuity (VA) and glaucoma. The empirical results show that our method outperforms the current state-of-the-art multimodal learning methods by improving the AUROC score 0.4%, 1.9% and 2.9% for DME, VA and glaucoma detection, respectively.
Deep supervised multi-view learning with graph priors

Peng Hu, Liangli Zhen*, Xi Peng, Hongyuan Zhu, Jie Lin, Xu Wang, and Dezhong Peng

IEEE Transactions on Image Processing, 2024

Abs PDF Code

This paper presents a novel method for supervised multi-view representation learning, which projects multiple views into a latent common space while preserving the discrimination and intrinsic structure of each view. Specifically, an \textitapriori discriminant similarity graph is first constructed based on labels and pairwise relationships of multi-view inputs. Then, view-specific networks progressively map inputs to common representations whose affinity approximates the constructed graph. To achieve graph consistency, discrimination, and cross-view invariance, the similarity graph is enforced to meet the following constraints: 1) pairwise relationship should be consistent between the input space and common space for each view; 2) within-class similarity is larger than any between-class similarity for each view; 3) the inter-view samples from the same (or different) classes are mutually similar (or dissimilar). Consequently, the intrinsic structure and discrimination are preserved in the latent common space using an apriori approximation schema. Moreover, we present a sampling strategy to approach a sub-graph sampled from the whole similarity structure instead of approximating the graph of the whole dataset explicitly, thus benefiting lower space complexity and the capability of handling large-scale multi-view datasets. Extensive experiments show the promising performance of our method on five datasets by comparing it with 18 state-of-the-art methods.
MedNAS: Multi-scale training-free neural architecture search for medical image analysis

Yan Wang, Liangli Zhen*, Jianwei Zhang, Miqing Li, Lei Zhang, Zizhou Wang, Yangqin Feng, Yu Xue, Xiao Wang, Zheng Chen, Tao Luo, Rick Siow Mong Goh, and Yong Liu

IEEE Transactions on Evolutionary Computation, 2024

Abs PDF

Deep neural networks have demonstrated impressive results in medical image analysis, but designing suitable architectures for each specific task is expertise-dependent and time-consuming. Neural architecture search (NAS) offers an effective means of discovering architectures. It has been highly successful in numerous applications, particularly in natural image classification. Yet, medical images possess unique characteristics, such as small regions and a wide variety of lesion sizes, that differentiate them from natural images. Furthermore, most current NAS methods struggle with high computational costs, especially when dealing with high-resolution image datasets. In this paper, we present a novel evolutionary neural architecture search method called Multi-Scale Training-Free Neural Architecture Search to address these challenges. Specifically, to accommodate the broad range of lesion region sizes in disease diagnosis, we develop a new reduction cell search space that enables the search algorithm to explicitly identify the optimal scale combination for multi-scale feature extraction. To overcome the issue of high computational costs, we utilize training-free indicators as performance measures for candidate architectures, which allows us to search for the optimal architecture more efficiently. More specifically, by considering the capability and simplicity of various networks, we formulate a multi-objective optimization problem that involves two training-free indicators and model complexity for candidate architectures. Extensive experiments on a large medical image benchmark and a publicly available breast cancer detection dataset are conducted. The empirical results demonstrate that our MSTF-NAS outperforms both human-designed architectures and current state-of-the-art NAS algorithms on both datasets, indicating the effectiveness of our proposed method.
Neural architecture search with progressive evaluation and sub-population preservation

Yu Xue, Jiajie Zha, Danilo Pelusi, Peng Chen, Tao Luo, Liangli Zhen, Yan Wang, and Mohamed Wahib

IEEE Transactions on Evolutionary Computation, 2024

Abs

Neural architecture search (NAS) is an effective approach for automating the design of deep neural networks. Evolutionary computation (EC) is commonly used in NAS due to its global optimization capability. However, the evaluation phase of architecture candidates in EC-based NAS is compute-intensive, limiting its application for many real-world problems. To overcome this challenge, we propose a novel progressive evaluation strategy for the evaluation phase in convolutional neural network architecture search, in which the number of training epochs of network individuals is progressively increased. In addition, a sub-population preservation strategy is proposed to preserve medium and large models to avoid prematurely discarding networks that may not perform well in the early stages but have the potential to excel with further optimization. Our proposed algorithm reduces the computational cost of the evaluation phase and promotes population diversity and fairness by preserving promising networks based on their distribution. We evaluate the proposed progressive evaluation and sub-population preservation of neural architecture search (PEPNAS) algorithm on the CIFAR10, CIFAR100, and ImageNet benchmark datasets, and compare it with 36 state-of-the-art algorithms, including manually designed networks, reinforcement learning (RL) algorithms, gradient-based algorithms, and other EC-based ones. The experimental results demonstrate that PEPNAS effectively identifies networks with competitive accuracy while also markedly improving the efficiency of the search process. For instance, PEPNAS discovers the architecture on CIFAR10 with a low error rate of 2.38% using only 0.7 GPU days. We directly adopt the searched architecture for the image classification on the CIFAR100 and ImageNet datasets, which achieves the top 1 error rates of 16.46% and 26.25%, respectively.
Evolutionary architecture search for generative adversarial networks based on weight sharing

Yu Xue, Weinan Tong, Ferrante Neri, Peng Chen, Tao Luo, Liangli Zhen, and Xiao Wang

IEEE Transactions on Evolutionary Computation, 2024

Abs PDF

Generative adversarial networks (GANs) are a powerful generative technique but frequently face challenges with training stability. Network architecture plays a significant role in determining the final output of GANs, but designing a fine architecture demands extensive domain expertise. This paper aims to address this issue by searching for high-performance generator’s architectures through neural architecture search (NAS). The proposed approach, called evolutionary weight sharing generative adversarial networks (EWSGAN), is based on weight sharing and comprises two steps. First, a supernet of the generator is trained using weight sharing. Second, a multi-objective evolutionary algorithm (MOEA) is employed to identify optimal subnets from the supernet. These subnets inherit weights directly from the supernet for fitness assessment. Two strategies are used to stabilise the training of the generator supernet: a fair single-path sampling strategy and a discarding strategy. Experimental results indicate that the architecture searched by our method achieved a new state-of-the-art among NAS-GAN methods with a Fréchet inception distance (FID) of 9.09 and an inception score (IS) of 8.99 on the CIFAR-10 dataset. It also demonstrates competitive performance on the STL-10 dataset, achieving FID of 21.89 and IS of 10.51.
Global challenge for safe and secure LLMs track 1

Xiaojun Jia, Yihao Huang, Yang Liu, Peng Yan Tan, Weng Kuan Yau, and Others

arXiv preprint arXiv:2411.14502, 2024

Abs PDF

This paper introduces the Global Challenge for Safe and Secure Large Language Models (LLMs), a pioneering initiative organized by AI Singapore (AISG) and the CyberSG R&D Programme Office (CRPO) to foster the development of advanced defense mechanisms against automated jailbreaking attacks. With the increasing integration of LLMs in critical sectors such as healthcare, finance, and public administration, ensuring these models are resilient to adversarial attacks is vital for preventing misuse and upholding ethical standards. This competition focused on two distinct tracks designed to evaluate and enhance the robustness of LLM security frameworks. Track 1 tasked participants with developing automated methods to probe LLM vulnerabilities by eliciting undesirable responses, effectively testing the limits of existing safety protocols within LLMs. Participants were challenged to devise techniques that could bypass content safeguards across a diverse array of scenarios, from offensive language to misinformation and illegal activities. Through this process, Track 1 aimed to deepen the understanding of LLM vulnerabilities and provide insights for creating more resilient models. The results of Track 1 highlighted significant advances in jailbreak methods and security testing for LLMs. Competing teams were evaluated based on their models’ resistance to 85 predefined undesirable behaviors, spanning categories such as prejudice, offensive content, misinformation, and promotion of illegal activities. Notably, top-performing teams achieved high attack success rates by introducing innovative techniques, including scenario induction templates that systematically generated context-sensitive prompts and re-suffix attack mechanisms, which adapted suffixes to bypass model filters across multiple LLMs. These techniques demonstrated not only effectiveness in circumventing safeguards but also transferability across different model types, underscoring the adaptability and sophistication of modern adversarial methods. Track 2, scheduled to begin in 2025, will emphasize the development of model-agnostic defense strategies aimed at countering advanced jailbreak attacks. The primary objective of this track is to advance adaptable frameworks that can effectively mitigate adversarial attacks across various LLM architectures.

2023

Generative gradient inversion via over-parameterized convolutional networks in federated learning

Chi Zhang, Xiaoman Zhang, Ekanut Sotthiwat, Yanyu Xu, Ping Liu, Liangli Zhen*, and Yong Liu

ICCV, 2023

Abs PDF Code

Federated learning has gained recognition as a secure approach for safeguarding local private data in collaborative learning. But the advent of gradient inversion research has posed significant challenges to this premise by enabling a third-party to recover groundtruth images via gradients. While prior research has predominantly focused on low-resolution images and small batch sizes, this study highlights the feasibility of reconstructing complex images with high resolutions and large batch sizes. The success of the proposed method is contingent on three crucial components: a convolutional generative model, an over-parameterized network, and a well-designed architecture. Practical experiments demonstrate that the proposed algorithm achieves high-fidelity image recovery, surpassing state-of-the-art competitors that commonly fail in more intricate scenarios. Consequently, our study shows that local participants in a federated learning system are vulnerable to potential data leakage issues. The source code will be available upon publication.
Contrastive domain adaptation with consistency match for automated pneumonia diagnosis

Yangqin Feng, Zizhou Wang, Xinxing Xu, Yan Wang, Huazhu Fu, Shaohua Li, Liangli Zhen, Xiaofeng Lei, Yingnan Cui, Jordan Sim Zheng Ting, and others

Medical Image Analysis, 2023

Abs PDF

Pneumonia can be difficult to diagnose since its symptoms are too variable, and the radiographic signs are often very similar to those seen in other illnesses such as a cold or influenza. Deep neural networks have shown promising performance in automated pneumonia diagnosis using chest X-ray radiography, allowing mass screening and early intervention to reduce the severe cases and death toll. However, they usually require many well-labelled chest X-ray images for training to achieve high diagnostic accuracy. To reduce the need for training data and annotation resources, we propose a novel method called Contrastive Domain Adaptation with Consistency Match (CDACM). It transfers the knowledge from different but relevant datasets to the unlabelled small-size target dataset and improves the semantic quality of the learnt representations. Specifically, we design a conditional domain adversarial network to exploit discriminative information conveyed in the predictions to mitigate the domain gap between the source and target datasets. Furthermore, due to the small scale of the target dataset, we construct a feature cloud for each target sample and leverage contrastive learning to extract more discriminative features. Lastly, we propose adaptive feature cloud expansion to push the decision boundary to a low-density area. Unlike most existing transfer learning methods that aim only to mitigate the domain gap, our method instead simultaneously considers the domain gap and the data deficiency problem of the target dataset. The conditional domain adaptation and the feature cloud generation of our method are learning jointly to extract discriminative features in an end-to-end manner. Besides, the adaptive feature cloud expansion improves the model’s generalisation ability in the target domain. Extensive experiments on pneumonia and COVID-19 diagnosis tasks demonstrate that our method outperforms several state-of-the-art unsupervised domain adaptation approaches, which verifies the effectiveness of CDACM for automated pneumonia diagnosis using chest X-ray imaging.

2022

Deep multimodal transfer learning for cross-modal retrieval

Liangli Zhen, Peng Hu, Xi Peng, Rick Siow Mong Goh, and Joey Tianyi Zhou

IEEE Transactions on Neural Networks and Learning Systems, 2022

Abs PDF

Cross-modal retrieval (CMR) enables flexible retrieval experience across different modalities (e.g., texts versus images), which maximally benefits us from the abundance of multimedia data. Existing deep CMR approaches commonly require a large amount of labeled data for training to achieve high performance. However, it is time-consuming and expensive to annotate the multimedia data manually. Thus, how to transfer valuable knowledge from existing annotated data to new data, especially from the known categories to new categories, becomes attractive for real-world applications. To achieve this end, we propose a deep multimodal transfer learning (DMTL) approach to transfer the knowledge from the previously labeled categories (source domain) to improve the retrieval performance on the unlabeled new categories (target domain). Specifically, we employ a joint learning paradigm to transfer knowledge by assigning a pseudolabel to each target sample. During training, the pseudolabel is iteratively updated and passed through our model in a self-supervised manner. At the same time, to reduce the domain discrepancy of different modalities, we construct multiple modality-specific neural networks to learn a shared semantic space for different modalities by enforcing the compactness of homoinstance samples and the scatters of heteroinstance samples. Our method is remarkably different from most of the existing transfer learning approaches. To be specific, previous works usually assume that the source domain and the target domain have the same label set. In contrast, our method considers a more challenging multimodal learning situation where the label sets of the two domains are different or even disjoint. Experimental studies on four widely used benchmarks validate the effectiveness of the proposed method in multimodal transfer learning and demonstrate its superior performance in CMR compared with 11 state-of-the-art methods.
Deep semi-supervised multi-view learning with increasing views

Peng Hu, Xi Peng, Hongyuan Zhu, Liangli Zhen, Jie Lin, Huaibai Yan, and Dezhong Peng

IEEE Transactions on Cybernetics, 2022

Abs PDF

In this article, we study two challenging problems in semisupervised cross-view learning. On the one hand, most existing methods assume that the samples in all views have a pairwise relationship, that is, it is necessary to capture or establish the correspondence of different views at the sample level. Such an assumption is easily isolated even in the semisupervised setting wherein only a few samples have labels that could be used to establish the correspondence. On the other hand, almost all existing multiview methods, including semisupervised ones, usually train a model using a fixed dataset, which cannot handle the data of increasing views. In practice, the view number will increase when new sensors are deployed. To address the above two challenges, we propose a novel method that employs multiple independent semisupervised view-specific networks (ISVNs) to learn representation for multiple views in a view-decoupling fashion. The advantages of our method are two-fold. Thanks to our specifically designed autoencoder and pseudolabel learning paradigm, our method shows an effective way to utilize both the labeled and unlabeled data while relaxing the data assumption of the pairwise relationship, that is, correspondence. Furthermore, with our view decoupling strategy, the proposed ISVNs could be separately trained, thus efficiently handling the data of increasing views without retraining the entire model. To the best of our knowledge, our ISVN could be one of the first attempts to make handling increasing views in the semisupervised setting possible, as well as an effective solution to the noncorresponding problem. To verify the effectiveness and efficiency of our method, we conduct comprehensive experiments by comparing 13 state-of-the-art approaches on four multiview datasets in terms of retrieval and classification.
Natural language video localization: A revisit in span-based question answering framework

Hao Zhang, Aixin Sun, Wei Jing, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

Abs PDF Code

Natural Language Video Localization (NLVL) aims to locate a target moment from an untrimmed video that semantically corresponds to a text query. Existing approaches mainly solve the NLVL problem from the perspective of computer vision by formulating it as ranking, anchor, or regression tasks. These methods suffer from large performance degradation when localizing on long videos. In this work, we address the NLVL from a new perspective, i.e., span-based question answering (QA), by treating the input video as a text passage. We propose a video span localizing network (VSLNet), on top of the standard span-based QA framework (named VSLBase), to address NLVL. VSLNet tackles the differences between NLVL and span-based QA through a simple yet effective query-guided highlighting (QGH) strategy. QGH guides VSLNet to search for the matching video span within a highlighted region. To address the performance degradation on long videos, we further extend VSLNet to VSLNet-L by applying a multi-scale split-and-concatenation strategy. VSLNet-L first splits the untrimmed video into short clip segments; then, it predicts which clip segment contains the target moment and suppresses the importance of other segments. Finally, the clip segments are concatenated, with different confidences, to locate the target moment accurately. Extensive experiments on three benchmark datasets show that the proposed VSLNet and VSLNet-L outperform the state-of-the-art methods; VSLNet-L addresses the issue of performance degradation on long videos. Our study suggests that the span-based QA framework is an effective strategy to solve the NLVL problem.
Augmented multi-party computation against gradient leakage in federated learning

Chi Zhang, Ekanut Sotthiwat, Liangli Zhen*, and Zengxiang Li

IEEE Transactions on Big Data, 2022

Abs PDF

Multi-Party Computation (MPC) provides an effective cryptographic solution for distributed computing systems so that local models with sensitive information are encrypted before sending to the centralized servers for aggregation. Though direct local knowledge leakages are eliminated in MPC-based algorithms, we observe the server can still obtain the local information indirectly in many scenarios, or even reveal the groundtruth images through methods like Deep Leakage from Gradients (DLG). To eliminate such possibilities and provide stronger protections, we propose an augmented MPC approach by encrypting local models with two rounds of decomposition before transmitting to the server. The proposed solution allows us to remove the constraint that servers must be honest in the general federated learning settings since the true global model is hidden from the servers. Specifically, the augmented MPC algorithm encodes local models into multiple secret shares in the first round, then each share is furthermore split into a public share and a private share. The consequences of such a two-round decomposition are that the augmented algorithm fully inherits the advantages of standard MPC by providing lossless encryption and decryption while simultaneously rendering the global model invisible to the central server. Both theoretical analysis and experimental verification demonstrate that such an augmented solution can provide stronger protections for the security and privacy of the training data, with minimal extra communication and computation costs incurred.
Deep supervised domain adaptation for pneumonia diagnosis from chest X-ray images

Yangqin Feng, Xinxing Xu, Yan Wang, Xiaofeng Lei, Soo Kng Teo, Jordan Sim Zheng Ting, Yonghan Ting, Liangli Zhen, Joey Tianyi Zhou, Yong Liu, and others

IEEE Journal of Biomedical and Health Informatics, 2022

Abs PDF

Pneumonia is one of the most common treatable causes of death, and early diagnosis allows for early intervention. Automated diagnosis of pneumonia can therefore improve outcomes. However, it is challenging to develop high-performance deep learning models due to the lack of well-annotated data for training. This paper proposes a novel method, called Deep Supervised Domain Adaptation (DSDA), to automatically diagnose pneumonia from chest X-ray images. Specifically, we propose to transfer the knowledge from a publicly available large-scale source dataset (ChestX-ray14) to a well-annotated but small-scale target dataset (the TTSH dataset). DSDA aligns the distributions of the source domain and the target domain according to the underlying semantics of the training samples. It includes two task-specific sub-networks for the source domain and the target domain, respectively. These two sub-networks share the feature extraction layers and are trained in an end-to-end manner. Unlike most existing domain adaptation approaches that perform the same tasks in the source domain and the target domain, we attempt to transfer the knowledge from a multi-label classification task in the source domain to a binary classification task in the target domain. To evaluate the effectiveness of our method, we compare it with several existing peer methods. The experimental results show that our method can achieve promising performance for automated pneumonia diagnosis.
Adversarial multimodal fusion with attention mechanism for skin lesion classification using clinical and dermoscopic images

Yan Wang, Yangqin Feng, Lei Zhang, Joey Tianyi Zhou, Yong Liu, Rick Siow Mong Goh, and Liangli Zhen*

Medical Image Analysis, 2022

Abs PDF

Accurate skin lesion diagnosis requires a great effort from experts to identify the characteristics from clinical and dermoscopic images. Deep multimodal learning-based methods can reduce intra- and inter-reader variability and improve diagnostic accuracy compared to the single modality-based methods. This study develops a novel method, named adversarial multimodal fusion with attention mechanism (AMFAM), to perform multimodal skin lesion classification. Specifically, we adopt a discriminator that uses adversarial learning to enforce the feature extractor to learn the correlated information explicitly. Moreover, we design an attention-based reconstruction strategy to encourage the feature extractor to concentrate on learning the features of the lesion area, thus, enhancing the feature vector from each modality with more discriminative information. Unlike existing multimodal-based approaches, which only focus on learning complementary features from dermoscopic and clinical images, our method considers both correlated and complementary information of the two modalities for multimodal fusion. To verify the effectiveness of our method, we conduct comprehensive experiments on a publicly available multimodal and multi-task skin lesion classification dataset: 7-point criteria evaluation database. The experimental results demonstrate that our proposed method outperforms the current state-of-the-art methods and improves the average AUC score by above 2% on the test set.
Efficient sharpness-aware minimization for improved training of neural networks

Jiawei Du, Hanshu Yan, Jiashi Feng, Joey Tianyi Zhou, Liangli Zhen, Rick Siow Mong Goh, and Vincent YF Tan

ICLR, 2022

Abs PDF Code

Overparametrized Deep Neural Networks (DNNs) often achieve astounding performances, but may potentially result in severe generalization error. Recently, the relation between the sharpness of the loss landscape and the generalization error has been established by Foret et al. (2020), in which the Sharpness Aware Minimizer (SAM) was proposed to mitigate the degradation of the generalization. Unfortunately, SAM’s computational cost is roughly double that of base optimizers, such as Stochastic Gradient Descent (SGD). This paper thus proposes Efficient Sharpness Aware Minimizer (ESAM), which boosts SAM’s efficiency at no cost to its generalization performance. ESAM includes two novel and efficient training strategies—StochasticWeight Perturbation and Sharpness-Sensitive Data Selection. In the former, the sharpness measure is approximated by perturbing a stochastically chosen set of weights in each iteration; in the latter, the SAM loss is optimized using only a judiciously selected subset of data that is sensitive to the sharpness. We provide theoretical explanations as to why these strategies perform well. We also show, via extensive experiments on the CIFAR and ImageNet datasets, that ESAM enhances the efficiency over SAM from requiring 100% extra computations to 40% vis-‘a-vis base optimizers, while test accuracies are preserved or even improved.

2021

Joint versus independent multi-view hashing for cross-view retrieval

Peng Hu, Xi Peng, Hongyuan Zhu, Jie Lin, Liangli Zhen, and Dezhong Peng

IEEE Transactions on Cybernetics, 2021

Abs PDF

Thanks to the low storage cost and high query speed, cross-view hashing (CVH) has been successfully used for similarity search in multimedia retrieval. However, most existing CVH methods use all views to learn a common Hamming space, thus making it difficult to handle the data with increasing views or a large number of views. To overcome these difficulties, we propose a decoupled CVH network (DCHN) approach which consists of a semantic hashing autoencoder module (SHAM) and multiple multiview hashing networks (MHNs). To be specific, SHAM adopts a hashing encoder and decoder to learn a discriminative Hamming space using either a few labels or the number of classes, that is, the so-called flexible inputs. After that, MHN independently projects all samples into the discriminative Hamming space that is treated as an alternative ground truth. In brief, the Hamming space is learned from the semantic space induced from the flexible inputs, which is further used to guide view-specific hashing in an independent fashion. Thanks to such an independent/decoupled paradigm, our method could enjoy high computational efficiency and the capacity of handling the increasing number of views by only using a few labels or the number of classes. For a newly coming view, we only need to add a view-specific network into our model and avoid retraining the entire model using the new and previous views. Extensive experiments are carried out on five widely used multiview databases compared with 15 state-of-the-art approaches. The results show that the proposed independent hashing paradigm is superior to the common joint ones while enjoying high efficiency and the capacity of handling newly coming views.
Evolutionary multi-objective model compression for deep neural networks

Zhehui Wang, Tao Luo, Miqing Li, Joey Tianyi Zhou, Rick Siow Mong Goh, and Liangli Zhen*

IEEE Computational Intelligence Magazine, 2021

Abs PDF Code

While deep neural networks (DNNs) deliver state-of-the-art accuracy on various applications from face recognition to language translation, it comes at the cost of high computational and space complexity, hindering their deployment on edge devices. To enable efficient processing of DNNs in inference, a novel approach, called Evolutionary Multi-Objective Model Compression (EMOMC), is proposed to optimize energy efficiency (or model size) and accuracy simultaneously. Specifically, the network pruning and quantization space are explored and exploited by using architecture population evolution. Furthermore, by taking advantage of the orthogonality between pruning and quantization, a two-stage pruning and quantization co-optimization strategy is developed, which considerably reduces time cost of the architecture search. Lastly, different dataflow designs and parameter coding schemes are considered in the optimization process since they have a significant impact on energy consumption and the model size. Owing to the cooperation of the evolution between different architectures in the population, a set of compact DNNs that offer trade-offs on different objectives (e.g., accuracy, energy efficiency and model size) can be obtained in a single run. Unlike most existing approaches designed to reduce the size of weight parameters with no significant loss of accuracy, the proposed method aims to achieve a trade-off between desirable objectives, for meeting different requirements of various edge devices. Experimental results demonstrate that the proposed approach can obtain a diverse population of compact DNNs that are suitable for a broad range of different memory usage and energy consumption requirements. Under negligible accuracy loss, EMOMC improves the energy efficiency and model compression rate of VGG-16 on CIFAR-10 by a factor of more than 8.9 X and 2.4 X, respectively.
Distributed monitoring for energy infrastructures: A two-tier analysis over wireless networks

Chi Zhang, Liangli Zhen, Joey Tianyi Zhou, and Cen Chen

IEEE Wireless Communication, 2021

Abs PDF

Wireless networks (e.g., 5G networks) enable distributed energy infrastructures to be connected even when they are geometrically isolated. Intelligent monitoring from remote sites therefore becomes possible, allowing decision makers to examine the status of distributed energy infrastructures from a central location. The major challenge is when local devices cannot perform the monitoring independently; transmitting every signal back to the central server triggers enormous amounts of wireless communication. To address this, we propose a two-tier AI system by offloading computations to multiple devices. Specifically, we build lightweight AI models for deployment on edge clients (i.e., edge sensors) and a large-scale AI model for the central server. These two types of AI models are trained with different criteria: the models on the edges act as the filtering tools to detect abnormal events and maximally avoid making false negative predictions, whereas the server model is supposed to be an expert for accurate predictions. By validating on a power theft dataset, we show that such a cascading methodology could filter out sufficient negative examples on the edge side while still being able to provide precise predictions on the second-round analysis.
Cross-modal discriminant adversarial network

Peng Hu, Xi Peng, Hongyuan Zhu, Jie Lin, Liangli Zhen, Wei Wang, and Dezhong Peng

Pattern Recognition, 2021

Abs PDF

Cross-modal retrieval aims at retrieving relevant points across different modalities, such as retrieving images via texts. One key challenge of cross-modal retrieval is narrowing the heterogeneous gap across diverse modalities. To overcome this challenge, we propose a novel method termed as Cross-modal discriminant Adversarial Network (CAN). Taking bi-modal data as a showcase, CAN consists of two parallel modality-specific generators, two modality-specific discriminators, and a Cross-modal Discriminant Mechanism (CDM). To be specific, the generators project diverse modalities into a latent cross-modal discriminant space. Meanwhile, the discriminators compete against the generators to alleviate the heterogeneous discrepancy in this space, i.e., the generators try to generate unified features to confuse the discriminators, and the discriminators aim to classify the generated results. To further remove the redundancy and preserve the discrimination, we propose CDM to project the generated results into a single common space, accompanying with a novel eigenvalue-based loss. Thanks to the eigenvalue-based loss, CDM could push as much discriminative power as possible into all latent directions. To demonstrate the effectiveness of our CAN, comprehensive experiments are conducted on four multimedia datasets comparing with 15 state-of-the-art approaches.
Automated building extraction using satellite remote sensing imagery

Qintao Hu, Liangli Zhen, Yao Mao, Xi Zhou, and Guozhong Zhou

Automation in Construction, 2021

Abs PDF

Automatic extraction of buildings from remote sensing images plays a critical role in urban planning and digital city construction applications. In real-world applications, however, real scenes can be highly complex (e.g., various building structures and shapes, presence of obstacles, and low contrast between buildings and surrounding regions), making automatic building extraction extremely challenging. To conquer this challenge, we propose a novel method called Deep Automatic Building Extraction Network (DABE-Net). It adopts squeeze-and-excitation (SE) operations and the residual recurrent convolutional neural network (RRCNN) to construct building-blocks. Furthermore, an attention mechanism is introduced into the network to improve segmentation accuracy. Specifically, to handle small buildings, we highlight small buildings and develop a multi-scale segmentation loss function. The theoretical analysis and experimental results show that the proposed method is effective in building extraction and outperforms several peer methods on the dataset of Mapping challenge competition.
DRSL: Deep relational similarity learning for cross-modal retrieval

Xu Wang, Peng Hu, Liangli Zhen, and Dezhong Peng

Information Sciences, 2021

Abs PDF Code

Cross-modal retrieval aims to retrieve relevant samples across different media modalities. Existing cross-modal retrieval approaches are contingent on learning common representations of all modalities by assuming that an equal amount of information exists in different modalities. However, since the quantity of information among cross-modal samples is unbalanced and unequal, it is inappropriate to directly match the obtained modality-specific representations across different modalities in a common space. In this paper, we propose a new method called Deep Relational Similarity Learning (DRSL) for cross-modal retrieval. Unlike existing approaches, the proposed DRSL aims to effectively bridge the heterogeneity gap of different modalities by directly learning the natural pairwise similarities instead of explicitly learning a common space. DRSL is a deep hybrid framework that integrates the relation networks module for relation learning, capturing the implicit nonlinear distance metric. To the best of our knowledge, DRSL is the first approach that incorporates relation networks into the cross-modal learning scenario. Comprehensive experimental results show that the proposed DRSL model achieves state-of-the-art results in cross-modal retrieval tasks on four widely-used benchmark datasets, i.e., Wikipedia, Pascal Sentences, NUS-WIDE-10K, and XMediaNet.
Learning cross-modal retrieval with noisy labels

Peng Hu, Xi Peng, Hongyuan Zhu, Liangli Zhen, and Jie Lin

CVPR, 2021

Abs PDF

Recently, cross-modal retrieval is emerging with the help of deep multimodal learning. However, even for unimodal data, collecting large-scale well-annotated data is expensive and time-consuming, and not to mention the additional challenges from multiple modalities. Although crowd-sourcing annotation, e.g., Amazon’s Mechanical Turk, can be utilized to mitigate the labeling cost, but leading to the unavoidable noise in labels for the non-expert annotating. To tackle the challenge, this paper presents a general Multi-modal Robust Learning framework (MRL) for learning with multimodal noisy labels to mitigate noisy samples and correlate distinct modalities simultaneously. To be specific, we propose a Robust Clustering loss (RC) to make the deep networks focus on clean samples instead of noisy ones. Besides, a simple yet effective multimodal loss function, called Multimodal Contrastive loss (MC), is proposed to maxi-mize the mutual information between different modalities, thus alleviating the interference of noisy samples and cross-modal discrepancy. Extensive experiments are conducted on four widely-used multimodal datasets to demonstrate the effectiveness of the proposed approach by comparing to 14 state-of-the-art methods.
Video corpus moment retrieval with contrastive learning

Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh

SIGIR, 2021

Abs PDF Code

Given a collection of untrimmed and unsegmented videos, video corpus moment retrieval (VCMR) is to retrieve a temporal moment (i.e., a fraction of a video) that semantically corresponds to a given text query. As video and text are from two distinct feature spaces, there are two general approaches to address VCMR: (i) to separately encode each modality representations, then align the two modality representations for query processing, and (ii) to adopt fine-grained cross-modal interaction to learn multi-modal representations for query processing. While the second approach often leads to better retrieval accuracy, the first approach is far more efficient. In this paper, we propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR. We adopt the first approach and introduce two contrastive learning objectives to refine video encoder and text encoder to learn video and text representations separately but with better alignment for VCMR. The video contrastive learning (VideoCL) is to maximize mutual information between query and candidate video at video-level. The frame contrastive learning (FrameCL) aims to highlight the moment region corresponds to the query at frame-level, within a video. Experimental results show that, although ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
Parallel attention network with sequence matching for video grounding

Hao Zhang, Aixin Sun, Wei Jing, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh

Findings of ACL, 2021

Abs PDF Code

Given a video, video grounding aims to retrieve a temporal moment that semantically corresponds to a language query. In this work, we propose a Parallel Attention Network with Sequence matching (SeqPAN) to address the challenges in this task: multi-modal representation learning, and target moment boundary prediction. We design a self-guided parallel attention module to effectively capture self-modal contexts and cross-modal attentive information between video and text. Inspired by sequence labeling tasks in natural language processing, we split the ground truth moment into begin, inside, and end regions. We then propose a sequence matching strategy to guide start/end boundary predictions using region labels. Experimental results on three datasets show that SeqPAN is superior to state-of-the-art methods. Furthermore, the effectiveness of the self-guided parallel attention module and the sequence matching module is verified.
Automated deepfake detection

Ping Liu, Yuewei Lin, Yang He, Yunchao Wei, Liangli Zhen, Joey Tianyi Zhou, Rick Siow Mong Goh, and Jingen Liu

arXiv preprint arXiv:2106.10705, 2021

Abs PDF

In this paper, we propose to utilize Automated Machine Learning to adaptively search a neural architecture for deepfake detection. This is the first time to employ automated machine learning for deepfake detection. Based on our explored search space, our proposed method achieves competitive prediction accuracy compared to previous methods. To improve the generalizability of our method, especially when training data and testing data are manipulated by different methods, we propose a simple yet effective strategy in our network learning process: making it to estimate potential manipulation regions besides predicting the real/fake labels. Unlike previous works manually design neural networks, our method can relieve us from the high labor cost in network construction. More than that, compared to previous works, our method depends much less on prior knowledge, e.g., which manipulation method is utilized or where exactly the fake image is manipulated. Extensive experimental results on two benchmark datasets demonstrate the effectiveness of our proposed method for deepfake detection.

2020

Kernel truncated regression representation for robust subspace clustering

Liangli Zhen, Dezhong Peng, Wei Wang, and Xin Yao

Information Sciences, 2020

Abs PDF Code

Subspace clustering aims to group data points into multiple clusters of which each corresponds to one subspace. Most existing subspace clustering approaches assume that input data lie on linear subspaces. In practice, however, this assumption usually does not hold. To achieve nonlinear subspace clustering, we propose a novel method, called kernel truncated regression representation. Our method consists of the following four steps: 1) projecting the input data into a hidden space, where each data point can be linearly represented by other data points; 2) calculating the linear representation coefficients of the data representations in the hidden space; 3) truncating the trivial coefficients to achieve robustness and block-diagonality; and 4) executing the graph cutting operation on the coefficient matrix by solving a graph Laplacian problem. Our method has the advantages of a closed-form solution and the capacity of clustering data points that lie on nonlinear subspaces. The first advantage makes our method efficient in handling large-scale datasets, and the second one enables the proposed method to conquer the nonlinear subspace clustering challenge. Extensive experiments on six benchmarks demonstrate the effectiveness and the efficiency of the proposed method in comparison with current state-of-the-art approaches.
Objective reduction for visualising many-objective solution sets

Liangli Zhen, Miqing Li, Dezhong Peng, and Xin Yao

Information Sciences, 2020

Abs PDF Code

Visualising a solution set is of high importance in many-objective optimisation. It can help algorithm designers understand the performance of search algorithms and decision makers select their preferred solution(s). In this paper, an objective reduction-based visualisation method (ORV) is proposed to view many-objective solution sets. ORV attempts to map a solution set from a high-dimensional objective space into a low-dimensional space while preserving the distribution and the Pareto dominance relation between solutions in the set. Specifically, ORV sequentially decomposes objective vectors which can be linearly represented by their positively correlated objective vectors until the expected number of preserved objective vectors is reached. ORV formulates the objective reduction as a solvable convex problem. Extensive experiments on both synthetic and real-world problems have verified the effectiveness of the proposed method.
An adaptive stochastic parallel gradient descent approach for efficient fiber coupling

Qintao Hu, Liangli Zhen, Yao Mao, Shiwei Zhu, Xi Zhou, and Guozhong Zhou

Optics Express, 2020

Abs PDF

In high-speed free-space optical communication systems, the received laser beam must be coupled into a single-mode fiber at the input of the receiver module. However, propagation through atmospheric turbulence degrades the spatial coherence of a laser beam and poses challenges for fiber coupling. In this paper, we propose a novel method, called as adaptive stochastic parallel gradient descent (ASPGD), to achieve efficient fiber coupling. To be specific, we formulate the fiber coupling problem as a model-free optimization problem and solve it using ASPGD in parallel. To avoid converging to the extremum points and accelerate its convergence speed, we integrate the momentum and the adaptive gain coefficient estimation to the original stochastic parallel gradient descent (SPGD) method. Simulation and experimental results demonstrate that the proposed method reduces 50% of iterations, while keeping the stability by comparing it with the original SPGD method.

2019

Deep supervised cross-modal retrieval

Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng

CVPR, 2019

Abs PDF Code

Cross-modal retrieval aims to enable flexible retrieval across different modalities. The core of cross-modal retrieval is how to measure the content similarity between different types of data. In this paper, we present a novel cross-modal retrieval method, called Deep Supervised Cross-modal Retrieval (DSCMR). It aims to find a common representation space, in which the samples from different modalities can be compared directly. Specifically, DSCMR minimises the discrimination loss in both the label space and the common representation space to supervise the model learning discriminative features. Furthermore, it simultaneously minimises the modality invariance loss and uses a weight sharing strategy to eliminate the cross-modal discrepancy of multimedia data in the common representation space to learn modality-invariant features. Comprehensive experimental results on four widely-used benchmark datasets demonstrate that the proposed method is effective in cross-modal learning and significantly outperforms the state-of-the-art cross-modal retrieval methods.
Scalable deep multimodal learning for cross-modal retrieval

Peng Hu, Liangli Zhen, Dezhong Peng, and Pei Liu

SIGIR, 2019

Abs PDF

Cross-modal retrieval takes one type of data as the query to retrieve relevant data of another type. Most of existing cross-modal retrieval approaches were proposed to learn a common subspace in a joint manner, where the data from all modalities have to be involved during the whole training process. For these approaches, the optimal parameters of different modality-specific transformations are dependent on each other and the whole model has to be retrained when handling samples from new modalities. In this paper, we present a novel cross-modal retrieval method, called Scalable Deep Multimodal Learning (SDML). It proposes to predefine a common subspace, in which the between-class variation is maximized while the within-class variation is minimized. Then, it trains m modality-specific networks for m modalities (one network for each modality) to transform the multimodal data into the predefined common subspace to achieve multimodal learning. Unlike many of the existing methods, our method can train different modality-specific networks independently and thus be scalable to the number of modalities. To the best of our knowledge, the proposed SDML could be one of the first works to independently project data of an unfixed number of modalities into a predefined common subspace. Comprehensive experimental results on four widely-used benchmark datasets demonstrate that the proposed method is effective and efficient in multimodal learning and outperforms the state-of-the-art methods in cross-modal retrieval.
Separated variational hashing networks for cross-modal retrieval

Peng Hu, Xu Wang, Liangli Zhen, and Dezhong Peng

ACM-MM, 2019

Abs PDF

Cross-modal hashing, due to its low storage cost and high query speed, has been successfully used for similarity search in multimedia retrieval applications. It projects high-dimensional data into a shared isomorphic Hamming space with similar binary codes for semantically-similar data. In some applications, all modalities may not be obtained or trained simultaneously for some reasons, such as privacy, secret, storage limitation, and computational resource limitation. However, most existing cross-modal hashing methods need all modalities to jointly learn the common Hamming space, thus hindering them from handling these problems. In this paper, we propose a novel approach called Separated Variational Hashing Networks (SVHNs) to overcome the above challenge. Firstly, it adopts a label network (LabNet) to exploit available and nonspecific label annotations to learn a latent common Hamming space by projecting each semantic label into a common binary representation. Then, each modality-specific network can separately map the samples of the corresponding modality into their binary semantic codes learned by LabNet. We achieve it by conducting variational inference to match the aggregated posterior of the hashing code of LabNet with an arbitrary prior distribution. The effectiveness and efficiency of our SVHNs are verified by extensive experiments carried out on four widely-used multimedia databases, in comparison with 11 state-of-the-art approaches.

2018

Local feature based multi-view discriminant analysis

Peng Hu, Dezhong Peng, Jixiang Guo, and Liangli Zhen

Knowledge-Based Systems, 2018

Abs PDF

In many real-world applications, an object can be represented from multiple views or styles. Thus, it is important to design algorithms that are able to recognize objects from distinct views. To the end, a large number of approaches have been proposed to achieve the heterogeneous recognition tasks through the use of local features. However, most of them only focus on binary views and thus cannot be applied to multi-view analysis. In this paper, we propose a novel local feature based multi-view discriminant analysis approach (FMDA). The proposed approach consists of three steps: First, the input images are represented using representation matrices and local feature descriptor (LFD) matrices of their overlapping patches, where the representation matrices are the linear coefficients of the LFDs for different views. In this way, it brings two advantages, i.e., addressing the small sample size (SSS) problem and preserving the discriminative information while reducing the redundant information in the LFD matrices. Second, the multi-view discriminant representation and feature projections are learned by projecting the LFDs of different views into a common space using the Fisher criterion. Finally, a simple but effective view-similarity constraint is proposed to adaptively learn the relationships between different views. To verify the effectiveness of the proposed method, extensive experiments are carried out on the FERET, CAS-PEAL-R1, CUFSF and HFB databases comparing with some state-of-the-art methods.
Multiobjective test problems with degenerate Pareto fronts

Liangli Zhen, Miqing Li, Ran Cheng, Dezhong Peng, and Xin Yao

arXiv preprint arXiv:1806.02706, 2018

Abs PDF Code

In multiobjective optimisation, a set of scalable test problems with a variety of features allow researchers to investigate and evaluate the abilities of different optimisation algorithms, and thus can help them to design and develop more effective and efficient approaches. Existing test problem suites mainly focus on situations where all the objectives are fully conflicting with each other. In such cases, an m-objective optimisation problem has an (m-1)-dimensional Pareto front in the objective space. However, in some optimisation problems, there may be unexpected characteristics among objectives, e.g., redundancy. The redundancy of some objectives can lead to the multiobjective problem having a degenerate Pareto front, i.e., the dimension of the Pareto front of the m-objective problem be less than (m-1). In this paper, we systematically study degenerate multiobjective problems. We abstract three general characteristics of degenerate problems, which are not formulated and systematically investigated in the literature. Based on these characteristics, we present a set of test problems to support the investigation of multiobjective optimisation algorithms under situations with redundant objectives. To the best of our knowledge, this work is the first one that explicitly formulates these three characteristics of degenerate problems, thus allowing the resulting test problems to be featured by their generality, in contrast to existing test problems designed for specific purposes (e.g., visualisation).

2017

Adjusting parallel coordinates for investigating multi-objective search

Liangli Zhen, Miqing Li, Ran Cheng, Dezhong Peng, and Xin Yao

SEAL, 2017

Abs PDF Code

Visualizing a high-dimensional solution set over the evolution process is a viable way to investigate the search behavior of evolutionary multi-objective optimization. The parallel coordinates plot which scales well to the data dimensionality is frequently used to observe solution sets in multi-objective optimization. However, the solution sets in parallel coordinates are typically presented by the natural order of the optimized objectives, with rare information of the relation between these objectives and also the Pareto dominance relation between solutions. In this paper, we attempt to adjust parallel coordinates to incorporate this information. Systematic experiments have shown the effectiveness of the proposed method.
How to read many-objective solution sets in parallel coordinates

Miqing Li, Liangli Zhen, and Xin Yao

IEEE Computational Intelligence Magazine, 2017

Abs PDF

Rapid development of evolutionary algor ithms in handling many-objective optimization problems requires viable methods of visualizing a high-dimensional solution set. The parallel coordinates plot which scales well to high-dimensional data is such a method, and has been frequently used in evolutionary many-objective optimization. However, the parallel coordinates plot is not as straightforward as the classic scatter plot to present the information contained in a solution set. In this paper, we make some observations of the parallel coordinates plot, in terms of comparing the quality of solution sets, understanding the shape and distribution of a solution set, and reflecting the relation between objectives. We hope that these observations could provide some guidelines as to the proper use of the parallel coordinates plot in evolutionary manyobjective optimization.
Underdetermined blind source separation using sparse coding

Liangli Zhen, Dezhong Peng, Zhang Yi, Yong Xiang, and Peng Chen

IEEE Transactions on Neural Networks and Learning Systems, 2017

Abs PDF Code

In an underdetermined mixture system with n unknown sources, it is a challenging task to separate these sources from their m observed mixture signals, where m . n. By exploiting the technique of sparse coding, we propose an effective approach to discover some 1-D subspaces from the set consisting of all the time-frequency (TF) representation vectors of observed mixture signals. We show that these 1-D subspaces are associated with TF points where only single source possesses dominant energy. By grouping the vectors in these subspaces via hierarchical clustering algorithm, we obtain the estimation of the mixing matrix. Finally, the source signals could be recovered by solving a series of least squares problems. Since the sparse coding strategy considers the linear representation relations among all the TF representation vectors of mixing signals, the proposed algorithm can provide an accurate estimation of the mixing matrix and is robust to the noises compared with the existing underdetermined blind source separation approaches. Theoretical analysis and experimental results demonstrate the effectiveness of the proposed method.