publications
The asterisk symbol (*) denotes the corresponding author. Google Scholar Profile
2025
- Deep evidential hashing for trustworthy cross-modal retrievalYuan Li, Liangli Zhen, Yuan Sun, Dezhong Peng, Xi Peng, and Peng HuAAAI, 2025
Cross-modal hashing provides an efficient solution for retrieval tasks across various modalities, such as images and text. However, most existing methods are deterministic models, which overlook the reliability associated with the retrieved results. This omission renders them unreliable for determining matches between data pairs based solely on Hamming distance. To bridge the gap, in this paper, we propose a novel method called Deep Evidential Cross-modal Hashing (DECH). This method equips hashing models with the ability to quantify the reliability level of the association between a query sample and each corresponding retrieved sample, bringing a new dimension of reliability to the cross-modal retrieval process. To achieve this, our method addresses two key challenges: i) To leverage evidential theory in guiding the model to learn hash codes, we design a novel evidence acquisition module to collect evidence and place the evidence captured by hash codes on a Beta distribution to derive a binomial opinion. Unlike existing evidential learning approaches that rely on classifiers, our method collects evidence directly through hash codes. ii) To tackle the task-oriented challenge, we first introduce a method to update the derived binomial opinion, allowing it to present the uncertainty caused by conflicting evidence. Following this manner, we present a strategy to precisely evaluate the reliability level of retrieved results, culminating in performance improvement. We validate the efficacy of our DECH through extensive experimentation on four benchmark datasets. The experimental results demonstrate our superior performance compared to 12 state-of-the-art methods.
- Generative image reconstruction from gradientsEkanut Sotthiwat, Liangli Zhen*, Chi Zhang, Zengxiang Li, and Rick GohIEEE Transactions on Neural Networks and Learning Systems, 2025
In this paper, we propose a method, Generative Image Reconstruction from Gradients (GIRG), for recovering training images from gradients in a federated learning setting, where privacy is preserved by sharing model weights and gradients rather than raw training data. Previous studies have shown the potential for revealing clients’ private information or even pixel-level recovery of training images from shared gradients. However, existing methods are limited to low-resolution images and small batch sizes or require prior knowledge about the client data. GIRG utilizes a conditional generative model to reconstruct training images and their corresponding labels from the shared gradients. Unlike previous generative model-based methods, GIRG does not require prior knowledge of the training data. Furthermore, GIRG optimizes the weights of the conditional generative model to generate highly accurate “dummy” images instead of optimizing the input vectors of the generative model. Comprehensive empirical results show that GIRG is able to recover high-resolution images with large batch sizes and can even recover images from the aggregation of gradients from multiple participants. These results reveal the vulnerability of current federated learning practices and call for immediate efforts to prevent inversion attacks in gradient-sharing-based collaborative training.
2024
- Continuous disentangled joint space learning for domain generalizationZizhou Wang, Yan Wang, Yangqin Feng, Jiawei Du, Yong Liu, Rick Siow Mong Goh, and Liangli Zhen*IEEE Transactions on Neural Networks and Learning Systems, 2024
Domain generalization aims to learn a model on one or multiple observed source domains that can generalize to unseen target test domains. Previous approaches have focused on extracting domain-invariant information from multiple source domains, but domain-specific information is also closely tied to semantics in individual domains and is not well-suited for generalization to the target domain. In this paper, we propose a novel domain generalization method called Continuous Disentangled Joint Space Learning (CJSL), which leverages both domain-invariant and domain-specific information for more effective domain generalization. The key idea behind CJSL is to formulate and learn a continuous joint space for domain-specific representations from source domains through iterative feature disentanglement. This learned continuous joint space can then be used to simulate domain-specific representations for test samples from a mixture of multiple domains via Monte Carlo sampling during the inference stage. Unlike existing approaches, which exploit domain-invariant feature vectors only or aim to learn a universal domain-specific feature extractor, we simulate domain-specific representations via sampling the latent vectors in the learned continuous joint space for the test sample to fully utilize the power of multiple domain-specific classifiers for the robust prediction. Empirical results demonstrate that CJSL outperforms 19 state-of-the-art methods on seven benchmarks, indicating the effectiveness of our proposed method.
- MedNAS: Multi-scale training-free neural architecture search for medical image analysisYan Wang, Liangli Zhen*, Jianwei Zhang, Miqing Li, Lei Zhang, Zizhou Wang, Yangqin Feng, Yu Xue, Xiao Wang, Zheng Chen, Tao Luo, Rick Siow Mong Goh, and Yong LiuIEEE Transactions on Evolutionary Computation, 2024
Deep neural networks have demonstrated impressive results in medical image analysis, but designing suitable architectures for each specific task is expertise-dependent and time-consuming. Neural architecture search (NAS) offers an effective means of discovering architectures. It has been highly successful in numerous applications, particularly in natural image classification. Yet, medical images possess unique characteristics, such as small regions and a wide variety of lesion sizes, that differentiate them from natural images. Furthermore, most current NAS methods struggle with high computational costs, especially when dealing with high-resolution image datasets. In this paper, we present a novel evolutionary neural architecture search method called Multi-Scale Training-Free Neural Architecture Search to address these challenges. Specifically, to accommodate the broad range of lesion region sizes in disease diagnosis, we develop a new reduction cell search space that enables the search algorithm to explicitly identify the optimal scale combination for multi-scale feature extraction. To overcome the issue of high computational costs, we utilize training-free indicators as performance measures for candidate architectures, which allows us to search for the optimal architecture more efficiently. More specifically, by considering the capability and simplicity of various networks, we formulate a multi-objective optimization problem that involves two training-free indicators and model complexity for candidate architectures. Extensive experiments on a large medical image benchmark and a publicly available breast cancer detection dataset are conducted. The empirical results demonstrate that our MSTF-NAS outperforms both human-designed architectures and current state-of-the-art NAS algorithms on both datasets, indicating the effectiveness of our proposed method.
- Geometric correspondence-based multimodal learning for ophthalmic image analysisYan Wang, Liangli Zhen*, Tien-En Tan, Huazhu Fu, Yangqin Feng, Zizhou Wang, Xinxing Xu, Rick Siow Mong Goh, Yipin Ng, Claire Calhoun, Gavin SW Tan, Jennifer K Sun, Yong Liu, and Daniel SW TingIEEE Transactions on Medical Imaging, 2024
Color fundus photography (CFP) and Optical coherence tomography (OCT) images are two of the most widely used modalities in the clinical diagnosis and management of retinal diseases. Despite the widespread use of multimodal imaging in clinical practice, few methods for automated diagnosis of eye diseases utilize correlated and complementary information from multiple modalities effectively. This paper explores how to leverage the information from CFP and OCT images to improve the automated diagnosis of retinal diseases. We propose a novel multimodal learning method, named geometric correspondence-based multimodal learning network (GeCoM-Net), to achieve the fusion of CFP and OCT images. Specifically, inspired by clinical observations, we consider the geometric correspondence between the OCT slice and the CFP region to learn the correlated features of the two modalities for robust fusion. Furthermore, we design a new feature selection strategy to extract discriminative OCT representations by automatically selecting the important feature maps from OCT slices. Unlike the existing multimodal learning methods, GeCoM-Net is the first method that formulates the geometric relationships between the OCT slice and the corresponding region of the CFP image explicitly for CFP and OCT fusion. Experiments have been conducted on a large-scale private dataset and a publicly available dataset to evaluate the effectiveness of GeCoM-Net for diagnosing diabetic macular edema (DME), impaired visual acuity (VA) and glaucoma. The empirical results show that our method outperforms the current state-of-the-art multimodal learning methods by improving the AUROC score 0.4%, 1.9% and 2.9% for DME, VA and glaucoma detection, respectively.
- Deep supervised multi-view learning with graph priorsPeng Hu, Liangli Zhen*, Xi Peng, Hongyuan Zhu, Jie Lin, Xu Wang, and Dezhong PengIEEE Transactions on Image Processing, 2024
This paper presents a novel method for supervised multi-view representation learning, which projects multiple views into a latent common space while preserving the discrimination and intrinsic structure of each view. Specifically, an \textitapriori discriminant similarity graph is first constructed based on labels and pairwise relationships of multi-view inputs. Then, view-specific networks progressively map inputs to common representations whose affinity approximates the constructed graph. To achieve graph consistency, discrimination, and cross-view invariance, the similarity graph is enforced to meet the following constraints: 1) pairwise relationship should be consistent between the input space and common space for each view; 2) within-class similarity is larger than any between-class similarity for each view; 3) the inter-view samples from the same (or different) classes are mutually similar (or dissimilar). Consequently, the intrinsic structure and discrimination are preserved in the latent common space using an apriori approximation schema. Moreover, we present a sampling strategy to approach a sub-graph sampled from the whole similarity structure instead of approximating the graph of the whole dataset explicitly, thus benefiting lower space complexity and the capability of handling large-scale multi-view datasets. Extensive experiments show the promising performance of our method on five datasets by comparing it with 18 state-of-the-art methods.
- Global challenge for safe and secure LLMs track 1Xiaojun Jia, Yihao Huang, Yang Liu, Peng Yan Tan, Weng Kuan Yau, and OthersarXiv preprint arXiv:2411.14502, 2024
This paper introduces the Global Challenge for Safe and Secure Large Language Models (LLMs), a pioneering initiative organized by AI Singapore (AISG) and the CyberSG R&D Programme Office (CRPO) to foster the development of advanced defense mechanisms against automated jailbreaking attacks. With the increasing integration of LLMs in critical sectors such as healthcare, finance, and public administration, ensuring these models are resilient to adversarial attacks is vital for preventing misuse and upholding ethical standards. This competition focused on two distinct tracks designed to evaluate and enhance the robustness of LLM security frameworks. Track 1 tasked participants with developing automated methods to probe LLM vulnerabilities by eliciting undesirable responses, effectively testing the limits of existing safety protocols within LLMs. Participants were challenged to devise techniques that could bypass content safeguards across a diverse array of scenarios, from offensive language to misinformation and illegal activities. Through this process, Track 1 aimed to deepen the understanding of LLM vulnerabilities and provide insights for creating more resilient models. The results of Track 1 highlighted significant advances in jailbreak methods and security testing for LLMs. Competing teams were evaluated based on their models’ resistance to 85 predefined undesirable behaviors, spanning categories such as prejudice, offensive content, misinformation, and promotion of illegal activities. Notably, top-performing teams achieved high attack success rates by introducing innovative techniques, including scenario induction templates that systematically generated context-sensitive prompts and re-suffix attack mechanisms, which adapted suffixes to bypass model filters across multiple LLMs. These techniques demonstrated not only effectiveness in circumventing safeguards but also transferability across different model types, underscoring the adaptability and sophistication of modern adversarial methods. Track 2, scheduled to begin in 2025, will emphasize the development of model-agnostic defense strategies aimed at countering advanced jailbreak attacks. The primary objective of this track is to advance adaptable frameworks that can effectively mitigate adversarial attacks across various LLM architectures.
- An end-to-end contrastive deep-learning framework for remote physiological signal measurementBingjie Wu, Menghan Zhou, Wei Liu, Xingyao Wang, Xingjian Zheng, Yiping Xie, Chaoqi Luo, and Liangli Zhen*1st Place in Track 1 of RePSS Challenge at IJCAI-2024IJCAI Vision-based Remote Physiological Signal Sensing Challenge, 2024
Heart rate measurements based on remote physiological signals could significantly facilitate health monitoring in daily life. However, the ground-truth labels of the physiological signals are expensive and hard to collect. In this paper, we present a contrastive self-supervised learning framework to extract discriminative remote physiological features by leveraging periodic signal priors without ground-truth labels in the pre-training stage. Specifically, a ranking loss and a contrastive learning loss are constructed to extract knowledge with resampling of the video clips. In addition, data augmentation and ensemble learning strategies are designed to fine-tune the pre-trained model and fuse the results to improve the heart rate measurement. Our final solution achieves the 1st place in track 1 of the 3rd Vision-based Remote Physiological Signal Sensing (RePSS) Challenge.
- Ensemble deep learning for blood pressure estimation using facial videosWei Liu, Bingjie Wu, Menghan Zhou, Xingyao Wang, Xingjian Zheng, Yiping Xie, Chaoqi Luo, and Liangli Zhen*1st Place in Track 2 of RePSS Challenge at IJCAI-2024IJCAI Vision-based Remote Physiological Signal Sensing Challenge, 2024
Blood pressure (BP) estimation is a standard and critical component of routine health assessment, especially for cardiac disease patients. Traditional methods typically require direct contact with the patient, which can cause discomfort and inconvenience. Remote photoplethysmography (rPPG) that enables non-contact measurement of the blood volume pulse using trivial cues from facial videos has drawn attention to measure vital signs. This paper presents an ensemble deep learning approach for estimating BP remotely using facial videos. Specifically, to address the vulnerabilities and biases in deep learning models for BP measurement, we emphasize both the accuracy of individual models and the diversity within the ensemble. We utilize advanced deep learning architectures to construct several regression models incorporating convolutional neural networks and transformer blocks, which learn the spatiotemporal relationships between different frames and locations. These trained models are then combined to measure BP readings. Additionally, to enhance the system’s robustness under varying lighting conditions, data augmentation techniques are employed to generate more training data. The proposed method is tested on an unseen dataset and the average root of mean squared error (RMSE) is 12.95 mmHg, ranking 1st in the 3rd Vision-based Remote Physiological Signal Sensing (RePSS) Challenge.
- MedMLP: An efficient MLP-like network for zero-shot retinal image classificationMenghan Zhou, Yanyu Xu, Zhi Da Soh, Huazhu Fu, Rick Siow Mong Goh, Ching-Yu Cheng, Yong Liu, and Liangli ZhenMICCAI, 2024
Deep neural networks (DNNs) have demonstrated superior performance compared to humans across various tasks. However, DNNs often face the challenge of domain shift, where their performance notably deteriorates when applied to medical images with distributions differing from those seen during training. To address this issue and achieve high performance in new target domains under zero-shot settings, we leverage the ability of self-attention mechanisms to capture global dependencies. We introduce a novel MLP-like model designed for superior efficiency and zero-shot robustness. Specifically, we propose an adaptive fully-connected (AdaFC) layer to overcome the fundamental limitation of traditional fully-connected layers in adapting to inputs of various sizes while maintaining computational efficiency. Building upon AdaFC, we present a new MLP-based network architecture named MedMLP. Through our proposed training pipeline, we achieve a significant 20.1% increase in model testing accuracy on an out-of-distribution dataset, surpassing the widely used ResNet-50 model.
- Neural architecture search with progressive evaluation and sub-population preservationYu Xue, Jiajie Zha, Danilo Pelusi, Peng Chen, Tao Luo, Liangli Zhen, Yan Wang, and Mohamed WahibIEEE Transactions on Evolutionary Computation, 2024
Neural architecture search (NAS) is an effective approach for automating the design of deep neural networks. Evolutionary computation (EC) is commonly used in NAS due to its global optimization capability. However, the evaluation phase of architecture candidates in EC-based NAS is compute-intensive, limiting its application for many real-world problems. To overcome this challenge, we propose a novel progressive evaluation strategy for the evaluation phase in convolutional neural network architecture search, in which the number of training epochs of network individuals is progressively increased. In addition, a sub-population preservation strategy is proposed to preserve medium and large models to avoid prematurely discarding networks that may not perform well in the early stages but have the potential to excel with further optimization. Our proposed algorithm reduces the computational cost of the evaluation phase and promotes population diversity and fairness by preserving promising networks based on their distribution. We evaluate the proposed progressive evaluation and sub-population preservation of neural architecture search (PEPNAS) algorithm on the CIFAR10, CIFAR100, and ImageNet benchmark datasets, and compare it with 36 state-of-the-art algorithms, including manually designed networks, reinforcement learning (RL) algorithms, gradient-based algorithms, and other EC-based ones. The experimental results demonstrate that PEPNAS effectively identifies networks with competitive accuracy while also markedly improving the efficiency of the search process. For instance, PEPNAS discovers the architecture on CIFAR10 with a low error rate of 2.38% using only 0.7 GPU days. We directly adopt the searched architecture for the image classification on the CIFAR100 and ImageNet datasets, which achieves the top 1 error rates of 16.46% and 26.25%, respectively.
- Evolutionary architecture search for generative adversarial networks based on weight sharingYu Xue, Weinan Tong, Ferrante Neri, Peng Chen, Tao Luo, Liangli Zhen, and Xiao WangIEEE Transactions on Evolutionary Computation, 2024
Generative adversarial networks (GANs) are a powerful generative technique but frequently face challenges with training stability. Network architecture plays a significant role in determining the final output of GANs, but designing a fine architecture demands extensive domain expertise. This paper aims to address this issue by searching for high-performance generator’s architectures through neural architecture search (NAS). The proposed approach, called evolutionary weight sharing generative adversarial networks (EWSGAN), is based on weight sharing and comprises two steps. First, a supernet of the generator is trained using weight sharing. Second, a multi-objective evolutionary algorithm (MOEA) is employed to identify optimal subnets from the supernet. These subnets inherit weights directly from the supernet for fitness assessment. Two strategies are used to stabilise the training of the generator supernet: a fair single-path sampling strategy and a discarding strategy. Experimental results indicate that the architecture searched by our method achieved a new state-of-the-art among NAS-GAN methods with a Fréchet inception distance (FID) of 9.09 and an inception score (IS) of 8.99 on the CIFAR-10 dataset. It also demonstrates competitive performance on the STL-10 dataset, achieving FID of 21.89 and IS of 10.51.
2023
- Generative gradient inversion via over-parameterized convolutional networks in federated learningChi Zhang, Xiaoman Zhang, Ekanut Sotthiwat, Yanyu Xu, Ping Liu, Liangli Zhen*, and Yong LiuICCV, 2023
Federated learning has gained recognition as a secure approach for safeguarding local private data in collaborative learning. But the advent of gradient inversion research has posed significant challenges to this premise by enabling a third-party to recover groundtruth images via gradients. While prior research has predominantly focused on low-resolution images and small batch sizes, this study highlights the feasibility of reconstructing complex images with high resolutions and large batch sizes. The success of the proposed method is contingent on three crucial components: a convolutional generative model, an over-parameterized network, and a well-designed architecture. Practical experiments demonstrate that the proposed algorithm achieves high-fidelity image recovery, surpassing state-of-the-art competitors that commonly fail in more intricate scenarios. Consequently, our study shows that local participants in a federated learning system are vulnerable to potential data leakage issues. The source code will be available upon publication.
- Low-resolution self-attention for semantic segmentationYu-Huan Wu, Shi-Chen Zhang, Yun Liu, Le Zhang, Xin Zhan, Daquan Zhou, Jiashi Feng, Ming-Ming Cheng, and Liangli ZhenarXiv preprint arXiv:2310.05026, 2023
Semantic segmentation tasks naturally require high-resolution information for pixel-wise segmentation and global context information for class prediction. While existing vision transformers demonstrate promising performance, they often utilize high resolution context modeling, resulting in a computational bottleneck. In this work, we challenge conventional wisdom and introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost. Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image’s resolution, with additional 3x3 depth-wise convolutions to capture fine details in the high-resolution space. We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure. Extensive experiments on the ADE20K, COCO-Stuff, and Cityscapes datasets demonstrate that LRFormer outperforms state-of-the-art models.
- Deep neural network augments performance of junior residents in diagnosing COVID-19 pneumonia on chest radiographsYangqin Feng, Jordan Sim Zheng Ting, Xinxing Xu, and othersDiagnostics, 2023
Chest X-rays (CXRs) are essential in the preliminary radiographic assessment of patients affected by COVID-19. Junior residents, as the first point-of-contact in the diagnostic process, are expected to interpret these CXRs accurately. We aimed to assess the effectiveness of a deep neural network in distinguishing COVID-19 from other types of pneumonia, and to determine its potential contribution to improving the diagnostic precision of less experienced residents. A total of 5051 CXRs were utilized to develop and assess an artificial intelligence (AI) model capable of performing three-class classification, namely non-pneumonia, non-COVID-19 pneumonia, and COVID-19 pneumonia. Additionally, an external dataset comprising 500 distinct CXRs was examined by three junior residents with differing levels of training. The CXRs were evaluated both with and without AI assistance. The AI model demonstrated impressive performance, with an Area under the ROC Curve (AUC) of 0.9518 on the internal test set and 0.8594 on the external test set, which improves the AUC score of the current state-of-the-art algorithms by 1.25% and 4.26%, respectively. When assisted by the AI model, the performance of the junior residents improved in a manner that was inversely proportional to their level of training. Among the three junior residents, two showed significant improvement with the assistance of AI. This research highlights the novel development of an AI model for three-class CXR classification and its potential to augment junior residents’ diagnostic accuracy, with validation on external data to demonstrate real-world applicability. In practical use, the AI model effectively supported junior residents in interpreting CXRs, boosting their confidence in diagnosis. While the AI model improved junior residents’ performance, a decline in performance was observed on the external test compared to the internal test set. This suggests a domain shift between the patient dataset and the external dataset, highlighting the need for future research on test-time training domain adaptation to address this issue.
- Contrastive domain adaptation with consistency match for automated pneumonia diagnosisYangqin Feng, Zizhou Wang, Xinxing Xu, Yan Wang, Huazhu Fu, Shaohua Li, Liangli Zhen, Xiaofeng Lei, Yingnan Cui, Jordan Sim Zheng Ting, and othersMedical Image Analysis, 2023
Pneumonia can be difficult to diagnose since its symptoms are too variable, and the radiographic signs are often very similar to those seen in other illnesses such as a cold or influenza. Deep neural networks have shown promising performance in automated pneumonia diagnosis using chest X-ray radiography, allowing mass screening and early intervention to reduce the severe cases and death toll. However, they usually require many well-labelled chest X-ray images for training to achieve high diagnostic accuracy. To reduce the need for training data and annotation resources, we propose a novel method called Contrastive Domain Adaptation with Consistency Match (CDACM). It transfers the knowledge from different but relevant datasets to the unlabelled small-size target dataset and improves the semantic quality of the learnt representations. Specifically, we design a conditional domain adversarial network to exploit discriminative information conveyed in the predictions to mitigate the domain gap between the source and target datasets. Furthermore, due to the small scale of the target dataset, we construct a feature cloud for each target sample and leverage contrastive learning to extract more discriminative features. Lastly, we propose adaptive feature cloud expansion to push the decision boundary to a low-density area. Unlike most existing transfer learning methods that aim only to mitigate the domain gap, our method instead simultaneously considers the domain gap and the data deficiency problem of the target dataset. The conditional domain adaptation and the feature cloud generation of our method are learning jointly to extract discriminative features in an end-to-end manner. Besides, the adaptive feature cloud expansion improves the model’s generalisation ability in the target domain. Extensive experiments on pneumonia and COVID-19 diagnosis tasks demonstrate that our method outperforms several state-of-the-art unsupervised domain adaptation approaches, which verifies the effectiveness of CDACM for automated pneumonia diagnosis using chest X-ray imaging.
2022
- Adversarial multimodal fusion with attention mechanism for skin lesion classification using clinical and dermoscopic imagesYan Wang, Yangqin Feng, Lei Zhang, Joey Tianyi Zhou, Yong Liu, Rick Siow Mong Goh, and Liangli Zhen*Medical Image Analysis, 2022
Accurate skin lesion diagnosis requires a great effort from experts to identify the characteristics from clinical and dermoscopic images. Deep multimodal learning-based methods can reduce intra- and inter-reader variability and improve diagnostic accuracy compared to the single modality-based methods. This study develops a novel method, named adversarial multimodal fusion with attention mechanism (AMFAM), to perform multimodal skin lesion classification. Specifically, we adopt a discriminator that uses adversarial learning to enforce the feature extractor to learn the correlated information explicitly. Moreover, we design an attention-based reconstruction strategy to encourage the feature extractor to concentrate on learning the features of the lesion area, thus, enhancing the feature vector from each modality with more discriminative information. Unlike existing multimodal-based approaches, which only focus on learning complementary features from dermoscopic and clinical images, our method considers both correlated and complementary information of the two modalities for multimodal fusion. To verify the effectiveness of our method, we conduct comprehensive experiments on a publicly available multimodal and multi-task skin lesion classification dataset: 7-point criteria evaluation database. The experimental results demonstrate that our proposed method outperforms the current state-of-the-art methods and improves the average AUC score by above 2% on the test set.
- Deep multimodal transfer learning for cross-modal retrievalLiangli Zhen, Peng Hu, Xi Peng, Rick Siow Mong Goh, and Joey Tianyi ZhouIEEE Transactions on Neural Networks and Learning Systems, 2022
Cross-modal retrieval (CMR) enables flexible retrieval experience across different modalities (e.g., texts versus images), which maximally benefits us from the abundance of multimedia data. Existing deep CMR approaches commonly require a large amount of labeled data for training to achieve high performance. However, it is time-consuming and expensive to annotate the multimedia data manually. Thus, how to transfer valuable knowledge from existing annotated data to new data, especially from the known categories to new categories, becomes attractive for real-world applications. To achieve this end, we propose a deep multimodal transfer learning (DMTL) approach to transfer the knowledge from the previously labeled categories (source domain) to improve the retrieval performance on the unlabeled new categories (target domain). Specifically, we employ a joint learning paradigm to transfer knowledge by assigning a pseudolabel to each target sample. During training, the pseudolabel is iteratively updated and passed through our model in a self-supervised manner. At the same time, to reduce the domain discrepancy of different modalities, we construct multiple modality-specific neural networks to learn a shared semantic space for different modalities by enforcing the compactness of homoinstance samples and the scatters of heteroinstance samples. Our method is remarkably different from most of the existing transfer learning approaches. To be specific, previous works usually assume that the source domain and the target domain have the same label set. In contrast, our method considers a more challenging multimodal learning situation where the label sets of the two domains are different or even disjoint. Experimental studies on four widely used benchmarks validate the effectiveness of the proposed method in multimodal transfer learning and demonstrate its superior performance in CMR compared with 11 state-of-the-art methods.
- Augmented multi-party computation against gradient leakage in federated learningChi Zhang, Ekanut Sotthiwat, Liangli Zhen*, and Zengxiang LiIEEE Transactions on Big Data, 2022
Multi-Party Computation (MPC) provides an effective cryptographic solution for distributed computing systems so that local models with sensitive information are encrypted before sending to the centralized servers for aggregation. Though direct local knowledge leakages are eliminated in MPC-based algorithms, we observe the server can still obtain the local information indirectly in many scenarios, or even reveal the groundtruth images through methods like Deep Leakage from Gradients (DLG). To eliminate such possibilities and provide stronger protections, we propose an augmented MPC approach by encrypting local models with two rounds of decomposition before transmitting to the server. The proposed solution allows us to remove the constraint that servers must be honest in the general federated learning settings since the true global model is hidden from the servers. Specifically, the augmented MPC algorithm encodes local models into multiple secret shares in the first round, then each share is furthermore split into a public share and a private share. The consequences of such a two-round decomposition are that the augmented algorithm fully inherits the advantages of standard MPC by providing lossless encryption and decryption while simultaneously rendering the global model invisible to the central server. Both theoretical analysis and experimental verification demonstrate that such an augmented solution can provide stronger protections for the security and privacy of the training data, with minimal extra communication and computation costs incurred.
- Efficient sharpness-aware minimization for improved training of neural networksJiawei Du, Hanshu Yan, Jiashi Feng, Joey Tianyi Zhou, Liangli Zhen, Rick Siow Mong Goh, and Vincent YF TanICLR, 2022
Overparametrized Deep Neural Networks (DNNs) often achieve astounding performances, but may potentially result in severe generalization error. Recently, the relation between the sharpness of the loss landscape and the generalization error has been established by Foret et al. (2020), in which the Sharpness Aware Minimizer (SAM) was proposed to mitigate the degradation of the generalization. Unfortunately, SAM’s computational cost is roughly double that of base optimizers, such as Stochastic Gradient Descent (SGD). This paper thus proposes Efficient Sharpness Aware Minimizer (ESAM), which boosts SAM’s efficiency at no cost to its generalization performance. ESAM includes two novel and efficient training strategies—StochasticWeight Perturbation and Sharpness-Sensitive Data Selection. In the former, the sharpness measure is approximated by perturbing a stochastically chosen set of weights in each iteration; in the latter, the SAM loss is optimized using only a judiciously selected subset of data that is sensitive to the sharpness. We provide theoretical explanations as to why these strategies perform well. We also show, via extensive experiments on the CIFAR and ImageNet datasets, that ESAM enhances the efficiency over SAM from requiring 100% extra computations to 40% vis-‘a-vis base optimizers, while test accuracies are preserved or even improved.
- Deep supervised domain adaptation for pneumonia diagnosis from chest X-ray imagesYangqin Feng, Xinxing Xu, Yan Wang, Xiaofeng Lei, Soo Kng Teo, Jordan Sim Zheng Ting, Yonghan Ting, Liangli Zhen, Joey Tianyi Zhou, Yong Liu, and othersIEEE Journal of Biomedical and Health Informatics, 2022
Pneumonia is one of the most common treatable causes of death, and early diagnosis allows for early intervention. Automated diagnosis of pneumonia can therefore improve outcomes. However, it is challenging to develop high-performance deep learning models due to the lack of well-annotated data for training. This paper proposes a novel method, called Deep Supervised Domain Adaptation (DSDA), to automatically diagnose pneumonia from chest X-ray images. Specifically, we propose to transfer the knowledge from a publicly available large-scale source dataset (ChestX-ray14) to a well-annotated but small-scale target dataset (the TTSH dataset). DSDA aligns the distributions of the source domain and the target domain according to the underlying semantics of the training samples. It includes two task-specific sub-networks for the source domain and the target domain, respectively. These two sub-networks share the feature extraction layers and are trained in an end-to-end manner. Unlike most existing domain adaptation approaches that perform the same tasks in the source domain and the target domain, we attempt to transfer the knowledge from a multi-label classification task in the source domain to a binary classification task in the target domain. To evaluate the effectiveness of our method, we compare it with several existing peer methods. The experimental results show that our method can achieve promising performance for automated pneumonia diagnosis.
- Deep semi-supervised multi-view learning with increasing viewsPeng Hu, Xi Peng, Hongyuan Zhu, Liangli Zhen, Jie Lin, Huaibai Yan, and Dezhong PengIEEE Transactions on Cybernetics, 2022
In this article, we study two challenging problems in semisupervised cross-view learning. On the one hand, most existing methods assume that the samples in all views have a pairwise relationship, that is, it is necessary to capture or establish the correspondence of different views at the sample level. Such an assumption is easily isolated even in the semisupervised setting wherein only a few samples have labels that could be used to establish the correspondence. On the other hand, almost all existing multiview methods, including semisupervised ones, usually train a model using a fixed dataset, which cannot handle the data of increasing views. In practice, the view number will increase when new sensors are deployed. To address the above two challenges, we propose a novel method that employs multiple independent semisupervised view-specific networks (ISVNs) to learn representation for multiple views in a view-decoupling fashion. The advantages of our method are two-fold. Thanks to our specifically designed autoencoder and pseudolabel learning paradigm, our method shows an effective way to utilize both the labeled and unlabeled data while relaxing the data assumption of the pairwise relationship, that is, correspondence. Furthermore, with our view decoupling strategy, the proposed ISVNs could be separately trained, thus efficiently handling the data of increasing views without retraining the entire model. To the best of our knowledge, our ISVN could be one of the first attempts to make handling increasing views in the semisupervised setting possible, as well as an effective solution to the noncorresponding problem. To verify the effectiveness and efficiency of our method, we conduct comprehensive experiments by comparing 13 state-of-the-art approaches on four multiview datasets in terms of retrieval and classification.
- Natural language video localization: A revisit in span-based question answering frameworkHao Zhang, Aixin Sun, Wei Jing, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong GohIEEE Transactions on Pattern Analysis and Machine Intelligence, 2022
Natural Language Video Localization (NLVL) aims to locate a target moment from an untrimmed video that semantically corresponds to a text query. Existing approaches mainly solve the NLVL problem from the perspective of computer vision by formulating it as ranking, anchor, or regression tasks. These methods suffer from large performance degradation when localizing on long videos. In this work, we address the NLVL from a new perspective, i.e., span-based question answering (QA), by treating the input video as a text passage. We propose a video span localizing network (VSLNet), on top of the standard span-based QA framework (named VSLBase), to address NLVL. VSLNet tackles the differences between NLVL and span-based QA through a simple yet effective query-guided highlighting (QGH) strategy. QGH guides VSLNet to search for the matching video span within a highlighted region. To address the performance degradation on long videos, we further extend VSLNet to VSLNet-L by applying a multi-scale split-and-concatenation strategy. VSLNet-L first splits the untrimmed video into short clip segments; then, it predicts which clip segment contains the target moment and suppresses the importance of other segments. Finally, the clip segments are concatenated, with different confidences, to locate the target moment accurately. Extensive experiments on three benchmark datasets show that the proposed VSLNet and VSLNet-L outperform the state-of-the-art methods; VSLNet-L addresses the issue of performance degradation on long videos. Our study suggests that the span-based QA framework is an effective strategy to solve the NLVL problem.
2021
- Evolutionary multi-objective model compression for deep neural networksZhehui Wang, Tao Luo, Miqing Li, Joey Tianyi Zhou, Rick Siow Mong Goh, and Liangli Zhen*IEEE Computational Intelligence Magazine, 2021
While deep neural networks (DNNs) deliver state-of-the-art accuracy on various applications from face recognition to language translation, it comes at the cost of high computational and space complexity, hindering their deployment on edge devices. To enable efficient processing of DNNs in inference, a novel approach, called Evolutionary Multi-Objective Model Compression (EMOMC), is proposed to optimize energy efficiency (or model size) and accuracy simultaneously. Specifically, the network pruning and quantization space are explored and exploited by using architecture population evolution. Furthermore, by taking advantage of the orthogonality between pruning and quantization, a two-stage pruning and quantization co-optimization strategy is developed, which considerably reduces time cost of the architecture search. Lastly, different dataflow designs and parameter coding schemes are considered in the optimization process since they have a significant impact on energy consumption and the model size. Owing to the cooperation of the evolution between different architectures in the population, a set of compact DNNs that offer trade-offs on different objectives (e.g., accuracy, energy efficiency and model size) can be obtained in a single run. Unlike most existing approaches designed to reduce the size of weight parameters with no significant loss of accuracy, the proposed method aims to achieve a trade-off between desirable objectives, for meeting different requirements of various edge devices. Experimental results demonstrate that the proposed approach can obtain a diverse population of compact DNNs that are suitable for a broad range of different memory usage and energy consumption requirements. Under negligible accuracy loss, EMOMC improves the energy efficiency and model compression rate of VGG-16 on CIFAR-10 by a factor of more than 8.9 X and 2.4 X, respectively.
- Learning cross-modal retrieval with noisy labelsPeng Hu, Xi Peng, Hongyuan Zhu, Liangli Zhen, and Jie LinCVPR, 2021
Recently, cross-modal retrieval is emerging with the help of deep multimodal learning. However, even for unimodal data, collecting large-scale well-annotated data is expensive and time-consuming, and not to mention the additional challenges from multiple modalities. Although crowd-sourcing annotation, e.g., Amazon’s Mechanical Turk, can be utilized to mitigate the labeling cost, but leading to the unavoidable noise in labels for the non-expert annotating. To tackle the challenge, this paper presents a general Multi-modal Robust Learning framework (MRL) for learning with multimodal noisy labels to mitigate noisy samples and correlate distinct modalities simultaneously. To be specific, we propose a Robust Clustering loss (RC) to make the deep networks focus on clean samples instead of noisy ones. Besides, a simple yet effective multimodal loss function, called Multimodal Contrastive loss (MC), is proposed to maxi-mize the mutual information between different modalities, thus alleviating the interference of noisy samples and cross-modal discrepancy. Extensive experiments are conducted on four widely-used multimodal datasets to demonstrate the effectiveness of the proposed approach by comparing to 14 state-of-the-art methods.
- Video corpus moment retrieval with contrastive learningHao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong GohSIGIR, 2021
Given a collection of untrimmed and unsegmented videos, video corpus moment retrieval (VCMR) is to retrieve a temporal moment (i.e., a fraction of a video) that semantically corresponds to a given text query. As video and text are from two distinct feature spaces, there are two general approaches to address VCMR: (i) to separately encode each modality representations, then align the two modality representations for query processing, and (ii) to adopt fine-grained cross-modal interaction to learn multi-modal representations for query processing. While the second approach often leads to better retrieval accuracy, the first approach is far more efficient. In this paper, we propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR. We adopt the first approach and introduce two contrastive learning objectives to refine video encoder and text encoder to learn video and text representations separately but with better alignment for VCMR. The video contrastive learning (VideoCL) is to maximize mutual information between query and candidate video at video-level. The frame contrastive learning (FrameCL) aims to highlight the moment region corresponds to the query at frame-level, within a video. Experimental results show that, although ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
- Cross-modal discriminant adversarial networkPeng Hu, Xi Peng, Hongyuan Zhu, Jie Lin, Liangli Zhen, Wei Wang, and Dezhong PengPattern Recognition, 2021
Cross-modal retrieval aims at retrieving relevant points across different modalities, such as retrieving images via texts. One key challenge of cross-modal retrieval is narrowing the heterogeneous gap across diverse modalities. To overcome this challenge, we propose a novel method termed as Cross-modal discriminant Adversarial Network (CAN). Taking bi-modal data as a showcase, CAN consists of two parallel modality-specific generators, two modality-specific discriminators, and a Cross-modal Discriminant Mechanism (CDM). To be specific, the generators project diverse modalities into a latent cross-modal discriminant space. Meanwhile, the discriminators compete against the generators to alleviate the heterogeneous discrepancy in this space, i.e., the generators try to generate unified features to confuse the discriminators, and the discriminators aim to classify the generated results. To further remove the redundancy and preserve the discrimination, we propose CDM to project the generated results into a single common space, accompanying with a novel eigenvalue-based loss. Thanks to the eigenvalue-based loss, CDM could push as much discriminative power as possible into all latent directions. To demonstrate the effectiveness of our CAN, comprehensive experiments are conducted on four multimedia datasets comparing with 15 state-of-the-art approaches.
- Automated building extraction using satellite remote sensing imageryQintao Hu, Liangli Zhen, Yao Mao, Xi Zhou, and Guozhong ZhouAutomation in Construction, 2021
Automatic extraction of buildings from remote sensing images plays a critical role in urban planning and digital city construction applications. In real-world applications, however, real scenes can be highly complex (e.g., various building structures and shapes, presence of obstacles, and low contrast between buildings and surrounding regions), making automatic building extraction extremely challenging. To conquer this challenge, we propose a novel method called Deep Automatic Building Extraction Network (DABE-Net). It adopts squeeze-and-excitation (SE) operations and the residual recurrent convolutional neural network (RRCNN) to construct building-blocks. Furthermore, an attention mechanism is introduced into the network to improve segmentation accuracy. Specifically, to handle small buildings, we highlight small buildings and develop a multi-scale segmentation loss function. The theoretical analysis and experimental results show that the proposed method is effective in building extraction and outperforms several peer methods on the dataset of Mapping challenge competition.
- DRSL: Deep relational similarity learning for cross-modal retrievalXu Wang, Peng Hu, Liangli Zhen, and Dezhong PengInformation Sciences, 2021
Cross-modal retrieval aims to retrieve relevant samples across different media modalities. Existing cross-modal retrieval approaches are contingent on learning common representations of all modalities by assuming that an equal amount of information exists in different modalities. However, since the quantity of information among cross-modal samples is unbalanced and unequal, it is inappropriate to directly match the obtained modality-specific representations across different modalities in a common space. In this paper, we propose a new method called Deep Relational Similarity Learning (DRSL) for cross-modal retrieval. Unlike existing approaches, the proposed DRSL aims to effectively bridge the heterogeneity gap of different modalities by directly learning the natural pairwise similarities instead of explicitly learning a common space. DRSL is a deep hybrid framework that integrates the relation networks module for relation learning, capturing the implicit nonlinear distance metric. To the best of our knowledge, DRSL is the first approach that incorporates relation networks into the cross-modal learning scenario. Comprehensive experimental results show that the proposed DRSL model achieves state-of-the-art results in cross-modal retrieval tasks on four widely-used benchmark datasets, i.e., Wikipedia, Pascal Sentences, NUS-WIDE-10K, and XMediaNet.
- Joint versus independent multi-view hashing for cross-view retrievalPeng Hu, Xi Peng, Hongyuan Zhu, Jie Lin, Liangli Zhen, and Dezhong PengIEEE Transactions on Cybernetics, 2021
Thanks to the low storage cost and high query speed, cross-view hashing (CVH) has been successfully used for similarity search in multimedia retrieval. However, most existing CVH methods use all views to learn a common Hamming space, thus making it difficult to handle the data with increasing views or a large number of views. To overcome these difficulties, we propose a decoupled CVH network (DCHN) approach which consists of a semantic hashing autoencoder module (SHAM) and multiple multiview hashing networks (MHNs). To be specific, SHAM adopts a hashing encoder and decoder to learn a discriminative Hamming space using either a few labels or the number of classes, that is, the so-called flexible inputs. After that, MHN independently projects all samples into the discriminative Hamming space that is treated as an alternative ground truth. In brief, the Hamming space is learned from the semantic space induced from the flexible inputs, which is further used to guide view-specific hashing in an independent fashion. Thanks to such an independent/decoupled paradigm, our method could enjoy high computational efficiency and the capacity of handling the increasing number of views by only using a few labels or the number of classes. For a newly coming view, we only need to add a view-specific network into our model and avoid retraining the entire model using the new and previous views. Extensive experiments are carried out on five widely used multiview databases compared with 15 state-of-the-art approaches. The results show that the proposed independent hashing paradigm is superior to the common joint ones while enjoying high efficiency and the capacity of handling newly coming views.
- Partially encrypted multi-party computation for federated learningEkanut Sotthiwat, Liangli Zhen*, Zengxiang Li, and Chi Zhang*CCGrid, 2021
Multi-party computation (MPC) allows distributed machine learning to be performed in a privacy-preserving manner so that end-hosts are unaware of the true models on the clients. However, the standard MPC algorithm also triggers additional communication and computation costs, due to those expensive cryptography operations and protocols. In this paper, instead of applying heavy MPC over the entire local models for secure model aggregation, we propose to encrypt critical part of model (gradients) parameters to reduce communication cost, while maintaining MPC’s advantages on privacy-preserving without sacrificing accuracy of the learnt joint model. Theoretical analysis and experimental results are provided to verify that our proposed method could prevent deep leakage from gradients attacks from reconstructing original data of individual participants. Experiments using deep learning models over the MNIST and CIFAR-10 datasets empirically demonstrate that our proposed partially encrypted MPC method can reduce the communication and computation cost significantly when compared with conventional MPC, and it achieves as high accuracy as traditional distributed learning which aggregates local models using plain text.
- Parallel attention network with sequence matching for video groundingHao Zhang, Aixin Sun, Wei Jing, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong GohFindings of ACL, 2021
Given a video, video grounding aims to retrieve a temporal moment that semantically corresponds to a language query. In this work, we propose a Parallel Attention Network with Sequence matching (SeqPAN) to address the challenges in this task: multi-modal representation learning, and target moment boundary prediction. We design a self-guided parallel attention module to effectively capture self-modal contexts and cross-modal attentive information between video and text. Inspired by sequence labeling tasks in natural language processing, we split the ground truth moment into begin, inside, and end regions. We then propose a sequence matching strategy to guide start/end boundary predictions using region labels. Experimental results on three datasets show that SeqPAN is superior to state-of-the-art methods. Furthermore, the effectiveness of the self-guided parallel attention module and the sequence matching module is verified.
- Automated deepfake detectionPing Liu, Yuewei Lin, Yang He, Yunchao Wei, Liangli Zhen, Joey Tianyi Zhou, Rick Siow Mong Goh, and Jingen LiuarXiv preprint arXiv:2106.10705, 2021
In this paper, we propose to utilize Automated Machine Learning to adaptively search a neural architecture for deepfake detection. This is the first time to employ automated machine learning for deepfake detection. Based on our explored search space, our proposed method achieves competitive prediction accuracy compared to previous methods. To improve the generalizability of our method, especially when training data and testing data are manipulated by different methods, we propose a simple yet effective strategy in our network learning process: making it to estimate potential manipulation regions besides predicting the real/fake labels. Unlike previous works manually design neural networks, our method can relieve us from the high labor cost in network construction. More than that, compared to previous works, our method depends much less on prior knowledge, e.g., which manipulation method is utilized or where exactly the fake image is manipulated. Extensive experimental results on two benchmark datasets demonstrate the effectiveness of our proposed method for deepfake detection.
- Distributed monitoring for energy infrastructures: A two-tier analysis over wireless networksChi Zhang, Liangli Zhen, Joey Tianyi Zhou, and Cen ChenIEEE Wireless Communication, 2021
Wireless networks (e.g., 5G networks) enable distributed energy infrastructures to be connected even when they are geometrically isolated. Intelligent monitoring from remote sites therefore becomes possible, allowing decision makers to examine the status of distributed energy infrastructures from a central location. The major challenge is when local devices cannot perform the monitoring independently; transmitting every signal back to the central server triggers enormous amounts of wireless communication. To address this, we propose a two-tier AI system by offloading computations to multiple devices. Specifically, we build lightweight AI models for deployment on edge clients (i.e., edge sensors) and a large-scale AI model for the central server. These two types of AI models are trained with different criteria: the models on the edges act as the filtering tools to detect abnormal events and maximally avoid making false negative predictions, whereas the server model is supposed to be an expert for accurate predictions. By validating on a power theft dataset, we show that such a cascading methodology could filter out sufficient negative examples on the edge side while still being able to provide precise predictions on the second-round analysis.
2020
- Kernel truncated regression representation for robust subspace clusteringLiangli Zhen, Dezhong Peng, Wei Wang, and Xin YaoInformation Sciences, 2020
Subspace clustering aims to group data points into multiple clusters of which each corresponds to one subspace. Most existing subspace clustering approaches assume that input data lie on linear subspaces. In practice, however, this assumption usually does not hold. To achieve nonlinear subspace clustering, we propose a novel method, called kernel truncated regression representation. Our method consists of the following four steps: 1) projecting the input data into a hidden space, where each data point can be linearly represented by other data points; 2) calculating the linear representation coefficients of the data representations in the hidden space; 3) truncating the trivial coefficients to achieve robustness and block-diagonality; and 4) executing the graph cutting operation on the coefficient matrix by solving a graph Laplacian problem. Our method has the advantages of a closed-form solution and the capacity of clustering data points that lie on nonlinear subspaces. The first advantage makes our method efficient in handling large-scale datasets, and the second one enables the proposed method to conquer the nonlinear subspace clustering challenge. Extensive experiments on six benchmarks demonstrate the effectiveness and the efficiency of the proposed method in comparison with current state-of-the-art approaches.
- Objective reduction for visualising many-objective solution setsLiangli Zhen, Miqing Li, Dezhong Peng, and Xin YaoInformation Sciences, 2020
Visualising a solution set is of high importance in many-objective optimisation. It can help algorithm designers understand the performance of search algorithms and decision makers select their preferred solution(s). In this paper, an objective reduction-based visualisation method (ORV) is proposed to view many-objective solution sets. ORV attempts to map a solution set from a high-dimensional objective space into a low-dimensional space while preserving the distribution and the Pareto dominance relation between solutions in the set. Specifically, ORV sequentially decomposes objective vectors which can be linearly represented by their positively correlated objective vectors until the expected number of preserved objective vectors is reached. ORV formulates the objective reduction as a solvable convex problem. Extensive experiments on both synthetic and real-world problems have verified the effectiveness of the proposed method.
- Underdetermined mixing matrix estimation by exploiting sparsity of sourcesLiangli Zhen, Dezhong Peng, Haixian Zhang, Yongsheng Sang, and Lijun ZhangMeasurement, 2020
To estimate the mixing matrix in underdetermined mixing systems, we propose a novel method by exploiting the sparsity of sources. We utilize the pairwise relationships among all of the mixture representations to detect the single source points in the time-frequency (TF) domain, i.e., the positions where only one source contributed dominantly. The mixture representations at these single source points are then clustered to estimate the underlying mixing matrix. Since the pairwise relationships among all mixtures are considered in the TF domain, the proposed method can achieve an accurate mixing matrix estimation and be robust in noisy cases. Experimental results indicate that our method is effective in mixing matrix estimation and outperforms five peer methods.
- An adaptive stochastic parallel gradient descent approach for efficient fiber couplingQintao Hu, Liangli Zhen, Yao Mao, Shiwei Zhu, Xi Zhou, and Guozhong ZhouOptics Express, 2020
In high-speed free-space optical communication systems, the received laser beam must be coupled into a single-mode fiber at the input of the receiver module. However, propagation through atmospheric turbulence degrades the spatial coherence of a laser beam and poses challenges for fiber coupling. In this paper, we propose a novel method, called as adaptive stochastic parallel gradient descent (ASPGD), to achieve efficient fiber coupling. To be specific, we formulate the fiber coupling problem as a model-free optimization problem and solve it using ASPGD in parallel. To avoid converging to the extremum points and accelerate its convergence speed, we integrate the momentum and the adaptive gain coefficient estimation to the original stochastic parallel gradient descent (SPGD) method. Simulation and experimental results demonstrate that the proposed method reduces 50% of iterations, while keeping the stability by comparing it with the original SPGD method.
2019
- Deep supervised cross-modal retrievalLiangli Zhen, Peng Hu, Xu Wang, and Dezhong PengCVPR, 2019
Cross-modal retrieval aims to enable flexible retrieval across different modalities. The core of cross-modal retrieval is how to measure the content similarity between different types of data. In this paper, we present a novel cross-modal retrieval method, called Deep Supervised Cross-modal Retrieval (DSCMR). It aims to find a common representation space, in which the samples from different modalities can be compared directly. Specifically, DSCMR minimises the discrimination loss in both the label space and the common representation space to supervise the model learning discriminative features. Furthermore, it simultaneously minimises the modality invariance loss and uses a weight sharing strategy to eliminate the cross-modal discrepancy of multimedia data in the common representation space to learn modality-invariant features. Comprehensive experimental results on four widely-used benchmark datasets demonstrate that the proposed method is effective in cross-modal learning and significantly outperforms the state-of-the-art cross-modal retrieval methods.
- Scalable deep multimodal learning for cross-modal retrievalPeng Hu, Liangli Zhen, Dezhong Peng, and Pei LiuSIGIR, 2019
Cross-modal retrieval takes one type of data as the query to retrieve relevant data of another type. Most of existing cross-modal retrieval approaches were proposed to learn a common subspace in a joint manner, where the data from all modalities have to be involved during the whole training process. For these approaches, the optimal parameters of different modality-specific transformations are dependent on each other and the whole model has to be retrained when handling samples from new modalities. In this paper, we present a novel cross-modal retrieval method, called Scalable Deep Multimodal Learning (SDML). It proposes to predefine a common subspace, in which the between-class variation is maximized while the within-class variation is minimized. Then, it trains m modality-specific networks for m modalities (one network for each modality) to transform the multimodal data into the predefined common subspace to achieve multimodal learning. Unlike many of the existing methods, our method can train different modality-specific networks independently and thus be scalable to the number of modalities. To the best of our knowledge, the proposed SDML could be one of the first works to independently project data of an unfixed number of modalities into a predefined common subspace. Comprehensive experimental results on four widely-used benchmark datasets demonstrate that the proposed method is effective and efficient in multimodal learning and outperforms the state-of-the-art methods in cross-modal retrieval.
- Separated variational hashing networks for cross-modal retrievalPeng Hu, Xu Wang, Liangli Zhen, and Dezhong PengACM-MM, 2019
Cross-modal hashing, due to its low storage cost and high query speed, has been successfully used for similarity search in multimedia retrieval applications. It projects high-dimensional data into a shared isomorphic Hamming space with similar binary codes for semantically-similar data. In some applications, all modalities may not be obtained or trained simultaneously for some reasons, such as privacy, secret, storage limitation, and computational resource limitation. However, most existing cross-modal hashing methods need all modalities to jointly learn the common Hamming space, thus hindering them from handling these problems. In this paper, we propose a novel approach called Separated Variational Hashing Networks (SVHNs) to overcome the above challenge. Firstly, it adopts a label network (LabNet) to exploit available and nonspecific label annotations to learn a latent common Hamming space by projecting each semantic label into a common binary representation. Then, each modality-specific network can separately map the samples of the corresponding modality into their binary semantic codes learned by LabNet. We achieve it by conducting variational inference to match the aggregated posterior of the hashing code of LabNet with an arbitrary prior distribution. The effectiveness and efficiency of our SVHNs are verified by extensive experiments carried out on four widely-used multimedia databases, in comparison with 11 state-of-the-art approaches.
2018
- Local feature based multi-view discriminant analysisPeng Hu, Dezhong Peng, Jixiang Guo, and Liangli ZhenKnowledge-Based Systems, 2018
In many real-world applications, an object can be represented from multiple views or styles. Thus, it is important to design algorithms that are able to recognize objects from distinct views. To the end, a large number of approaches have been proposed to achieve the heterogeneous recognition tasks through the use of local features. However, most of them only focus on binary views and thus cannot be applied to multi-view analysis. In this paper, we propose a novel local feature based multi-view discriminant analysis approach (FMDA). The proposed approach consists of three steps: First, the input images are represented using representation matrices and local feature descriptor (LFD) matrices of their overlapping patches, where the representation matrices are the linear coefficients of the LFDs for different views. In this way, it brings two advantages, i.e., addressing the small sample size (SSS) problem and preserving the discriminative information while reducing the redundant information in the LFD matrices. Second, the multi-view discriminant representation and feature projections are learned by projecting the LFDs of different views into a common space using the Fisher criterion. Finally, a simple but effective view-similarity constraint is proposed to adaptively learn the relationships between different views. To verify the effectiveness of the proposed method, extensive experiments are carried out on the FERET, CAS-PEAL-R1, CUFSF and HFB databases comparing with some state-of-the-art methods.
- Multiobjective test problems with degenerate Pareto frontsLiangli Zhen, Miqing Li, Ran Cheng, Dezhong Peng, and Xin YaoarXiv preprint arXiv:1806.02706, 2018
In multiobjective optimisation, a set of scalable test problems with a variety of features allow researchers to investigate and evaluate the abilities of different optimisation algorithms, and thus can help them to design and develop more effective and efficient approaches. Existing test problem suites mainly focus on situations where all the objectives are fully conflicting with each other. In such cases, an m-objective optimisation problem has an (m-1)-dimensional Pareto front in the objective space. However, in some optimisation problems, there may be unexpected characteristics among objectives, e.g., redundancy. The redundancy of some objectives can lead to the multiobjective problem having a degenerate Pareto front, i.e., the dimension of the Pareto front of the m-objective problem be less than (m-1). In this paper, we systematically study degenerate multiobjective problems. We abstract three general characteristics of degenerate problems, which are not formulated and systematically investigated in the literature. Based on these characteristics, we present a set of test problems to support the investigation of multiobjective optimisation algorithms under situations with redundant objectives. To the best of our knowledge, this work is the first one that explicitly formulates these three characteristics of degenerate problems, thus allowing the resulting test problems to be featured by their generality, in contrast to existing test problems designed for specific purposes (e.g., visualisation).
2017
- Underdetermined blind source separation using sparse codingLiangli Zhen, Dezhong Peng, Zhang Yi, Yong Xiang, and Peng ChenIEEE Transactions on Neural Networks and Learning Systems, 2017
In an underdetermined mixture system with n unknown sources, it is a challenging task to separate these sources from their m observed mixture signals, where m . n. By exploiting the technique of sparse coding, we propose an effective approach to discover some 1-D subspaces from the set consisting of all the time-frequency (TF) representation vectors of observed mixture signals. We show that these 1-D subspaces are associated with TF points where only single source possesses dominant energy. By grouping the vectors in these subspaces via hierarchical clustering algorithm, we obtain the estimation of the mixing matrix. Finally, the source signals could be recovered by solving a series of least squares problems. Since the sparse coding strategy considers the linear representation relations among all the TF representation vectors of mixing signals, the proposed algorithm can provide an accurate estimation of the mixing matrix and is robust to the noises compared with the existing underdetermined blind source separation approaches. Theoretical analysis and experimental results demonstrate the effectiveness of the proposed method.
- How to read many-objective solution sets in parallel coordinatesMiqing Li, Liangli Zhen, and Xin YaoIEEE Computational Intelligence Magazine, 2017
Rapid development of evolutionary algor ithms in handling many-objective optimization problems requires viable methods of visualizing a high-dimensional solution set. The parallel coordinates plot which scales well to high-dimensional data is such a method, and has been frequently used in evolutionary many-objective optimization. However, the parallel coordinates plot is not as straightforward as the classic scatter plot to present the information contained in a solution set. In this paper, we make some observations of the parallel coordinates plot, in terms of comparing the quality of solution sets, understanding the shape and distribution of a solution set, and reflecting the relation between objectives. We hope that these observations could provide some guidelines as to the proper use of the parallel coordinates plot in evolutionary manyobjective optimization.
- Adjusting parallel coordinates for investigating multi-objective searchLiangli Zhen, Miqing Li, Ran Cheng, Dezhong Peng, and Xin YaoSEAL, 2017
Visualizing a high-dimensional solution set over the evolution process is a viable way to investigate the search behavior of evolutionary multi-objective optimization. The parallel coordinates plot which scales well to the data dimensionality is frequently used to observe solution sets in multi-objective optimization. However, the solution sets in parallel coordinates are typically presented by the natural order of the optimized objectives, with rare information of the relation between these objectives and also the Pareto dominance relation between solutions. In this paper, we attempt to adjust parallel coordinates to incorporate this information. Systematic experiments have shown the effectiveness of the proposed method.
- Underdetermined blind separation by combining sparsity and independence of sourcesPeng Chen, Dezhong Peng, Liangli Zhen, Yifan Luo, and Yong XiangIEEE Access, 2017
In this paper, we address underdetermined blind separation of N sources from their M instantaneous mixtures, where N > M, by combining the sparsity and independence of sources. First, we propose an effective scheme to search some sample segments with the local sparsity, which means that in these sample segments, only Q(Q <; M) sources are active. By grouping these sample segments into different sets such that each set has the same Q active sources, the original underdetermined BSS problem can be transformed into a series of locally overdetermined BSS problems. Thus, the blind channel identification task can be achieved by solving these overdetermined problems in each set by exploiting the independence of sources. In the second stage, we will achieve source recovery by exploiting a mild sparsity constraint, which is proven to be a sufficient and necessary condition to guarantee recovery of source signals. Compared with some sparsity-based UBSS approaches, this paper relaxes the sparsity restriction about sources to some extent by assuming that different source signals are mutually independent. At the same time, the proposed UBSS approach does not impose any richness constraint on sources. Theoretical analysis and simulation results illustrate the effectiveness of our approach
2014
- Locally linear representation for image clusteringLiangli Zhen, Zhang Yi, Xi Peng, and Dezhong PengElectronics Letters, 2014
The construction of the similarity graph plays an essential role in a spectral clustering (SC) algorithm. There exist two popular schemes to construct a similarity graph, i.e. the pairwise distance-based scheme (PDS) and the linear representation-based scheme (LRS). It is notable that the above schemes suffered from some limitations and drawbacks, respectively. Specifically, the PDS is sensitive to noises and outliers, while the LRS may incorrectly select inter-subspaces points to represent the objective point. These drawbacks degrade the performance of the SC algorithms greatly. To overcome these problems, a novel scheme to construct the similarity graph is proposed, where the similarity computation among different data points depends on both their pairwise distances and the linear representation relationships. This proposed scheme, called locally linear representation (LLR), encodes each data point using a collection of data points that not only produce the minimal reconstruction error but also are close to the objective point, which makes it robust to noises and outliers, and avoids selecting inter-subspaces points to represent the objective point to a large extent.
2013
- Local neighborhood embedding for unsupervised nonlinear dimension reductionLiangli Zhen, Xi Peng, and Dezhong PengJournal of Software, 2013
The construction of similarity relationship among data points plays a critical role in manifold learning. There exist two popular schemes, i.e., pairwise-distance based similarity and reconstruction coefficient based similarity. Existing works only have involved one scheme of them. These two schemes have different drawbacks. For pairwisedistance based similarity graph algorithms, they are sensitive to the noise and outliers. For reconstruction coefficient based similarity graph algorithms, they need sufficient sampled data and the neighborhood size is sensitive. This paper proposes a novel algorithm, called Local Neighborhood Embedding (LNE), which preserves pairwise-distance based similarity and reconstruction coefficient based similarity for finding the latent low dimensional structure of data. It has following three advantages: Firstly,it is insensitive to the choice of neighborhood size; Secondly, it is robust to the noise; Thirdly, It works well even in under-sampled case. Furthermore, the proposed objective function has a closedform solution, which means it has a low computational complexity, and the experimental results illustrate that LNE has a competitive performance in dimensionality reduction.
© The copyright of the papers above is owned by the respective publishers. The electronic versions here are made available under the CC-BY-NC-ND 4.0 license.