Yichi_zhang Danli Deliberative Agent for Following Natural Language Instructions 2022

[TOC] Title: DANLI: Deliberative Agent for Following Natural Language Instructions Author: Yichi Zhang Publish Year: 22 Oct, 2022 Review Date: Sun, Nov 20, 2022 Summary of paper Motivation reactive agent simply learn and imitate behaviours encountered in the training data these reactive agents are insufficient for long-horizon complex tasks. To address this limitation, we propose a neuro-symbolic deliberative agent that, while following language instructions, proactively applies reasoning and planning based on its neural and symbolic representations acquired from the past experience. Contribution We show that our deliberative agent achieves greater than 70% improvement over reactive baselines on the challenging TEACh benchmark Some key terms Natural language instruction following with embodied AI agents ...

November 20, 2022 · 2 min · 343 words · Sukai Huang

Xiang_li Diffusion-LM Improves Controllable Text Generation 2022

[TOC] Title: Diffusion-LM Improves Controllable Text Generation Author: Xiang Lisa Li Publish Year: May 2022 Review Date: Mon, Nov 14, 2022 https://arxiv.org/pdf/2205.14217.pdf Summary of paper Motivation can language tokens be represented as floating number? they develop a new non-autoregressive language model based on continuous diffusion Diffusion LM iteratively denoises as sequence of Gaussian vectors into word vectors, yielding a sequence of intermediate latent variable. how to convert from continuous embeddings back to words they used rounding and many other tricks to stabilise the training process Contribution they tried diffusion model for Language Model Incomprehension Not sure if the model is good at text generation. ...

November 14, 2022 · 1 min · 104 words · Sukai Huang

Consider incremental publication of results Nov, 2022

You need password to access to the content, go to Slack *#phdsukai to find more. ...

November 13, 2022 · 7 min · Sukai Huang
Relatedness and naturalness

Jie_huang Can Language Models Be Specific How 2022

[TOC] Title: Can Language Models Be Specific? How? Author: Jie Huang et. al. Publish Year: 11 Oct 2022 Review Date: Tue, Nov 8, 2022 Summary of paper Motivation they propose to measure how specific the language of pre-trained language models (PLM) is, To achieve this, they introduced a novel approach to build a benchmark for specificity testing by forming masked token prediction tasks with prompts. for instance given “J.K. Rowling was born in [MASK]”, we want to test whether a more specific answer will be better filled by PLMs. e.g., Yate instead of England it is known that if the prediction is more specific, we can retrieve more fine-grained information from language models, and further acquire more information. viewer’s opinion: we are not saying that summarisation is easy or having less useful information, there are cases that abstract info is more useful Contribution although there are works on measuring how much knowledge is stored in PLMs or improving the correctness of the predictions, non attempted to measure or improve the specificity of prediction made by PLMs. Understanding how specific the language of PLMs is can help us better understand the behaviour of language models and facilitate downstream applications such as question answering etc. setup a dataset benchmark for specificity, The quality of the benchmark is high, where the judgment on which answer is more specific is ∼ 97% consistent with humans. Discovery in general, PLMs prefer less specific answers without subjects given, and they only have a weak ability to differentiate coarse-grained/fine-grained objects by measuring their (cosine) similarities to subjects. the results indicate that specificity was neglected by existing research on language models Improving specificity of the prediction few-shot prompting ...

November 8, 2022 · 3 min · 429 words · Sukai Huang

Yizhou_zhao Semantic Aligned Fusion Transformer for One Shot Object Detection 2022

[TOC] Title: Semantic-Aligned Fusion Transformer for One Shot Object Detection Author: Yizhou Zhao et. al. Publish Year: 2022 Review Date: Mon, Oct 24, 2022 https://arxiv.org/pdf/2203.09093v2.pdf Summary of paper Motivation with extreme data scarcity, current approaches, explore various feature fusions to obtain directly transferable meta-knowledge in this paper, they, attribute the previous limitation to inappropriate correlation methods that misalign query-support semantics by overlooking spatial structure and scale variances.

October 24, 2022 · 1 min · 67 words · Sukai Huang
architecture

Ting_i_hsieh One Shot Object Detection With Co Attention and Co Excitation 2019

[TOC] Title: One-Shot Object Detection With Co-Attention and Co-Excitation Author: Ting-I Hsieh et. al. Publish Year: Nov 2019 Review Date: Mon, Oct 24, 2022 https://arxiv.org/pdf/1911.12529.pdf Summary of paper Motivation this paper aims to tackle the challenging problem of one-shot object detection, Given a query image patch whose class label is not included in the training data, To this end, they developed a novel co-attention and co-excitation (CoAE) framework that makes contributions in three key technical aspects first, use the non-local operation to explore the co-attention embodied in each query-target pair and yield region proposals accounting for the one-shot situation. second, we formulate a squeeze-and-co-excitation scheme that can adaptively emphasise correlated feature channels to help uncover relevant object proposals and eventually the target objects third, we design a margin-based ranking loss for implicitly learning a metric to predict the similarity of a region proposal to the underlying query, no matter its class label is seen or unseen training. ...

October 24, 2022 · 1 min · 158 words · Sukai Huang
architecture

Ayan_kumar_bhunia a Deep One Shot Network for Query Based Logo Retrieval 2019

[TOC] Title: A Deep-One Shot Network for Query-Based Logo Retrieval Author: Ayan Kumar Bhunia et. al. Publish Year: Jul 2019 Review Date: Mon, Oct 24, 2022 https://arxiv.org/pdf/1811.01395.pdf Summary of paper Motivation Existing general purpose just cannot handle unseen new logos (not labelled logos) in this work, they developed an easy-to-implement query based logo detection and localisation system by employing a one-shot learning technique using off-the-shelf neural network components. Limitation of current work Deep-learning based framework are largely data-driven, contrary to logo-dataset that have several image classes but few images. need to be robust to new unseen logos, the model should be designed to satisfy the incremental demands for logo classes, contrary to existing methods which are limited to a set of seen logos and are not. Contribution propose a scalable solution for the logo detection problem, they present a query-based logo search and detection system by employing a simple fully differentiable one-shot learning framework which can be used for new logo classes without further training the whole network. to deal with the logos of varying sizes, we propose a novel one-shot framework through multi-scale conditioning that is specially designed to learn the similarity between the query image and target image at multiple scales and resolutions. Architecture ...

October 24, 2022 · 2 min · 258 words · Sukai Huang
overall architecture

Yuetian_weng an Efficient Spatio Temporal Pyramid Transformer for Action Detection 2022

[TOC] Title: An Efficient Spatio-Temporal Pyramid Transformer for Action Detection Author: Yuetian Weng et. al. Publish Year: Jul 2022 Review Date: Thu, Oct 20, 2022 Summary of paper Motivation the task of action detection aims at deducing both the action category and localisation of the start and end moment for each action instance in a long, untrimmed video. it is non-trivial to design an efficient architecture for action detection due to the prohibitively expensive self-attentions over a long sequence of video clips To this end, they present an efficient hierarchical spatial temporal transformer for action detection Building upon the fact that the early self-attention layer in Transformer still focus on local patterns. Background to date, the majority of action detection methods are driven by 3D convolutional neural networks (CNNs), e.g., C3D, I3D, to encode video segment features from video RGB frames and optical flows however, the limited receptive field hinders the CNN-based models to capture long-term spatio-temporal dependencies. alternatively, vision transformers have shown the advantage of capturing global dependencies via the self-attention mechanism. Hierarchical ViTs divide Transformer blocks into several stages and progressively reduce the spatial size of feature maps when the network goes deeper. but having self-attention over a sequence of images is expensive also they found out that the global attention in the early layers actually only encodes local visual pattens (i.e., it only attends to its nearby tokens in adjacent frames while rarely interacting with tokens in distance frames) Efficient Spatio-temporal Pyramid Transformer ...

October 20, 2022 · 4 min · 649 words · Sukai Huang
MEME agent network architecture

Steven_kapturowski Human Level Atari 200x Faster 2022

[TOC] Title: Human Level Atari 200x Faster Author: Steven Kapturowski et. al. DeepMind Publish Year: September 2022 Review Date: Wed, Oct 5, 2022 Summary of paper https://arxiv.org/pdf/2209.07550.pdf Motivation Agent 57 came at the cost of poor data-efficiency , requiring nearly 80,000 million frames of experience to achieve. this one can achieve the same performance in 390 million frames Contribution Some key terms NFNet - Normalisation Free Network https://towardsdatascience.com/nfnets-explained-deepminds-new-state-of-the-art-image-classifier-10430c8599ee Batch normalisation – the bad it is expensive batch normalisation breaks the assumption of data independence NFNet applies 3 different techniques: Modified residual branches and convolutions with Scaled Weight standardisation Adaptive Gradient Clipping Architecture optimisation for improved accuracy and training speed. https://github.com/vballoli/nfnets-pytorch Previous Non-Image features ...

October 5, 2022 · 2 min · 357 words · Sukai Huang
CoBERL architecture

Andrea_banino Coberl Contrastive Bert for Reinforcement Learning 2022

[TOC] Title: CoBERL Contrastive BERT for Reinforcement Learning Author: Andrea Banino et. al. DeepMind Publish Year: Feb 2022 Review Date: Wed, Oct 5, 2022 Summary of paper https://arxiv.org/pdf/2107.05431.pdf Motivation Contribution Some key terms Representation learning in reinforcement learning motivation: if state information could be effectively extracted from raw observations it may then be possible to learn from there as fast as from states. however, given the often sparse reward signal coming from the environment, learning representations in RL has to be achieved with little to no supervision. approach types class 1: auxiliary self-supervised losses to accelerate the learning speed in model-free RL algorithm class 2: learn a world model and use this to collect imagined rollouts, which then act as extra data to train the RL algorithm reducing the samples required from the environment CoBERL is in class 1 ​ it uses both masked language modelling and contrastive learning RL using BERT architecture – RELIC ...

October 5, 2022 · 2 min · 258 words · Sukai Huang
architecture

Alex_petrekno Sample Factory Asynchronous Rl at Very High Fps 2020

[TOC] Title: Sample Factory: Asynchronous Rl at Very High FPS Author: Alex Petrenko Publish Year: Oct, 2020 Review Date: Sun, Sep 25, 2022 Summary of paper Motivation Identifying performance bottlenecks RL involves three workloads: environment simulation inference backpropagation overall performance depends on the lowest workload In existing methods (A2C/PPO/IMPALA) the computational workloads are dependent -> under-utilisation of the system resources. Existing high-throughput methods focus on distributed training, therefore introducing a lot of overhead such as networking serialisation, etc. ...

September 25, 2022 · 1 min · 154 words · Sukai Huang
3D U-Net

Jonathan_ho Video Diffusion Models 2022

[TOC] Title: Google Video Diffusion Models Author: Jonathan Ho et. al. Publish Year: 22 Jun 2022 Review Date: Thu, Sep 22, 2022 Summary of paper Motivation proposing a diffusion model for video generation that shows very promising initial results Contribution this is the extension of image diffusion model they introduce a new conditional sampling technique for spatial and temporal video extension that performs better. Some key terms Diffusion model A diffusion model specified in continuous time is a generative model with latents Training diffusion model ...

September 22, 2022 · 3 min · 471 words · Sukai Huang
architecture diagram

Dongwon Fire Burns Sword Cuts Commonsense Inductive Bias for Exploration in Text Based Games 2022

[TOC] Title: Fire Burns, Sword Cuts: Commonsense Inductive Bias for Exploration in Text Based Games Author: Dongwon Kelvin Ryu et. al. Publish Year: ACL 2022 Review Date: Thu, Sep 22, 2022 Summary of paper Motivation Text-based games (TGs) are exciting testbeds for developing deep reinforcement learning techniques due to their partially observed environments and large action space. A fundamental challenges in TGs is the efficient exploration of the large action space when the agent has not yet acquired enough knowledge about the environment. So, we want to inject external commonsense knowledge into the agent during training when the agent is most uncertain about its next action. Contribution In addition to performance increase, the produced trajectory of actions exhibit lower perplexity, when tested with a pre-trained LM, indicating better closeness to human language. Some key terms Exploration efficiency ...

September 22, 2022 · 2 min · 276 words · Sukai Huang
model structure

Wenlong_huang Language Models as Zero Shot Planners Extracting Actionable Knowledge for Embodied Agents 2022

[TOC] Title: Language Models as Zero Shot Planners: Extracting Actionable Knowledge for Embodied Agents Author: Wenlong Huang et. al. Publish Year: Mar 2022 Review Date: Mon, Sep 19, 2022 Summary of paper Motivation Large language models are learning general commonsense world knowledge. so this paper, the author investigate the possibility of grounding high-level tasks, expressed as natural language (e.g., “make breakfast”) to a chosen set of action steps (“open fridge”). Contribution they found out that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level tasks into mid-level plans without any further training. they proposed several tools to improve executability of the model generation without invasive probing or modifications to the model. Some key terms What is prompt learning ...

September 19, 2022 · 2 min · 253 words · Sukai Huang
add object detection pretrain model

Pengchuan_zhang Vinvl Revisiting Visual Representations in Vision Language Models 2021

[TOC] Title: VinVL: Revisiting Visual Representations in Vision Language Models Author: Pengchuan Zhang et. al. Publish Year: 10 Mar 2021 Review Date: Sat, Sep 3, 2022 Summary of paper Motivation In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model Oscar. And utilise an improved approach OSCAR + to pretrain the VL model Contribution has a bigger Object Detection model with larger amount of training data, called “ResNeXt-152 C4” Some key terms Vision Language Pretraining ...

September 3, 2022 · 2 min · 332 words · Sukai Huang
illustration of Oscar model

Xiujun_li Oscar Object Semantic Aligned Pro Training for Vision Language Tasks 2020

[TOC] Title: Oscar: Object Semantic Aligned Pro Training for Vision Language Tasks Author: Xiujun Li et. al. Publish Year: 26 Jul 2020 Review Date: Sat, Sep 3, 2022 Summary of paper Motivation Existing method simply concatenates image region features (patch features) and text features as input to the model to be pre-trained and use self-attention to learn image-text semantic alignments in a brute force manner. the lack of explicit alignment information between the image regions and the text poses alignment modelling a weakly-supervised learning task. ...

September 3, 2022 · 3 min · 462 words · Sukai Huang
Illustration of DiffCSE

Yung_sung_chuang Diffcse Difference Based Contrastive Learning for Sentence Embeddings 2022

[TOC] Title: DiffCSE: Difference Based Contrastive Learning for Sentence Embeddings Author: Yung-Sung Chuang et. al. Publish Year: 21 Apr 2022 Review Date: Sat, Aug 27, 2022 Summary of paper Motivation DiffCSE learns sentences that are sensitive to the difference between the original sentence and and edited sentence. Contribution we propose DiffCSE, an unsupervised contrastive learning framework for learning sentence embeddings Some key terms DiffCSE this is an unsupervsied contrastive learning framework rather than model architecture Contrastive learning in single modality data ...

August 27, 2022 · 2 min · 351 words · Sukai Huang
Different architectures for image and text retrieval

Gregor_geigle Retrieve Fast Rerank Smart Cooperative and Joint Approaches for Improved Cross Modal Retrieval 2022

[TOC] Title: Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval Author: Gregor Geigle et. al. Publish Year: 19 Feb, 2022 Review Date: Sat, Aug 27, 2022 Summary of paper Motivation they want to combine the cross encoder and the bi encoder advantages and have a more efficient cross-modal search and retrieval efficiency and simplicity of BE approach based on twin network expressiveness and cutting-edge performance of CE methods. Contribution We propose a novel joint Cross Encoding and Binary Encoding model (Joint-Coop), which is trained to simultaneously cross-encode and embed multi-modal input; it achieves the highest scores overall while maintaining retrieval efficiency ...

August 27, 2022 · 3 min · 453 words · Sukai Huang
MP-Net structure

Kaitao_song Mpnet Masked and Permuted Retrain for Language Understanding 2020

[TOC] Title: MPNet: Masked and Permuted Pre-training for Language Understanding Author: Kaitao Song et. al. Publish Year: 2020 Review Date: Thu, Aug 25, 2022 Summary of paper Motivation BERT adopts masked language modelling (MLM) for pre-training and is one of the most successful pre-training models. Since BERT is all attention block and the positional embedding is the only info that care about the ordering, BERT neglects dependency among predicted tokens ...

August 25, 2022 · 2 min · 378 words · Sukai Huang
multimodal framework

Sergios_karagiannakos Vision Language Models Towards Multimodal Dl 2022

[TOC] Title: Vision Language Models Towards Multimodal Deep Learning Author: Sergios Karagiannakos Publish Year: 03 Mar 2022 Review Date: Tue, Aug 9, 2022 https://theaisummer.com/vision-language-models/

August 9, 2022 · 1 min · 24 words · Sukai Huang
learnable codebook

Jiali_duan Multimodal Alignment Using Representation Codebook 2022

[TOC] Title: Multi-modal Alignment Using Representation Codebook Author: Jiali Duan, Liqun Chen et. al. Publish Year: 2022 CVPR Review Date: Tue, Aug 9, 2022 Summary of paper Motivation aligning signals from different modalities is an important step as it affects the performance of later stage such as cross-modality fusion. since image and text often reside in different regions of the feature space, directly aligning them at instance level is challenging especially when features are still evolving during training. Contribution in this paper, we treat image and text as two “views” of the same entity, and encode them into a joint vision-language coding space spanned by a dictionary of cluster centres (codebook). to further smooth out the learning process, we adopt a teacher-student distillation paradigm, where the momentum teacher of one view guides the student learning of the other. Some key terms Types of Vision language pre-training tasks ...

August 9, 2022 · 3 min · 513 words · Sukai Huang

A preliminary idea about using instruction following as a intermediate training step towards a general learning-based agent

This page is not completed yet You need password to access to the content, go to Slack *#phdsukai to find more. ...

August 7, 2022 · 5 min · Sukai Huang

Supplementary explanations for proposed methods and PhD thesis structure

You need password to access to the content, go to Slack *#phdsukai to find more. ...

August 4, 2022 · 11 min · Sukai Huang

Younggyo_seo Masked World Models for Visual Control 2022

[TOC] Title: Masked World Models for Visual Control 2022 Author: Younggyo Seo et. al. Publish Year: 2022 Review Date: Fri, Jul 1, 2022 https://arxiv.org/abs/2206.14244?context=cs.AI https://sites.google.com/view/mwm-rl Summary of paper Motivation TL:DR: Masked autoencoders (MAE) has emerged as a scalable and effective self-supervised learning technique. Can MAE be also effective for visual model-based RL? Yes! with the recipe of convolutional feature masking and reward prediction to capture fine-grained and task-relevant information. Some key terms Decouple visual representation learning and dynamics learning ...

July 1, 2022 · 2 min · 227 words · Sukai Huang

A Brief Overview of Rank Based Prioritized Experience Replay 2016

[TOC] Title: Prioritised Experience Replay Author: Neuralnet.ai Publish Year: 25 Feb, 2016 Review Date: Thu, Jun 2, 2022 https://www.neuralnet.ai/a-brief-overview-of-rank-based-prioritized-experience-replay/ Replay memory is essential in RL Replay memory has been successfully deployed in both value based and policy gradient based reinforcement learning algorithms, to great success. The reasons for this success cut right to the heart of reinforcement learning. In particular, replay memory simultaneously solves two outstanding problems with the field. ...

June 2, 2022 · 2 min · 365 words · Sukai Huang