Future Work

Allen Z Ren Robots That Ask for Help Uncertainty Alignment 2023

[TOC] Title: Robots That Ask for Help: Uncertainty Alignment for Large Language Model Planners Author: Allen Z. Ren et. al. Publish Year: 4 Sep 2023 Review Date: Fri, Jan 26, 2024 url: arXiv:2307.01928v2 Summary of paper Motivation LLMs have various capabilities but often make overly confident yet incorrect predictions. KNOWNO aims to measure and align this uncertainty, enabling LLM-based planners to recognize their limitations and request assistance when necessary. Contribution built on theory of conformal prediction Some key terms Ambiguity in NL ...

Harsh_jhamtani Natural Language Decomposition and Interpretation of Complex Utterances 2023

[TOC] Title: Natural Language Decomposition and Interpretation of Complex Utterances Author: Jacob Andreas Publish Year: 15 May 2023 Review Date: Mon, May 22, 2023 url: https://arxiv.org/pdf/2305.08677.pdf Summary of paper Motivation natural language interface often require supervised data to translate user request into structure intent representations however, during data collection, it can be difficult to anticipate and formalise the full range of user needs we introduce an approach for equipping a simple language to code model to handle complex utterances via a process of hierarchical natural language decomposition. Contribution Experiments show that the proposed approach enables the interpretation of complex utterances with almost no complex training data, while outperforming standard few-shot prompting approaches. Some key terms Methodology ...

Siddharth_karamcheti Language Driven Representation Learning for Robotics 2023

[TOC] Title: Language-Driven Representation Learning for Robotics Author: Siddharth Karamcheti et. al. Publish Year: 24 Feb 2023 Review Date: Fri, Mar 3, 2023 url: https://arxiv.org/pdf/2302.12766.pdf Summary of paper Motivation recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks. leveraging methods such as masked autoencoding and contrastive learning, these representations exhibit strong transfer to policy learning for visuomotor control but robot learning encompasses a diverse set of problems beyond control including grasp affordance prediction, language-conditioned imitation learning, and intent scoring for human-robot collaboration amongst others. Contribution first, we demonstrate that existing representations yield inconsistent results across these tasks: masked autoencoding approaches pick up on low-level spatial features at the cost of high-level semantics, while contrastive learning approaches capture the opposite (i.e., high-level semantics) We then introduce Voltron, a framework for language driven representation learning from human videos and associated captions. Voltron trades off language conditioned visual reconstruction to learn low-level visual patterns (mask auto-encoding) and visually grounded language generation to encode high-level semantics. (hindsight relabelling and contrastive learning) Some key terms How can we learn visual representations that generalise across the diverse spectrum of problems in robot learning? ...

Jie_huang Can Language Models Be Specific How 2022

[TOC] Title: Can Language Models Be Specific? How? Author: Jie Huang et. al. Publish Year: 11 Oct 2022 Review Date: Tue, Nov 8, 2022 Summary of paper Motivation they propose to measure how specific the language of pre-trained language models (PLM) is, To achieve this, they introduced a novel approach to build a benchmark for specificity testing by forming masked token prediction tasks with prompts. for instance given “J.K. Rowling was born in [MASK]”, we want to test whether a more specific answer will be better filled by PLMs. e.g., Yate instead of England it is known that if the prediction is more specific, we can retrieve more fine-grained information from language models, and further acquire more information. viewer’s opinion: we are not saying that summarisation is easy or having less useful information, there are cases that abstract info is more useful Contribution although there are works on measuring how much knowledge is stored in PLMs or improving the correctness of the predictions, non attempted to measure or improve the specificity of prediction made by PLMs. Understanding how specific the language of PLMs is can help us better understand the behaviour of language models and facilitate downstream applications such as question answering etc. setup a dataset benchmark for specificity, The quality of the benchmark is high, where the judgment on which answer is more specific is ∼ 97% consistent with humans. Discovery in general, PLMs prefer less specific answers without subjects given, and they only have a weak ability to differentiate coarse-grained/fine-grained objects by measuring their (cosine) similarities to subjects. the results indicate that specificity was neglected by existing research on language models Improving specificity of the prediction few-shot prompting ...

Yuetian_weng an Efficient Spatio Temporal Pyramid Transformer for Action Detection 2022

[TOC] Title: An Efficient Spatio-Temporal Pyramid Transformer for Action Detection Author: Yuetian Weng et. al. Publish Year: Jul 2022 Review Date: Thu, Oct 20, 2022 Summary of paper Motivation the task of action detection aims at deducing both the action category and localisation of the start and end moment for each action instance in a long, untrimmed video. it is non-trivial to design an efficient architecture for action detection due to the prohibitively expensive self-attentions over a long sequence of video clips To this end, they present an efficient hierarchical spatial temporal transformer for action detection Building upon the fact that the early self-attention layer in Transformer still focus on local patterns. Background to date, the majority of action detection methods are driven by 3D convolutional neural networks (CNNs), e.g., C3D, I3D, to encode video segment features from video RGB frames and optical flows however, the limited receptive field hinders the CNN-based models to capture long-term spatio-temporal dependencies. alternatively, vision transformers have shown the advantage of capturing global dependencies via the self-attention mechanism. Hierarchical ViTs divide Transformer blocks into several stages and progressively reduce the spatial size of feature maps when the network goes deeper. but having self-attention over a sequence of images is expensive also they found out that the global attention in the early layers actually only encodes local visual pattens (i.e., it only attends to its nearby tokens in adjacent frames while rarely interacting with tokens in distance frames) Efficient Spatio-temporal Pyramid Transformer ...

Pengchuan_zhang Vinvl Revisiting Visual Representations in Vision Language Models 2021

[TOC] Title: VinVL: Revisiting Visual Representations in Vision Language Models Author: Pengchuan Zhang et. al. Publish Year: 10 Mar 2021 Review Date: Sat, Sep 3, 2022 Summary of paper Motivation In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model Oscar. And utilise an improved approach OSCAR + to pretrain the VL model Contribution has a bigger Object Detection model with larger amount of training data, called “ResNeXt-152 C4” Some key terms Vision Language Pretraining ...

Yung_sung_chuang Diffcse Difference Based Contrastive Learning for Sentence Embeddings 2022

[TOC] Title: DiffCSE: Difference Based Contrastive Learning for Sentence Embeddings Author: Yung-Sung Chuang et. al. Publish Year: 21 Apr 2022 Review Date: Sat, Aug 27, 2022 Summary of paper Motivation DiffCSE learns sentences that are sensitive to the difference between the original sentence and and edited sentence. Contribution we propose DiffCSE, an unsupervised contrastive learning framework for learning sentence embeddings Some key terms DiffCSE this is an unsupervsied contrastive learning framework rather than model architecture Contrastive learning in single modality data ...

Different architectures for image and text retrieval

Gregor_geigle Retrieve Fast Rerank Smart Cooperative and Joint Approaches for Improved Cross Modal Retrieval 2022

[TOC] Title: Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval Author: Gregor Geigle et. al. Publish Year: 19 Feb, 2022 Review Date: Sat, Aug 27, 2022 Summary of paper Motivation they want to combine the cross encoder and the bi encoder advantages and have a more efficient cross-modal search and retrieval efficiency and simplicity of BE approach based on twin network expressiveness and cutting-edge performance of CE methods. Contribution We propose a novel joint Cross Encoding and Binary Encoding model (Joint-Coop), which is trained to simultaneously cross-encode and embed multi-modal input; it achieves the highest scores overall while maintaining retrieval efficiency ...

Kaitao_song Mpnet Masked and Permuted Retrain for Language Understanding 2020

[TOC] Title: MPNet: Masked and Permuted Pre-training for Language Understanding Author: Kaitao Song et. al. Publish Year: 2020 Review Date: Thu, Aug 25, 2022 Summary of paper Motivation BERT adopts masked language modelling (MLM) for pre-training and is one of the most successful pre-training models. Since BERT is all attention block and the positional embedding is the only info that care about the ordering, BERT neglects dependency among predicted tokens ...

Jiali_duan Multimodal Alignment Using Representation Codebook 2022

[TOC] Title: Multi-modal Alignment Using Representation Codebook Author: Jiali Duan, Liqun Chen et. al. Publish Year: 2022 CVPR Review Date: Tue, Aug 9, 2022 Summary of paper Motivation aligning signals from different modalities is an important step as it affects the performance of later stage such as cross-modality fusion. since image and text often reside in different regions of the feature space, directly aligning them at instance level is challenging especially when features are still evolving during training. Contribution in this paper, we treat image and text as two “views” of the same entity, and encode them into a joint vision-language coding space spanned by a dictionary of cluster centres (codebook). to further smooth out the learning process, we adopt a teacher-student distillation paradigm, where the momentum teacher of one view guides the student learning of the other. Some key terms Types of Vision language pre-training tasks ...

Younggyo_seo Masked World Models for Visual Control 2022

[TOC] Title: Masked World Models for Visual Control 2022 Author: Younggyo Seo et. al. Publish Year: 2022 Review Date: Fri, Jul 1, 2022 https://arxiv.org/abs/2206.14244?context=cs.AI https://sites.google.com/view/mwm-rl Summary of paper Motivation TL:DR: Masked autoencoders (MAE) has emerged as a scalable and effective self-supervised learning technique. Can MAE be also effective for visual model-based RL? Yes! with the recipe of convolutional feature masking and reward prediction to capture fine-grained and task-relevant information. Some key terms Decouple visual representation learning and dynamics learning ...

Deepmind Flamingo a Visual Language Model for Few Shot Learning 2022

[TOC] Title: Flamingo: a Visual Language Model for Few-Shot Learning Author: Jean-Baptiste Alayrac et. al. Publish Year: Apr 2022 Review Date: May 2022 Summary of paper Flamingo architecture Pretrained vision encoder: from pixels to features the model’s vision encoder is a pretrained Normalizer-Free ResNet (NFNet) they pretrain the vision encoder using a contrastive objective on their datasets of image and text pairs, using the two term contrastive loss from paper “Learning Transferable Visual Models From Natural Language Supervision” ...

Shivam_miglani Nltopddl Learning From Nlp Manuals 2020

[TOC] Title: NLtoPDDL: One-Shot Learning of PDDL Models from Natural Language Process Manuals Author: Shivam Miglani et. al. Publish Year: 2020 Review Date: Mar 2022 Summary of paper Motivation pipeline Pipeline architecture Phase 1 we have a DQN that learns to extract words that represent action name, action arguments, and the sequence of actions present in annotated NL process manuals. (why only action name, do we need to extract other information???) Again, why this is called DQN RL? is it just normal supervised learning… (Check EASDRL paper to understand Phase 1) ...