해석: End-to-End Multi-Task Learning with Attention

하나의 네트워크로 이미지의 공동 특징을 추출하고, 각 테스크별로 soft-attention module을 적용한 논문

Liu, S., Johns, E., & Davison, A. J. (2019). End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1871-1880).

Motivation

멀티테스크 학습에서는 공통표현(Shared representation)을 학습하는데 있어서 다음의 2가지 요소가 도전과제:

네트워크 아키텍처 문제(어떻게 공통특징을 추출할 것인가?): 공통표현에서 각 테스크별로 공유되는 특징 뿐만아니라, 각 테스크별 고유한 특징을 뽑는게 어려움. 달리 표현하면, 공유되는 특징은 일반화 특징을 잘뽑아야한다는 것이지만, 테스크별 고유한 특징을 뽑으려면 약간의 오버피팅도 필요함
손실함수 (어떻게 벨런스를 줄 것인가?): 각 테스크별로 상대적인 가중치를 주기가 쉽지 않음. 메뉴얼로 지정하는 것도 꽤 손이 많이 감.

2019년 당시에는 위의 문제점을 하나만 다루는 연구가 많았음. 이 연구에서는 위의 두 문제를 동시에 해결하고자함.

Method: MTAN (Multi-task Attention Network)

MTAN은 2가지 모듈로 구성:

Shared network
task-specific attention network

첫 Attention layer에서는 아래와 같이 데이터 흐름이 구성

$p^{(j)}$: j-th block에서 나온 특징값.
$a_{i}^{(j)}$: attention mask (=attention weight) 처럼 쓰이는데, 벡터임. element-wise multiplication으로 사용됨. $i$은 task을 의미함.
$\hat{a}_{i}^{(j)}$: 특징값에 attentio mask을 곱함

$\hat{a}_{i}^{(j)} = a_{i}^{(j)} \odot p^{(j)} $

그 이후의 attention layer에서는 아래와 같이 구성:

$a_{i}^{(j)} = h_{i}^{(j)}(g_{i}^{(j)}([u^{(j)};f^{j}(\hat{a}_{i}^{(j-1)})])) , j\geq 2$

$f^{(j)}, g^{(j)}, h^{(j)}$ : CNN + BN + Activation의 조합의 레이어들.
이전레이어의 task specific feature($ a_{i}^{(j)} $)을 task 특이적인 CNN에 태워서 만들어냄
g, h의 CNN은 [1x1] kernel (=채널별로 선형결합), f은 [3x3] kernels임.
attention mask은 : sigmoid후의 값으로, 1이되면 identity여서 global feature가 곧 task-specific feature을 의미함.

Dynamic weight average(DWA)

각 task kk의 가중치를 조절하는 task weighting function
과거 loss 값을 활용하여 task 중요도를 결정

$\lambda_k(t) := K \frac{\exp(w_k(t - 1)/T )}{\sum\limits_{i} \exp(w_i(t - 1)/T )}$

$ w_k(t - 1) = \frac{L_k(t - 1)}{L_k(t - 2)}$

즉, 이전 t(에폭)의 손실함수값이 나온 것을 가지고, softmax하여 다음 epoch의 가중치를 얼마줄건지 결정하는 테스크

Results

실험을 위해 여러 벤치마크모델을 설계

Single-Task, One Task: 싱글 테스크로 SegNet
Single-Task, STAN: SegNet에 단일테스크용 MTAN 방법만 적용(attention만)
Multi-Task, Split (Wide, Deep): wide:컨볼루션 필터만 조절한경우, deep:컨본루션 레이어 수를 증가
Multi-Task, Dense: 공유 네트워크 + task-specfic 네트워크. attention module 없는 경우 (단순한 feature sharing)
Multi-Task, Cross-Stitch:종래의 방법

저작자표시 (새창열림)

'Best Paper review > Computer vision' 카테고리의 다른 글

Sigmoid Loss for Language Image Pre-Training (0)	2025.09.15
[5분 컷 리뷰] BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (0)	2025.03.04
MoCo (Momentum Contrast for Unsupervised Visual Representation Learning) (0)	2024.09.11
[5분 컷 리뷰] MiT(Mixed Vision Transformer). SegFormer: Simple and Efﬁcient Design for SemanticSegmentation with Transformers (0)	2024.08.12
[5분 컷 리뷰] DINO v2: Learning Robust Visual Features without Supervision (0)	2024.05.24

연금술사

해석: End-to-End Multi-Task Learning with Attention

'Best Paper review > Computer vision' 카테고리의 다른 글

티스토리툴바

해석: End-to-End Multi-Task Learning with Attention

'Best Paper review > Computer vision' 카테고리의 다른 글

관련글

티스토리툴바