'Mediapipe-attention-mesh' 태그의 글 목록

Mediapipe-attention-mesh

[Tensorflow-models] face-landmarks-detection - 1. Model card 2023.12.22

[Tensorflow-models] face-landmarks-detection - 1. Model card

merrykang 2023. 12. 22. 16:48

2023. 12. 22. 16:48

개요

이번부터 시리즈로 쓰는 글은 Tensorflow-models 중 face-landmarks-detection 을 자사 제품 내에서 구현한 후, 그와 관련된 AI 이론을 정리한 내용임
이번 글 model card와 관련된 내용의 출처는https://drive.google.com/file/d/1tV7EJb3XgMS7FwOErTgLU1ZocYyNmwlf/preview

1. 모델 설명 | Used to

모델 설명

모델명: MediaPipe Attention Mesh
구체적인 설명

A lightweight model for real-time prediction of 3D facial surface landmarks from video captured by a front-facing smartphone camera.
Designed for applications like AR makeup, eye tracking and AR puppeteering that rely on highly accurate landmarks for eye (+ iris) and lips regions predicted by the model. Runs at over 50 FPS on Pixel 2 phone.

Used to

특정 신원에 대한 얼굴 또는 얼굴 특징에 대해서는 저장할 수 없음에 주의

Applications
- Detection of human facial surface landmarks from monocular video.
- Optimized for videos captured on front-facing cameras of smartphones.
- Well suitable for mobile AR (augmented reality) applications.

Domain & Users
- The primary intended application is AR entertainment.
- Intended users are people who use augmented reality for entertainment purposes.

Out-of-scope applications
Not appropriate for:
- This model is not intended for human life-critical decisions.
- Predicted face landmarks do not provide facial recognition or identification and do not store any unique face representation.

2. Model type 및 Model architecture

모델 카드 인용

MobileNetV2 - like with customized blocks for real-time performance and an attention mechanism to refine lips and eye regions and to predict irises.

Model type: CNN

Model architecture

face-landmarks-detection 에서는 detector-model과 landmark-model이라고 하지 않고, 각각을 sub model이라고 칭함
입, 눈, 눈동자를 제외한 박스 안에 있는 얼굴 부위 탐지 및 좌표 예측: MobileNetV2
입, 눈, 눈동자의 탐지 및 좌표 예측: attention mechanism
- 특히 이 메커니즘에서 spatial transformer 모듈이 사용됨
- 이 모듈은 아핀 변환 행렬(affine transformation matrix)에 의해 조절되고, 사용자가 박스 안의 keypoint를 당기고 회전하고 바꾸고 왜곡할 수 있게 함

attention mechanism
An attention mechanism pulls out visual features of a given region of interest by sampling a grid of 2D points in the feature space and extracting the features under the sampled points. This allows to train architectures end-to-end and to enrich the features that are used by the attention mechanism.

spatial transformer
Specifically, we use a spatial transformer module which is controlled by an affine transformation matrix and allows us to zoom, rotate, translate, and skew the sampled grid of points.

3. input, output

Inputs
Image of cropped face with 25% margin on each side and size 192x192 px.

Output(s)
- Facial surface represented as 468 3D landmarks flatened into a 1D tensor: (x1, y1, z1), (x2, y2, z2), ... x- and y-coordinates follow the image pixel coordinates;
z-coordinates are relative to the face center of mass and are scaled proportionally to the face width.
- Lips refined region surface represented as 80 2D landmarks (inner and outer contours and an intermediate line) flattened into a 1D tensor.
- Eye with eyebrow refined region surface (x2) represented as 71 2D landmarks (eye and eyebrow contours with surrounding areas) flattened into a 1D tensor.
- Iris refined region surface (x2) represented as 5 2D landmarks (1 for pupil center and 4 for iris contour) flattened into a 1D tensor.
- Face flag indicating the likelihood of the face being present in the input image. Used in tracking mode to detect that the face was lost and the face detector should be applied to obtain a new face position. Face probability threshold is set at 0.5 by default and can be adjusted.

[
  {
    box: {
      xMin: 304.6476503248806,
      xMax: 502.5079975897382,
      yMin: 102.16298762367356,
      yMax: 349.035215984403,
      width: 197.86034726485758,
      height: 246.87222836072945
    },
    keypoints: [
      {x: 406.53152857172876, y: 256.8054528661723, z: 10.2, name: "lips"},
      {x: 406.544237446397, y: 230.06933367750395, z: 8},
      ...
    ],
  }
]

4. evaluation metric

face의 metric : IOD MAE

양쪽 눈 사이의 거리의 평균 오차
IOD (Normalization by Interocular Distance)

Normalization by interocular distance (IOD) is applied to unify the scale of the samples.
IOD is calculated as the distance between the eye centers (which are estimated as the centers of segments connecting eye corners) and is taken as 100%.
To accommodate head rotations, 3D IOD from the ground truth is employed.

MAE (Mean Absolute Error normalized by interocular distance) : handpose와 동일

Mean absolute error is calculated as the pixel distance between ground truth and predicted face mesh.
The model provides 3D coordinates, but as the z screen coordinates as well as metric world coordinates are obtained from synthetic data, so for a fair comparison with human annotations, only 2D screen coordinates MNAE are employed.

evaluation results

구체적인 내용

Comparison with fairness goal of 2.56% IOD MAE discrepancy across 17 regions:
- Tracking mode: from 2.73% to 3.95% (difference of 1.22%)
- Reacquisition mode: from 3.01% to 4.28% (difference of 1.27%)

Comparison with our fairness criteria yields a maximum discrepancy between best and worst performing regions of 1.22% for the tracking mode and 1.27% for the reacquisition mode.
We therefore consider the models performing well across groups.

Tracking mode and Reacquisition mode

	Tracking mode	Reacquisition mode
발동 상황	메인 모드 = 대부분의 얼굴 감지 가능한 일반적인 상황 = 이전 프레임에서 높은 정확도의 얼굴 정보를 얻을 수 있을 때	이전 프레임에서 얼굴 정보를 얻을 수 없을 때 = 첫 번째 프레임 or 얼굴 추적 정보가 없어졌을 때
이용하는 툴	the Mdiapipe Python Solution API for face mesh	BlazeFace Detector

Tracking mode
The main mode that takes place most of the time and is based on obtaining a highly accurate face crop from the prediction on the previous frame (frames 2, 3, ... on the image below). Underneath we utilize the MediaPipe Python Solution API for Face Mesh and run the pipeline for several frames on the same image before measuring the tracking accuracy (thus the crop region is determined from the model predictions as in a video stream).

Reacquisition mode
Takes place when there is no information about the face from previous frames. It happens either on the first frame (image below) or on the frames where the face tracking is lost. In this case, an external face detector is being run over the whole frame. We used BlazeFace Detector for the evaluation of the reacquisition mode.

저작자표시

'AI > Image' 카테고리의 다른 글

[Tensorflow-models] hand-pose-detection - 2. Neural Network (1)	2023.12.27
[Tensorflow-models] pose-detection - 1. Model card (1)	2023.12.22
[Tensorflow-models] hand-pose-detection - 1. Model card (1)	2023.12.20
[Tensorflow-models] 머신러닝과 딥러닝의 개념 (1)	2023.12.19
[Tensorflow-models] HTML - clearRect() 함수 (0)	2023.11.30

PREV 이전 1 NEXT 다음

Merry's Dev. Blog

Mediapipe-attention-mesh

[Tensorflow-models] face-landmarks-detection - 1. Model card

개요

1. 모델 설명 | Used to

2. Model type 및 Model architecture

3. input, output

4. evaluation metric

'AI > Image' 카테고리의 다른 글

+ Recent posts

티스토리툴바