개요

 


 

1. 모델 설명 | Used to

모델 설명
  • 모델명: MediaPipe BlazePose GHUM 3D
  • 구체적인 설명
Lite (3MB size), Full (6 MB size) and Heavy (26 MB size) models, to estimate the full 3D body pose of an individual in videos captured by a smartphone or web camera.
Optimized for on-device, real-time fitness applications: Lite model runs ~44 FPS on a CPU via XNNPack TFLite and ~49 FPS via TFLite GPU on a Pixel 3. Full model runs ~18 FPS on a CPU via XNNPack TFLite and ~40 FPS via TFLite GPU on a Pixel 3.
Heavy model runs ~4FPS on a CPU via XNNPack TFLite and ~19 FPS via TFLite GPU on a Pixel 3.

 

Used to
Applications
3D full body pose estimation for single-person videos on mobile, desktop and in browser.

Domain & Users
 - Augmented reality
 - 3D Pose and gesture recognition
 - Fitness and repetition counting
 - 3D pose measurements (angles / distances)

Out-of-scope applications
Multiple people in an image.
 - People too far away from the camera (e.g. further than 14 feet/4 meters)
 - Head is not visible Applications requiring metric accurate depth
 - Any form of surveillance or identity recognition is explicitly out of scope and not enabled by this technology

 


 

2. Model type 및 Model architecture

모델 카드 인용
Convolutional Neural Network: MobileNetV2-like with customized blocks for real-time performance.

 

Model type: CNN

 

Model architecture : MobileNetV2

 


 

3. input, output

Inputs
Regions in the video frames where a person has been detected.
Represented as a 256x256x3 array with aligned human full body part, centered by mid-hip in vertical body pose and rotation distortion of (-10, 10) . Channels order: RGB with values in [0.0, 1.0].

Output(s)
33x5 array corresponding to (x, y, z, visibility, presence).
 - X, Y coordinates are local to the region of interest and range from [0.0, 255.0].
 - Z coordinate is measured in "image pixels" like the X and Y coordinates and represents the distance relative to the plane of the subject's hips, which is the origin of the Z axis. Negative values are between the hips and the camera; positive values are behind the hips.
Z coordinate scale is similar with X, Y scales but has different nature as obtained not via human annotation, by fitting synthetic data (GHUM model) to the 2D annotation. Note, that Z is not metric but up to scale.
 - Visibility is in the range of [min_float, max_float] and after user-applied sigmoid denotes the probability that a keypoint is located within the frame and not occluded by another bigger body part or another object.
 - Presence is in the range of [min_float, max_float] and after user-applied sigmoid denotes the probability that a keypoint is located within the frame.
[
  {
    score: 0.8,
    keypoints: [
      {x: 230, y: 220, score: 0.9, score: 0.99, name: "nose"},
      {x: 212, y: 190, score: 0.8, score: 0.91, name: "left_eye"},
      ...
    ],
    keypoints3D: [
      {x: 0.65, y: 0.11, z: 0.05, score: 0.99, name: "nose"},
      ...
    ],
    segmentation: {
      maskValueToLabel: (maskValue: number) => { return 'person' },
      mask: {
        toCanvasImageSource(): ...
        toImageData(): ...
        toTensor(): ...
        getUnderlyingType(): ...
      }
    }
  }
]

 


 

4. evaluation metric

pose의 metric : PDJ
  • average Percentage of Detected Joints: 감지된 부위의 평균 비율 오차
We consider a keypoint to be correctly detected if predicted visibility for it matches ground truth and the absolute 2D Euclidean error between the reference and target keypoint normalized by the 2D torso diameter projection
is smaller than 20%.
This value was determined during development as the maximum value that does not degrade accuracy in classifying pose / asana based solely on the key points without perceiving the original RGB image. The model is providing 3D coordinates, but the z-coordinate is obtained from synthetic data, so for a fair comparison with human annotations, only 2D coordinates are employed.

 

evaluation results
  • geographical
Evaluation across 14 regions of heavy, full and lite models on smartphone back-facing camera photos dataset results an average performance of 94.2% +/- 1.3% stdev with a range of [91.4%, 96.2%] across regions for the heavy model, an average performance of 91.8% +/- 1.4% stdev with a range of [89.2%, 94.0%] across regions for the full model and an average performance of 87.0% +/-2.0% stdev with a range of [83.2%, 89.7%] across regions for the lite model.
Comparison with our fairness criteria yields a maximum discrepancy between average and worst performing regions of 4.8% for the heavy, 4.8% for the full and 6.5% for the light model.
  • skin tone and gender
Evaluation on smartphone back-facing camera photos dataset results in an average performance of 93.6% with a range of [89.3%, 95.0%] across all skin tones for the heavy model, an average performance of 91.1% with a range of [85.9%, 92.9%] across all skin tones for the full model and an average performance of 86.4% with a range of [80.5%, 87.8%] across regions for the lite model. The maximum discrepancy between worst and best performing categories is 5.7% for the heavy model, 7.0% for the full model and 7.3% for the lite model.
Evaluation across gender yields an average performance of 94.8% with a range of [94.2%, 95.3%] for the heavy model, an average performance of 92.3% with a range of [91.2%, 93.4%] for the full model, and an average of 83.7% with a range of [86.0%, 89.1%] for the lite model.
The maximum discrepancy is 1.1% for the heavy model, 2.2% for the full model and 3.1% for the lite model.

개요

 


 

1. 모델 설명 | Used to

모델 설명
  • 모델명: MediaPipe Attention Mesh
  • 구체적인 설명
A lightweight model for real-time prediction of 3D facial surface landmarks from video captured by a front-facing smartphone camera.
Designed for applications like AR makeup, eye tracking and AR puppeteering that rely on highly accurate landmarks for eye (+ iris) and lips regions predicted by the model. Runs at over 50 FPS on Pixel 2 phone.

 

Used to
  • 특정 신원에 대한 얼굴 또는 얼굴 특징에 대해서는 저장할 수 없음에 주의
Applications
 - Detection of human facial surface landmarks from monocular video.
 - Optimized for videos captured on front-facing cameras of smartphones.
 - Well suitable for mobile AR (augmented reality) applications.

Domain & Users
 - The primary intended application is AR entertainment.
 - Intended users are people who use augmented reality for entertainment purposes.

Out-of-scope applications
Not appropriate for:
 - This model is not intended for human life-critical decisions.
 - Predicted face landmarks do not provide facial recognition or identification and do not store any unique face representation.

 


 

2. Model type 및 Model architecture

모델 카드 인용
MobileNetV2 - like with customized blocks for real-time performance and an attention mechanism to refine lips and eye regions and to predict irises.

 

Model type: CNN

 

Model architecture
  • face-landmarks-detection 에서는 detector-model과 landmark-model이라고 하지 않고, 각각을 sub model이라고 칭함
  • 입, 눈, 눈동자를 제외한 박스 안에 있는 얼굴 부위 탐지 및 좌표 예측: MobileNetV2
  • 입, 눈, 눈동자의 탐지 및 좌표 예측: attention mechanism
    • 특히 이 메커니즘에서 spatial transformer 모듈이 사용됨
    • 이 모듈은 아핀 변환 행렬(affine transformation matrix)에 의해 조절되고, 사용자가 박스 안의 keypoint를 당기고 회전하고 바꾸고 왜곡할 수 있게 함
attention mechanism
An attention mechanism pulls out visual features of a given region of interest by sampling a grid of 2D points in the feature space and extracting the features under the sampled points. This allows to train architectures end-to-end and to enrich the features that are used by the attention mechanism.

spatial transformer
Specifically, we use a spatial transformer module which is controlled by an affine transformation matrix and allows us to zoom, rotate, translate, and skew the sampled grid of points.

 


 

3. input, output

Inputs
Image of cropped face with 25% margin on each side and size 192x192 px.

Output(s)
 - Facial surface represented as 468 3D landmarks flatened into a 1D tensor: (x1, y1, z1), (x2, y2, z2), ... x- and y-coordinates follow the image pixel coordinates;
z-coordinates are relative to the face center of mass and are scaled proportionally to the face width.
 - Lips refined region surface represented as 80 2D landmarks (inner and outer contours and an intermediate line) flattened into a 1D tensor.
 - Eye with eyebrow refined region surface (x2) represented as 71 2D landmarks (eye and eyebrow contours with surrounding areas) flattened into a 1D tensor.
 - Iris refined region surface (x2) represented as 5 2D landmarks (1 for pupil center and 4 for iris contour) flattened into a 1D tensor.
 - Face flag indicating the likelihood of the face being present in the input image. Used in tracking mode to detect that the face was lost and the face detector should be applied to obtain a new face position. Face probability threshold is set at 0.5 by default and can be adjusted.
[
  {
    box: {
      xMin: 304.6476503248806,
      xMax: 502.5079975897382,
      yMin: 102.16298762367356,
      yMax: 349.035215984403,
      width: 197.86034726485758,
      height: 246.87222836072945
    },
    keypoints: [
      {x: 406.53152857172876, y: 256.8054528661723, z: 10.2, name: "lips"},
      {x: 406.544237446397, y: 230.06933367750395, z: 8},
      ...
    ],
  }
]

 


 

4. evaluation metric

face의 metric : IOD MAE
  • 양쪽 눈 사이의 거리의 평균 오차
  • IOD (Normalization by Interocular Distance)
Normalization by interocular distance (IOD) is applied to unify the scale of the samples.
IOD is calculated as the distance between the eye centers (which are estimated as the centers of segments connecting eye corners) and is taken as 100%.
To accommodate head rotations, 3D IOD from the ground truth is employed.
  • MAE (Mean Absolute Error normalized by interocular distance) : handpose와 동일
Mean absolute error is calculated as the pixel distance between ground truth and predicted face mesh.
The model provides 3D coordinates, but as the z screen coordinates as well as metric world coordinates are obtained from synthetic data, so for a fair comparison with human annotations, only 2D screen coordinates MNAE are employed.

 

evaluation results
  • 구체적인 내용
Comparison with fairness goal of 2.56% IOD MAE discrepancy across 17 regions:
 - Tracking mode: from 2.73% to 3.95% (difference of 1.22%)
 - Reacquisition mode: from 3.01% to 4.28% (difference of 1.27%)

Comparison with our fairness criteria yields a maximum discrepancy between best and worst performing regions of 1.22% for the tracking mode and 1.27% for the reacquisition mode.
We therefore consider the models performing well across groups.
  • Tracking mode and Reacquisition mode
  Tracking mode Reacquisition mode
발동 상황 메인 모드
 = 대부분의 얼굴 감지 가능한 일반적인 상황
 = 이전 프레임에서 높은 정확도의 얼굴 정보를 얻을 수 있을 때
이전 프레임에서 얼굴 정보를 얻을 수 없을 때
 = 첫 번째 프레임 or 얼굴 추적 정보가 없어졌을 때
이용하는 툴 the Mdiapipe Python Solution API for face mesh BlazeFace Detector
Tracking mode
The main mode that takes place most of the time and is based on obtaining a highly accurate face crop from the prediction on the previous frame (frames 2, 3, ... on the image below). Underneath we utilize the MediaPipe Python Solution API for Face Mesh and run the pipeline for several frames on the same image before measuring the tracking accuracy (thus the crop region is determined from the model predictions as in a video stream).

Reacquisition mode
Takes place when there is no information about the face from previous frames. It happens either on the first frame (image below) or on the frames where the face tracking is lost. In this case, an external face detector is being run over the whole frame. We used BlazeFace Detector for the evaluation of the reacquisition mode.

 

+ Recent posts