개요

 


 

1. 모델 설명 | Used to

모델 설명
  • 모델명: MediaPipe BlazePose GHUM 3D
  • 구체적인 설명
Lite (3MB size), Full (6 MB size) and Heavy (26 MB size) models, to estimate the full 3D body pose of an individual in videos captured by a smartphone or web camera.
Optimized for on-device, real-time fitness applications: Lite model runs ~44 FPS on a CPU via XNNPack TFLite and ~49 FPS via TFLite GPU on a Pixel 3. Full model runs ~18 FPS on a CPU via XNNPack TFLite and ~40 FPS via TFLite GPU on a Pixel 3.
Heavy model runs ~4FPS on a CPU via XNNPack TFLite and ~19 FPS via TFLite GPU on a Pixel 3.

 

Used to
Applications
3D full body pose estimation for single-person videos on mobile, desktop and in browser.

Domain & Users
 - Augmented reality
 - 3D Pose and gesture recognition
 - Fitness and repetition counting
 - 3D pose measurements (angles / distances)

Out-of-scope applications
Multiple people in an image.
 - People too far away from the camera (e.g. further than 14 feet/4 meters)
 - Head is not visible Applications requiring metric accurate depth
 - Any form of surveillance or identity recognition is explicitly out of scope and not enabled by this technology

 


 

2. Model type 및 Model architecture

모델 카드 인용
Convolutional Neural Network: MobileNetV2-like with customized blocks for real-time performance.

 

Model type: CNN

 

Model architecture : MobileNetV2

 


 

3. input, output

Inputs
Regions in the video frames where a person has been detected.
Represented as a 256x256x3 array with aligned human full body part, centered by mid-hip in vertical body pose and rotation distortion of (-10, 10) . Channels order: RGB with values in [0.0, 1.0].

Output(s)
33x5 array corresponding to (x, y, z, visibility, presence).
 - X, Y coordinates are local to the region of interest and range from [0.0, 255.0].
 - Z coordinate is measured in "image pixels" like the X and Y coordinates and represents the distance relative to the plane of the subject's hips, which is the origin of the Z axis. Negative values are between the hips and the camera; positive values are behind the hips.
Z coordinate scale is similar with X, Y scales but has different nature as obtained not via human annotation, by fitting synthetic data (GHUM model) to the 2D annotation. Note, that Z is not metric but up to scale.
 - Visibility is in the range of [min_float, max_float] and after user-applied sigmoid denotes the probability that a keypoint is located within the frame and not occluded by another bigger body part or another object.
 - Presence is in the range of [min_float, max_float] and after user-applied sigmoid denotes the probability that a keypoint is located within the frame.
[
  {
    score: 0.8,
    keypoints: [
      {x: 230, y: 220, score: 0.9, score: 0.99, name: "nose"},
      {x: 212, y: 190, score: 0.8, score: 0.91, name: "left_eye"},
      ...
    ],
    keypoints3D: [
      {x: 0.65, y: 0.11, z: 0.05, score: 0.99, name: "nose"},
      ...
    ],
    segmentation: {
      maskValueToLabel: (maskValue: number) => { return 'person' },
      mask: {
        toCanvasImageSource(): ...
        toImageData(): ...
        toTensor(): ...
        getUnderlyingType(): ...
      }
    }
  }
]

 


 

4. evaluation metric

pose의 metric : PDJ
  • average Percentage of Detected Joints: 감지된 부위의 평균 비율 오차
We consider a keypoint to be correctly detected if predicted visibility for it matches ground truth and the absolute 2D Euclidean error between the reference and target keypoint normalized by the 2D torso diameter projection
is smaller than 20%.
This value was determined during development as the maximum value that does not degrade accuracy in classifying pose / asana based solely on the key points without perceiving the original RGB image. The model is providing 3D coordinates, but the z-coordinate is obtained from synthetic data, so for a fair comparison with human annotations, only 2D coordinates are employed.

 

evaluation results
  • geographical
Evaluation across 14 regions of heavy, full and lite models on smartphone back-facing camera photos dataset results an average performance of 94.2% +/- 1.3% stdev with a range of [91.4%, 96.2%] across regions for the heavy model, an average performance of 91.8% +/- 1.4% stdev with a range of [89.2%, 94.0%] across regions for the full model and an average performance of 87.0% +/-2.0% stdev with a range of [83.2%, 89.7%] across regions for the lite model.
Comparison with our fairness criteria yields a maximum discrepancy between average and worst performing regions of 4.8% for the heavy, 4.8% for the full and 6.5% for the light model.
  • skin tone and gender
Evaluation on smartphone back-facing camera photos dataset results in an average performance of 93.6% with a range of [89.3%, 95.0%] across all skin tones for the heavy model, an average performance of 91.1% with a range of [85.9%, 92.9%] across all skin tones for the full model and an average performance of 86.4% with a range of [80.5%, 87.8%] across regions for the lite model. The maximum discrepancy between worst and best performing categories is 5.7% for the heavy model, 7.0% for the full model and 7.3% for the lite model.
Evaluation across gender yields an average performance of 94.8% with a range of [94.2%, 95.3%] for the heavy model, an average performance of 92.3% with a range of [91.2%, 93.4%] for the full model, and an average of 83.7% with a range of [86.0%, 89.1%] for the lite model.
The maximum discrepancy is 1.1% for the heavy model, 2.2% for the full model and 3.1% for the lite model.

개요

 


 

1. 모델 설명 | Used to

모델 설명
  • 모델명: MediaPipe Attention Mesh
  • 구체적인 설명
A lightweight model for real-time prediction of 3D facial surface landmarks from video captured by a front-facing smartphone camera.
Designed for applications like AR makeup, eye tracking and AR puppeteering that rely on highly accurate landmarks for eye (+ iris) and lips regions predicted by the model. Runs at over 50 FPS on Pixel 2 phone.

 

Used to
  • 특정 신원에 대한 얼굴 또는 얼굴 특징에 대해서는 저장할 수 없음에 주의
Applications
 - Detection of human facial surface landmarks from monocular video.
 - Optimized for videos captured on front-facing cameras of smartphones.
 - Well suitable for mobile AR (augmented reality) applications.

Domain & Users
 - The primary intended application is AR entertainment.
 - Intended users are people who use augmented reality for entertainment purposes.

Out-of-scope applications
Not appropriate for:
 - This model is not intended for human life-critical decisions.
 - Predicted face landmarks do not provide facial recognition or identification and do not store any unique face representation.

 


 

2. Model type 및 Model architecture

모델 카드 인용
MobileNetV2 - like with customized blocks for real-time performance and an attention mechanism to refine lips and eye regions and to predict irises.

 

Model type: CNN

 

Model architecture
  • face-landmarks-detection 에서는 detector-model과 landmark-model이라고 하지 않고, 각각을 sub model이라고 칭함
  • 입, 눈, 눈동자를 제외한 박스 안에 있는 얼굴 부위 탐지 및 좌표 예측: MobileNetV2
  • 입, 눈, 눈동자의 탐지 및 좌표 예측: attention mechanism
    • 특히 이 메커니즘에서 spatial transformer 모듈이 사용됨
    • 이 모듈은 아핀 변환 행렬(affine transformation matrix)에 의해 조절되고, 사용자가 박스 안의 keypoint를 당기고 회전하고 바꾸고 왜곡할 수 있게 함
attention mechanism
An attention mechanism pulls out visual features of a given region of interest by sampling a grid of 2D points in the feature space and extracting the features under the sampled points. This allows to train architectures end-to-end and to enrich the features that are used by the attention mechanism.

spatial transformer
Specifically, we use a spatial transformer module which is controlled by an affine transformation matrix and allows us to zoom, rotate, translate, and skew the sampled grid of points.

 


 

3. input, output

Inputs
Image of cropped face with 25% margin on each side and size 192x192 px.

Output(s)
 - Facial surface represented as 468 3D landmarks flatened into a 1D tensor: (x1, y1, z1), (x2, y2, z2), ... x- and y-coordinates follow the image pixel coordinates;
z-coordinates are relative to the face center of mass and are scaled proportionally to the face width.
 - Lips refined region surface represented as 80 2D landmarks (inner and outer contours and an intermediate line) flattened into a 1D tensor.
 - Eye with eyebrow refined region surface (x2) represented as 71 2D landmarks (eye and eyebrow contours with surrounding areas) flattened into a 1D tensor.
 - Iris refined region surface (x2) represented as 5 2D landmarks (1 for pupil center and 4 for iris contour) flattened into a 1D tensor.
 - Face flag indicating the likelihood of the face being present in the input image. Used in tracking mode to detect that the face was lost and the face detector should be applied to obtain a new face position. Face probability threshold is set at 0.5 by default and can be adjusted.
[
  {
    box: {
      xMin: 304.6476503248806,
      xMax: 502.5079975897382,
      yMin: 102.16298762367356,
      yMax: 349.035215984403,
      width: 197.86034726485758,
      height: 246.87222836072945
    },
    keypoints: [
      {x: 406.53152857172876, y: 256.8054528661723, z: 10.2, name: "lips"},
      {x: 406.544237446397, y: 230.06933367750395, z: 8},
      ...
    ],
  }
]

 


 

4. evaluation metric

face의 metric : IOD MAE
  • 양쪽 눈 사이의 거리의 평균 오차
  • IOD (Normalization by Interocular Distance)
Normalization by interocular distance (IOD) is applied to unify the scale of the samples.
IOD is calculated as the distance between the eye centers (which are estimated as the centers of segments connecting eye corners) and is taken as 100%.
To accommodate head rotations, 3D IOD from the ground truth is employed.
  • MAE (Mean Absolute Error normalized by interocular distance) : handpose와 동일
Mean absolute error is calculated as the pixel distance between ground truth and predicted face mesh.
The model provides 3D coordinates, but as the z screen coordinates as well as metric world coordinates are obtained from synthetic data, so for a fair comparison with human annotations, only 2D screen coordinates MNAE are employed.

 

evaluation results
  • 구체적인 내용
Comparison with fairness goal of 2.56% IOD MAE discrepancy across 17 regions:
 - Tracking mode: from 2.73% to 3.95% (difference of 1.22%)
 - Reacquisition mode: from 3.01% to 4.28% (difference of 1.27%)

Comparison with our fairness criteria yields a maximum discrepancy between best and worst performing regions of 1.22% for the tracking mode and 1.27% for the reacquisition mode.
We therefore consider the models performing well across groups.
  • Tracking mode and Reacquisition mode
  Tracking mode Reacquisition mode
발동 상황 메인 모드
 = 대부분의 얼굴 감지 가능한 일반적인 상황
 = 이전 프레임에서 높은 정확도의 얼굴 정보를 얻을 수 있을 때
이전 프레임에서 얼굴 정보를 얻을 수 없을 때
 = 첫 번째 프레임 or 얼굴 추적 정보가 없어졌을 때
이용하는 툴 the Mdiapipe Python Solution API for face mesh BlazeFace Detector
Tracking mode
The main mode that takes place most of the time and is based on obtaining a highly accurate face crop from the prediction on the previous frame (frames 2, 3, ... on the image below). Underneath we utilize the MediaPipe Python Solution API for Face Mesh and run the pipeline for several frames on the same image before measuring the tracking accuracy (thus the crop region is determined from the model predictions as in a video stream).

Reacquisition mode
Takes place when there is no information about the face from previous frames. It happens either on the first frame (image below) or on the frames where the face tracking is lost. In this case, an external face detector is being run over the whole frame. We used BlazeFace Detector for the evaluation of the reacquisition mode.

 

개요

 

Model Card Hand Tracking (Lite_Full) with Fairness Oct 2021.pdf

 

drive.google.com

 

 


 

0. Model card 란?

의미
  • 모델의 다양한 속성, 성능 등을 공유하여 개발자로 하여금 선택의 기회를 넓혀주기 위한 문서
  • 구글에서 발표한 Model Cards for Model Reporting 논문에 기반한 것

 

목적
  • 머신러닝 모델의 유스케이스 명료화 
  • 머신러닝 모델의 적용 오류 상황 최소화
In order to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited, we recommend that released models be accompanied by documentation detailing their performance characteristics

 

포함된 내용
  • 다양한 평가 기준: 문화적 차이, 지역, 성별 등
  • 모델 사용 방법
  • 구체적인 성능 평가 과정
  • 여타 관련 요소
Model cards are short documents accompanying trained machine learning models that provide
benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type [15]) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains.
Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information.

 


 

1. 모델 설명 | Used to 

모델명 : MediaPipe Hands (Lite / Full)
  • 구체적인 설명
Hand tracking neural network pipelines: Lite and Full, to predict 2D and 3D hand landmarks on an image / video sequence.
Both pipelines consist of:
 - Hand detector model, which locates hand region
 - Hand tracking model, which predict 2D keypoints, 3D world keypoints, handedness on a cropped area around hand
 - MediaPipe graph, with hand tracking logic.
  • mediapipe와 tfjs 차이
    • hand-pose-detection 모델을 로드하여 옵션을 설정하다 보면, mediapipe 옵션과 tfjs 옵션이 구분되어 있음을 알 수 있음. 그러나 양자의 차이가 헷갈려서 이하에서 정리함
    • 이하 내용은 handpose, face, pose-detection 에서 동일함
  mediapipe tfjs
공통점 머신러닝 모델을 웹 또는 기타 플랫폼에서 실행하기 위한 도구  
차이점 - Google에서 개발한 크로스 플랫폼 프레임워크
- 머신러닝 모델을 비롯한 다양한 타입의 계산 그래프를 구축하고 실행
- Tensorflow의 웹 버전
: Tensorflow 모델을 웹 브라우저 상에서 실행할 수 있게 해줌
미리 구축된 다양한 머신러닝 솔루션(예: 얼굴 인식, 손동작 인식 등)을 제공 → 사용자는 쉽게 복잡한 머신러닝 기능을 구현 JavaScript를 사용하여 직접 모델을 구축하고 학습시키는 데에초점
C++, Python, JavaScript 등 다양한 언어를 지원
→ 데스크톱, 모바일, 웹 등 다양한 플랫폼에서 실행
웹 브라우저 또는 Node.js 환경에서 실행

 


 

Used to
Application
Predicting landmarks within the crop of prominently displayed hands in images or videos captured by a smartphone camera.

Domain & Users
Mobile AR (augmented reality) Mobile AR (augmented reality) applications. Gesture recognition Hand control

Out-of-scope applications not appropriate for:
 - Counting the number of hands in a crowd
 - Predicting hand landmarks with gloves or occlusions. For example when the hand is holding objects or there is decoration on the hand including jewelry, tattoo and henna.
 - Any form of surveillance or identity recognition is explicitly out of scope and not enabled by this technology.

 


 

2. model type | model architecture

개요
  • 총 2가지의 모델로 구성되어 있음. ①detector ②landmark (tracking) 모델
  • 따라서 model type은 양자 모두 CNN (Convolutional Neural Network)로 모두 동일하지만, model architecture는 그 양상이 상이함
  • 즉 모델 아키텍처는 2단계 (detector model 실행 → tracking model 실행)로 이루어져 있는 셈
Two step neural network pipeline with single-shot detector and following regression model running on the cropped region.
  • 위의 사실은 face, pose - detection 이 동일하지만, detector 모델과 tracking 모델 정보를 별도로 제공하는건 handpose가 동일함

 


 

model type: CNN

 


 

model architecture
  • detector model: SSD (Single-Shot Detector model) -> 손 (모양)을 탐지
  • tracker model: Regression model -> 탐지한 손 (모양)의 좌표를 예측

 


3. input, output

  • detector model
Inputs
A frame of video or an image, represented as a 192 x 192 x 3 tensor. Channels order: RGB with values in [0.0, 1.0].

Output(s)
A float tensor 2016 x 18 of predicted embeddings representing anchors transformation which are further used in Non Maximum Suppression algorithm.
  • tracking model (landmark model)
Inputs
A crop of a frame of video or an image, represented as a 224 x 224 x 3 tensor. Channels order: RGB with values in [0.0, 1.0].

Output(s)
A float scalar represents the presence of a hand in the given input image. 21 3-dimensional screen landmarks represented as a 1 x 63 tensor and normalized by image size. This output should only be considered valid when the presence score is higher than a threshold. A float scalar represents the handedness of the predicted hand. This output should only be considered valid when the presence score is higher than a threshold. 21 3-dimensional metric scale world landmarks represented as a 1 x 63 tensor. Predictions are based on the GHUM hand model. This output should only be considered valid when the presence score is higher than a threshold.
  • 결론적으로 사용자 입장에서의 input, output은 다음과 같음
Inputs
A video stream or an image of arbitrary size. Channels order: RGB with values in [0.0, 1.0].

Output(s)
List of detected hands, each containing 21 3-dimensional screen landmarks A float scalar represents the handedness probability of the predicted hand. 21 3-dimensional metric scale world landmarks. Predictions are based on the GHUM hand model.
[
  {
    score: 0.8,
    handedness: ‘Right’,
    keypoints: [
      {x: 105, y: 107, name: "wrist"},
      {x: 108, y: 160, name: "pinky_finger_tip"},
      ...
    ],
    keypoints3D: [
      {x: 0.00388, y: -0.0205, z: 0.0217, name: "wrist"},
      {x: -0.025138, y: -0.0255, z: -0.0051, name: "pinky_finger_tip"},
      ...
    ]
  }
]

 


4. evaluation metric

의미
  • 모델 성능을 측정하는 방법 (Model Performance Measures)
  • 이 방법은 ‘fairness evaluation’ 임. 즉 모델이 여러 집단(ex. 피부색, 성별, 거주 지역)에 대해 동일하게 잘 동작하는지 확인하는 절차임
    -> handpose의 경우, 다양한 지역 / 피부색 / 성별 인구에 대해 모델 작동을 테스트 했음
  • 따라서 metric = 언급한 집단 별로 차이가 있을 법한 수치를 계산하는 것임
  • 위의 내용은 handpose, face, pose-detection 모두에도 적용되는 내용임

 


handpose의 metric : MNAE (Mean of Normalized Absolute Error by palm size)
  • 손바닥 크기의 평균 오차
  • Normalization by palm size
Normalization by palm size is applied to unify the scale of the samples. Palm size is calculated as the distance between the wrist and the first joint (MCP) of the middle finger.
  • Mean absolute error
Mean absolute error is calculated as the pixel distance between ground truth and predicted hand landmarks.
The model provides 3D coordinates, but as the z screen coordinates as well as metric world coordinates are obtained from synthetic data, so for a fair comparison with human annotations, only 2D screen coordinates MNAE are employed.

 


evaluation results
  • 해석 방법
    • MNAE의 범위, 평균, 오차 범위 / MNAE와 stdev와의 차이 / 각 집단의 카테고리별 차이 등을 통계적으로 해석하여 밑과 같은 결론 도출
    • 결론적으로 fairness results를 해석하는 것과 동일함
    • 위 사실은 face, pose - detection 모두 동일함
  • geographical
Evaluation across 14 regions on the validation dataset yields an average performance of 12.02% +/- 1.6% stdev with a range of [8.43%, 13.42%] across regions for the lite model and an average performance of 10.09% +/- 1.73% stdev with a range of [6.10%, 13.00%] across regions for the full model.
We found that per-joint MNAE is the smallest at the base of each finger, and gets larger toward the fingertip. We conjecture that the prediction is easier around the palm which is more rigid than the fingers. We also found that the normalized absolute error is larger for blurry or occluded joints.
The findings are consistent across all regions. We didn’t find any error pattern with regard to the regions.
  • skin tone
Evaluation across 6 skin tone types on the validation dataset yields an average performance of 5.67% +/- 0.94% stdev with a range of [4.88%, 7.25%] across types for lite model and an average performance of 5.08% +/- 0.72% stdev with a range of [4.53%, 6.21%] across types for full model.
  • gender
Evaluation across genders on the validation dataset yields an average performance of 5.67% with a range of [5.29%, 6.05%] for lite model and an average performance of 5.09% with a range of [4.80%, 5.38%] for full model.
Our findings are the same as in geographical fairness evaluation results above. We didn’t find any error pattern with regard to the skin tone types or the gender.

+ Recent posts