GUI Agent数据标注与合成实操

2025年7月

Posted by franztao on July 18, 2025

GUI Agent数据标注与合成实操

OpenCUA方案

image.png

自研方案

image.png

1 人标注AgentNetTool – Annotation & Verification Tool (人标注,得到视频和点击的标注数据)

上图红色字体的操作,通过AgentNetTool 人标注能直接获得

image.png

image.png

2 直接录视频(无标注人力,只有操作的视频)

上图红色字体的操作,如果没有人力标注,就需要通过LLM来生成

动作场景生成

image.png

Video2Action: Reducing Human Interactions in Action Annotation of App Tutorial Videos

  1. opencv https://github.com/Breakthrough/PySceneDetect

    场景切割算法

    detect-content

    顾名思义,这种方法就是根据前后图像的内容来进行判断,与我们常识中所说的视频转场是一样的。算法会根据前后2帧的视频数据,计算出它们不同的区域大小,如果这个区域大于某个预先设定的值(默认是30,可以通过 --threshold 参数来指定),那么就认为场景已经切换了。

    detect-threshold

    这是比较传统的检测方法,有点像 ffmpeg 中的 blackframe 滤镜。它会用特定的值去跟数据帧的亮度比较进行,如果大于某个预先设定的值,就认为场景已经切换了。在 pyscenedetect 中,这个值是由视频帧的每个像素的 RGB 的平均值计算而来。

        
    from scenedetect import detect, AdaptiveDetector, split_video_ffmpeg
        
        
    def f1():
        src = r'C:\Users\m01216.METAX-TECH\Downloads\agentnet-annotator-win32-x64-0811\agentnet-annotator-win32-x64\resources\backend\_internal\Recordings\b3289d91-aac7-400c-b864-f53982bc3654\2025-08-20 13-56-53.mp4'
        detector = AdaptiveDetector()
        dst = r'C:\Users\m01216.METAX-TECH\Downloads\agentnet-annotator-win32-x64-0811\agentnet-annotator-win32-x64\resources\backend\_internal\Recordings\b3289d91-aac7-400c-b864-f53982bc3654\AdaptiveDetector'
        
        os.makedirs(dst, exist_ok=True)
        stats_file_path = os.path.join(dst, 'stats_file_path.txt')
        # os.makedirs(stats_file_path, exist_ok=True)
        scene_list = detect(src, detector, stats_file_path=stats_file_path, show_progress=True)
        
        print(scene_list)
        split_video_ffmpeg(src, scene_list, output_dir=dst)
        
        
    if __name__ == '__main__':
        # https://mp.weixin.qq.com/s/TavPYa-7vFtBxBQWdkowQA
        #  scenedetect --input "C:\Users\m01216.METAX-TECH\Downloads\agentnet-annotator-win32-x64-0811\agentnet-annotator-win32-x64\resources\backend\_internal\Recordings\b3289d91-aac7-400c-b864-f53982bc3654\2025-08-20 13-56-53.mp4" --output "C:\Users\m01216.METAX-TECH\Downloads\agentnet-annotator-win32-x64-0811\agentnet-annotator-win32-x64\resources\backend\_internal\Recordings\b3289d91-aac7-400c-b864-f53982bc3654\AdaptiveDetector" --stats my_video.stats.csv  detect-content list-scenes save-images
        
        # `detect-adaptive` or `detect-content` to find fast cuts,
        # and `detect-threshold`
        #  scenedetect --input "C:\Users\m01216.METAX-TECH\Downloads\agentnet-annotator-win32-x64-0811\agentnet-annotator-win32-x64\resources\backend\_internal\Recordings\b3289d91-aac7-400c-b864-f53982bc3654\2025-08-20 13-56-53.mp4" --output "C:\Users\m01216.METAX-TECH\Downloads\agentnet-annotator-win32-x64-0811\agentnet-annotator-win32-x64\resources\backend\_internal\Recordings\b3289d91-aac7-400c-b864-f53982bc3654\detect_content" --stats my_video.stats.csv  detect-content list-scenes save-images
        
        #  scenedetect --input "C:\Users\m01216.METAX-TECH\Downloads\agentnet-annotator-win32-x64-0811\agentnet-annotator-win32-x64\resources\backend\_internal\Recordings\b3289d91-aac7-400c-b864-f53982bc3654\2025-08-20 13-56-53.mp4" --output "C:\Users\m01216.METAX-TECH\Downloads\agentnet-annotator-win32-x64-0811\agentnet-annotator-win32-x64\resources\backend\_internal\Recordings\b3289d91-aac7-400c-b864-f53982bc3654\detect_adaptive" --stats my_video.stats.csv  detect-adaptive list-scenes save-images
        
        #  scenedetect --input "C:\Users\m01216.METAX-TECH\Downloads\agentnet-annotator-win32-x64-0811\agentnet-annotator-win32-x64\resources\backend\_internal\Recordings\b3289d91-aac7-400c-b864-f53982bc3654\2025-08-20 13-56-53.mp4" --output "C:\Users\m01216.METAX-TECH\Downloads\agentnet-annotator-win32-x64-0811\agentnet-annotator-win32-x64\resources\backend\_internal\Recordings\b3289d91-aac7-400c-b864-f53982bc3654\detect_threshold" --stats my_video.stats.csv  detect-threshold list-scenes save-images
        f1()
        
    
  2. CV模型预测,未开源,需要再找

action prediction 动作位置预测

image.png

https://openreview.net/pdf?id=PcwaP4o7vk

image.png

3 DataProcessor – Action Reduction & State–Action Matching(聚类,减少动作状态,过滤一些无含义的动作,得到action 文本)

https://github.com/xlang-ai/OpenCUA/tree/main/data/data-process

image.png

4 CoTGenerator – Synthesizing Reflective Long Chain-of-Thought Inner Monologue(生成aciton对应的COT,得到 thought 文本)

https://github.com/xlang-ai/OpenCUA/tree/main/data/cot-generate

image.png

5 AgentNet 训练数据具备

https://huggingface.co/datasets/xlangai/AgentNet

6  训练代码

未开源,qwen的基座模型

7 inference代码

https://github.com/xlang-ai/OpenCUA/tree/main/model/inference

8  部署商用工具与代码

使用字节的UI-TARS-desktop

https://github.com/bytedance/UI-TARS-desktop/blob/main/README.zh-CN.md

9 评测

https://mp.weixin.qq.com/s/1smJ0pgptP2UPhT4Fx970Q: https://mp.weixin.qq.com/s/1smJ0pgptP2UPhT4Fx970Q

参考资料

https://agentnet-tool.xlang.ai/requirements/annotation/annotation/

Annotation Pipeline Overview

Annotation Pipeline contains 5 steps:

  • Step 1: Open OBS and login AgentNet

  • Step 2: Record a task using AgentNet tool

  • Step 3: Review the recorded task

  • Step 4: Write task description

  • Step 5: Upload the task 

https://opencua.xlang.ai/: https://opencua.xlang.ai/

https://mp.weixin.qq.com/s/DrVO8xp3z-OWIESP7q-Vjg: https://mp.weixin.qq.com/s/DrVO8xp3z-OWIESP7q-Vjg

https://docs.google.com/presentation/d/10hC_ek-fmJVQnBj-99K0jVJwjG2xhHcfYV1F_iSUFlo/edit?pli=1&slide=id.g31e0b40d258_0_110#slide=id.g31e0b40d258_0_110: https://docs.google.com/presentation/d/10hC_ek-fmJVQnBj-99K0jVJwjG2xhHcfYV1F_iSUFlo/edit?pli=1&slide=id.g31e0b40d258_0_110#slide=id.g31e0b40d258_0_110

https://arxiv.org/pdf/2508.09123: https://arxiv.org/pdf/2508.09123

https://mp.weixin.qq.com/s/DGwJjOPTyBxlsYBpeqLcqw: https://mp.weixin.qq.com/s/DGwJjOPTyBxlsYBpeqLcqw