diffusion transformer

  • Tencent’s HY-Motion 1.0: Text-to-3D Motion Model


    Tencent Released Tencent HY-Motion 1.0: A Billion-Parameter Text-to-Motion Model Built on the Diffusion Transformer (DiT) Architecture and Flow MatchingTencent Hunyuan's 3D Digital Human team has introduced HY-Motion 1.0, a billion-parameter text-to-3D motion generation model built on the Diffusion Transformer (DiT) architecture with Flow Matching. This model translates natural language prompts into 3D human motion clips using a unified SMPL-H skeleton, making it suitable for digital humans, game characters, and cinematics. The model is trained on a vast dataset of over 3,000 hours of motion data, including high-quality motion capture and animation assets, and is designed to improve instruction following and motion realism through reinforcement learning techniques. HY-Motion 1.0 is available on GitHub and Hugging Face, offering developers tools and interfaces for integration into various animation and game development pipelines. Why this matters: HY-Motion 1.0 represents a significant advancement in AI-driven 3D animation, enabling more realistic and diverse character motions from simple text prompts, which can enhance digital content creation across industries.

    Read Full Article: Tencent’s HY-Motion 1.0: Text-to-3D Motion Model

  • Tencent HY-Motion 1.0: Text-to-Motion Model


    Tencent HY-Motion 1.0 - a billion-parameter text-to-motion modelTencent HY-Motion 1.0 is an open-source, billion-parameter model that converts text into 3D character animations using the Diffusion Transformer (DiT) architecture and flow matching. This model enhances the capabilities of developers and creators by providing high-fidelity, fluid, and diverse animations that can be easily integrated into existing 3D animation workflows. It features a full-stage training strategy, including pre-training, supervised fine-tuning, and reinforcement learning, to ensure physical plausibility and semantic accuracy across over 200 motion categories. This advancement sets a new standard for instruction-following capability and motion quality in the industry. This matters because it significantly enhances the ability to create complex and realistic 3D animations from natural language, broadening the possibilities for content creation and innovation in digital media.

    Read Full Article: Tencent HY-Motion 1.0: Text-to-Motion Model

  • Egocentric Video Prediction with PEVA


    Whole-Body Conditioned Egocentric Video PredictionPredicting Ego-centric Video from human Actions (PEVA) is a model designed to predict future video frames based on past frames and specified actions, focusing on whole-body conditioned egocentric video prediction. The model leverages a large dataset called Nymeria, which pairs real-world egocentric video with body pose capture, allowing it to simulate physical human actions from a first-person perspective. PEVA is trained using an autoregressive conditional diffusion transformer, which helps it handle the complexities of human motion, including high-dimensional and temporally extended actions. PEVA's approach involves representing each action as a high-dimensional vector that captures full-body dynamics and joint movements, using a 48-dimensional action space for detailed motion representation. The model employs techniques like random timeskips, sequence-level training, and action embeddings to better predict motion dynamics and activity patterns. During testing, PEVA generates future frames by conditioning on past frames, using an autoregressive rollout strategy to predict and update frames iteratively. This allows the model to maintain visual and semantic consistency over extended prediction periods, demonstrating its capability to generate coherent video sequences. The model's effectiveness is evaluated using various metrics, showing that PEVA outperforms baseline models in generating high-quality egocentric videos and maintaining coherence over long time horizons. However, it is acknowledged that PEVA is still an early step toward fully embodied planning, with limitations in long-horizon planning and task intent conditioning. Future directions involve extending PEVA to interactive environments and integrating high-level goal conditioning. This research is significant as it advances the development of world models for embodied agents, which are crucial for applications in robotics and AI-driven environments. Why this matters: Understanding and predicting human actions in egocentric video is crucial for developing advanced AI systems that can interact seamlessly with humans in real-world environments, enhancing applications in robotics, virtual reality, and autonomous systems.

    Read Full Article: Egocentric Video Prediction with PEVA