📌 TOPINDIATOURS Hot ai: ByteDance Introduces Astra: A Dual-Model Architecture for
The increasing integration of robots across various sectors, from industrial manufacturing to daily life, highlights a growing need for advanced navigation systems. However, contemporary robot navigation systems face significant challenges in diverse and complex indoor environments, exposing the limitations of traditional approaches. Addressing the fundamental questions of “Where am I?”, “Where am I going?”, and “How do I get there?”, ByteDance has developed Astra, an innovative dual-model architecture designed to overcome these traditional navigation bottlenecks and enable general-purpose mobile robots.
Traditional navigation systems typically consist of multiple, smaller, and often rule-based modules to handle the core challenges of target localization, self-localization, and path planning. Target localization involves understanding natural language or image cues to pinpoint a destination on a map. Self-localization requires a robot to determine its precise position within a map, especially challenging in repetitive environments like warehouses where traditional methods often rely on artificial landmarks (e.g., QR codes). Path planning further divides into global planning for rough route generation and local planning for real-time obstacle avoidance and reaching intermediate waypoints.
While foundation models have shown promise in integrating smaller models to tackle broader tasks, the optimal number of models and their effective integration for comprehensive navigation remained an open question.
ByteDance’s Astra, detailed in their paper “Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning” (website: https://astra-mobility.github.io/), addresses these limitations. Following the System 1/System 2 paradigm, Astra features two primary sub-models: Astra-Global and Astra-Local. Astra-Global handles low-frequency tasks like target and self-localization, while Astra-Local manages high-frequency tasks such as local path planning and odometry estimation. This architecture promises to revolutionize how robots navigate complex indoor spaces.
Astra-Global: The Intelligent Brain for Global Localization
Astra-Global serves as the intelligent core of the Astra architecture, responsible for critical low-frequency tasks: self-localization and target localization. It functions as a Multimodal Large Language Model (MLLM), adept at processing both visual and linguistic inputs to achieve precise global positioning within a map. Its strength lies in utilizing a hybrid topological-semantic graph as contextual input, allowing the model to accurately locate positions based on query images or text prompts.
The construction of this robust localization system begins with offline mapping. The research team developed an offline method to build a hybrid topological-semantic graph G=(V,E,L):
- V (Nodes): Keyframes, obtained by temporal downsampling of input video and SfM-estimated 6-Degrees-of-Freedom (DoF) camera poses, act as nodes encoding camera poses and landmark references.
- E (Edges): Undirected edges establish connectivity based on relative node poses, crucial for global path planning.
- L (Landmarks): Semantic landmark information is extracted by Astra-Global from visual data at each node, enriching the map’s semantic understanding. These landmarks store semantic attributes and are connected to multiple nodes via co-visibility relationships.
In practical localization, Astra-Global’s self-localization and target localization capabilities leverage a coarse-to-fine two-stage process for visual-language localization. The coarse stage analyzes input images and localization prompts, detects landmarks, establishes correspondence with a pre-built landmark map, and filters candidates based on visual consistency. The fine stage then uses the query image and coarse output to sample reference map nodes from the offline map, comparing their visual and positional information to directly output the predicted pose.
For language-based target localization, the model interprets natural language instructions, identifies relevant landmarks using their functional descriptions within the map, and then leverages landmark-to-node association mechanisms to locate relevant nodes, retrieving target images and 6-DoF poses.
To empower Astra-Global with robust localization abilities, the team employed a meticulous training methodology. Using Qwen2.5-VL as the backbone, they combined Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). SFT involved diverse datasets for various tasks, including coarse and fine localization, co-visibility detection, and motion trend estimation. In the GRPO phase, a rule-based reward function (including format, landmark extraction, map matching, and extra landmark rewards) was used to train for visual-language localization. Experiments showed GRPO significantly improved Astra-Global’s zero-shot generalization, achieving 99.9% localization accuracy in unseen home environments, surpassing SFT-only methods.
Astra-Local: The Intelligent Assistant for Local Planning
Astra-Local acts as the intelligent assistant for Astra’s high-frequency tasks, a multi-task network capable of efficiently generating local paths and accurately estimating odometry from sensor data. Its architecture comprises three core components: a 4D spatio-temporal encoder, a planning head, and an odometry head.
The 4D spatio-temporal encoder replaces traditional mobile stack perception and prediction modules. It begins with a 3D spatial encoder that processes N omnidirectional images through a Vision Transformer (ViT) and Lift-Splat-Shoot to convert 2D image features into 3D voxel features. This 3D encoder is trained using self-supervised learning via 3D volumetric differentiable neural rendering. The 4D spatio-temporal encoder then builds upon the 3D encoder, taking past voxel features and future timestamps as input to predict future voxel features through ResNet and DiT modules, providing current and future environmental representations for planning and odometry.
The planning head, based on pre-trained 4D features, robot speed, and task information, generates executable trajectories using Transformer-based flow matching. To prevent collisions, the planning head incorporates a masked ESDF loss (Euclidean Signed Distance Field). This loss calculates the ESDF of a 3D occupancy map and applies a 2D ground truth trajectory mask, significantly reducing collision rates. Experiments demonstrate its superior performance in collision rate and overall score on out-of-distribution (OOD) datasets compared to other methods.
The odometry head predicts the robot’s relative pose using current and past 4D features and additional sensor data (e.g., IMU, wheel data). It trains a Transformer model to fuse information from different sensors. Each sensor modality is processed by a specific tokenizer, combined with modality embeddings and temporal positional embeddi…
Konten dipersingkat otomatis.
đź”— Sumber: syncedreview.com
📌 TOPINDIATOURS Hot ai: Ai2’s Olmo 3 family challenges Qwen and Llama with efficie
The Allen Institute for AI (Ai2) hopes to take advantage of an increased demand for customized models and enterprises seeking more transparency from AI models with its latest release.
Ai2 made the latest addition to its Olmo family of large language models available to organizations, continuing to focus on openness and customization.Â
Olmo 3 has a longer context window, more reasoning traces and is better at coding than its previous iteration. This latest version, like the other Olmo releases, is open-sourced under the Apache 2.0 license. Enterprises will have complete transparency into and control over the training data and checkpointing.Â
Ai2 will release three versions of Olmo 3:
-
Olmo 3- Think in both 7B and 32B are considered the flagship reasoning models for advanced research
-
Olmo 3- Base also in both parameters, which is ideal for programming, comprehension, math and long-context reasoning. Ai2 said this version is “ideal for continued pre-training or fine-tuning
-
Olmo 3-Instruct in 7B that is optimized for instruction following, multi-turn dialogue and tool use
The company said Olmo 3- Think is the “first-ever fully open 32B thinking model that generates explicit reasoning-chain-style content.” Olmo-3 Think also has a long context window of 65,000 tokens, perfect for longer-running agentic projects or reasoning over longer documents.Â
Noah Smith, Ai2’s senior director of NLP research, told VentureBeat in an interview that many of its customers, from regulated enterprises to research institutions, want to use models that give them assurance about what went into the training.Â
“The releases from our friends in the tech world are very cool and super exciting, but there are a lot of people for whom data privacy control over what goes into the model, how the models train and other constraints on how the model can be used as front of mind,” said Smith.Â
Developers can access the models on Hugging Face and the Ai2 Playground.Â
Transparency and customization
Smith said models like Olmo 3, which the company believes any organization using its models has to have control over and mold in the way that best works for them.
“We don't believe in one-size-fits-all solutions,” Smith said. It's a known thing in the world of machine learning that if you try and build a model that solves all the problems, it ends up not being really the best model for any one problem. There aren't formal proofs of that, but it's a thing that old timers like me have kind of observed.”
He added that models with the ability to specialize “are maybe not as flash as getting high scores on math exams” but offer more flexibility for enterprises.
Olmo 3 allows enterprises to essentially retrain the model by adding to the data mix it learns from. The idea is that businesses can bring in their proprietary sources to guide the model in answering specific company queries. To help enterprises during this process, Ai2 added checkpoints from every major training phase.
Demand for model customization has grown as enterprises that cannot build their own LLMs want to create company-specific or industry-focused models. Startups like Arcee have begun offering enterprise-focused, customizable small models.
Models like Olmo 3, Smith said, also give enterprises more confidence in the technology. Since Olmo 3 provides the training data, Smith said enterprises can trust that the model did not ingest anything it shouldn’t have.
Ai2 has always claimed to be committed to greater transparency, even launching a tool called OlmoTrace in April that can track a model’s output directly back to the original training data. The company releases open-sourced models and posts its code to repositories like GitHub for anyone to use.
Competitors like Google and OpenAI have faced criticism from developers over moves that hid raw reasoning tokens and chose to summarize reasoning, claiming that they now resort to “debugging blind” without transparency.
Ai2 pretrained Olmo 3 on the six-trillion-token open source dataset, Dolma 3. The dataset encompasses web data, scientific literature and code. Smith said they optimized Olmo 3 for code, compared to the focus on math for Olmo 2.Â
How it stacks up
Ai2 claims that the Olmo 3 family of models represents a significant leap for truly open-source models, at least for open-source LLMs developed outside China. The base Olmo 3 model trained “with roughly 2.5x greater compute efficiency as measured by GPU-hours per token,” meaning it consumed less energy during pre-training and costs less.
The company said the Olmo 3 models outperformed other open models, such as Marin from Stanford, LLM360’s K2, and Apertus, though Ai2 did not provide figures for the benchmark testing.
“Of note, Olmo 3-Think (32B) is the strongest fully open reasoning model, narrowing the gap to the best open-weight models of similar scale, such as the Qwen 3-32B-Thinking series of models across our suite of reasoning benchmarks, all while being trained on 6x fewer tokens,” Ai2 said in a press release.
The company added that Olmo 3-Instruct performed better than Qwen 2.5, Gemma 3 and Llama 3.1.
Â
đź”— Sumber: venturebeat.com
🤖 Catatan TOPINDIATOURS
Artikel ini adalah rangkuman otomatis dari beberapa sumber terpercaya. Kami pilih topik yang sedang tren agar kamu selalu update tanpa ketinggalan.
✅ Update berikutnya dalam 30 menit — tema random menanti!