Archives AI News

PlayStation launches new app for parental controls

acastro STK097 03

PlayStation is launching a new PlayStation Family app for iOS and Android to help parents manage their child’s playtime on PS5 and PS4. Parents can already set parental control features directly on a console, but this app gives parents another way to set limits and keep track of what their kid is playing. “The app […]

Scaling high-performance inference cost-effectively

1 xN8hT4E.max 1000x1000 1

At Google Cloud Next 2025, we announced new inference capabilities with GKE Inference Gateway, including support for vLLM on TPUs, Ironwood TPUs, and Anywhere Cache.  Our inference solution is based on AI Hypercomputer, a system built on our experience running models like Gemini and Veo 3, which serve over 980 trillion tokens a month to more than 450 million users. AI Hypercomputer services provide intelligent and optimized inferencing, including resource management, workload optimization and routing, and advanced storage for scale and performance, all co-designed to work together with industry leading GPU and TPU accelerators. Today, GKE Inference Gateway is generally available, and we are launching new capabilities that deliver even more value. This underscores our commitment to helping companies deliver more intelligence, with increased performance and optimized costs for both training and serving. Let’s take a look at the new capabilities we are announcing. Efficient model serving and load balancing A user’s experience of a generative AI application highly depends on both a fast initial response to a request and a smooth streaming of the response through to completion. With these new features, we’ve improved time-to-first-token (TTFT) and time-per-output-token (TPOT) on AI Hypercomputer. TTFT is based on the prefill phase, a compute-bound process where a full pass through the model creates a key-value (KV) cache. TPOT is based on the decode phase, a memory-bound process where tokens are generated using the KV cache from the prefill stage.  We improve both of these in a variety of ways. Generative AI applications like chatbots and code generation often reuse the same prefix in API calls. To optimize for this, GKE Inference Gateway now offers prefix-aware load balancing. This new, generally available feature improves TTFT latency by up to 96% at peak throughput for prefix-heavy workloads over other clouds by intelligently routing requests with the same prefix to the same accelerators, while balancing the load to prevent hotspots and latency spikes. Consider a chatbot for a financial services company that helps users with account inquiries. A user starts a conversation to ask about a recent credit card transaction. Without prefix-aware routing, when the user asks follow up questions, such as the date of the charge or the confirmation number, the LLM has to re-read and re-process the entire initial query before it can answer the follow up question. The re-computation of the prefill phase is very inefficient and adds unnecessary latency, with the user experiencing delays between each question. With prefix-aware routing, the system intelligently reuses the data from the initial query by routing the request back to the same KV cache. This bypasses the prefill phase, allowing the model to answer almost instantly. Less computation also means fewer accelerators for the same workload, providing significant cost savings. To further optimize inference performance, you can now also run disaggregated serving using AI Hypercomputer, which can improve throughput by 60%. Enhancements in GKE Inference Gateway, llm-d, and vLLM, work together to enable dynamic selection of prefill and decode nodes based on query size. This significantly improves both TTFT and TPOT by increasing the utilization of compute and memory resources at scale. Take an example of an AI-based code completion application, which needs to provide low-latency responses to maintain interactivity. When a developer submits a completion request, the application must first process the input codebase; this is referred to as the prefill phase. Next, the application generates a code suggestion token by token; this is referred to as the decode phase. These tasks have dramatically different demands on accelerator resources — compute-intensive vs. memory-intensive processing. Running both phases on a single node results in neither being fully optimized, causing higher latency and poor response times. Disaggregated serving assigns these phases to separate nodes, allowing for independent scaling and optimization of each phase. For example, if the user base of developers submit a lot of requests based on large codebases, you can scale the prefill nodes. This improves latency and throughput, making the entire system more efficient. Just as prefix-aware routing optimizes the reuse of conversational context, and disaggregated serving enhances performance by intelligently separating the computational demands of model prefill and token decoding, we have also addressed the fundamental challenge of getting these massive models running in the first place. As generative AI models grow to hundreds of gigabytes in size, they can often take over ten minutes to load, leading to slow startup and scaling. To solve this, we now support the Run:ai model streamer with Google Cloud Storage and Anywhere Cache for vLLM, with support for SGLang coming soon. This enables 5.4 GiB/s of direct throughput to accelerator memory, reducing model load times by over 4.9x, resulting in a better end user experience. vLLM Model Load Time Get started faster with data-driven decisions Finding the ideal technology stack for serving AI models is a significant industry challenge. Historically, customers have had to navigate rapidly evolving technologies, the switching costs that impact hardware choices, and hundreds of thousands of possible deployment architectures. This inherent complexity makes it difficult to quickly achieve the best price-performance for your inference environment. The GKE Inference Quickstart, now generally available, can save you time, improve performance, and reduce costs when deploying AI workloads by helping determine the right accelerator for your workloads in the right configuration, suggesting the best accelerators, model server, and scaling configuration for your AI/ML inference applications. New improvements to GKE Inference Quickstart include cost insights and benchmarked performance best practices, so you can easily compare costs and understand latency profiles, saving you months on evaluation and qualification.  GKE Inference Quickstart’s recommendations are grounded in a living repository of model and accelerator performance data that we generate by benchmarking our GPU and TPU accelerators against leading large language models like Llama, Mixtral, and Gemma more than 100 times per week. This extensive performance data is then enriched with the same storage, network, and software optimizations that power AI inferencing on Google’s global-scale services like Gemini, Search, and YouTube.  Let’s say you’re tasked with deploying a new, public-facing chatbot. The goal is to provide fast, high-quality responses at the lowest cost. Until now, finding the most optimal and cost-effective solution for deploying AI models was a significant challenge. Developers and engineers had to rely on a painstaking process of trial and error. This involved manually benchmarking countless combinations of different models, accelerators, and serving architectures, with all the data logged into a spreadsheet to calculate the cost per query for each scenario. This manual, weeks-long, or even months-long, project was prone to human error and offered no guarantee that the best possible solution was ever found. Using Google Colab and the built-in optimizations in the Google Cloud console, GKE Inference Quickstart lets you choose the most cost-effective accelerators for, say, serving a Llama 3-based chatbot application that needs a TTFT of less than 500ms. These recommendations are deployable manifests, making it easy to choose a technology stack that you can provision from GKE in your Google Cloud environment. With GKE Inference Quickstart, your evaluation and qualification effort has gone from months to days. Views from the Google Colab that helps the engineer with their evaluation. Try these new capabilities for yourself. To get started with GKE Inference QuickStart, from the Google Cloud console, go to Kubernetes Engine > AI/ML, and select “+ Deploy Models” near the top of the screen. Use the Filter to select Optimized > Values = True. This will show you all of the models that have price/performance optimization to select from. Once you select a model, you’ll see a sliding bar to select latency. The compatible accelerators from the drop-down will change to ones that match the performance of the latency you are selecting. You will notice that the cost/million output token will also change based on your selections. Then, via Google Colab, you can plot and view the price/performance of leading AI models on Google Cloud. Chatbot Arena ratings are integrated to help you determine the best model for your needs based on model size, rating, and price per million tokens. You can also pull in your organization’s in-house quality measures into the colab to join with Google’s comprehensive benchmarks to make data-driven decisions.  Dedicated to optimizing inference At Google Cloud, we are committed to helping companies deploy and improve their AI inference workloads at scale. Our focus is on providing a comprehensive platform that delivers unmatched performance and cost-efficiency for serving large language models and other generative AI applications. By leveraging a codesigned stack of industry-leading hardware and software innovations — including the AI Hypercomputer, GKE Inference Gateway, and purpose-built optimizations like prefix-aware routing, disaggregated serving, and model streaming — we ensure that businesses can deliver more intelligence with faster, more responsive user experiences and lower total cost of ownership. Our solutions are designed to address the unique challenges of inference, from model loading times to resource utilization, enabling you to deliver on the promise of generative AI. To learn more and get started, visit our AI Hypercomputer site.

Fast and efficient AI inference with new NVIDIA Dynamo recipe on AI Hypercomputer

1 lTFF8Zu.max 1000x1000 1

As generative AI becomes more widespread, it’s important for developers and ML engineers to be able to easily configure infrastructure that supports efficient AI inference, i.e., using a trained AI model to make predictions or decisions based on new, unseen data. While great at training models, traditional GPU-based serving architectures struggle with the "multi-turn" nature of inference, characterized by back-and-forth conversations where the model must maintain context and understand user intent. Further, deploying large generative AI models can be both complex and resource-intensive. At Google Cloud, we’re committed to providing customers with the best choices for their AI needs. That's why we are excited to announce a new recipe for disaggregated inferencing with NVIDIA Dynamo, a high-performance, low-latency platform for a variety of AI models. Disaggregated inference separates out model processing phases, offering a significant leap in performance and cost-efficiency. Specifically, this recipe makes it easy to deploy NVIDIA Dynamo on Google Cloud’s AI Hypercomputer, including Google Kubernetes Engine (GKE), vLLM inference engine, and A3 Ultra GPU-accelerated instances powered by NVIDIA H200 GPUs. By running the recipe on Google Cloud, you can achieve higher performance and greater inference efficiency while meeting your AI applications’ latency requirements. You can find this recipe, along with other resources, in our growing AI Hypercomputer resources repository on GitHub.  Let’s take a look at how to deploy it. The two phases of inference LLM inference is not a monolithic task; it's a tale of two distinct computational phases. First is the prefill (or context) phase, where the input prompt is processed. Because this stage is compute-bound, it benefits from access to massive parallel processing power. Following prefill is the decode (or generation) phase, which generates a response, token by token, in an autoregressive loop. This stage is bound by memory bandwidth, requiring extremely fast access to the model's weights and the KV cache.  In traditional architectures, these two phases run on the same GPU, creating resource contention. A long, compute-heavy prefill can block the rapid, iterative decode steps, leading to poor GPU utilization, higher inference costs, and increased latency for all users. aside_block <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e1edce6fa90>), ('btn_text', ''), ('href', ''), ('image', None)])]> A specialized, disaggregated inference architecture Our new solution tackles this challenge head-on by disaggregating, or physically separating, the prefill and decode stages across distinct, independently managed GPU pools. Here’s how the components work in concert: A3 Ultra instances and GKE: The recipe uses GKE to orchestrate separate node pools of A3 Ultra instances, powered by NVIDIA H200 GPUs. This creates specialized resource pools — one optimized for compute-heavy prefill tasks and another for memory-bound decode tasks. NVIDIA Dynamo: Acting as the inference server, NVIDIA Dynamo's modular front end and KV cache-aware router processes incoming requests. It then pairs GPUs from the prefill and decode GKE node pools and orchestrates workload execution between them, transferring the KV cache that’s generated in the prefill pool to the decode pool to begin token generation. vLLM: Running on pods within each GKE pool, the vLLM inference engine helps ensure best-in-class performance for the actual computation, using innovations like PagedAttention to maximize throughput on each individual node. This disaggregated approach allows each phase to scale independently based on real-time demand, helping to ensure that compute-intensive prompt processing doesn’t interfere with fast token generation. Dynamo supports popular inference engines including SGLang, TensorRT-LLM and vLLM. The result is a dramatic boost in overall throughput and maximized utilization of every GPU. Experiment with Dynamo Recipes for Google Cloud The reproducible recipe shows the steps to deploy disaggregated inference with NVIDIA Dynamo on the A3 Ultra (H200) VMs on Google Cloud using GKE for orchestration and vLLM as the inference engine. The single node recipe demonstrates disaggregated inference with one node of A3 Ultra using four GPUs for prefill and four GPUs for decode. The multi-node recipe demonstrates disaggregated inference with one node of A3 Ultra for prefill and one node of A3 Ultra for decode for the Llama-3.3-70B-Instruct Model. Future recipes will provide support for additional NVIDIA GPUs (e.g. A4, A4X) and inference engines with expanded coverage of models.  The recipe highlights the following key steps:  Perform initial setup - This sets up environment variables and secrets; this needs to be done one-time only.  Install Dynamo Platform and CRDs - This sets up the various Dynamo Kubernetes components; this needs to be done one-time only. Deploy inference backend for a specific model workload - This deploys vLLM/SGLang as the inference backend for Dynamo disaggregated inference for a specific model workload. Repeat this step for every new model inference workload deployment. Process inference requests - Once the model is deployed for inference, incoming queries are processed to provide responses to users. Once the server is up, you will see the prefill and decode workers along with the frontend pod which acts as the primary interface to serve the requests. We can verify if everything works as intended by sending a request to the server like this. The response is generated and truncated to max_tokens. code_block <ListValue: [StructValue([('code', 'curl -s localhost:8000/v1/chat/completions \rn -H "Content-Type: application/json" \rn -d '{rn "model": "meta-llama/Llama-3.3-70B-Instruct",rn "messages": [rn {rn "role": "user",rn "content": "what is the meaning of life ?"rn }rn ],rn "stream":false,rn "max_tokens": 30rn }' | jq -r '.choices[0].message.content'rnrnrn---rnThe question of the meaning of life is a complex and deeply philosophical one that has been debated by scholars, theologians, philosophers, and scientists for'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e1edce6fcd0>)])]> Get started today By moving beyond the constraints of traditional serving, the new disaggregated inference recipe represents the future of efficient, scalable LLM inference. It enables you to right-size resources for each specific task, unlocking new performance paradigms and significant cost savings for your most demanding generative AI applications. We are excited to see how you will leverage this recipe to build the next wave of AI-powered services. We encourage you to try out our Dynamo Disaggregated Inference Recipe which provides a starting point with recommended configurations and easy steps. We hope you have fun experimenting and share your feedback!

We tested six smart rings, and there’s a clear winner

247184 Smart Ring battle AKrales 0184

Smart rings are having a moment. After years as a niche gadget, regular people are starting to see the appeal. They're thinner, more accurate, and more wearable compared to a decade ago - and for some people, they're a smarter choice than smartwatches. Smartwatches may dominate the wearable landscape, but they don't work for everyone. […]

Blackmagic’s camera dock works with the new iPhone’s professional filmmaking features

blackmagic1

It was briefly mentioned by Apple’s Greg Joswiak during the company’s “Awe dropping” event yesterday, but today Blackmagic Design officially announced its new Camera ProDock that “adds professional camera connections to iPhone 17 Pro and iPhone 17 Pro Max.” These include additional USB-C ports for connecting accessories like external storage drives and BNC connectors that […]

Our approach to carbon-aware data centers: Central data center fleet management

Data centers are the engines of the cloud, processing and storing the information that powers our daily lives. As digital services grow, so do our data centers and we are working to responsibly manage them. Google thinks of infrastructure at the full stack level, not just as hardware but as hardware abstracted through software, allowing us to innovate. We have previously shared how we’re working to reduce the embodied carbon impact at our data centers by optimizing our technical infrastructure hardware. In this post, we shine a spotlight on our “central fleet” program, which has helped us shift our internal resource management system from a machine economy to a more sustainable resource and performance economy.  What is Central Fleet? At its core, our central fleet program is a resource distribution approach that allows us to manage and allocate computing resources, like processing power, memory, and storage in a more efficient and sustainable way. Instead of individual teams or product teams within Google ordering and managing their own physical machines, our central fleet acts as a centralized pool of resources that can be dynamically distributed to where they are needed most. Think of it like a shared car service. Rather than each person owning a car they might only use for a couple of hours a day, a shared fleet allows for fewer cars to be used more efficiently by many people. Similarly, our central fleet program ensures our computing resources are constantly in use, minimizing waste and reducing the need to procure new machines. aside_block <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e9b8877d280>), ('btn_text', ''), ('href', ''), ('image', None)])]> How it works: A shift to a resource economy The central fleet approach fundamentally changes how we provision and manage resources. When a team needs more computing power, instead of ordering specific hardware, they place an order for "quota" from the central fleet. This makes the computing resources fungible, that is, interchangeable and flexible. For instance, a team will ask for a certain amount of processing power or storage capacity, not a particular server model.  This "intent-based" ordering system provides flexibility in how demand is fulfilled. Our central fleet can intelligently fulfill requests using either existing inventory or procure at scale, which can lower cost and environmental impact. It also facilitates the return of unneeded resources that can then be reallocated to other teams, further reducing waste. All of this is possible with our full-stack infrastructure and built on the Borg cluster management system to abstract away the physical hardware into a single, fungible resource pool. This software-level intelligence allows us to treat our infrastructure as a fluid, optimizable system rather than a collection of static machines, unlocking massive efficiency gains. The sustainability benefits of central fleet The central fleet approach aligns  with Google's broader dedication to sustainability and a circular economy. By optimizing the use of our existing hardware, we can achieve carbon savings. For example, in 2024, our central fleet program helped avoid procurement of new components and machines with an embodied impact equivalent to approximately 260,000 metric tons of CO2e. This roughly equates to avoiding 660 million miles driven by an average gasoline-powered passenger vehicle.1   This fulfillment flexibility leads to greater resource efficiency and a reduced carbon footprint in several ways: Reduced electronic waste: By extending the life of our machines through reallocation and reuse, we minimize the need to manufacture new hardware and reduce the amount of electronic waste. Lower embodied carbon: The manufacturing of new servers carries an embodied carbon footprint. By avoiding the creation of new machines, we avoid these associated CO2e emissions. Increased energy efficiency: Central fleet allows for the strategic placement of workloads on the most power-efficient hardware available, optimizing energy consumption across our data centers. Promote a circular economy: This model is a prime example of circular economy principles in action, shifting from a linear "take-make-dispose" model to one that emphasizes reuse and longevity. The central fleet initiative is more than an internal efficiency project; it's a tangible demonstration of embedding sustainability into our core business decisions. By rethinking how we manage our infrastructure, we can meet growing AI and cloud demand while simultaneously paving the way for a more sustainable future. Learn more at sustainability.google. 1. Estimated avoided emissions were calculated by applying internal LCA emissions factors to machines and component resources saved through our central fleet initiative in 2024.  We input the estimated avoided emissions into the EPA’s Greenhouse Gas Equivalencies Calculator to calculate the equivalent number of miles driven by an average gasoline-powered passenger vehicle (accessed August 2025). The data and claims have not been verified by an independent third-party.

Reddit is testing a way to read articles without leaving the app

STK115 Reddit 01.jpg

As AI tools gobble up news publishers’ traffic on traditional referral platforms like Google, Reddit is offering publishers another way to share their content — within its app.  On September 10th, Reddit announced a slew of new features available to some publishers that are meant to help them better understand where their stories are being […]

The iPhone 17 is a shockingly great deal

DSC01498

Apple made a lot of noise about its super-slim iPhone Air and super-citrusy iPhone 17 Pro this week, but the company gave the base iPhone 17 some major updates that make it a really great deal - and all without raising its $799 starting price. Let's start with the iPhone 17's display, which is now […]