Meet Alibaba’s Page Agent: A JavaScript In-Page GUI Agent That Controls Web Interfaces With Natural Language Through the DOM

Most browser automation runs from the outside. Playwright, Puppeteer, Selenium, and browser-use all drive a browser from an external process. They read the page through screenshots or the Chrome DevTools Protocol.

Alibaba’s Page Agent takes the opposite path. The agent lives inside the webpage as plain JavaScript. It reads the live DOM as text and acts as the real user. No headless browser, no screenshots, no multi-modal model.

The project is open-source under the MIT license. The codebase is TypeScript-first. It builds on browser-use, from which its DOM processing and prompt are derived.

TL;DR

Page Agent runs inside the page as JavaScript, reading the live DOM as text, not screenshots.

DOM dehydration compresses the page into a FlatDomTree so smaller text models can act precisely.

It is model-agnostic through any OpenAI-compatible endpoint and ships under the MIT license.

Prompt-level safety and single-page scope are real limits; keep server-side validation for risky actions.

Best fit: copilots and form-filling inside apps you own, not external or locked-down sites.

What is Page Agent?

Page Agent is a client-side library for adding agent behavior to a web app. You embed it, then issue commands in natural language. The agent finds elements, clicks buttons, and fills forms from within the page.

Because it runs in the browser session, it inherits the user’s cookies, session, and authentication. There is no separate backend to write. The existing UI validation and security rules stay in place.

The design is model-agnostic. You bring your own large language model through any OpenAI-compatible endpoint. Only text is sent to the model, so a strong text model is enough.

How DOM Dehydration Works

The core technique is what the team calls DOM dehydration. A modern page can hold thousands of nodes. Sending raw HTML to a model would be slow and expensive.

When a command arrives, the agent scans the Document Object Model. It identifies every interactive element, such as buttons, links, and input fields. Each element receives an index plus a role and a label.

The live DOM is converted into a FlatDomTree, a clean text map of what matters. Redundant markup is stripped out. The model reads this compact representation, not pixels.

The interactive demo on this page mirrors this loop. Watch the “Dehydrated DOM” and “Action trace” panels update as commands run.