No ChatGPT/Gemini manuals, no command syntax, no tutorials.
Just say what you want — and your AI does it.
That's the vision we had when building Nova-CUA — a JARVIS-like desktop agent that turns natural language into real actions on your computer.
- "Open LinkedIn, search for ML internships, and save them."
- "Open Firefox and buy me the latest AirPods from Amazon."
- "Launch YouTube and play Barcelona's latest match highlights."
All without lifting a finger.
With my teammates Archeet Shah and Peter Bui, we designed Nova-CUA, a desktop automation agent that combines LLMs with GUI grounding. It interprets your instructions, generates a plan, and directly operates your desktop to carry out tasks.
Our workflow included:
- Gemini 2.5 for planning and code generation for task execution.
- InterVL-4B for GUI grounding — to identify and interact with on-screen elements like buttons, icons, and text.
The most rewarding part for me is learning about how multi-agent workflows are structured, integrating LLMs with GUI grounding for end-to-end automation.