Get This Tool
SoMatic
Summary
Agents that automate desktop UIs spend more time guessing coordinates than doing work — clicking offscreen, missing dynamic elements, failing silently on apps that never exposed an API. SoMatic exists to replace that guessing with a local YOLO model that numbers every interactive element in a screenshot and hands the agent a structured map it can act against.
The core workflow is a CLI command that takes a screenshot, runs element detection locally, and returns numbered marks with coordinates as JSON — so agents target elements by ID, not by fragile pixel hunts. Every action returns JSON, which means downstream agents can chain steps without parsing unstructured output. The self-hosted, MIT-licensed model runs on your own hardware, so no screenshot data leaves the machine. The wall appears with non-standard or highly dynamic UIs where YOLO detection misses elements or mislabels them — teams handling those cases add a fallback coordinate layer manually. At this GitHub star count, the community size is small, which means debugging edge cases happens in the codebase, not a forum.
Bottom line: Pick this for wiring an AI agent into a legacy desktop app or a PDF form workflow that has no API; reconsider when your target UI renders elements dynamically or changes layout frequently enough to break detection confidence.
Pricing Plans
FreeOpen Source (Free)
Full access to the SoMatic CLI and MCP server. MIT licensed.
- CLI tool for desktop UI automation
- Local YOLO vision detection
- MCP server support
- JSON output for all commands
- Headless Xvfb support
View full pricing on github.com →
Pricing may have changed since last verified. Check the official site for current plans.
Community Performance Report Card
No community ratings yet. Be the first to rate this tool!
Community Benchmarks Community
Sign in to submit a benchmarkNo community benchmarks yet. Be the first to share a real-world data point.
Pros
Sign in to edit- Local YOLO-based element detection returns numbered marks as JSON, so agents target UI elements by stable ID rather than fragile pixel coordinates that break on resize or re-render.
- MCP server is included out of the box, so Claude and other MCP-compatible agents plug in without a custom integration layer — the handoff between agent decision and desktop action is a standard tool call.
- Headless Xvfb support means the same automation pipeline that works on a developer's desktop runs on a server with no display attached, so you do not maintain separate codebases for local and CI environments.
- MIT license and fully self-hosted execution means no screenshot data leaves your infrastructure, so automation against internal or regulated applications does not create a data-handling obligation with a vendor.
- Every CLI command returns JSON, which means agents can chain steps by parsing structured output rather than scraping human-readable text — reducing the failure surface in multi-step workflows.
Cons
Sign in to edit- Detection quality depends entirely on the bundled YOLO model's training distribution — UIs with non-standard controls, heavily custom widgets, or frequent layout changes produce missed or mislabeled marks, and there is no documented fine-tuning path for teams whose target apps fall outside the model's coverage. Teams hitting this wall add manual coordinate fallbacks, which reintroduces the fragility SoMatic was meant to eliminate.
- The project is maintained by a single author with 18 stars and zero open issues at the time of scraping — not because everything works perfectly, but because the community debugging surface is nearly nonexistent. Teams that hit a detection edge case or a platform-specific headless failure debug the source directly; there is no forum, no commercial support, and no track record of response time on issues.
- There is no built-in action verification or retry logic described in the docs — the CLI returns JSON coordinates and executes actions, but confirming that a click produced the expected state change is the agent's responsibility. Pipelines that need reliable end-state confirmation build that verification layer themselves, which is the point at which teams with stricter reliability requirements evaluate alternatives like Playwright for browser targets or platform-native accessibility APIs for desktop targets.
Community Reviews
Sign in to write a reviewNo reviews yet. Be the first to share your experience.
About
- Platforms
- Linux, macOS, Windows (via npm + Python runtime)
- API Available
- Yes
- Self-Hosted
- Yes
- Last Updated
- 2026-06-09T08:16:09.589Z
Best For
Who it's for
- AI coding agents and agentic frameworks
- Teams building reliable automation infrastructure
- Developers integrating agents with legacy desktop software
- Headless server automation requiring visual understanding
- Browser-native and non-browser UI interaction workflows
What it does well
- Automating cross-platform desktop workflows for AI agents
- Reliable web scraping and interaction for LLM-driven bots
- PDF form filling and document navigation via UI automation
- Testing native applications and complex UI scenarios
- Building MCP-powered automation pipelines for Claude and other agents
Integrations
Discussion Community
Sign in to commentNo discussion yet. Sign in to start the conversation.
Compare SoMatic
Spotted incorrect or missing data? Join our community of contributors.
Sign Up to ContributeCommunity Notes & Tips Community
Sign in to contributeBe the first to contribute. General notes, observations, gotchas, and tips from people who use this tool day-to-day.
Frequently Asked Questions
- Is SoMatic free?
- Yes — SoMatic is fully free to use. There is no paid tier.
- Is SoMatic open source?
- Yes. SoMatic is open source.
- Does SoMatic have an API?
- Yes. SoMatic exposes a developer API. See the official documentation at https://github.com/smyan1909/somatic for details.
- Can I self-host SoMatic?
- Yes. SoMatic supports self-hosting on your own infrastructure.
- What platforms does SoMatic support?
- SoMatic is available on: Linux, macOS, Windows (via npm + Python runtime).
Hours Saved & ROI Stories Community
Sign in to contributeBe the first to contribute. Concrete time/cost savings, with context. e.g. "Cut my code review backlog from 4h to 45m per week."
Curated lists that include this category
Most desktop automation breaks the moment the UI shifts — a new dialog, a resized window, an element that moved two pixels. SoMatic runs a local YOLO model against a screenshot of any native desktop application, detects and numbers every interactive element, and returns a JSON coordinate map the agent can act against. Elements can be targeted by mark ID, by nearest-mark offset, or by direct pixel coordinate. The public binary is `somatic`; install is via npm or pip.
The differentiating feature is the Set-of-Marks approach: instead of asking an agent to infer where a button lives, SoMatic assigns each detected element a numbered mark and hands those numbers back as structured data. The agent sees element 7 is a submit button at these coordinates — and clicks it by ID. No screen-reading heuristics, no brittle XPath, no browser-only DOM access. This makes it usable against native apps, PDFs, and any UI that does not expose a programmatic interface.
SoMatic ships an MCP server, which means Claude and other MCP-compatible agents can call it as a tool in an automation pipeline — the agent requests a screenshot, receives the mark map, decides which element to interact with, and issues the action, all within a standard MCP handoff. Headless Xvfb support is included for server environments where there is no physical display. Both paths — interactive desktop and headless server — use the same JSON-returning CLI surface, so the integration contract does not change between environments.
The project is MIT-licensed and fully open-source, hosted under a single-maintainer GitHub repository. There are no paid tiers, no hosted API, and no telemetry described in the docs. Detection runs locally, so screenshots of sensitive internal apps do not transit a third-party service. The tradeoff is that model quality and detection coverage depend entirely on what the bundled YOLO weights handle — the vendor does not describe fine-tuning paths for custom UI patterns in the available documentation.
