Skip to main content
AIDiveForge AIDiveForge
Visit SoMatic

Get This Tool

License: MIT Any use incl. commercial
Local-run terms: MIT licensed. Users may use, modify, and distribute the code freely for any purpose, including commercial, provided they include a copy of the MIT license.

Share This Tool

Compare This Tool
📋 Embed this tool on your site

Copy this code to embed a compact tool card:

SoMatic

FreeOpen SourceAPISelf-HostedAgentic

Summary

Agents that automate desktop UIs spend more time guessing coordinates than doing work — clicking offscreen, missing dynamic elements, failing silently on apps that never exposed an API. SoMatic exists to replace that guessing with a local YOLO model that numbers every interactive element in a screenshot and hands the agent a structured map it can act against.

The core workflow is a CLI command that takes a screenshot, runs element detection locally, and returns numbered marks with coordinates as JSON — so agents target elements by ID, not by fragile pixel hunts. Every action returns JSON, which means downstream agents can chain steps without parsing unstructured output. The self-hosted, MIT-licensed model runs on your own hardware, so no screenshot data leaves the machine. The wall appears with non-standard or highly dynamic UIs where YOLO detection misses elements or mislabels them — teams handling those cases add a fallback coordinate layer manually. At this GitHub star count, the community size is small, which means debugging edge cases happens in the codebase, not a forum.

Bottom line: Pick this for wiring an AI agent into a legacy desktop app or a PDF form workflow that has no API; reconsider when your target UI renders elements dynamically or changes layout frequently enough to break detection confidence.

Pricing Plans

Free

Open Source (Free)

Free

Full access to the SoMatic CLI and MCP server. MIT licensed.

  • CLI tool for desktop UI automation
  • Local YOLO vision detection
  • MCP server support
  • JSON output for all commands
  • Headless Xvfb support

View full pricing on github.com →

Pricing may have changed since last verified. Check the official site for current plans.

Community Performance Report Card

No community ratings yet. Be the first to rate this tool!

Best For: AI coding agents and agentic frameworks, Teams building reliable automation infrastructure, Developers integrating agents with legacy desktop software, Headless server automation requiring visual understanding, Browser-native and non-browser UI interaction workflows

Community Benchmarks Community

No community benchmarks yet. Be the first to share a real-world data point.

  • Local YOLO-based element detection returns numbered marks as JSON, so agents target UI elements by stable ID rather than fragile pixel coordinates that break on resize or re-render.
  • MCP server is included out of the box, so Claude and other MCP-compatible agents plug in without a custom integration layer — the handoff between agent decision and desktop action is a standard tool call.
  • Headless Xvfb support means the same automation pipeline that works on a developer's desktop runs on a server with no display attached, so you do not maintain separate codebases for local and CI environments.
  • MIT license and fully self-hosted execution means no screenshot data leaves your infrastructure, so automation against internal or regulated applications does not create a data-handling obligation with a vendor.
  • Every CLI command returns JSON, which means agents can chain steps by parsing structured output rather than scraping human-readable text — reducing the failure surface in multi-step workflows.
  • Detection quality depends entirely on the bundled YOLO model's training distribution — UIs with non-standard controls, heavily custom widgets, or frequent layout changes produce missed or mislabeled marks, and there is no documented fine-tuning path for teams whose target apps fall outside the model's coverage. Teams hitting this wall add manual coordinate fallbacks, which reintroduces the fragility SoMatic was meant to eliminate.
  • The project is maintained by a single author with 18 stars and zero open issues at the time of scraping — not because everything works perfectly, but because the community debugging surface is nearly nonexistent. Teams that hit a detection edge case or a platform-specific headless failure debug the source directly; there is no forum, no commercial support, and no track record of response time on issues.
  • There is no built-in action verification or retry logic described in the docs — the CLI returns JSON coordinates and executes actions, but confirming that a click produced the expected state change is the agent's responsibility. Pipelines that need reliable end-state confirmation build that verification layer themselves, which is the point at which teams with stricter reliability requirements evaluate alternatives like Playwright for browser targets or platform-native accessibility APIs for desktop targets.

Community Reviews

No reviews yet. Be the first to share your experience.

About

Platforms
Linux, macOS, Windows (via npm + Python runtime)
API Available
Yes
Self-Hosted
Yes
Last Updated
2026-06-09T08:16:09.589Z

Best For

Who it's for

  • AI coding agents and agentic frameworks
  • Teams building reliable automation infrastructure
  • Developers integrating agents with legacy desktop software
  • Headless server automation requiring visual understanding
  • Browser-native and non-browser UI interaction workflows

What it does well

  • Automating cross-platform desktop workflows for AI agents
  • Reliable web scraping and interaction for LLM-driven bots
  • PDF form filling and document navigation via UI automation
  • Testing native applications and complex UI scenarios
  • Building MCP-powered automation pipelines for Claude and other agents

Integrations

Claude CodeCursorContinue30+ other agents; MCP protocol; Xvfb headless

Discussion Community

No discussion yet. Sign in to start the conversation.

Compare SoMatic

Spotted incorrect or missing data? Join our community of contributors.

Sign Up to Contribute

Community Notes & Tips Community

Be the first to contribute. General notes, observations, gotchas, and tips from people who use this tool day-to-day.

Frequently Asked Questions

Is SoMatic free?
Yes — SoMatic is fully free to use. There is no paid tier.
Is SoMatic open source?
Yes. SoMatic is open source.
Does SoMatic have an API?
Yes. SoMatic exposes a developer API. See the official documentation at https://github.com/smyan1909/somatic for details.
Can I self-host SoMatic?
Yes. SoMatic supports self-hosting on your own infrastructure.
What platforms does SoMatic support?
SoMatic is available on: Linux, macOS, Windows (via npm + Python runtime).

Hours Saved & ROI Stories Community

Be the first to contribute. Concrete time/cost savings, with context. e.g. "Cut my code review backlog from 4h to 45m per week."

SoMatic

Most desktop automation breaks the moment the UI shifts — a new dialog, a resized window, an element that moved two pixels. SoMatic runs a local YOLO model against a screenshot of any native desktop application, detects and numbers every interactive element, and returns a JSON coordinate map the agent can act against. Elements can be targeted by mark ID, by nearest-mark offset, or by direct pixel coordinate. The public binary is `somatic`; install is via npm or pip.

The differentiating feature is the Set-of-Marks approach: instead of asking an agent to infer where a button lives, SoMatic assigns each detected element a numbered mark and hands those numbers back as structured data. The agent sees element 7 is a submit button at these coordinates — and clicks it by ID. No screen-reading heuristics, no brittle XPath, no browser-only DOM access. This makes it usable against native apps, PDFs, and any UI that does not expose a programmatic interface.

SoMatic ships an MCP server, which means Claude and other MCP-compatible agents can call it as a tool in an automation pipeline — the agent requests a screenshot, receives the mark map, decides which element to interact with, and issues the action, all within a standard MCP handoff. Headless Xvfb support is included for server environments where there is no physical display. Both paths — interactive desktop and headless server — use the same JSON-returning CLI surface, so the integration contract does not change between environments.

The project is MIT-licensed and fully open-source, hosted under a single-maintainer GitHub repository. There are no paid tiers, no hosted API, and no telemetry described in the docs. Detection runs locally, so screenshots of sensitive internal apps do not transit a third-party service. The tradeoff is that model quality and detection coverage depend entirely on what the bundled YOLO weights handle — the vendor does not describe fine-tuning paths for custom UI patterns in the available documentation.

Related Listings

Docunerve

Docunerve accepts PDFs — including scanned documents — and returns structured Markdown or JSON that downstream LLM pipelines can actually…

VerifiedFreemium
View tool

Airparser

Airparser takes unstructured documents — emails, PDFs, scanned forms, handwritten notes — and pulls structured fields out of them using…

VerifiedFreemium
View tool