Thoughts of LLM and Ui.Vision Integration Test

I’ve spent the last couple of days testing the integration between UI Vision and LLM. Here are my observations and thoughts:

  1. AiComputerUse consumes a significant number of tokens. Integrating with open-source LLMs, such as Meta or Qwen, locally might be a more efficient approach.
  2. I’m unsure how to leverage AiComputerUse to manage internal enterprise applications, like internal forms and CRM systems. However, if ComputerUse can automate tasks based on prompts, it could reduce RPA development work and focus on data and prompt preparation. This could be the future of RPA development.
  3. Existing prompted instructions, such as AIScreenXY, help handle dynamic web responses and minimize change requests due to business design changes. I’m excited to explore integrating these features with popular LLMs like OpenAI, Azure OpenAI, Gemini, and DeepSeek. It’s thrilling to test LLMs with RPA and manage mouse movements, clicks, and content filling.

Please continue the excellent work and provide more integration options with different LLMs.

Thanks!

1 Like

Thanks for testing our LLM integration. Some thoughts:

  1. AiComputerUse uses the Claude Computer Use interface. To my knowledge, Meta, Qwen, DeepSeek, Mistral etc have no comparable feature yet. By contrast, AIScreenXY and especially aiPrompt could be connected to our other LLM providers. We will do this when there is enough demand for it and/or as part of our free tech support for Enterprise Customers.

  2. AiComputerUse is very powerful. Its main drawback for RPA automation at the moment is (1) reliability/repeatabilty (it is in Beta, and sometimes the LLM does not do what you expect it it to) and cost (as you said).

  3. By contrast AIScreenXY and AIPrompt are very reliable and do not cost much compute. From our perspective, these two commands are ready for use in production.

The recently announced Open AI Computer User API appears to offer similar functionality to Claude’s existing offerings. Will you be providing API connectivity with Open AI in the near future?

1 Like

Maybe uivision can allow openrouter, so it opens up more options & experimentation?

1 Like

Another idea to consider is enabling the UI to support multiple mode-ready LLM access points and tokens. Using AIScreenXY and AIPROMPT, utilizing a UI-level loop to control screen capturing and simulate mouse clicks by moving the cursor to specific screen locations.

1 Like

Thanks for the suggestions, all noted and added to our LLM todo list.

In order to implement the best possible generic LLM integration, we would love to understand why the Claude API is not “good enough”? :thinking:

In other words, is there something (a certain task, use case, project,…) that fails with Claude and you hope to get it working by using OpenAi, Gemini, Deepseek or Mistral?

Or is this about saving costs by using cheaper LLM apis or even running the LLM inference locally?

To deploy Computer Use in an enterprise environment, information security is a top consideration. CU requires a locally deployed, open-source language model with vision capabilities to function effectively.

While Claude can be used for POC purposes, it’s not suitable for real-world production in an enterprise setting.

Recently, I’ve deployed Qwen2.5-VL-72B-Instruct in my lab and am eager to explore CU’s capabilities, with the goal of replacing some traditional RPA solutions.

1 Like

We totally agree. Just as we offer local computer vision and local OCR, the goal is to offer local LLM support as well.

As of today, we have not found a model that has good “Computer Use” like features and can run a regular local machine (e. g. PC with grafics card, Mac M4, etc). But hopefully this will be available soon.

We did not test Qwen yet, but thanks for the hint. It sounds promising.:

* Being agentic: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.

:point_right: :test_tube: For anyone that wants to experiment with local LLMs (e. g. Qwen) before we have the time to do it: The full Ui.Vision RPA source code is on Github. The Computer Use logic and Anthropic integration are in the Scr/Services/AI folder.

Inspired by Tesla’s vision-based autopilot technology, I predict that vision-enabled LLM will be integrated with PCs in the future, effectively creating a “real AI PC.”

This setup should support real-time screen capturing and local inference, similar to Tesla’s approach, which captures 2,000 images per second for analysis by the onboard GPU, enabling rapid decision-making and actions.

A user recently asked us an interesting question: Can I integrate OpenAI’s operator with your system? My automation system is meant to integrate the operator with Instagram to receive leads interested in scheduling appointments at my clinic. I need an RPA (Robotic Process Automation) that interacts with the operator, connects to my clinic’s proprietary system (which doesn’t have an API), checks available time slots in the calendar, offers them to the lead via the operator, and then, once the lead confirms a time, automatically books it and marks that slot as unavailable in my cloud-based proprietary calendar system. Can your system achieve this to make the scheduling process 100% automated?

Our suggestion is to first try to achieve the desired result using Anthropic Computer Use that is currently available inside Ui.Vision.

Once that works, if needed: The second step would be to optimize the solution for cost (e. g. switch to OpenAI’s operator + Ui.Vision? We plan test that soon.) Or optimize for privacy and use local LLMs.

Another user asked: Are there any plans to make the cheaper Deepseek API available as an alternative to Anthropic?

Answer: Yes, we plan to add support for more LLM providers later this year.

Further LLM suggestions, questions or even RPA github pull requests are very welcome.

This demand might not need computer use and can be managed through traditional RPA.

Deepseek didn’t support vision now. We might try gpt4o, Gemini 2.0 or qwen