Coin World Report:
Source: New Yuan
Claude 3.5 receives a major upgrade late at night!
As expected, Anthropic AI finally made a big move this week – the release of Claude 3.5 Haiku, and the upgraded version of Claude 3.5 Sonnet.
However, the “super-sized” Opus has still not made an appearance.
What is stunning is that the evolved Claude 3.5 Sonnet has defeated OpenAI o1 in all aspects, making it the most powerful reasoning model.
It has seen significant improvements in all areas, especially its industry-leading coding capability.
On the other hand, Claude 3.5 Haiku performs at a similar level to the previous generation’s strongest Claude 3 Opus in terms of performance, cost, and speed.
Now, Claude can operate computers like humans, not only being able to view screens and move cursors, but also to click buttons and type text!
According to the head of developer relations at Anthropic, “computer usage” is the first step in a brand new human-machine interaction paradigm. It is also a new fundamental capability that AI models should possess.
Many start-ups specializing in browser agents became outdated overnight.
Netizens exclaimed, “The agents and workflows are going to change…”
An AI that can use a computer by itself?
During the public beta, Anthropic introduced a groundbreaking new feature: computer usage capability. Starting today, developers can guide Claude to use computers like humans through an API.
Claude 3.5 Sonnet is the first model to offer this feature during the public beta.
Of course, this feature is still in the experimental stage and may be clumsy and prone to errors. Anthropic chose to release this feature early in order to obtain feedback from developers and improve it quickly.
Why train AI to operate computers?
Anthropic stated that in the past few years, powerful AI development has achieved many milestones, such as complex logical reasoning and the ability to recognize and understand images.
The next breakthrough is AI operating computers! If models can use all software by following instructions instead of using specialized tools, it represents the future direction.
Basic computer operations
In this demo, the researchers at Anthropic gave Claude a challenging task:
“My friend is coming to San Francisco, and I want to watch the sunrise at the Golden Gate Bridge with him tomorrow morning. We will depart from the Pacific Heights. Can you help us find a great viewing spot, check the driving time and sunrise time, and schedule a calendar event so that we have enough time to get there?”
Claude opened Google and started searching.
How far is the Golden Gate Bridge from the user’s residence? Claude opened a map to find the distance.
After obtaining the necessary information, Claude opened the calendar and scheduled the event for the owner.
Automated coding for website creation
Developers demonstrated how Claude controlled their laptop and smoothly completed a website programming task.
First, Claude navigated to Claude.ai in the developer’s Chrome browser and created a 90s-themed personal homepage for itself.
It entered the URL, typed prompts, and sent a request to another Claude.
Claude.ai returned some code, and the rendered result looked good, but the developer wanted to make some modifications to the website on their local computer.
They instructed Claude to download the file and open it in VS Code. Claude successfully completed these commands.
Then, the developer instructed Claude to start a server to view the file in a browser.
Claude opened the VS Code terminal and attempted to start a server but encountered an error – Python was not installed on the machine.
However, by examining the terminal output, Claude discovered the problem by itself! It retried with Python 3 and successfully ran the server.
However, there was an error in the terminal output, and a file icon was missing from the top. The developer asked Claude to identify the error and fix it in the file.
To their surprise, Claude found the line causing the error in VS Code, deleted the entire line, saved the file, and ran the website again.
This time, the website was correct!
Automated data retrieval for form filling
Suppose we need to fill out a supplier request form from “Ant Device Company,” but the required data is scattered across different parts of the computer. Can Claude help us complete it?
Claude began capturing screenshots of the developer’s screen and quickly found that “Ant Device Company” was not on the form.
Immediately, it switched to the CRM system to search for the company. Once found, it scrolled through the pages to find all the information needed to fill out the form and submitted it.
This means that many tedious tasks in our work can be delegated to Claude!
Now, this feature is available in the API.
Companies such as Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company have already explored Claude’s new potential, allowing it to perform complex tasks consisting of dozens or even hundreds of steps.
For example, Replit is utilizing Claude 3.5 Sonnet’s computer usage and user interface navigation capabilities to develop features for Replit Agent, enabling real-time evaluation during the application building process.
Lower than humans, but with promising future
How does the upgraded Claude 3.5 Sonnet perform in terms of computer usage?
In the OSWorld test, it scored 14.9% in the task category based solely on screen screenshots, clearly surpassing the second-ranked AI system (7.8%).
When more steps were allowed to complete the tasks, Claude’s score increased to 22.0%.
This indicates that multiple interactions between the model and the environment can optimize task performance.
Although this result is a significant improvement compared to before, it still falls far below the 72.36% performance of humans.
This suggests that there is still much room for improvement for Claude 3.5 Sonnet in the future.
After all, some operations that humans effortlessly perform, such as scrolling, dragging, and zooming, are currently challenging for Claude.
Upgraded Claude 3.5 Sonnet, the coding champion outperforms o1
The upgraded Claude 3.5 Sonnet has seen comprehensive improvements in various industry benchmark tests.
In particular, it has made significant breakthroughs in intelligent agent coding and tool usage tasks.
In terms of coding capability, it improved its performance from 33.4% to 49.0% in the SWE-bench Verified test.
This surpasses all publicly available models, including OpenAI o1-preview and specialized systems designed for intelligent agent coding.
Furthermore, in the TAU-bench (a benchmark test evaluating intelligent agent tool usage capability), Claude 3.5 Sonnet also performed well:
Its score in the retail domain increased from 62.6% to 69.2%, and in the more challenging aviation domain, it jumped from 36.0% to 46.0%.
From the table below, it can be seen that the new version of Claude 3.5 Sonnet significantly surpasses GPT-4o in the GPQA (Diamond) benchmark test.
Claude 3.5 Sonnet’s performance in visual QA, mathematical reasoning, document visual question answering, chart question answering, and scientific table benchmark tests sets a new industry standard.
It is worth mentioning that while the performance of the upgraded Claude 3.5 Sonnet has significantly improved, it maintains the same price and processing speed as the previous model.
Feedback from early testing users further confirms that Claude 3.5 Sonnet, after the upgrade, has made a qualitative leap in AI-driven coding.
GitLab:
In the DevSecOps task test, it was found that Claude 3.5 Sonnet significantly improved its reasoning ability (up to 10% in each use case) without increasing latency, making it an ideal choice for driving complex software development processes.
Cognition:
When applying the upgraded Claude 3.5 Sonnet to autonomous AI evaluation, substantial progress was made in coding, planning, and problem-solving compared to the previous model.
The Browser Company:
When automating web workflows using this model, Claude 3.5 Sonnet’s performance surpassed all models they had tested before.
In addition, before its deployment, Claude 3.5 Sonnet underwent joint testing at the US AI Security Institute (US AISI) and the UK Security Institute (UK AISI).
Furthermore, through self-evaluation, Anthropic has established “Responsible Scaling Polic