Coin World News Report:
Source: NewSmart
Claude 3.5 welcomes a major upgrade in the late night!
As expected, Anthropic AI finally made a big move this week – the release of Claude 3.5 Haiku, and the all-new upgraded version, Claude 3.5 Sonnet.
However, the “super-sized” Opus has yet to make an appearance.
What is astonishing is that the evolved Claude 3.5 Sonnet has surpassed OpenAI o1 and is considered the most powerful reasoning model.
It has been significantly improved in all aspects, especially in its industry-leading coding ability.
Meanwhile, Claude 3.5 Haiku is on par with the previous generation’s most powerful Claude 3 Opus in terms of performance, cost, and speed, similar to the previous generation’s Haiku.
In fact, Claude can now operate computers like humans, not only being able to view screens and move cursors but also click buttons and type text!
The Head of Developer Relations at Anthropic stated that “computer usage” is the first step towards a new paradigm of human-computer interaction. It is also a new fundamental capability that AI models should possess.
Many start-ups specializing in browser agents have become outdated overnight.
Netizens have expressed their admiration: Agents and workflows are going to undergo a revolution…
An AI that can use a computer by itself?
During the public beta, Anthropic introduced a groundbreaking new feature: computer usage capability. From today onwards, developers can guide Claude to use computers like humans through an API.
Claude 3.5 Sonnet is the first model to provide this feature during the public beta.
Of course, this feature is still in the experimental stage and may be clumsy and prone to errors. Anthropic chose to release this feature early to obtain developer feedback and improve it quickly.
Why train AI to operate computers?
Anthropic stated that in the past few years, powerful AI development has achieved many milestones, such as complex logical reasoning and image recognition and understanding.
The next breakthrough is AI operating computers! If models can use all software without the need for custom tools, it represents the direction of the future.
Basic computer operations
In this demo, the researchers at Anthropic gave Claude a challenging task:
“My friend is coming to San Francisco, and I want to watch the sunrise together at the Golden Gate Bridge tomorrow morning. We will depart from the Pacific Heights. Can you help us find an excellent viewing spot, check the driving time and sunrise time, and then schedule a calendar event for us, allowing us enough time to get there?”
Claude opened Google and started searching.
How far is the Golden Gate Bridge from the user’s location? Claude will open the map to find the distance.
After obtaining the necessary information, it opened the calendar and scheduled the event for the user.
Automatic coding and website creation
The developers demonstrated how Claude manipulated its own laptop and smoothly completed a website programming task.
First, Claude navigated to Claude.ai in the Chrome browser on the laptop and created a personal homepage with a 90s theme.
It entered the URL by itself, typed in prompts, and sent a request to another Claude.
Claude.ai returned some code, and the rendered image looked good, but the developer wanted to make some modifications to the website on his local computer.
So he asked Claude to download the file and open it in VS Code. Claude successfully completed these instructions.
Then the developer asked Claude to start a server so that he could view the file in the browser.
Claude opened the VS Code terminal and tried to start the server but encountered an error: Python was not installed on the machine.
By checking the terminal output, Claude discovered the problem by itself! It tried again with Python 3 and successfully ran the server.
However, there was an error in the terminal output, and a file icon was missing from the top. The developer asked Claude to identify the error and fix it in the file.
To their surprise, Claude found the line that caused the error in VS Code, deleted the entire line, saved the file, and ran the website again.
This time, the website was correct!
Automatic data extraction and form filling
Suppose we need to fill out a supplier request form from “Ant Device Company,” but the data we need is scattered all over the computer. Can Claude help us with this?
It started by capturing screenshots of the developer’s screen and quickly realized that “Ant Device Company” was not on the form.
Immediately, it switched to the CRM system to search for the company. Once found, it scrolled through the page to find all the necessary information for filling out the form and submitted it.
This means that many tedious tasks we have to do in our work can now be handled by Claude!
Now, this feature is available in the API.
Well-known companies such as Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company are already exploring the new potential of Claude, allowing it to perform complex tasks consisting of dozens or even hundreds of steps.
For example, Replit is using Claude 3.5 Sonnet’s computer usage and user interface navigation capabilities to develop features for Replit Agent, providing real-time evaluations during the application building process.
Lower than humans, but with promising future prospects
How does the upgraded Claude 3.5 Sonnet perform in terms of computer usage?
In OSWorld tests, it scored 14.9% in a task category based solely on screen screenshots, significantly surpassing the second-ranked AI system (7.8%).
When allowed to perform more steps to complete tasks, Claude’s score improved to 22.0%.
This indicates that multiple interactions between the model and the environment can optimize task performance.
Although this result is a significant improvement from before, it still falls far behind the human performance of 72.36%.
This also implies that there is still much room for improvement for Claude 3.5 Sonnet in the future.
After all, some operations that humans effortlessly perform (scrolling, dragging, zooming) are extremely challenging for Claude at the moment.
Upgraded Claude 3.5 Sonnet, the coding champion that outperforms o1
The upgraded Claude 3.5 Sonnet has seen comprehensive improvements in various industry benchmark tests.
In particular, it has made significant breakthroughs in intelligent agent coding and tool usage tasks.
In terms of coding ability, its performance in the SWE-bench Verified test improved from 33.4% to 49.0%.
This surpasses all publicly available models, including OpenAI o1-preview and specialized systems designed for intelligent agent coding.
Furthermore, in the TAU-bench (a benchmark test evaluating intelligent agent tool usage), Claude 3.5 Sonnet also performed remarkably well:
Its score in the retail sector increased from 62.6% to 69.2%, and in the more challenging aviation field, it jumped from 36.0% to 46.0%.
From the table below, it can be seen that the new version of Claude 3.5 Sonnet significantly outperforms GPT-4o in the GPQA (Diamond) reasoning test benchmark.
Claude 3.5 Sonnet’s performance sets a new industry benchmark in visual QA, mathematical reasoning, document visual question answering, chart question answering, and scientific table benchmark tests.
It is worth mentioning that while achieving performance breakthroughs, the upgraded Claude 3.5 Sonnet maintains the same price and running speed as the previous model.
Feedback from some early testing users further confirms the significant leap in “quality” achieved by Claude 3.5 Sonnet in the AI-driven coding field.
GitLab:
In DevSecOps task tests, Claude 3.5 Sonnet’s reasoning ability significantly improved without increasing latency (up to 10% improvement in each use case), making it an ideal choice for driving complex software development processes.
Cognition:
Applying the upgraded Claude 3.5 Sonnet to autonomous AI evaluation has achieved substantial progress in coding, planning, and problem-solving compared to the previous model.
The Browser Company:
When automating web workflows using this model, Claude 3.5 Sonnet’s performance surpassed all previously tested models.
In addition, before deployment, Claude 3.5 Sonnet underwent joint testing at the US AI Security Institute (US AISI) and the UK AI Security Institute (UK AISI).
Furthermore, through self-evaluation, Anthropic demonstrates its commitment to “Responsible Scaling Policy.”As mentioned earlier, the upgraded version of Claude 3.5 Sonnet is now available for use on web pages and terminal apps.
The pricing for the API starts at $3 per million input tokens and $15 per million output tokens.
By using intelligent caching technology, costs can be reduced by up to 90%, and batch processing API can save 50% of costs.
Use Cases:
Claude 3.5 Sonnet can understand subtle instructions and context, identify and correct its own errors, and generate in-depth analysis and insights from complex data. With advanced encoding, visual recognition, and writing abilities, Claude 3.5 Sonnet can be applied to various scenarios.
– Simulating human computer operation
By integrating Claude through API, developers can guide Claude to use computers like humans – by observing the screen, moving the mouse, clicking buttons, and typing text. Claude 3.5 Sonnet is the first cutting-edge AI model that can reliably use computers in this way, although it is still experimental in the public testing phase, its capabilities will continue to improve over time.
– Code generation
Claude 3.5 Sonnet can assist in the entire software development lifecycle – from initial design to bug fixing, system maintenance to performance optimization. It can be directly integrated into products or used as an intelligent coding assistant through the Claude.ai platform.
– Intelligent dialogue systems
With enhanced reasoning abilities and an affable, natural tone, Claude 3.5 Sonnet is well-suited for developing intelligent dialogue systems that require cross-system data connection and execution of operations.
– Intelligent knowledge Q&A
Claude 3.5 Sonnet has the ability to process large-scale context and has a very low hallucination rate, making it an ideal choice for handling large knowledge bases, documents, and code libraries for Q&A tasks.
– Visual information extraction
Claude 3.5 Sonnet can easily extract information from visual materials such as charts, graphics, and complex diagrams, making it an ideal AI model for data analysis and data science tasks.
– Process automation
Claude 3.5 Sonnet can automate repetitive tasks or processes. It has industry-leading instruction execution capabilities and can handle complex processes and operations.
The brand new Claude 3.5 Haiku, an AI model that surpasses its predecessor
Compared to its predecessor, Claude 3.5 Haiku can be considered the “smallest cup”.
It is the fastest model developed by Anthropic.
Not only does it maintain similar operating costs and processing speed as Claude 3 Haiku, but it also improves in all aspects of its skills.
In fact, in multiple intelligent benchmark tests, Claude 3.5 Haiku has surpassed its predecessor, the powerful Claude 3 Opus model.
Similarly, Claude 3.5 Haiku performs exceptionally well in coding tasks.
For example, in the SWE-bench Verified test, it achieved a high score of 40.6%,
surpassing
many AI agents that use state-of-the-art publicly available models, including the original versions of Claude 3.5 Sonnet and GPT-4o.
Claude 3.5 Haiku has three outstanding advantages:
1. Low-latency response
2. More accurate instruction execution capabilities
3. More precise tool usage
These features make the model particularly suitable for user-facing product development, specialized sub-agent tasks, and generating personalized experiences based on massive data (such as purchase records, price information, or inventory data).
At the end of this month, Claude 3.5 Haiku will be launched on multiple platforms, including Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI. (Initially, it will be introduced as a pure text model and later include image input capabilities)
The pricing for Claude 3.5 Haiku starts at $0.25 per million input tokens and $1.25 per million output tokens.
By using prompt caching technology, costs can be reduced by up to 90%, and using message batch processing API can save 50% of costs.
Use Cases:
With its fast processing speed, improved instruction execution capabilities, and more precise tool usage, Claude 3.5 Haiku is ideal for user-facing products, specialized auxiliary tasks, and generating personalized experiences from massive data.
– Code autocompletion
Claude 3.5 Haiku can provide fast and accurate code suggestions and completions, effectively speeding up the development workflow. It is particularly suitable for software development teams who want to simplify the coding process and increase productivity.
– Intelligent chatbots
With enhanced conversational abilities and fast response times, Claude 3.5 Haiku performs well in driving responsive chatbots that handle a large volume of user interactions. It is especially valuable for customer service, e-commerce, and educational platforms that require scalable interactive capabilities.
– Data extraction and automatic labeling
Claude 3.5 Haiku efficiently processes and categorizes information, performing exceptionally well in tasks such as rapid data extraction and automatic labeling. This capability is particularly useful for organizations that deal with large amounts of unstructured data in fields such as finance, healthcare, and research.
– Real-time content moderation
Through its improved reasoning and content understanding abilities, Claude 3.5 Haiku provides reliable and real-time content moderation services. This is highly valuable for social platforms, online communities, and media organizations that need to maintain large-scale security and appropriate content.
Teaching Claude to use computers
Anthropic states that operations that humans can easily perform, such as scrolling, dragging, and zooming, are still challenging for Claude.
Regarding risks such as spam, false information, and fraud, the company is seeking strategies from security agencies, such as developing detection systems to detect potential harm.
Research process
Anthropic laid the foundation for AI to recognize and interpret images through its work on tool usage and multimodal tasks.
Based on this, Claude also needs to reason how and when to perform actions based on screen content.
To do this, researchers train Claude to accurately calculate pixels in order to complete commands, as it needs to calculate how many pixels to vertically or horizontally move the mouse pointer to click in the correct position.
During this process, Claude quickly transitions from training on simple software like calculators and text editors to other applications (note that it does not allow internet access during this period).
This training enables Claude to convert user instructions into a series of logical steps and execute tasks. It can even self-correct and retry tasks when encountering obstacles.
Anecdote
Alex Albert, the head of developer relations at Anthropic, shared an interesting story about the team’s development of computer usage capabilities.
At that time, they held an engineer’s bug bash to ensure that all potential issues with the API were discovered.
This involved locking a group of engineers in a room for several hours.
Coincidentally, everyone was hungry. One engineer had a brilliant idea, “Why not let Claude order food for us on DoorDash?”
Surprisingly, about a minute later, Claude ordered pizza for the engineers.
Looking to the future
The ability for AI to operate computers represents a new approach to artificial intelligence development.
So far, LLM developers have been working to adapt tools to models, creating specialized environments for AI to perform various tasks using specially designed tools.
Now, Anthropic is taking a different approach – they choose to adapt the model to the tools. In other words, Claude can integrate seamlessly into our everyday computer environments and use existing software directly, just like humans do.
Although Claude has reached the current highest level, its operations are still relatively slow and prone to errors. Many operations we perform on computers in our daily lives, such as dragging and zooming, are still beyond Claude’s capabilities.
Additionally, the way Claude currently observes the screen is similar to quickly flipping through a “picture book” – by taking continuous screenshots and stitching them together, rather than observing a continuous video stream. This means it may miss some brief actions or notifications.
Interestingly, Anthropic encountered some amusing incidents while recording demos.
For example, in one demonstration, Claude accidentally clicked to stop a long-running screen recording, resulting in the loss of all the recorded footage.
In another coding demo, Claude suddenly became “distracted” and started browsing photos of Yellowstone National Park with great interest.
In summary, Claude’s current performance gives us great expectations for the future: the ability of AI to operate computers will progress rapidly, and one day even software development novices will be able to use it effortlessly.