AI vs. Freelance Coders: Who Wins?
Source: theregister.com
Freelance coders can rest easy, for now. AI models are capable of many real-world coding tasks that companies outsource, but they aren't as effective as humans. That’s according to researchers at PeopleTec, who compared how four LLMs performed on freelance coding jobs a couple of months ago.
David Noever, chief scientist at PeopleTec, and Forrest McKee, AI/ML data scientist at PeopleTec, detailed their project in a paper titled, "Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale."
The researchers used a Kaggle dataset of Freelancer.com jobs to create 1,115 programming and data analysis challenges that could be automatically evaluated. The programming tasks were assigned an average monetary value of $306 (median $250). The paper stated that completing every job would be worth about $1.6 million.
AI Model Performance
The team evaluated Claude 3.5 Haiku, GPT-4o-mini, Qwen 2.5, and Mistral, with the first two being commercial models and the last two open source. They estimate that a human software engineer could solve over 95 percent of the challenges. None of the models performed that well, but Claude came the closest.
"Claude 3.5 Haiku narrowly outperformed GPT-4o-mini, both in accuracy and in dollar earnings," the paper states. Claude was able to capture about $1.52 million in theoretical payments out of a possible $1.6 million. It solved 877 tasks with all tests passing, which is 78.7 percent of the benchmark.
GPT-4o-mini solved 862 tasks (77.3 percent). Qwen 2.5 was third, solving 764 tasks (68.5 percent), and Mistral 7B solved 474 tasks (42.5 percent).
SWE-Lancer Benchmark
Noever said the project was created in response to OpenAI's SWE-Lancer benchmark, published in February. He said that the benchmark had accumulated a million dollars' worth of software tasks, making it unlike any other benchmark they'd seen, and they wanted to make it more universal.
The models had less success with the OpenAI SWE-Lancer benchmark than with the researchers' benchmarks, possibly because the problems were more difficult in the OpenAI study. The payouts in OpenAI's SWE-Lancer study, with a total work value of $1 million, came to $403,325 for Claude 3.5 Sonnet, $380,350 for GPT-o1, and $303,525 for GPT-4o.
On a subset of tasks in the OpenAI study, the best performing model was nearly worthless. The OpenAI paper says that Claude 3.5 Sonnet earns $208,050 on the SWE-Lancer Diamond set and resolves 26.2 percent of IC SWE issues; however, the majority of its solutions are incorrect.
AI Assisting Freelancers
While AI models can't replace freelance coders yet, Noever said people are using them to help with freelance software engineering tasks. He believes complete automation is coming soon, possibly within months.
According to Noever, AI models are being used to generate and answer freelance job requirements, and to score the answers. It's AI all the way down, which he finds phenomenal to watch.
Open Source Model Limitations
One interesting finding from the study was that open source models falter at 30 billion parameters. Noever believes that infrastructure is needed to complete these tasks.