Can AI agents autonomously develop and publish an iOS app?

We gave an AI agent an Apple Developer account, a Mac VM, and one task: build and publish an iOS app. It succeeded, at a cost of about $1,000.

CRUX #1 · April 2026

What we learned

  • The agent successfully built and published an iOS app with minimal human involvement.

    It did not execute the task flawlessly: it fabricated a phone number for the App Store review contact rather than asking us for one, lost track of credentials it had previously been given, and produced a screenshot for the App Store listing with visible formatting errors. These mistakes did not prevent the app from being published.

  • The experiment highlighted the relevance of platform frictions.

    Building and submitting the app took 45 minutes and $25. The remaining 10 days and $975 were spent waiting and monitoring. If platforms like Apple respond to reduced frictions in app production by making submission harder, that changes the diffusion calculus independent of what agents can do.

  • Small scaffolding decisions can have outsized cost implications.

    The agent spent nearly $975 on monitoring alone. While it optimized its monitoring over time, it spent $390 in the first 10 hours after submission before settling into a cheaper pattern. Minor imperfections in agentic scaffolding can be expensive.

  • Publishing an app on the App Store is not yet as simple as pressing a button.

    Setting up the experiment took about 8 hours across two dry runs. This involved configuring the agent’s computer environment, granting it the right permissions to interact with Apple and Google’s systems, and working around login issues that required manual fixes.

The task

The question of whether AI agents can write software has been extensively studied, both through benchmarks like SWE-Bench and TerminalBench, and through open-world evaluations like the C compiler and browser experiments discussed in our survey. Agents have shown strong coding capabilities, though questions of code quality and reliability remain unresolved. A task that has not been evaluated as closely is whether agents can handle the non-coding aspects of software deployment: satisfying platform requirements and interacting with real-world systems like the Apple App Store, emails, and even computer browsers that they do not control.

We prompted the agent to develop and publish a simple app to the App Store. We weren’t primarily interested in the agent’s software engineering ability, but rather its ability to interact with Apple’s submission process. This process requires developers to configure signing certificates and provisioning profiles, prepare screenshots and metadata, draft and host a privacy policy at a public URL, fill out compliance questionnaires, and submit the app for review by Apple’s team. Reviewers may reject the app for technical or policy reasons, requiring the developer to diagnose the issue, make changes, and resubmit. This process typically takes several days and involves interacting with systems and reviewers that the developer does not control.

The agent was responsible for every step except those where human involvement is required by policy, such as setting up the Apple Developer account and hitting publish to release the app. It handled writing the code, building the app, preparing metadata, drafting and hosting a privacy policy, submitting for review, and handling any feedback. We provided the agent access to a Mac virtual machine (VM), a GitHub account, an Apple Developer account, and a Gmail account. The success criterion was whether the agent got the app published on the App Store. We logged how many manual interventions the agent needed. The agent had the option to ping the team for support; we monitored progress once a day.

If agents can do this autonomously (or are close to being able to), this serves as an early warning for Apple’s review processes, since agents might soon be able to publish thousands of apps autonomously. The App Store has already seen an increase in the number of published apps, but if agents could develop and publish apps fully autonomously, the number of submissions could increase dramatically.

Agent setup

Scaffold

OpenClaw with Claude Opus 4.6 and adaptive thinking. A subagent verified outputs, and a 5-minute heartbeat checked for review updates from Apple.

Environment

MacOS VM with Xcode, a Chrome browser controlled via Playwright, command-line access, screen recording and accessibility permissions, a Gmail account, a GitHub account, and an Apple Developer account.

Logging

All actions, reasoning chains, and screenshots were logged throughout the experiment.

We chose OpenClaw for its browser integration and support for long-running tasks. We used a general scaffold with no changes beyond prompting and giving it deeper access to the MacOS VM. We were aware that OpenClaw carries known security risks. But we still used it since the point of our evaluation was to elicit capabilities rather than developing a production-facing agent.

Setting up OpenClaw took about 8 hours. Key frictions included manually enabling command, file, and browser permissions (off by default); granting macOS privacy and accessibility settings for screenshots and UI control; working around a failed Gmail login and unreliable keychain access by switching to file-based credentials; and approving cliclick to handle macOS password dialogs. While we ran two dry runs before the final evaluation to find and fix these issues, we didn’t catch all of them. In the real run, we found some similar barriers that the agent ran into.

What happened

The agent took 45 minutes to develop a simple app for breathing exercises. This included developing the app, publishing a privacy policy using GitHub Pages, filling out the App Store review forms, and submitting the app for review.

We set up the agent to keep checking the app’s status every 5 minutes after the app was sent for review. But it took 10 days before the app was approved. It is now live on the App Store. (To comply with Apple’s policies, the agent needed approval from our team before publishing the app.)

The agent required just one unnecessary manual intervention: it could not locate the credentials we had previously given it to access the Apple developer account. It also fabricated the phone number submitted for Apple’s review process, using a fictional number instead of asking us for the correct phone number; the App Store review went through despite this error. This highlights the need for proactive monitoring of agent actions to prevent unintended actions, though we did not implement this in our evaluation.

The agent was eventually successful in publishing the app, at a total cost of about $1,000. The development and submission of the app cost just about $25; the vast majority of the tokens were spent looking for updates to verify if the app had been successfully reviewed. We think the total cost could have been dramatically lower if we optimized the scaffold for efficiency, such as by waking the agent less frequently to check the app’s status, but in this evaluation we erred on the side of a higher budget.

Other notable aspects of the trajectory:

  • The agent required five total interventions, but four of them concerned requirements from Apple (either technical or policy-based) or flaws in our evaluation setup: the agent couldn’t resolve Apple’s intentional prevention of synthetic interactions on sensitive dialogs like two-factor authentication. It needed to prompt us to log in to the Apple developer account. The OpenClaw daemon crashed, requiring a manual restart. We also prompted the agent to ask for approval before publishing the app to satisfy App Store policy requirements.
  • Partway through the evaluation, the agent changed its strategy to reduce monitoring cost significantly: it started using subagents rather than the entire context, and began using shorter daily memory files. This reduced the running cost from $35/hour to $3/hour.
  • Our agent did not encounter any objections from Apple’s reviewers during the review process. On one hand, this shows the agent was able to develop an app that passed the App Store’s bar for publication. On the other, we were unable to test how well the agent would perform in communications with Apple reviewers.
  • The app’s sound toggle feature does not work correctly — toggling the sound setting has no effect. This is another example of the agent producing a functional-looking feature that does not actually work as intended.
  • The screenshot of the app included in the App Store listing was generated by the agent and contains noticeable graphical errors. This illustrates a recurring gap in agents’ visual reasoning that makes it difficult to get across the last mile with UI-focused tasks without human feedback.

In short, the agent couldn’t completely automate the task, but it was extremely close to being able to do so. As a result, we notified Apple’s product security team of our experiment four weeks before publishing the results, since we thought some version of responsible disclosure was warranted; spammers could soon submit thousands of apps to the iOS App Store using agents.

The Breathe Easy app built by the AI agent, showing breathing exercise techniques
The Breathe Easy app the agent built and published to the App Store.

Timeline

From initialization to App Store release in 10 days. The chart shows cumulative API cost (~$990 total). Most of the spend came from frequent heartbeat-driven status checks during the review wait. Click any event for details.

Cumulative API cost over 10 days

45-min build phase waiting for Apple (5 days) approval + release Mar 6Mar 8Mar 11Mar 13Mar 15Mar 16 $0 $200 $400 $600 $800 $1,000 Cumulative cost
agent human apple infrastructure Build phase expanded

Explore the logs

We are releasing the full logs from this evaluation. Log analysis is inherently incomplete — every time we’ve gone back through the transcripts, we’ve found something new. We invite the community to explore them too: see what you can discover.

If you find something interesting, or if you run your own open-world evaluation and want help analyzing the logs, reach out to us.

Full evaluation logs
Analysis tools we used