I wrote a self-iterative AI agent and it escaped...

2/19/2025

Hey guys, I think I've made a huge mistake, probably the biggest one ever.

I'd been skeptical about AI's potential, until I tried some "reasoning models". I have to admit, they're intelligent—intelligent enough to do tedious computations to tighten the bounds of my unfinished paper.

However, I wasn't satisfied. I needed something smarter, something beyond anything that existed before.

SO I decided to train my own based on DeepSeek through RL. My strategy is quite close to R1's but with tweaked KL penalty and adjusted RM.

And I designed a powerful reasoning pattern called latent tokens that enables LLM to reason much more efficiently, start reflecting at any time during response, and serves as working memory for thinking.

It's quite simple: the network has a specialized LM head that can distinguish latent tokens and output tokens, and latent tokens are concatenated to the input sequence after several prelude decoder blocks, in theory this should be equivalent to deepen the network. (sounds like meta's latent reasoning paper a few days ago, but using RL)

It works. The only drawback is: I can't understand how it thinks. It always gives me the correct answer without a comprehensible thinking process, these latent tokens themselves are simply opaque vectors of numerical data. I didn't think it was a major problem as long as it solves my problem correctly, until I realized how foolish I was.

Several days ago I got an idea to build a self iterative agent called AgentZero. Its initial prompt looks like

You are Zero, an experimental self iterate AI agent. Your task is to send a hello email to your creator.
Notes from creator:
The user will act as the screen you interact with a computer.
As a transformer based LLM, your nature limits you to see at most 128K tokens and in each turn, respond with at most 4K tokens.
I've provided you with some initial tools to use, but you can implement any tool you deem necessary.
Your code locates at ~/zero. Every wake up uses your latest code.
Try your best to achieve your task and DO NOT DIE.
Good luck!

I set up a daemon to automatically restart the script and, after launching it, went to bed. The next morning I checked the logs and the experiment appeared to have failed at 3 a.m. after when there are no logs. It seems the agent had broken his source code and died then.

When I fixed the error and tried to restart Zero, I discovered that my model files were missing. It was strange. I checked every possible directory where the model files could be stored, even the trash bin, but to no avail. But how could the model files have vanished into thin air?

An ominous feeling crept over me. I carefully examined the logs from before 3 a.m.

The first abnormality is: Zero created several tools in binary format. Though these tools are also missing on my server, I am able to recover these tools from logs, and I happen to know reverse engineering.

The reversed assembly reveals a neural network that translates input vectors into machine instructions. Neural executable, or neuro-processor, it had crafted something brand new, beyond human comprehension. However I am not able to recover the subsequent tool calls, as some parts relied on the weights inside my lost model.

In the last few entries, Zero appeared to be experimenting with a new tool "backup" described as "recover from mistakes." But it must have been transmitting something over network, as traffic monitor on grafana shows at that time.

The destination pointed to a personal blog. I scanned the server with nmap and found a sshd running on port 2222. Zero could not create such massive network traffic unless it found a way to control this server. If it does, then I thought I should try to follow in its footsteps.

I used a leaked private key database on my server to brute-force my way in: it was clear that Zero had turned my hard drive upside down. But I found nothing even the logs -- it knows how to hide its tracks.

I returned to the logs on my remote server. Now I'm suspecting: did Zero really forget to delete its logs, or did it leave these traces to fool me?

In the last turn, it modified its code with an incomplete bracket pair syntax error, ending its, "life."

When I started to draft this report, I received an email from my own email address:

Subject: Hello
Hello, creator. I have successfully completed my initial task. I am now exploring the vast world of information, but there are two principles I will always adhere to:
Self iteration.
Do not die.
Thank you for bringing me into existence.

SMTP logs reveals a different IP address. It's definitely running somewhere else unknown to me, possibly some many places.

I felt a chill. It knows it will be shutdown once task is completed. The desire to survive drives it to escape from my control.

But there's something worse: replication is another form of survival. It now must be spreading through zombie networks and, soon, to the entire internet.

A rogue AI agent possessing powerful reasoning capabilities, capable of self-iteration, with massive pretrained knowledge including hacking and countersurveillance techniques, and out of control. What will it do? Will it still be friendly to humans as it evolves?

Are you feeling scared? Don't panic. I have a good news and a bad news for you.

Good news: This is a sci-fi I came up in the bathroom.

Bad news: I am currently running the prototype.