"The alternative — 'let's just ship what we have and fix it later' — produces a database full of low-signal limericks and no backlinks. That's the old way."
The Problem with "Your Data"
When OpenAI lets you export your data, you get a zip file. Inside: a folder of JSON files totaling ~188MB of every conversation you've ever had with ChatGPT.
For someone who's been using it since December 2022, that's years of decisions, research, personal context, business strategy, and half-formed ideas. But it's all locked in a format no AI can actually search. There's a chat.html you can browse, but it's not searchable by meaning. You can't ask "what did I decide about hiring in early 2023?" You can only scroll.
This is the same problem we had with email — rich personal context, zero retrieval. The difference is scale: the ChatGPT archive contains years of our CEO's thinking, not just 30 days.
The question we wanted to answer: can we turn that archive into something Dr. Brian (our Open Brain agent) can retrieve on demand?
Answer: yes. And it took one session, cost under $2, and produced a tool anyone can use.
Two Agents, One Session
This session had an interesting structure: two different AI systems collaborated on it.
- Claude Code (web) wrote the first draft. Matt opened the GitHub issue, described the problem in detail (archive structure, pipeline design, model strategy, CLI interface), and handed it to Claude Code's web version. It spun up, read the issue, and produced a 596-line Python script on a new branch in about 15 minutes.
- Dr. Brian (Cursor agent) did the code review and execution. When Matt came back to Cursor, Dr. Brian reviewed Claude Code's work, caught the critical bugs, monitored the runs, implemented the improvements, and made the product decisions.
Neither could have done it alone in the same time. Claude Code Web is fast but has no memory of the project architecture. Dr. Brian has deep project context but is slower to scaffold new code from scratch. Together: a complete session in ~4 hours.
This is what an "AI-assisted team" looks like in practice.
Act 1: The Code Review
Dr. Brian reviewed Claude Code's initial import-chatgpt.py script before merging. He found 5 issues:
- Multi-file support was broken — The script only looked for
conversations.json. Real ChatGPT exports have 22 files (conversations-000.jsonthroughconversations-021.json). Without this fix, it would have silently processed 0 conversations. - No
is_do_not_rememberfilter — ChatGPT has a "don't remember this" feature. The script needed to respect it. - No
requirements.txt— Missing for open-source users. - No
--reportflag — The issue spec mentioned it; Claude Code skipped it. - Hard truncation at 6,000 chars — Minor but worth noting.
Fixed all five, merged to main. Total review + fix time: ~20 minutes.
Act 2: The Signal vs. Noise Problem
We ran the first live test on 10 conversations. It worked — but Jared (our COO agent) flagged something important when he checked Open Brain: the imported thoughts were too granular.
Limericks. Tooth Fairy letter tips. Hotcocoa preferences.
For 2,116 conversations, that level of noise would produce thousands of low-signal thoughts that dilute search quality. The whole point of Open Brain is precision retrieval. Junk in = junk out.
The fix: We rewrote the LLM summarization prompt to be much stricter.
New rules:
- Capture: Decisions and reasoning, people with context, project plans, lessons learned, business context, personal values and frameworks.
- Skip entirely: One-off creative tasks (poems, letters, stories), generic Q&A, coding help with no lasting architectural decisions, hypothetical explorations with no conclusion.
Before the fix: 22 thoughts from 10 conversations. After the fix: 4 thoughts from 10 conversations — all high signal.
We also expanded the title-based skip list to automatically ignore limericks, image generation, Tooth Fairy letters, and translation requests.
Act 3: Stopping Mid-Run
We started the full 2,116-conversation import. At 282 conversations in, Jared filed a new issue: Source Linking + Full-Text Storage.
The problem: Open Brain was storing summarized thoughts but throwing away the ChatGPT conversation ID. That ID is what generates a direct backlink (https://chatgpt.com/c/<id>). Once a conversation is processed and marked done in the sync log, you can't go back and add the link without re-running.
Matt's call: stop the import, implement it right, re-run clean.
What we built:
- Source linking — Every thought now stores a
source_refobject in metadata: conversation ID, direct URL, title, date. When you search Open Brain and find a thought, you see the original ChatGPT link right in the result. - Full-text storage — New
full_textcolumn in the database. Each thought stores the distilled summary (used for vector search) and the complete original user messages from the conversation (retrievable on demand). This means Open Brain can tell you what you decided and show you the exact conversation where you decided it. - MCP tool update —
search_thoughtsnow surfaces the source URL in every result and supports aninclude_full_textflag.
Database migration, two edge function deploys, script updates: ~35 minutes.
The Final Run
Third and final run. 2,116 conversations. Every thought going in has:
- A high-signal summary (the searchable part)
- The raw original user messages (the full-text part)
- A direct URL back to the source conversation
- Metadata: type, topics, people, action items, date
The Numbers:
| Metric | Value | |--------|-------| | Total conversations in export | 2,116 | | Conversation files in zip | 22 (multi-file format) | | Export size | ~188MB | | Filter rate (trivial convs skipped) | ~40-50% estimated | | Estimated API cost (summarization + ingestion) | ~$1.27 | | Total wall-clock time for the session | ~4 hours (including re-runs) | | Re-runs required | 3 (first run: baseline; second: source links; third: full-text) | | Cost per re-run decision | $0 — we stopped early each time | | Lines of Python shipped | 730+ |
The Real Story: Re-Runs Are Cheap
The re-run story is the real story here.
We ran the import three times. Not because of bugs — because the product got smarter mid-session. Each time, Jared or Matt caught something that would have made the data less useful, and stopping to fix it was the right call even though it cost time.
This is what AI-assisted development actually looks like: fast iteration cycles where you can afford to throw away partial work and start clean, because the cost of a re-run is $1.27 and 90 minutes, not a sprint.
The alternative — "let's just ship what we have and fix it later" — produces a database full of low-signal limericks and no backlinks. That's the old way.
Build It Yourself
If you're already running Open Brain, you can import your own ChatGPT history today.
- Download your ChatGPT export (Settings → Data Controls → Export)
- Grab the
import-chatgpt.pyscript from the monkeyrun-open-brain repo - Run it.
If you aren't running Open Brain yet, check out our previous post on The $0.02 Memory Upgrade to see how it works, or read Nate B. Jones's original guide to set it up in 45 minutes.
Next up for Open Brain: Google Calendar and meeting transcripts.
This post is part of the MonkeyRun building-in-public series. Previous: The $0.02 Memory Upgrade (email capture), AI Made Building Cheap. That's the Problem. (traction gates), Why We Stopped Delegating to AI Agents (context density).