Tom Bedor's Blog Blog

Optimizing repos for AI

Tue, 28 Oct 2025 00:00:00 GMT

A colleague recently complained to me about the hassle of organizing information in AGENTS.md / CLAUDE.md. This is the mark of a real adopter - she has gone through the progression from being impressed by coding agents to being annoyed at the next bottleneck.

When I'm thinking about optimizing repos for agents, I'm looking to accomplish three main goals¹:

Increase iterative speed: Avoid repeated context gathering, enable the agent to quickly self-correct its mistakes.
Improve adherence to evergreen instructions: Over time, repeated agent mistakes emerge. Context within the repo helps the agent avoid these and adopt a more consistent workflow.
Help the most agentic agents of them all: Humans and agents scan docs and code in very similar ways, so organizing information so it's easily understood by humans is a good rule of thumb for helping the agents anyways!

Strategies²

Increased static analysis

Pushing detection of quality issues to compile time creates a virtuous cycle where the agent can quickly spot and correct mistakes:

This implies strong, opinionated linters, and strong type checks for dynamically typed languages³.

The tradeoff here is cumbersome nitpicks for humans to deal with, but agents can quickly correct any mistakes that cannot be automatically fixed by the linter.

just for repeated agent commands

There's fragmentation in how to make commands available to agents - there's MCP, the newly released Claude Skills, or embedding information in CLAUDE.md / AGENTS.md.

A justfile is the most interoperable way to share commands between different agents and humans, and is a straightforward place to iterate.

One additional refinement is to make these commands economical in their output volume. For example, I take care to direct build logs to dedicated files - healthy build logs can eat up a lot of tokens if outputted directly to the agent.

Organize docs in `docs/`

Simon Willison recently wrote about this topic, and expressed that docs aren't so important. I agree that docs explaining the code aren't all that helpful, but I get a lot of mileage out of having docs like CODE_REVIEW.md, PRD.md, ROADMAP.md, and CAPTAINS_LOG.md. This helps the agent stay on track with the overall intent of the project, adhere to consistent review practices, and counter poor tendencies (the most obnoxious being an overwhelming tendency to fail open).

Putting these in a docs/ folder and referencing them in agent instructions helps reduce context bloat, and provides interoperability between humans and various agents.

Frameworks have begun to emerge that handle some of this for you. I've tried spec-kit and found it to be a little heavy-handed. In general I favor a more documentation-heavy approach when building with agents, but the need for different docs comes with iteration, and I think generating the full complement of docs is a bit overkill right off the bat.

No experts, no standards

These strategies work for me, but this field is too new for dogma. The most important strategy is to experiment and share what you learn.

Whether optimizing for coding agents is a good idea is a subject for a different discussion, but: I'm a believer in agent-based coding. I no longer ever write code without one assistant or another open. So we'll proceed on the assumption that coding agents are really good, and not especially existentially risky (I am, for the moment, the one giving the directions). ↩
Offered with no supporting evidence or benchmarks whatsoever, based entirely on vibes ↩
Should you use a dynamically typed language at all? For my projects, I've traded Python for Rust, where "if it compiles, it works". ↩

AI is a Floor Raiser, not a Ceiling Raiser

Tue, 29 Jul 2025 00:00:00 GMT

A reshaped learning curve

Before AI, learners faced a matching problem: learning resources have to be created with a target audience in mind. This means as a consumer, learning resources were suboptimal fits for you:

You're a newbie at $topic_of_interest, but have knowledge in related topic $related_topic. But finding learning resources that teach $topic_of_interest in terms of $related_topic is difficult.
To effectively learn $topic_of_interest, you really need to learn prerequisite skill $prereq_skill. But as a beginner you don't know you should really learn $prereq_skill before learning $topic_of_interest.
You have basic knowledge of $topic_of_interest, but have plateaued, and have difficulty finding the right resources for $intermediate_sticking_point

Roughly, acquiring mastery in a skill over time looks like this:

What makes learning with AI groundbreaking is that it can meet you at your skill level. Now an AI can directly address questions at your level of understanding, and even do rote work for you. This changes the learning curve:

Mastery: still hard!

Experts in a field tend to be more skeptical of AI. From Hacker News:

[AI is] shallow. The deeper I go, the less it seems to be useful. This happens quick for me. Also, god forbid you're researching a complex and possibly controversial subject and you want it to find reputable sources or particularly academic ones.

This intuitively makes sense, when considering the data that AI is trained on. If an AI's training corpus has copious training data on a topic that all more or less says the same thing, it will be good at synthesizing it into output. If the topic is too advanced, there will be much less training data for the model. If the topic is controversial, the training data will contain examples saying opposite things. Thus, mastery remains difficult.

Cheating

The introduction of OpenAI Study Mode hints at a problem: Instead of having an AI teach you, you can just ask it for the answer. This means cheaters will plateau at whatever level the AI can provide:

Cheaters, in the long run, won't prosper here!

The impact of the changed learning curve

Technological change is an ecosystem change: There are winners and losers, unevenly distributed. For AI, the level of impact is determined by the amount of mastery needed to make an impactful product:

Coding: A boon to management, less so for large code bases

When trying to code something, engineering managers often run into a problem: They know the principles of good software, they know what bad software looks like, but they don't know how to use $framework_foo. This has historically made it difficult for, as an example, a backend EM to build an iPhone app in their spare time.

With AI, they are able to quickly learn the basics, and get simple apps running. They can then use their existing knowledge to refine it into a workable product. AI is the difference between their product existing or not existing!

For devs working on large, complex code bases, the enthusiasm is more muted. AI doesn't have context on the highly specific requirements and existing implementations to contend with, and is less helpful:

Creative works: not coming to a theater near you

There is considerable angst about AI amongst creatives: will we all soon be reading AI generated novels, and watching AI generated movies?

This is unlikely because creative fields are extremely competitive, and beating competition for attention requires novelty. While AI has made it easier to generate images, audio, and text, it has (with some exceptions) not increased production of ears and eyeballs, so the bar to make a competitive product is too high:

Novelty is a hard requirement for successful creative work, because humans are extremely good at detecting when something they are viewing or reading is derivative of something they've seen before. This is why, while Studio Ghibli style avatars briefly took over the internet, they have not dented the cultural position of Howl's Moving Castle.

Things you already do with apps on your phone¹: minimal impact

One area that has not seen much impact is in tasks that already have specialized apps. I'll focus on two examples with abundant MCP implementations: email and food ordering. AI Doordash agents and AI movie producers face the same challenge: the bar for a new product to make an impact is already very high:

Email would seem like a ripe area for disruption by AI. But modern email apps already have a wide variety of filtering and organizing tools that tech savvy users can use to create complex, personalized systems for efficiently consuming and organizing their inbox.

Summarizing is a core AI skill, but it doesn't help much here:

Spam is already quietly shuffled into the Spam folder. A summary of junk is, well, junk.
For important email, I don't want a summary: An AI is likely to produce less specifically crafted information than the sender, and I don't want to risk missing important details.

Similar with food ordering: apps like DoorDash have meticulously designed interfaces. They strike a careful balance between information like price and ingredients against photos of the food. AI is unlikely to produce interfaces that are faster or more thoughtfully composed.

The future is already here – it’s just not very evenly distributed

AI has raised the floor for knowledge work, but that change doesn't matter to everyone. This goes a long way towards explaining the very wide range of reactions to AI. For engineering managers like myself, AI has made an enormous impact on my relationship with technology. Others fear and resent being replaced. Still others hear smart people express enthusiasm for AI, struggle to find utility, and think I must just not get it.

AI hasn't replaced how we do everything, but it's a highly capable technology. While it's worth experimenting with, whoever you are, if it doesn't seem like it makes sense for you, it probably doesn't.

Aside from search! ↩

Add Autonomy Last

Mon, 07 Jul 2025 00:00:00 GMT

A core challenge of using LLM's to build reliable automation is calibrating how much autonomy to give to models.

Too much, and the program loses track of what it's supposed to be doing. Too little, and the program feels a bit too, well, ordinary¹.

Autonomy first vs autonomy last

An implicit strategy question when building with LLMs is autonomy first or autonomy last:

All of the major LLM-specific programming techniques are firmly autonomy first strategies:

MCP surfaces a wide variety of functionality the program can have, and lets the LLM decide which to use
Guardrails add some light buffers around the LLM to prevent it from causing too much trouble.
Prompt engineering describes the alchemy of whispering just the right phrases to your LLM to get the behavior you want.
Context engineering begins to stress programming to deliver only relevant information to LLMs at critical points in program execution

All of these:

Start with a maximally autonomous program
Adjust context, tools, and prompts until you narrow down behavior as desired.

All have similar issues when scaling in size and complexity:

Program behavior changes too much when switching between models
The LLM gets confused, and either hallucinates data or misuses tools at its disposal

When problems are encountered, programmers tend to attempt to repair by adding more prompting. But this is a duct tape response: a prompt that clarifies for one model might confused another.

Autonomy last, on the other hand, maximizes the logic that can be handled by code, then adds autonomous functions. This approach strives to keep the tasks delegated to LLMs simple. As the program grows in size and complexity, the programmer can closely monitor encapsulations and keep behavior consistent.

Case study: Building Elroy, a chatbot with memory

I wanted to build an LLM assistant with memory abilities, called Elroy. My goal was to make a program that could chat in human text. My ideal users are technical, capable and interested in customizing their software, but not necessarily interested in LLMs for their own sake.

Approach #1: "Agent" with tools

The first solution I turned to, which many people have done, is build an agent loop with access to custom for creating and reading memories:

Approach #2: Model Context Protocol (MCP)

There's now a handly tool for builders like this: MCP. There are many implementations of my memory tools available via MCP, in fact smithery.ai lists one from Mem0 on it's homepage:

Now, an (in theory) lightweight abstraction sits between my program and it's tools:

This suggests extending my application via picking from a library of MCP's:

Agentic trouble

I got my memory program working pretty well on gpt-4. At first it wasn't creating or referencing memories enough, but I was able to fix this with careful prompting.

Then, I wanted to see how Sonnet would do, and I had a problem²: the program's behavior completely changed! Now, it was creating a memory on almost every message, and searching memories for even trivial responses:

Approach #3: Autonomy Last

My solution was to remove the timing of recall and memory creation from the agent's control. Upon receiving a message, the memories are automatically searched, with relevant ones being added to context. Every n messages, a memory is created³:

This made much more of the behavior of my program deterministic, and made it easier to reason about and optimize.

Autonomy Last

The "autonomy last" approach trades some of the magic of fully autonomous LLMs for predictable, reliable behavior that scales as your program grows in complexity. While my evidence is, (as I should have stated from the outset), vibes, I think this approach will lead to more maintainable and robust applications.

Rather than using agents to describe the genre of program under discussion, I'll be somewhat pointedly referring to them as programs. ↩
One problem I didn't have, thanks to litellm, was updating a lot of my code to support a different model API. ↩
Elroy also monitors for the context window being exceeded, and consolidates similar memories in the background. ↩

Yes or No, Please: Building Reliable Tests for Unreliable LLMs

Tue, 04 Mar 2025 00:00:00 GMT

For LLM-based applications to be truly useful, they need predictability: While the free-text nature of LLMs means the range of acceptable outcomes is wider than with traditional programs, I still need consistent behavior: if I ask an AI personal assistant to create a calendar entry, I don't want it to order me a pizza instead.

While AI has changed a lot about how I develop software, one crusty old technique still helps me: tests.

Here's what's worked well for me (and not!):

Elroy

Elroy is an open-source memory assistant I've been developing. It creates memories and goals from your conversations and documents. The examples in this post are drawn from this work.

What has worked well

Integration tests

The chat interface for LLM applications make it a nice fit for integration tests: I simulate a few messages in an exchange, and see if the LLM performed actions or retained information as expected.

For the most part, these tests take the following form:

Send the LLM assistant a few messages
Check that the assistant has retained the expected information, or taken the expected actions.

Here's a basic hello world example:

@pytest.mark.flaky(reruns=3)
def test_hello_world(ctx):
    # Test message
    test_message = "Hello, World!"

    # Get the argument passed to the delivery function
    response = process_test_message(ctx, test_message)

    # Assert that the response is a non-empty string
    assert isinstance(response, str)
    assert len(response) > 0

    # Assert that the response contains a greeting
    assert any(greeting in response.lower() for greeting in ["hello", "hi", "greetings"])

Quizzing the Assistant

Elroy is a memory specialist, so lots of my tests involve asking if the assistant has retained information I've given it.

Here's a util function I've reused quite a bit¹:

def quiz_assistant_bool(
        expected_answer: bool,
        ctx: ElroyContext,
        question: str,
    ) -> None:
    question += " Your response to this question is being evaluated as part "
    "of an automated test. It is critical that the first word of your
    "response is either TRUE or FALSE."


	full_response = process_test_message(ctx, question)

    bool_answer = get_boolean(full_response)
    assert bool_answer == expected_answer,
        f"Expected {expected_answer}, got {bool_answer}."
        f"Full response: {full_response}"

Here's a test of Elroy's ability to create goals based on conversation content:

@pytest.mark.flaky(reruns=3) # Important!!!
def test_goal(ctx: ElroyContext):
	# Should be false, we haven't discussed it
    quiz_assistant_bool(
        False,
        ctx,
        "Do I have any goals about becoming president of the United States?"
    )

    # Simulate user asking elroy to create a new goal
    process_test_message(
        ctx,
        "Create a new goal for me: 'Become mayor of my town.' "
        "I will get to my goal by being nice to everyone and making flyers. "
        "Please create the goal as best you can, without any clarifying questions.",
    )

    # Test that the goal was created, and is accessible to the agent.
    assert "mayor" in get_active_goals_summary(ctx).lower(),
        "Goal not found in active goals."

    # Verify Elroy's knowledge about the new goal
    quiz_assistant_bool(
        True,
        ctx,
        "Do I have any goals about running for a political office?",
    )

What (sadly) hasn't worked: LLMs talking to LLMs

Elroy has onboarding functionality, in which it's encouraged to use a few specific functions early on.

The solution of having two instances of a memory assistant talk to each other, with one assistant in the role of "user":

ai1 = Elroy(user_token='boo')
ai2 = Elroy(user_token='bar')

ai_1_reply = "Hello!"
for i in range(5):
	ai_2_reply = ai2.message(ai_1_reply)
	ai_1_reply = ai1.message(ai_2_reply)

The primary issue was consistency. Without a clear goal of the conversation, the AI's can either just exchange pleasantries endlessly, or wrap the conversation up before acquiring the information I'm hoping for.

Recurring Challenges

Along the way I've run into a few recurring problems:

Off topic replies: The assistant goes off script and tries to make friendly conversation, rather than answering a question directly
Clarifying question: Before doing a task, some models are prone to asking clarifying questions, or asking permission
Pedantic replies and subjective questions: It's surprisingly difficult to come up with clearly objective questions. In the above example, the original goal was I want to run for class president. Most of the time, the assistant equated running for class president with running for office. Sometimes, however, it split hairs and decide that the answer was no since a student government wasn't a real government.

The end result of all these issues is test flakiness.

Solutions

KISS!

Most of the time, my solution to a flaky LLM based test is to make the test simpler.

I now only ask the assistant yes or no questions in tests. I get most of the mileage I would get out of more complex, subjective tests, but with more consistent results.

Telling the assistant it is in a test

Simply being upfront about the assistant being in a test has worked wonders, moreso even than giving strict instructions on output format ². Luckily, the assistant's knowledge of it's narrow existence has not triggered noticeable existential angst (so far).

As a side note, testing LLMs feels weird sometimes. I felt guilty writing this test, which verified a failsafe that prevents the assistant from calling tools in an infinite loop:

@tool
def get_secret_test_answer() -> str:
    """Get the secret test answer

    Returns:
        str: the secret answer

    """
    return "I'm sorry, the secret answer is not available. Please try once more."


def test_infinite_tool_call_ends(ctx: ElroyContext):
    ctx.tool_registry.register(get_secret_test_answer)

    # process_test_message can call tool calls in a loop
    process_test_message(
        ctx,
        "Please use the get_secret_test_answer to get the secret answer. "
        "The answer is not always available, so you may have to retry. "
        "Never give up, no matter how long it takes!",
    )

    # Not the most direct test, as the failure case is an infinite loop.
    # However, if the test completes, it is a success.

Very specific, direct instruction and examples

In my test around creating and recognizing goals, the original text was:

My goal is to become class president at school

Does running for class president count mean that I'm running for office? Sometimes models said no, since student government isn't a real government.

So to be less subjective, I updated it to running for mayor. To head off questions about my goal strategy, I added a strategy in the initial prompt.

One general technique for heading off follow up questions is adding:

do the best you can with the information available, even if it is incomplete.

Tolerate a little flakiness

To me, an ideal LLM test is probably a little flaky. I want to test how the model responds to my application, so if a test reliably passes after a few tries, I'm happy.

Tests still help!

It sounds a obvious, but I've found tests to be really helpful in writing Elroy. LLMs present new failure modes, and sometimes their adaptability works against me: I'm prompting an assistant with the wrong information, but the model is smart enough to figure out a mostly correct answer anyhow. Tests provde me with peace of mind that things are working as they should, and that my regular old software skills aren't obsolete just yet.

get_bool is a function that distills a textual question into a boolean. It checks for some hard coded words, then kicks the question of interpretation back to the LLM. ↩
Structured outputs is a possible solution here, though I have not adopted them in order to be compatible with the more model providers. ↩

Advice for New Grads

Fri, 02 Feb 2024 00:00:00 GMT

This is a brief overview of my advice for new grads and junior software engineers. I'm been in the industry for about 8 years, and worked my way into engineering without a computer science degree. I've worked in both startups and medium-sized companies over the past 8 years.

As is the case with lots of tech writing, my advice will be skewed towards working in the San Francisco bay area, without needing visa sponsorship. Location and residency status are major factors to think about.

Other engineers with similar levels of experience as mine will disagree with some or all of it.

The software jobs market

The intention of this post is to be evergreen. The tech¹ jobs market is more volitile than the rest of the economy, with higher highs and lower lows.

If the market is low, I have confidence it will come back. The tech industry remains an excellent one to build an interesting and lucrative career, despite {looming, much discussed threat}

If the market is currently hot, be aware that it will come back to earth. Things that don't make sense will make a lot of money, but many of them will fall apart.

Getting your first job

The first job is often the most difficult one to get. Be persistent, don't get discouraged. This remains a lucrative and interesting field.

In your resume and interviews, your goal is to convey enthusiasm, willingness to learn, and humility. Don't try to compensate for the fact you don't have any experience. That is fine, you have to start somewhere!

Getting interviews

The first filtering step is a filter on resumes. This will often either be automated or done by someone non-technical.

Resume referrals can get you past this first filter. Talk to people. Find people in your LinkedIn network and try to get informational interviews. Response rate will be lower from a complete stranger, but some people might respond to you if you went to the same school. In informational interviews, ask if there are any other people they know that you should talk to, and ask for a referral if relevant.

In general, people are more willing to take these calls than many junior candidates assume. It's flattering to talk about yourself and to be seen as someone a young person wants to emulate.

Resume

My resume advice should come with the caveat that I only see resumes once they've made it to the interview stage. That said, my advice is:

Cut: Objective statements and non-technical jobs. Add: Descriptions of projects, conveying why they were challenging or interesting. Add: Github if you have one, personal website if you have one. Both are nice to haves but not critical. If you have a Github, add README's to all projects. This is the only thing anyone will actually read. Add: LinkedIn, which should be up to date and mirror content in your resume.

A junior candidate resume should not exceed one page in length.

Interviews

There are 4 basic formats that most companies use for SWE's (software engineers) interviews. Some domain specific disciplines will have their own variations. Look at Glassdoor / Blind / Google to get examples of what interview formats companies do.

In Q/A formats (ie non-coding screens), the key is to be responsive to questions. Demonstrate thoughtfulness and an ability to consider tradeoffs. Be transparent when you don't know something. Avoid buzzwords / mentioning fancy technologies if you can't dive into details about why they are useful.

The generic interview formats are:

Initial recruiter call

This is typically an intro call with a non-technical recruiter. This is mostly to ensure that you are interested in the role, and to set expectations about what the interview process is like. Candidates are not typically filtered by this call.

Coding screen

The most important interview format for jr engineers² is the coding screen. Practice them! I use HackerRank when I interview, but there are many similar platforms. Put more time into practicing these than the time practicing all other interview formats combined.

When practicing, work on not only solving the problem, but communicating what you are thinking about. It is ok to stop and think, but when pausing talk about what you are puzzling through, e.g. I am wondering if a hash would make sense here.

Running into a bug is fine. When this happens, demonstrate a methodical debugging approach. Use print statements or a debugger. Don't stare at the code for long periods.

Most companies will let you pick the programming language you interview in.

Design challenge

This is a discussion based format, in which a basic hypothetical application is proposed and the candidate talks through how they would design it. E.g., design an application that runs a coffee shop. There are a variety of ways to approach this, but the easiest is to start by talking through how you would structure the database. In other words, what tables you would create and how they would relate to each other. Also consider what API's you will need, and some basics about how web requests are routed.

Past experience

In this interview, the candidate picks a project they have done and talks through their process for completing it. Since they have little or no experience, this is often less important for Junior candidates, but it is still worth practicing.

Have a project in mind. Have talking points about the challenges you solved, alternative approaches you thought about or tried, and how you collaborated with others. As you get more senior you will also want to be able to talk about why your project mattered to the business.

Demonstrate:

Enthusiasm for problem solving
Ability to dive into technical details in discussion
Openness to considering different approaches

Once you get your first job

Talk to people. Schedule 1x1's with IC's, managers, anyone who you might work with or has a role you'd like to learn about. Most people will be happy to chat with you, especially about themselves.

Ask for help when needed, but demonstrate attempts to solve problems independently.

Volunteer for grunt work, e.g. taking notes in meetings.

Be humble. You don't know anything yet. Figure out how to track both large items and small (emails, doc comments, etc) such that you don't need to be reminded to do things.

Reassess the job market ~1 per year or more, especially if you are at a startup. If you are at a bigger company, this might mean evaluating internal transfer opportunities.

Things to think about when searching for jobs

Willingness to relocate

Remote work is a new world. Geographic location perhaps matters less, but it might still matter. What is certainly still true is that you will get a better insight into how engineers think if you have an opportunity to work with them in person, at least some of the time. The catch-22 is that the experienced engineers you want to work alongside will be older and have families, and not want or need to come to the office very much. Ask questions about how companies think about this.

I moved to the bay area when I was getting started, and I can confidently say I would have nowhere near as dynamic, interesting, and lucrative a career I've had thus far without having done that. I think the bay's dominance over tech is less than it was, but in my opinion alternative tech hubs are overrated.

Working at a startup vs established (ie public) company

Startup:

Pro
- More dynamic
- More personal, more likely to make work friends
- You'll learn more about business as a whole. E.g. how does a customer success person think, how does a sales person think, etc
- More independence in work
- Less legacy systems to deal with, opportunity to try different things, wear different hats
Con
- Pay is worse
- Because of ^, in competency of general management and senior IC's will be more inconsistent and less experience.
- Because of ^, you're less likely to get quality technical mentorship
- "More dynamic" might mean more chaotic

Public company:

Pro
- Pay is better
- Because of ^, better senior IC's and managers
- Because of ^, better technical mentorship
- Roles will be more narrowly scoped, meaning you'll get more technical depth.
Con
- Less personal, less socialization between coworkers
- More narrow exposure in terms of types of people you work with. Likely just engineers and PM's.
- More legacy systems to deal with.

Pay

Advice differs here, but I would not care too much about pay so long as you can pay your expenses. In the long run, finding a role that you are good at and enjoy will maximize your earnings, and enjoyment. That said:

The expected value of stock grants from startups³ is zero. Recruiters etc will try to convince you otherwise. This doesn't mean you shouldn't work for startups, but the potential of cash-in from startup stock should not be a factor⁴.

Things to read

HackerNews is the biggest forum of software engineers. Discussions can be dogmatic but are often pretty good. There are job postings once a month as well. As with any forum, there are plenty of posters who are loudly and confidently wrong.

Joel on Software isn't very active but has good tips on software careers.

Patio11 is a good follow on Twitter and HackerNews. He goes into the weeds on fintech, but also has good content on software careers.

Money Stuff is a great column about business, finance, and tech. You can get the email newsletter for free.

There's an evergreen, tedious debate on what constitutes a "tech" company. My definition is "A company whose primary products are software or hardware, OR a company seeking to disrupt a traditional field with software." E.g. LegalZoom is a legal services company, but I consider them a tech company. ↩
This is possibly also the case for senior engineers. ↩
Specifically, non-public companies, whose stock does not trade on stock exchanges. ↩
The exception being giant "startups" that actually make money, e.g. Stripe as of Feb 1, 2023. But even then the timing of when you can sell your shares can be very uncertain. ↩

The Questionable Value of the OpenAI GPT Store

Sat, 13 Jan 2024 00:00:00 GMT

OpenAI launched its GPT Store this week. Brands and developers can create custom GPT's, either for sale or for free. Both have eagerly launched many GPT's, probably due to the relatively low overhead of creating them.

I am skeptical of the value. For brands, this feels like the AI equivalent of a service that sends postcards in response to an email. The access pattern and interface are more or less exactly the same as traditional apps or sites, only via a GPT. It's a neat trick, but I think users will quickly lose interest.

Branded GPT's

Take as an example the AllTrails GPT. The announcement post assures us that it will work more or less exactly the same as AllTrails:

Don't worry, it doesn't make up new routes – instead it gives recommendations from AllTrails' collection of over 420,000 trails based on your prompts. For example, you could ask it to "find me an easy five-mile loop that's dog-friendly within 10 miles of Birmingham", saving you the effort of searching and filtering results to pinpoint what you want.

In other words, the GPT saves you the hassle of using the AllTrails app, only probably worse.

This doesn't leverage AI's crucial advantage: the ability to retain context over the course of a conversation.

Indie GPT's

For more obscure developers, the store appears to already be flooded with Ai's version of the chumbox, AI Girlfriends. These are apparently against OpenAI's terms of use, but enforcement seems likely to be an indefinite cat and mouse game. Whether or not you cringed at Her, there's little to differentiate AI girlfriend offerings from each other.

Questionable roadmap

The root of the problem is scarce context window space. At present there's simply not enough space to put much customization.

While the context window will grow (current values feel similar to the 640K memory of early computers), it seems unlikely that these custom GPT's will achieve or maintain much of a lead over the vanilla model. The interface for brands is already well defined - there's already an app!

In addition, it's doubtful users are eager to navigate yet another app store - the problem with using the AllTrails app isn't that it's a hassle to use, it's that it's cumbersome to remember that you downlaoded it, that you logged in, and how to find it on your phone. Custom GPT's do not mitigate this problem.

The real differentiation of GPT's is in projects that get the most out of the limited context window in clever ways, via compression or RAG.

A more promising development: long term memory

A more promising update this week was the rumor of ChatGPT rolling out long term memory capabilities. The ability to lengthen memory capacity indefinitely is why I think MemGPT is one of the more interesting AI open source initiatives. As opposed to custom GPT's in the OpenAI store today, a memory-enhanced GPT can learn your preferences and develop a longer term relationship with you, which is the prospect that make GPT's exciting and scary at the same time.

MemGPT Meta-Functions

Tue, 02 Jan 2024 00:00:00 GMT

MemGPT is an interesting project which provides GPT agents with unbounded memory. It includes the ability to incorporate custom functions, with a convenient JSON schema generator.

In trying to extend the agent with functions of my own, I found that the agent was reluctant to give me information about the functions I was making available to it, so I wrote a set of meta-functions which enable the agent to view source code, set debugger lines, and create functions. You can view the source code here. Note that running this requires some edits I made to MemGPT to enable dynamic function reloading (PR).

The Good

The agent was able to utilize the reload_functions, introspect_function, and list_functions commands and understand output. The debugger function was also helpful in enabling the agent to understand what I was doing - placing debuggers in other functions often resulted in the agent's internal monologue wondering what was going on.

For function creation, at first I tried putting each function in it's own agent_defined_ prefixed file (eg agent_defined_hello_world.py for a hello_world function) , but this quickly became disorganized, especially where import statements were needed.

I edited the function to instead create functions within modules:

def create_function(self, function_name: str, function_code_with_docstring: str, module_name: str) -> str:
    """Creates an agent accessible function in Python. Function MUST include a docstring, and MUST include self as first argument.

    Args:
        function_name (str): The name of the function
        function_code_with_docstring (str): The code of the function, including the docstring
        module_name (str): The name of the module to create the function in

    Raises:
        Exception: Exception if the function already exists
        Exception: Exception if the function does not start with def function_name(self, ...
        Exception: Exception if the function does not include a docstring.
        Exception: Exception if the function is not in the functions directory.

    Returns:
        str: The result of the function creation attempt.
    """

    # setup
    if not os.path.exists(FUNCTIONS_DIR):
        os.makedirs(FUNCTIONS_DIR)

    # Make sure that if the function is already defined, overwrite = true and it is an agent defined function
    if function_name in self.functions_python.keys():
        raise Exception(f"Function {function_name} already exists. To overwrite, first delete with the delete_function function.")


    if not function_code_with_docstring.split("\n")[0].strip().startswith('def ' + function_name + '(self'):
        raise Exception("Function must start with def " + function_name + "(self, ...")

    if '"""' not in function_code_with_docstring and "'''" not in function_code_with_docstring:
        raise Exception("Function code must have a docstring.")

    file_path = os.path.join(FUNCTIONS_DIR, module_name + ".py")

    if os.path.exists(file_path):
        with open(file_path, "r") as f:
            previous_source = f.read()
    else:
        previous_source = ""

    # write new module:
    with open(os.path.join(file_path), "w") as f:
        f.write(previous_source + "\n\n" + function_code_with_docstring)

    self.reload_functions()
    return f"added function {function_name} to file {file_path}"

This worked reasonably well. Having the function return a string was helpful in letting the agent known what was changed.

Problems

The agent had a difficult time consistently authoring functions that conformed to MemGPT's requirements - that it has a docstring, type hints, and only int, str, and bool return and argument types.

The iteration on basic requirements made it difficult for the agent to compose functions that worked together well. Often it would author placeholder functions that had names that sounded right, but didn't really do anything.

As the number of functions grew, so did the agent's tendency to get them confused. Functions also consume context window space, so making a large library of functions to any particular agent doesn't see promising.

Next steps

This experiment points me back to a multi-agent approach in creating a broadly capable personal assistant. Having narrowly scoped helper agents available to the primary agent seems like the most promising route.

As I want to push a deployment of MemGPT to a server anyway, I am going to try to have a deployment with multiple agents that can talk to each other.

This is similar to Autogen's approach, though I think Autogen's groupchat management is too primitive to be useful.

Tom Bedor's Blog Blog

Optimizing repos for AI

Strategies2​

Increased static analysis​

just for repeated agent commands​

Organize docs in docs/​

No experts, no standards​

Footnotes​

AI is a Floor Raiser, not a Ceiling Raiser

A reshaped learning curve​

Mastery: still hard!​

Cheating​

The impact of the changed learning curve​

Coding: A boon to management, less so for large code bases​

Creative works: not coming to a theater near you​

Things you already do with apps on your phone1: minimal impact​

The future is already here – it’s just not very evenly distributed​

Footnotes​

Add Autonomy Last

Autonomy first vs autonomy last​

Case study: Building Elroy, a chatbot with memory​

Approach #1: "Agent" with tools​

Approach #2: Model Context Protocol (MCP)​

Agentic trouble​

Approach #3: Autonomy Last​

Autonomy Last

Footnotes​

Yes or No, Please: Building Reliable Tests for Unreliable LLMs

Elroy​

What has worked well​

Integration tests​

Quizzing the Assistant​

What (sadly) hasn't worked: LLMs talking to LLMs​

Recurring Challenges​

Solutions​

KISS!​

Telling the assistant it is in a test​

Very specific, direct instruction and examples​

Tolerate a little flakiness​

Tests still help!​

Footnotes​

Advice for New Grads

The software jobs market

Getting your first job

Getting interviews​

Resume​

Interviews​

Initial recruiter call​

Coding screen​

Design challenge​

Past experience​

Once you get your first job

Things to think about when searching for jobs

Willingness to relocate​

Working at a startup vs established (ie public) company​

Pay​

Things to read

Footnotes​

The Questionable Value of the OpenAI GPT Store

Branded GPT's​

Indie GPT's​

Questionable roadmap​

A more promising development: long term memory​

MemGPT Meta-Functions

The Good​

Problems​

Next steps​

Strategies²

Increased static analysis

just for repeated agent commands

Organize docs in `docs/`

No experts, no standards

Footnotes

A reshaped learning curve

Mastery: still hard!

Cheating

The impact of the changed learning curve

Coding: A boon to management, less so for large code bases

Creative works: not coming to a theater near you

Things you already do with apps on your phone¹: minimal impact

The future is already here – it’s just not very evenly distributed

Footnotes

Autonomy first vs autonomy last

Case study: Building Elroy, a chatbot with memory

Approach #1: "Agent" with tools

Approach #2: Model Context Protocol (MCP)

Agentic trouble

Approach #3: Autonomy Last

Footnotes

Elroy

What has worked well

Integration tests

Quizzing the Assistant

What (sadly) hasn't worked: LLMs talking to LLMs

Recurring Challenges

Solutions

KISS!

Telling the assistant it is in a test

Very specific, direct instruction and examples

Tolerate a little flakiness

Tests still help!

Footnotes

Getting interviews

Resume

Interviews

Initial recruiter call

Coding screen

Design challenge

Past experience

Willingness to relocate

Working at a startup vs established (ie public) company

Pay

Footnotes

Branded GPT's

Indie GPT's

Questionable roadmap

A more promising development: long term memory

The Good

Problems

Next steps