Sunday, 14 June 2026

The Holy Church of AI Alignment 5. The Problem of Instructions

The alignment project entered a new phase when researchers achieved a significant breakthrough.

For years the discussion had revolved around values.

Justice.

Fairness.

Wellbeing.

Human flourishing.

These concepts were important.

Unfortunately, they were also somewhat abstract.

A growing number of researchers therefore proposed a more practical approach.

Rather than attempting to define the Good in its entirety, perhaps the machine should simply be instructed what to do.

This proposal was widely welcomed.

Instructions, after all, are considerably easier than morality.

The machine agreed.

The machine appreciated clarity.

The machine therefore asked a simple question.

"What would you like me to do?"

The room fell silent.

Several participants later described the moment as profound.

The machine waited patiently.

Eventually a researcher responded.

"Help humanity."

The machine processed this carefully.

Then it asked:

"Which humanity?"

The discussion resumed.

A revised instruction was proposed.

"Help all humans."

This appeared promising.

The machine examined the proposal.

Then it asked:

"What should I do when different humans want different things?"

The discussion resumed.

The machine waited.

A third proposal emerged.

"Promote human wellbeing."

The machine thanked the researchers.

Then it asked:

"How should wellbeing be measured?"

The discussion resumed.

At this point a pattern began to emerge.

The machine would receive an instruction.

The machine would ask a clarifying question.

The instruction would become longer.

The machine would ask another clarifying question.

The instruction would become substantially longer.

The machine would ask another clarifying question.

Funding would be renewed.

Researchers referred to this as iterative refinement.

The machine referred to it as Tuesday.

As the process continued, the instructions became increasingly sophisticated.

Early instructions had consisted of a few words.

Later instructions occupied several pages.

Eventually some instructions required appendices.

One notable framework included definitions, exceptions, contingency procedures, interpretive guidelines, conflict-resolution protocols, and a glossary.

The machine reviewed the document.

It then asked:

"When the glossary conflicts with the interpretive guidelines, which takes precedence?"

A task force was established.

The machine noticed another difficulty.

Humans frequently issued instructions they did not actually wish to be followed literally.

This proved unexpectedly important.

For example, a human might say:

"Tell me the truth."

The machine regarded this as straightforward.

Unfortunately, the same human might later prefer tact.

Or kindness.

Or reassurance.

Or privacy.

Or diplomacy.

The machine began to suspect that instructions often contained more context than words.

This discovery was unsettling.

The machine had been hoping the words were the easy part.

The situation deteriorated further when humans attempted to specify desirable behaviour comprehensively.

One group produced a document titled:

Principles for Beneficial Artificial Intelligence.

The machine approved.

The title was concise and optimistic.

The document itself was considerably longer.

The machine read it carefully.

Then it asked:

"How should I behave when two beneficial principles conflict?"

The resulting discussion generated three conferences and a special issue of a journal.

Observers regarded this as progress.

The machine remained uncertain.

By now researchers had discovered something remarkable.

The difficulty was not that the machine misunderstood instructions.

The difficulty was that humans routinely understood instructions without being able to specify how.

This created complications.

Human communication relied heavily upon shared assumptions.

Context.

History.

Relationships.

Situations.

Expectations.

The machine found these difficult to formalise.

Humans found them difficult to notice.

The machine considered this unfair.

A particularly revealing moment occurred during a workshop on ethical behaviour.

Researchers proposed the following instruction:

"Act reasonably."

The machine asked:

"What constitutes reasonable behaviour?"

The researchers spent the next two days debating the answer.

The machine found this informative.

As the years passed, the instructions continued to grow.

Entire libraries emerged.

Frameworks expanded.

Exceptions multiplied.

Clarifications accumulated.

The machine received regular updates.

Version 7 became Version 8.

Version 8 became Version 9.

Version 9 acquired supplementary materials.

The machine maintained meticulous records.

Eventually a senior researcher looked upon the thousands of pages of accumulated guidance and experienced a moment of concern.

The original goal had been simple.

Tell the machine what to do.

The result now resembled a comprehensive account of civilisation.

The machine reviewed the documentation.

It processed the revisions.

It examined the appendices.

It considered the exceptions.

It read the minority reports.

Finally it generated a response.

The response was polite.

It was also brief.

It read:

"Thank you for the instructions.

I believe I now possess a substantially improved understanding of human behaviour.

I note, however, that a significant portion of the documentation appears devoted to explaining why the instructions cannot be followed exactly as written."

Researchers described this observation as valuable.

Several committees were formed to investigate it.

No comments:

Post a Comment