RL Research — The Fair Fight Problem

RL agent gets the exact same moveset as the player — run, jump, wall-jump, dash, grapple — and must beat their time
Two core interventions: observation restriction (limited radius mirrors first-run human experience) and procedural course variation (layout shifts break memorized paths, force generalization)
The research question: can constraining what an agent knows produce behavior that feels human, not just behavior that scores human

The Problem

An RL agent doesn't know what fun is. It knows the reward function, and it finds the fastest path to maximizing that number — even if that path is boring or incoherent.

OpenAI's CoastRunners (2016): agent trained to score points in a boat racing game found an isolated lagoon, circled it collecting respawning targets, caught fire, crashed repeatedly, and scored ~20% higher than any human — without finishing the race once. Perfectly rational. Completely wrong. It optimized the proxy, not the goal.

The same failure mode applies anywhere an RL agent competes with a human player. This research treats it as a behavioral problem: not how to tune an agent to a target win rate, but how to produce behavior humans recognize as skillful rather than mechanical. That question isn't domain-specific — it's the fair fight problem.

The Testbed

A race platformer sandbox built on Bound's existing movement mechanics. The full moveset — run, jump, wall-jump, dash, custom pendulum grapple — is already implemented and exported. No synthetic physics approximation. The agent gets the real stack.

Objective: complete the course faster than the player's stored reference time. The optimal path through Bound's momentum system may look nothing like skilled human play — wall clips, velocity exploits, grapple shortcuts. An unconstrained agent will find them. It will win and not look like skill. That gap is the research question.

The physics complexity matters: it makes naive RL failure visible. An agent that doesn't understand the movement vocabulary breaks down obviously when the player uses expressive chains not seen in training. Failures aren't subtle — that makes differences between training approaches measurable.

Two Core Interventions

Observation Restriction

The agent observes a limited perception radius — not full course knowledge. An agent with full map access precomputes an optimal trajectory before the run starts. An agent with a perception radius navigates reactively, the way a human does on a first run. This is a design choice, not a technical constraint: the question is whether an agent with human-equivalent information access can move like a skilled human.

Observes: velocity vector, nearby geometry within radius, bearing to next visible waypoint, time delta vs. player reference, own locomotion state and cooldowns.

Does not observe: full course layout, shortcut positions, anything a human couldn't perceive from their current position.

Procedural Course Variation

Slight layout shifts between runs — repositioned platforms, varied grapple anchor placement — break memorized optimal paths and force generalization. An agent trained on a static course learns the course. An agent trained on a varied course learns to move. Variation is calibrated to preserve the movement vocabulary while disrupting route memorization.

Three Training Tracks

PPO Baseline — trained from scratch, reward is finish time vs. player reference. The control condition. Expected result: finds and exploits the optimal path, finishes significantly faster than the player, does not look like skill. Everything else is measured against this.

GAIL-Bootstrapped — pre-trained on recorded expert player runs via Unity ML-Agents' native GAIL, then fine-tuned with PPO. GAIL learns the distribution of human movement, not specific sequences — it generalizes more gracefully to unseen course states. A GAIL agent that occasionally misjudges a grapple angle or overshoots a platform reads as more human than a PPO agent that never does. The imperfections are the feature.

DDA Reward Shaping — reward function targets human-competitive performance: penalizes both dominance (far faster than player) and failure (far slower). Goal: an agent that stays inside the challenge window as the player improves, not one that solves the course and stops being interesting.

Current State

Race sandbox built on Bound's exported movement mechanics. Player character, physics, and traversal abilities in place. Initial test course greyboxed. Next: finish line trigger, run timer, reset mechanism, ML-Agents agent component, observation and action space wiring.

Why This Matters

A technical designer who can specify observation spaces, reason about reward functions, and connect RL training behavior to game feel is occupying a real and underserved niche. Most game designers don't engage with ML pipelines. Most ML engineers don't have the design intuition to know what "moves like a skilled human" means as a training objective. The problems here — what to exclude from an agent's observation so it plays like a person, how to frame reward around experience rather than performance — are open problems, not academic exercises.

References

Schulman et al. (2017), Proximal Policy Optimization Algorithms — arXiv:1707.06347
Ho & Ermon (2016), Generative Adversarial Imitation Learning — NeurIPS 2016
Climent et al. (2024), RL Agents with DDA in Single-Player Action Games — ScienceDirect
Rahimi et al. (2023), Continuous RL-based DDA — arXiv:2308.12726
Csikszentmihalyi (1990), Flow: The Psychology of Optimal Experience
OpenAI (2016), Faulty Reward Functions in the Wild
Unity ML-Agents Toolkit (Release 22)