Project Overview
This project simulates a prey-predator ecosystem using multi-agent reinforcement learning. We apply the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm to learn cooperative behaviors among predators while preys try to escape.
The environment is a 2D world where multiple species interact dynamically: predators hunt, preys evade, and some species can even be both. This setup reflects the complexity of real ecosystems, where survival depends on both competition and cooperation.


Technical Approach
Environment
Agents move in a 2D plane with either borders or a looping torus. Species are assigned predator/prey roles, with interactions determined by distances. Complex food chains can be modeled where a species is predator to one group and prey to another.
Observations & Actions
Each agent observes relative positions of others plus its past action. Actions are continuous 2D vectors in [-1,1], defining movement directions scaled by species speed.
Rewards
The reward system balances survival and cooperation:
- Predator catching prey: +10
- Prey captured: -10
- Evasion: reward proportional to distance from predators
- Cooperation: shared reward for minimizing team distance to prey
- Borders: penalties when crossing limits if borders are enabled
Episodes end when all preys are caught or after a fixed number of steps.
Algorithm: MADDPG
The core idea of MADDPG is to use centralized training, decentralized execution. Each agent has its own actor network (policy) to learn a local action-value function, while a centralized critic uses states and actions of all agents to evaluate Q-values. This architecture allows agents to learn cooperative behavior by taking into account the actions of other agents during training.
To model species invariance, we also experimented with shared networks across agents of the same species. While promising, this introduced training instability due to conflicting gradients from agents with different experiences.
Key Insights
Our experiments revealed several important findings about multi-agent cooperation. Initially, with individual rewards, predators acted selfishly and failed to cooperate effectively. However, introducing shared rewards dramatically improved coordination, leading to successful captures through teamwork.
Complex food chains generated chaotic but realistic dynamics that were highly sensitive to agent initializations, making training more challenging. We found that modeling the world as a torus made prey escape easier and interpretation more difficult, so we switched to a 2D plane with borders for clearer analysis.
While shared networks for species improved representation capacity, they slowed convergence due to unstable credit assignment between agents with different experiences. Most importantly, careful tuning of agent speeds was crucial for observing cooperative behavior. We set prey to be much faster than predators to force the predators to work together rather than compete individually.
More details can be found in the report and the poster.
Documentation
References
- Lillicrap, T.P., et al. (2015). Continuous control with deep reinforcement learning. arXiv:1509.02971.
- Lowe, R., et al. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. NeurIPS 30.
-
Previous
Averaging Weights Leads to Wider Optima and Better Generalization -
Next
Playing Flappy Bird with Reinforcement Learning