Specifically, DAgger proceeds by iterating the following procedure. At the first iteration, a dataset D of trajectories is created from the expert policy and used to train a first policy that best fits those trajectories without overfitting them. Then, during iteration i, new trajectories are collected with the learned policy and added to the dataset D. After that, the aggregated dataset D with the new and old trajectories is used to train a new policy, .
As per the report in the Dagger paper (https://arxiv.org/pdf/1011.0686.pdf ...