AWS DeepRacer Local Training DRFC

Darren Broderick (DBro)
11 min readFeb 23, 2023

Training Locally on DRFC -> Troubleshooting

Handy Commands

  • dr-upload-model -b -f -i
  • dr-upload-model -b -f -I “name of model”
  • If you have disk problems do “docker system prune”

So you’re training locally on DRFC for AWS DeepRacer, great!

But it’s not always straightforward, sometimes it’s easy to forget how to run or update your stack after the inital setup or you get new errors when starting training again, especially after a season break.

This article an be used as a supplement to the main DRFC guide.
https://aws-deepracer-community.github.io/deepracer-for-cloud

It’s a list of commands/ steps I follow and troubleshooting problems / solutions I’ve faced when training locally.

Hopefully it can help you too, BUT it is tailored to how I run things FYI.

Contents

  1. Handy monthly items
  2. General Training Starting Steps
  3. Virtual DRFC Upload
  4. Physical DRFC Upload
  5. Container Update Links
  6. Open GL Robomaker
  7. New Sagemaker -> M40 Tagging
  8. Log Analysis
  9. Run Second DRFC Instance
  10. Steps for fresh DRFC
  11. Troubleshooting DRFC (List of issues & solutions)
  12. Miscellaneous

Handy monthly items

Latest Robomaker Container (For Training)
https://hub.docker.com/r/aws deep racercommunity/deepracer-robomaker/tags?page=1&ordering=last_updated

All Track Files & Details (For DR_WORLD_NAME & Log Analysis)
https://github.com/aws-deepracer-community/deepracer-race-data/tree/main/raw_data/tracks

Commands

  1. docker ps -a
  2. docker images
  3. docker service ls

General Training Starting Steps

These are commands I run if starting from a reboot

  1. source bin/activate.sh
  2. sudo liquidctl set fan1 speed 30
    (This is my own fan setting)
  3. dr-increment-training -f
  4. dr-update OR dr-update-env (I tend to favour -env)
  5. dr-start-training OR dr-start-training -w
  6. dr-start-viewer OR dr-update-viewer
  7. http://127.0.0.1:8100 OR http://localhost:8100
  8. dr-logs-robomaker (dr-logs-robomaker -n2) for worker 2 etc
  9. dr-logs-sagemaker
  10. nvidia-smi (check temperatures)
  11. htop to check threads and memory usage
    (Try to maximise my worker count, but keep to <75%)
  12. dr-start-evaluation -c & dr-stop-evaluation

Virtual DRFC Upload

  1. aws configure
  2. dr-upload-model -b -f
  3. Uploads best checkpoint to s3

Physical DRFC Upload

  1. dr-upload-car-zip -f
  2. Sagemaker must be running for this to work
  3. Only uses last checkpoint, not best

Container Update Links

Check your version with command ”docker images”

docker service ls to make sure you see s3_minio.

Open GL Robomaker

https://aws-deepracer-community.github.io/deepracer-for-cloud/opengl.html

  1. example -> docker pull awsdeepracercommunity/deepracer-robomaker:4.0.12-gpu-gl
  2. system.env: (Below bullet points)
  • DR_HOST_X=True; uses the local X server rather than starting one within the docker container.
  • DR_ROBOMAKER_IMAGE; choose the tag for an OpenGL enabled image - e.g. cpu-gl-avx for an image where Tensorflow will use CPU orgpu-glor an image where also Tensorflow will use the GPU.
  • Do echo $DISPLAY and see what that is, should be :0 but might be :1
  • Make system.env dr_display value same as echo value
  • dr-reload

source utils/setup-xorg.sh

  1. source utils/start-xorg.sh
  2. you should see the xorg stuff in nvidia-smi once you run the start-xorg.sh script
  3. sudo pkill x11vnc
  4. sudo pkill Xorg

New Sagemaker — M40 Tagging (redunant from v5.1.1)

With the latest images you don’t need to compile a specific image (like your -m40 image)

run -> docker tag 2b4e84b8c10a awsdeepracercommunity/deepracer-sagemaker:gpu-m40

Log Analysis

  1. run -> dr-start-loganalysis
  2. Only change needed is for model_logs_root
    e.g. ‘minio/bucket/model-name/0’
  3. All Track files & details
    https://github.com/aws-deepracer-community/deepracer-race-data/tree/main/raw_data/tracks
  4. Might have to upload the new track to tracks folder
  5. Repo for all racer data
    https://github.com/aws-deepracer-community/deepracer-race-data/tree/main/raw_data/leaderboards

Run Second DRFC Instance

  1. Create 2 different run.env or use 2 folders
  2. The DR_RUN_ID keeps things separate
  3. Only 1 minio should be running
  4. Use a unique model name
  5. Run source bin/activate.sh run-1.env to activate a separate environment

Steps for fresh DRFC

https://aws-deepracer-community.github.io/deepracer-for-cloud/installation.html

  1. ./bin/prepare.sh && sudo reboot
  2. docker start
  3. ARCH=gpu
  4. Run LARS script -> source bin/lars_one.sh
  5. docker swarm init (If issues run step 7 and grab IP, run step 8, check bottom for example)
  6. ifconfig -a
  7. docker swarm init
  8. docker swarm init — advertise-addr 000.000.0.000
  9. sudo ./bin/init.sh -a gpu -c local
  10. docker images
  11. docker tag xxxxxxx awsdeepracercommunity/deepracer-sagemaker:gpu-m40
  12. source bin/activate.sh
  13. vim run.env
  14. vim system.env
  15. dr-update
  16. aws configure — profile minio
  17. aws configure
    (use real AWS IAM details below to allow upload of models)
  18. dr-reload
  19. docker ps -a
  20. Setup multiple GPU
  21. cd custom-files
  22. vim on the 3 files
  23. dr-upload-custom-files

Different editor option to vim

gedit

Troubleshooting DRFC (List of issues & solutions)

General Tip

It’s always worth checking if you are missing anything new that might have been added to the default files that DRFC would then be expecting.

In particular, the system.env or template-run.env files and compare them with your own.

Troubleshooting Docker Start

Docker failed to start

  • docker ps -a
  • docker service ls
  • sudo service docker status
  • sudo service — status-all
  • sudo systemctl status docker.service

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

  • sudo systemctl stop docker
  • sudo systemctl start docker
  • sudo systemctl enable docker
  • sudo systemctl restart docker
  • sudo service docker restart
  • snap list
  • sudo su THEN apt-get install docker.io
  • Re-run Installing Docker (From Lars)
  • cat /etc/docker/daemon.json
  • apt-cache policy docker-ce
  • sudo tail /var/log/syslog
  • sudo cat /var/log/syslog | grep dockerd | tail

“For me it was a missing file”

  • udo gedit /etc/docker/daemon.json
  • Make /etc/docker/daemon.json look like below:

{
“runtimes”: {
“nvidia”: {
“path”: “nvidia-container-runtime”,
“runtimeArgs”: []
}
},
“default-runtime”: “nvidia”
}

  • Make /etc/docker/daemon.json look like below:
  • sudo systemctl stop docker then sudo systemctl start docker
  • test with -> docker images

Troubleshooting Docker Swarm

Could not connect to the endpoint URL: “http://localhost:9000/bucket

Error response from daemon: This node is not a swarm manager. Use “docker swarm init” or “docker swarm join” to connect this node to swarm and try again.

You might have to disable ipv6 to stop docker pulling from multiple addresses

Here’s how to disable IPv6 on Linux if you’re running a Red Hat-based system:

  1. Open the terminal window.
  2. Change to the root user.
  3. Type these commands:
sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.lo.disable_ipv6=1

To re-enable IPv6, type these commands:

sudo sysctl -w net.ipv6.conf.all.disable_ipv6=0
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=0
sudo sysctl -w net.ipv6.conf.lo.disable_ipv6=0
sysctl -p

run -> ./bin/init.sh (Resets run, system.env, hyperparam, RF & model_metadata)

run -> docker pull minio/minio:RELEASE.2022–10–24T18–35–07Z

DR_MINIO_IMAGE in system.env, make sure it’s set to:
RELEASE.2022–10–24T18–35–07Z

Other Fixes That Might Work for minio

run -> docker swarm init

Error response from daemon: This node is already part of a swarm. Use “docker swarm leave” to leave this swarm and join another one.

run -> docker swarm leave
run -> docker swarm init

Error response from daemon: could not choose an IP address to advertise since this system has multiple addresses on interface

run -> docker network ls
sagemaker-local should appear in the network

IF THERE’S NO sagemaker-local appearing
There’s a fix script for this called “lars_swarm_fix.sh” in the bin folder.

  1. run-> docker swarm leave — force ( — is 2 dashes, just weird formatting)
  2. run -> source bin/lars_swarm_fix.sh

If script not included here is a copy, just update to your address

# create the network sagemaker-local if it doesn't exist
SAGEMAKER_NW='sagemaker-local'
# Update 2a00:23c8:33a3:a205 from results of docker swarm init
docker swarm init --advertise-addr 2a00:23c8:33a3:a205
SWARM_NODE=$(docker node inspect self | jq .[0].ID -r)
docker node update --label-add Sagemaker=true $SWARM_NODE > /dev/null 2> /dev/null
docker node update --label-add Robomaker=true $SWARM_NODE > /dev/null 2> /dev/null
docker network ls | grep -q $SAGEMAKER_NW
if [ $? -ne 0 ]
then
docker network create $SAGEMAKER_NW -d overlay --attachable --scope swarm
else
docker network rm $SAGEMAKER_NW
docker network create $SAGEMAKER_NW -d overlay --attachable --scope swarm --subnet=192.168.2.0/24
fi

Swarm initialized: current node (wv3eqpslrstc6hm7n65744z) is now a manager.

  1. ifconfig -a
  2. docker network ls (to check sagemaker-local has appeared)
  3. You don’t need run “docker swarm join — token” command
  4. docker start minio
  5. dr-upload-custom-files
  6. dr-update-env
  7. dr-start-training

Error Troubleshoot from running source bin/lars_swarm_fix.sh

Script might need address, error message will say, This node is not a swarm manager. Use “docker swarm init”

run -> docker swarm init (and grab the first addr, example below)

  1. docker swarm init — advertise-addr 2a00:23c8::d6c3:4a71:9adb:87ad

Swarm is a docker concept, you can theoretically connect multiple machines together and run DRFC over multiple machines, sagemaker on one PC, robomakers spread out, but once you have cloned DRFC you can now do bin/init.sh -a gpu -c local

network “sagemaker-local” is declared as external, but could not be found. You need to create a swarm-scoped network before the stack is deployed

run -> docker network ls
sagemaker-local should appear in the network

ERROR: No Swarm Nodes labelled for placement of Robomaker. Please add Robomaker node.
Example: docker node update — label-add Robomaker=true kok6p1fhpuo36uneiehcalgac

Sagemaker: DoorMan: installing SIGINT, SIGTERM

Training doesn’t start, run below to troubleshoot;

  • docker ps -a
  • docker logs <container_id>

Robomaker logs below

Listening for VNC connections on TCP port 5900
Listening for VNC connections on TCP6 port 5900
isten6: bind: Address already in use
Not listening on IPv6 interface.

ERROR:[NodeMonitor]: Rosnode threw exception. Master node could be dead.

If running ubuntu desktop, I’ve ran into issues with ipv6 being enable after updates. So basically docker swarm fails. If you run the follow should disable ipv6 and resolve the issue.

sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.lo.disable_ipv6=1

Issue — Minio kept making new containers every 10 seconds

Issue — Minio containers kept exiting within 7 seconds

  • docker ps -a
  • docker service rm s3_minio
  • docker-compose -f $DR_DIR/docker/docker-compose-local.yml -p s3 up
  • docker ps
  • ls -l data
  • ls -l
  • Issue was I ran the init script as root
  • Fix -> chown -R dbro:dbro .
  • docker-compose -f $DR_DIR/docker/docker-compose-local.yml -p s3 up
  • docker ps -a showed there were now 2 minio’s running
  • docker-compose -f $DR_DIR/docker/docker-compose-local.yml -p s3 down
  • docker stack rm s3
  • dr-reload
  • docker ps
  • dr-upload-custom-files

Troubleshoot Issue: GPU_0_bfc ran out of memory

Solution

Possibly it is moving to the wrong GPU

“nvidia-smi” to check what GPU number to set

DR_SAGEMAKER_CUDA_DEVICES in your system.env

Line 30 — system.env

Troubleshoot Issue:

If rl-coach fails, check nvidia-smi

Solution

General Notes

  • The m40 runs sagemaker docker
  • System ram runs robomaker
  • You can offload some of the robomaker to gpu by using the opengl
  • Basically the model lives inside the GPU memory
  • training checkpoints are in — cd data/minio/bucket

Wouldn’t go any higher than what “htop” shows below because you’re at 80% on all threads

— — — — — Additional Scripts — — —

## Create script “lars_one.sh”

if [[ "${ARCH}" == "gpu" ]];
then
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
	sudo apt-get update && sudo apt-get install -y --no-install-recommends nvidia-docker2 nvidia-container-toolkit nvidia-container-runtime
cat /etc/docker/daemon.json | jq 'del(."default-runtime") + {"default-runtime": "nvidia"}' | sudo tee /etc/docker/daemon.json
fi

Miscellaneous

Sensors

  • “FRONT_FACING_CAMERA”
  • “SECTOR_LIDAR”
  • “LIDAR”
  • “STEREO_CAMERAS”

Check temperature commands

  1. nvidia-smi
  2. nvidia-smi -l 60
  3. watch -n900 nvidia-smi (Every 15 minutes auto calls)
  4. sensors

Set fan speed commands

  • sudo liquidctl set fan1 speed 30
  • sudo liquidctl set fan1 speed 0

Check specs / stats commands

  • nvidia-smi -L
  • GeForce GTX 1650 -> nvidia-smi -a -i 0
  • M40 Specs -> nvidia-smi -a -i 1
  • lspci -k | grep -EA3 ‘VGA|3D|Display’
  • top (checks processors to help see worker limits)
  • free -m
  • htop
  • docker stats
  • docker run — rm — gpus all nvidia/cuda:11.6.0-base-ubuntu20.04 nvidia-smi

Installation commands

  • sudo snap install jupyter
  • sudo apt install git
  • sudo apt install nvidia-cuda-toolkit
  • sudo apt install curl
  • sudo apt install jq
  • sudo pip install liquidctl (to install fan controller globally)
  • sudo apt install net-tools
  • sudo apt install vim
  • sudo apt-get install htop
  • sudo apt install hddtemp
  • sudo apt install lm-sensors
  • pip install — user pipenv
  • sudo apt install pipenv
  • pipenv install jupyterlab

Installing Docker

  1. sudo su (run from root)
  2. curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
  3. sudo add-apt-repository “deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable”
  4. sudo apt-get update && sudo apt-get install -y — no-install-recommends docker-ce docker-ce-cli containerd.io
  5. sudo apt-get install -y — no-install-recommends nvidia-docker2 nvidia-container-toolkit nvidia-container-runtime
  6. sudo apt-get upgrade

Steps for Cuda upgrade

First removed existing:

  1. sudo dpkg -P $(dpkg -l | grep nvidia-driver | awk ‘{print $2}’)
  2. sudo apt autoremove

then added new:

then rebooted and do nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |

I plan to keep this guide edited / updated should new tips or solutions come my way.

Thank you!

--

--