AWS DeepRacer Local Training DRFC

Training Locally on DRFC -> Troubleshooting

Contents

  1. Handy monthly items
  2. General Training Starting Steps
  3. Virtual DRFC Upload
  4. Physical DRFC Upload
  5. Container Update Links
  6. Open GL Robomaker
  7. New Sagemaker -> M40 Tagging
  8. Log Analysis
  9. Run Second DRFC Instance
  10. Steps for fresh DRFC
  11. Troubleshooting DRFC
  12. Miscellaneous

Handy monthly items

Latest Robomaker Container (For Training)
https://hub.docker.com/r/aws deep racercommunity/deepracer-robomaker/tags?page=1&ordering=last_updated

General Training Starting Steps

These are commands I run if starting from a reboot

  1. source bin/activate.sh
  2. sudo liquidctl set fan1 speed 30
    (This is my own fan setting)
  3. dr-increment-training -f
  4. dr-update OR dr-update-env (I tend to favour -env)
  5. dr-start-training OR dr-start-training -w
  6. dr-start-viewer OR dr-update-viewer
  7. http://127.0.0.1:8100 OR http://localhost:8100
  8. dr-logs-robomaker (dr-logs-robomaker -n2) for worker 2 etc
  9. dr-logs-sagemaker
  10. nvidia-smi (check temperatures)
  11. htop to check threads and memory usage
    (Try to maximise my worker count, but keep to <75%)
  12. dr-start-evaluation -c & dr-stop-evaluation

Virtual DRFC Upload

  1. aws configure
  2. dr-upload-model -b -f
  3. Uploads best checkpoint to s3

Physical DRFC Upload

  1. dr-upload-car-zip -f
  2. Sagemaker must be running for this to work
  3. Only uses last checkpoint, not best

Open GL Robomaker

https://aws-deepracer-community.github.io/deepracer-for-cloud/opengl.html

  1. example -> docker pull awsdeepracercommunity/deepracer-robomaker:4.0.12-gpu-gl
  2. system.env: (Below bullet points)
  • DR_HOST_X=True; uses the local X server rather than starting one within the docker container.
  • DR_ROBOMAKER_IMAGE; choose the tag for an OpenGL enabled image - e.g. cpu-gl-avx for an image where Tensorflow will use CPU orgpu-glor an image where also Tensorflow will use the GPU.
  • Do echo $DISPLAY and see what that is, should be :0 but might be :1
  • Make system.env dr_display value same as echo value
  • dr-reload
  1. source utils/start-xorg.sh
  2. you should see the xorg stuff in nvidia-smi once you run the start-xorg.sh script
  3. sudo pkill x11vnc
  4. sudo pkill Xorg

New Sagemaker — M40 Tagging

run -> docker tag 2b4e84b8c10a awsdeepracercommunity/deepracer-sagemaker:gpu-m40

Log Analysis

  1. run -> dr-start-loganalysis
  2. Only change needed is for model_logs_root
    e.g. ‘minio/bucket/model-name/0’
  3. Tracks
    https://github.com/aws-deepracer-community/deepracer-simapp/tree/master/bundle/deepracer_simulation_environment/share/deepracer_simulation_environment/routes
  4. Might have to upload the new track to tracks folder
  5. Repo for all racer data
    https://github.com/aws-deepracer-community/deepracer-race-data/tree/main/raw_data/leaderboards

Run Second DRFC Instance

  1. Create 2 different run.env or use 2 folders
  2. The DR_RUN_ID keeps things separate
  3. Only 1 minio should be running
  4. Use a unique model name
  5. Run source bin/activate.sh run-1.env to activate a separate environment

Steps for fresh DRFC

  1. ./bin/prepare.sh && sudo reboot
  2. docker start
  3. ARCH=gpu
  4. Run LARS script -> source bin/lars_one.sh
  5. docker swarm init (If issues run step 7 and grab IP, run step 8, check bottom for example)
  6. ifconfig -a
  7. docker swarm init
  8. docker swarm init — advertise-addr 000.000.0.000
  9. sudo ./bin/init.sh -a gpu -c local
  10. docker images
  11. docker tag xxxxxxx awsdeepracercommunity/deepracer-sagemaker:gpu-m40
  12. source bin/activate.sh
  13. vim run.env
  14. vim system.env
  15. dr-update
  16. aws configure — profile minio
  17. aws configure
    (use real AWS IAM details below to allow upload of models)
  18. dr-reload
  19. docker ps -a
  20. Setup multiple GPU
  21. cd custom-files
  22. vim on the 3 files
  23. dr-upload-custom-files

Troubleshooting DRFC

  • docker ps -a
  • docker service ls
  • sudo service docker status
  • sudo service — status-all
  • sudo systemctl status docker.service
  • sudo systemctl stop docker
  • sudo systemctl start docker
  • sudo systemctl enable docker
  • sudo systemctl restart docker
  • sudo service docker restart
  • snap list
  • sudo su THEN apt-get install docker.io
  • Re-run Installing Docker (From Lars)
  • cat /etc/docker/daemon.json
  • apt-cache policy docker-ce
  • sudo tail /var/log/syslog
  • sudo cat /var/log/syslog | grep dockerd | tail
  • udo gedit /etc/docker/daemon.json
  • Make /etc/docker/daemon.json look like below:
  • Make /etc/docker/daemon.json look like below:
  • sudo systemctl stop docker then sudo systemctl start docker
  • test with -> docker images
  1. Open the terminal window.
  2. Change to the root user.
  3. Type these commands:
  4. sysctl -w net.ipv6.conf.all.disable_ipv6=1
    sysctl -w net.ipv6.conf.default.disable_ipv6=1
    sysctl -w net.ipv6.conf.tun0.disable_ipv6=1
  5. To re-enable IPv6, type these commands:
  6. sysctl -w net.ipv6.conf.all.disable_ipv6=0
    sysctl -w net.ipv6.conf.default.disable_ipv6=0
    sysctl -w net.ipv6.conf.tun0.disable_ipv6=0
    sysctl -p

Other Fixes That Might Work for minio

run -> docker swarm init

  1. run-> docker swarm leave — force
  2. run -> source bin/lars_swarm_fix.sh
  1. docker swarm init — advertise-addr 2a00:23c8::d6c3:4a71:9adb:87ad
  1. ifconfig -a
  2. You don’t need to join the token
  3. dr-start-training
  • docker ps -a
  • docker service rm s3_minio
  • docker-compose -f $DR_DIR/docker/docker-compose-local.yml -p s3 up
  • docker ps
  • ls -l data
  • ls -l
  • Issue was I ran the init script as root
  • Fix -> chown -R dbro:dbro .
  • docker-compose -f $DR_DIR/docker/docker-compose-local.yml -p s3 up
  • docker ps -a showed there were now 2 minio’s running
  • docker-compose -f $DR_DIR/docker/docker-compose-local.yml -p s3 down
  • docker stack rm s3
  • dr-reload
  • docker ps
  • dr-upload-custom-files
  • The m40 runs sagemaker docker
  • System ram runs robomaker
  • You can offload some of the robomaker to gpu by using the opengl image, but generally yes
  • Basically model is living inside the GPU memory
  • training checkpoints are in — cd data/minio/bucket
if [[ "${ARCH}" == "gpu" ]];
then
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
	sudo apt-get update && sudo apt-get install -y --no-install-recommends nvidia-docker2 nvidia-container-toolkit nvidia-container-runtime
cat /etc/docker/daemon.json | jq 'del(."default-runtime") + {"default-runtime": "nvidia"}' | sudo tee /etc/docker/daemon.json
fi

Miscellaneous

Sensors

  • “FRONT_FACING_CAMERA”
  • “SECTOR_LIDAR”
  • “LIDAR”
  • “STEREO_CAMERAS”
  1. nvidia-smi
  2. nvidia-smi -l 60
  3. watch -n900 nvidia-smi (Every 15 minutes auto calls)
  4. sensors
  • sudo liquidctl set fan1 speed 30
  • sudo liquidctl set fan1 speed 0
  • nvidia-smi -L
  • GeForce GTX 1650 -> nvidia-smi -a -i 0
  • M40 Specs -> nvidia-smi -a -i 1
  • lspci -k | grep -EA3 ‘VGA|3D|Display’
  • top (checks processors to help see worker limits)
  • free -m
  • htop
  • docker stats
  • docker run — rm — gpus all nvidia/cuda:11.6.0-base-ubuntu20.04 nvidia-smi
  • sudo snap install jupyter
  • sudo apt install git
  • sudo apt install nvidia-cuda-toolkit
  • sudo apt install curl
  • sudo apt install jq
  • sudo pip install liquidctl (to install fan controller globally)
  • sudo apt install net-tools
  • sudo apt install vim
  • sudo apt-get install htop
  • sudo apt install hddtemp
  • sudo apt install lm-sensors
  • pip install — user pipenv
  • sudo apt install pipenv
  • pipenv install jupyterlab
  1. sudo su (run from root)
  2. curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
  3. sudo add-apt-repository “deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable”
  4. sudo apt-get update && sudo apt-get install -y — no-install-recommends docker-ce docker-ce-cli containerd.io
  5. sudo apt-get install -y — no-install-recommends nvidia-docker2 nvidia-container-toolkit nvidia-container-runtime
  6. sudo apt-get upgrade
  1. sudo dpkg -P $(dpkg -l | grep nvidia-driver | awk ‘{print $2}’)
  2. sudo apt autoremove
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |

--

--

Developer | ML Tinkerer | Runner | Rugby | AWS DeepRacer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store