Bachelor Thesis: Investigating Monocular Depth Estimation

Investigation of CNN, Vision Transformer, and Foundation model backbones for monocular depth estimation using PyTorch.

Bachelor thesis on Monocular Depth Estimation, comparing CNN, Vision Transformer, and DINOv2-based architectures. Demonstrated the importance of pretrained foundation models for improved depth accuracy and efficiency. The project was carried out in Python using PyTorch.

Project Preview

Highlights

Deep Learning & Computer Vision – Designed and trained multiple monocular depth estimation models using CNN, Vision Transformer (Swin), and DINOv2 foundation architectures.
Model Implementation in PyTorch – Implemented, trained, and evaluated models end-to-end in Python with PyTorch, TorchVision, and Transformers, using Weights & Biases for experiment tracking.
Transfer Learning & Model Evaluation – Demonstrated the impact of pretrained weights on depth estimation accuracy through quantitative (RMSE, AbsRel) and qualitative evaluation.
Data Engineering & Experimentation – Processed and augmented NYU Depth V2 and COCO datasets; implemented pseudo-labeling using DepthAnythingV2 to extend training data.
Research & Reproducibility – Conducted systematic ablation studies (encoder depth, skip connections, loss functions) and published reproducible code and results on GitHub.