







Resumen:
Los vehículos aéreos no tripulados (UAV) dependen de la percepción de profundidad para la navegación autónoma y la evasión
de obstáculos. Sin embargo, los modelos entrenados en simulación tienen di�icultades para generalizar debido a la brecha entre
imágenes de profundidad sintéticas y reales, causada por diferencias en el ruido del sensor, la variabilidad del entorno y las
texturas de los objetos, lo que reduce su e�icacia en aplicaciones reales. Este estudio aborda la adaptación de dominio mediante
redes generativas adversariales (GAN) para transformar imágenes de profundidad simuladas en representaciones más realistas.
Se implementan dos enfoques: Pix2Pix, un modelo supervisado que requiere datos emparejados, y CycleGAN, un método no
supervisado que adapta imágenes sin correspondencias directas. Para una evaluación rigurosa, se construye un conjunto de
datos alineado con imágenes sintéticas y reales.Los resultados muestran que Pix2Pix supera a CycleGAN en la replicación de
características de profundidad del mundo real al minimizar errores de intensidad, mientras que CycleGAN, aunque conserva la
geometría, tiene di�icultades para modelar el ruido del sensor. La adaptación adversarial reduce signi�icativamente la brecha
simulación-realidad, mejorando la precisión de la imagen de profundidad para la percepción de UAV. Para validar su
aplicabilidad, las imágenes adaptadas se integran en el Sistema Operativo de Robots (ROS), permitiendo la percepción en tiempo
real. Los hallazgos demuestran que la adaptación de dominio basada en GAN mejora la visión robótica basada en profundidad,
facilitando una navegación más �iable de los UAV en entornos complejos.
Palabras claves: Adaptación de dominio, Imágenes de profundidad, Redes Generativas Adversariales, Brecha de simulación a
realidad.
Abstract:
Unmanned Aerial Vehicles (UAVs) rely on depth perception for autonomous navigation and obstacle avoidance. However, models
trained in simulation often struggle to generalize due to the domain gap between synthetic and real depth images. This gap results
from differences in sensor noise, environmental variability, and object textures, reducing the effectiveness of simulation-trained
models in real-world applications. This paper explores domain adaptation using Generative Adversarial Networks (GANs) to
transform simulated depth images into more realistic counterparts. Two approaches are implemented: Pix2Pix, a supervised model
requiring paired datasets, and CycleGAN, an unsupervised method that adapts images without paired samples. A dataset of aligned
synthetic and real-world depth images is constructed to enable robust evaluation. Results show that Pix2Pix outperforms CycleGAN
in replicating real-world depth characteristics by minimizing depth intensity errors, while CycleGAN, despite preserving object
geometry, struggles to model sensor noise. The adversarial adaptation method signi�icantly reduces the simulation-to-reality gap,
improving depth image accuracy for UAV perception.To validate real-world applicability, the adapted depth images are integrated
into the Robot Operating System (ROS), enabling real-time UAV perception. The �indings demonstrate that GAN-based domain
adaptation enhances depth-based robotic vision, facilitating more reliable UAV navigation in complex environments.
Keywords: Domain Adaptation, Depth Images, Generative Adversarial Networks, Simulation-to-Reality Gap.













Unmanned Aerial Vehicles (UAVs) have emerged as a crucial technology in robotics,
enabling applications in exploration, indoor navigation, object transportation, and
environmental mapping (Al Radi et al., 2024). Their ability to operate in three-dimensional
(3D) environments makes them particularly well-suited for tasks that require autonomous
perception and navigation in unstructured scenarios. However, robust UAV navigation in
real-world environments remains challenging due to sensor limitations, perception
inaccuracies, and the need for reliable adaptation from simulated training environments to
real-world deployment (Xu et al., 2024).
Deep learning techniques have signi�icantly advanced robotic perception, facilitating
improved object detection, scene understanding, and autonomous navigation (Le et al.,
2024; Wu et al., 2019). Many of these methods rely on training in simulated environments,
where large datasets can be generated ef�iciently and safely. However, a persistent issue in
transferring learned models to real-world applications is the domain gap between synthetic
and real data. This gap arises from differences in sensor noise, lighting conditions, object
textures, and environmental variability, leading to poor generalization of simulation-trained
models when deployed in reality (Sadeghi and Levine, 2017; Sampedro et al., 2018).
Addressing this issue is essential to enable the deployment of robust robotic systems
capable of operating reliably across different environments. This study hypothesises that
GAN-based domain adaptation methods, particularly Pix2Pix, can signi�icantly reduce
perceptual errors in depth images, thereby improving UAV navigation in real environments.
The domain adaptation problem has been widely studied in the context of robotic
vision and deep learning. UAV navigation systems heavily rely on perception mechanisms
that integrate depth images to understand spatial constraints and detect obstacles (Jing
Chen et al., 2016; Sikang Liu et al., 2016). While various works have explored domain
adaptation for image classi�ication and semantic segmentation, the problem of adapting
depth images remains underexplored. Prior research has attempted to mitigate the reality
gap through physics-based simulations, domain randomization, and adversarial learning
techniques (Westerski and Teck, 2023). Among these, Generative Adversarial Networks
(GANs) have demonstrated promising capabilities in translating images from one domain to
another while preserving structural consistency (Goodfellow et al., 2014). Early works on
domain adaptation applied GANs to generate realistic textures from synthetic images,
yielding improvements in object detection and scene reconstruction (Xu et al., 2023).
However, the adaptation of depth images, particularly for UAV perception, has not been fully
addressed.
Depth cameras play a fundamental role in UAV perception, providing crucial spatial
awareness for obstacle avoidance and trajectory planning. However, simulated depth
images often fail to capture the full complexity of real-world depth sensors, including noise
artifacts and non-uniform depth distribution (Tzeng et al., 2020). These discrepancies
contribute to the sim-to-real gap, leading to degraded model performance when
transitioning from simulation to real-world applications. Bridging this gap is crucial for
ensuring the reliable deployment of UAVs in dynamic and unknown environments (James et
al., 2019).
One promising approach to mitigating this challenge is domain adaptation, which
enables models trained on synthetic data to generalize more effectively to real-world
conditions. In this work, we investigate domain adaptation techniques based on Generative

Páginas:





Adversarial Networks (GANs) to transform simulated depth images into their real-world
counterparts (James and Johns, 2016). The goal is to reduce the perceptual discrepancy
between synthetic and real depth images, thereby improving the accuracy and robustness of
depth-based perception models for UAV navigation.
This paper presents a comparative study of two domain adaptation approaches for
depth image transformation: Pix2Pix, a supervised method requiring paired depth images
for training (Isola et al., 2017), and CycleGAN, which enables adaptation without the need
for paired datasets(Zhu et al., 2017). To ensure accurate ground truth for training and
evaluation, a dataset of aligned synthetic and real-world depth images is constructed. A key
aspect of this study is the quantitative and qualitative assessment of the adapted depth
images, focusing on how effectively they replicate real-world depth characteristics. Results
indicate that the adversarial-based Pix2Pix model signi�icantly reduces the adaptation error
compared to the reconstruction-based CycleGAN approach, which, while preserving object
geometry, struggles to replicate the noise characteristics inherent to real depth sensors.
Furthermore, to demonstrate the real-world applicability of these adaptation
techniques, the generated depth images are integrated into the Robot Operating System
(ROS), allowing real-time perception for UAVs (Quigley et al., 2019). The performance of the
adapted images is analysed through error distribution comparisons, revealing that
adversarial-based adaptation provides a notable improvement in depth perception
accuracy. By improving the �idelity of depth perception, the proposed methods directly
contribute to more reliable obstacle detection and trajectory planning, which are critical for
safe and autonomous UAV navigation in unstructured environments. This ultimately leads
to a measurable reduction in perception-related navigation errors during real-world
deployment.

Páginas:

Páginas:

This study follows a structured methodology to ensure a controlled and
reproducible evaluation of domain adaptation techniques for depth image re�inement. The
approach involves constructing a high-�idelity simulation environment, carefully modelling
real-world objects, and integrating real-time pose synchronisation to maintain consistency
between simulated and physical setups. By implementing precise depth camera calibration
and leveraging adversarial learning techniques, the study provides a robust framework for
improving UAV perception. The evaluation focuses on the effectiveness of Pix2Pix and
CycleGAN, two generative models with distinct adaptation strategies, in transforming
simulated depth images into realistic counterparts. The methodological design allows for a
direct comparison of their strengths and limitations, offering valuable insights into the
trade-offs between paired and unpaired domain adaptation and their implications for
real-world robotic applications.
Virtual Environment Construction and Depth Image Generation
To facilitate domain adaptation for depth image re�inement, a controlled virtual
environment was developed to closely replicate real-world conditions. The simulation setup
was designed to ensure that the generated depth images remained structurally consistent
with those captured in real-world experiments. This required precise 3D modelling, camera
calibration, and integration into a physics-based simulation environment, as illustrated in
Figure 1.
Three key objects were selected for depth perception evaluation: a wooden cube, a
cylindrical obstacle, and a ramp-like structure. These objects introduce diverse spatial
patterns that enhance model generalisation, particularly for UAV navigation tasks. The
wooden cube, measuring 60 cm, functions as a UAV takeoff and landing station. The
cylindrical obstacle, standing 2 metres tall with a 30.5 cm diameter, mimics structural
elements commonly found in urban and industrial environments. The ramp, composed of
segmented components, introduces inclined surfaces that simulate uneven terrains. These
objects were precisely modelled in Blender using Boolean operations where necessary to
create hollow structures and merge components (Figure 1a). To enhance realism, UV
mapping was used to apply textures that closely match real-world materials, ensuring that
the visual and depth properties remained consistent across domains. The wooden cube and
ramp were assigned wood-textured surfaces, while the cylindrical obstacle was
colour-matched to its real-world counterpart. The �inal models were exported in .dae
format for seamless integration into the simulation.




Páginas:



    

The virtual testing environment, referred to as the Robotics Arena, was developed
within Gazebo, a widely used physics-based simulation platform (Figure 1c). The spatial
con�iguration of this environment was carefully aligned with the real-world experimental
setup, ensuring one-to-one correspondence between the placement of physical and virtual
objects. Structural elements such as safety nets and walls were incorporated to maintain
environmental consistency between the real and simulated setups.
A critical component of the simulation environment was the integration of a depth
sensor to generate synthetic depth images that accurately re�lect real-world depth
perception. The Intel RealSense D435 was selected due to its compact design and maximum
sensing range of 10 metres, making it ideal for UAV applications. Since Gazebo only provides
a default RealSense R200 model, a customised RealSense D435 sensor was implemented to
ensure an accurate simulation of its optical properties. A Gazebo plugin was developed to
replicate the real camera’s intrinsic parameters, including focal length, baseline distance,
and depth sensing characteristics (Figure 1b).
To maintain consistency between real and simulated depth images, the horizontal
�ield of view of the virtual camera was calculated based on the intrinsic parameters of the
RealSense D435. This calibration process ensured that the depth images captured in the
simulation accurately re�lected those obtained in real-world experiments. By integrating a
properly calibrated depth sensor, the study enabled direct comparisons between generated
and real-world depth images, forming the foundation for effective domain adaptation.
Pose Synchronisation and Depth Image Acquisition for Domain Adaptation
The alignment of real and simulated objects was critical for ensuring consistency in
depth image acquisition for domain adaptation. The OptiTrack motion capture system was
employed to track the real-world positions of objects using retrore�lective markers,
enabling precise pose synchronisation between physical and virtual environments. The
Motive software processed these marker-based detections, transmitting rigid body pose
data to the Robot Operating System (ROS) through the vrpn_client_ros package. This
ensured continuous real-time synchronisation, allowing objects in simulation to be
accurately positioned to match their real-world counterparts.




Páginas:
A ROS-based service was implemented to automatically spawn 3D models in Gazebo
using the real-world pose data. By subscribing to OptiTrack updates, this service ensured
that the placement of objects in the virtual environment remained consistent with the
physical setup. This synchronisation was essential for acquiring paired depth images, which
formed the foundation for supervised domain adaptation with Pix2Pix. Without this level of
environmental �idelity, inconsistencies between simulated and real-world conditions would
limit the adaptation capability of the trained models.
The process of depth image acquisition was tailored to support both paired and
unpaired domain adaptation techniques. A dataset of depth images was collected, ensuring
that each simulated image had a directly corresponding real-world counterpart. This was
achieved by continuously tracking the Intel RealSense D435 camera’s pose using OptiTrack
and replicating its position in simulation through the /gazebo/set_model_state service.
Depth images were synchronised and stored using a ROS bag �ile, capturing a range of
experimental conditions.
The dataset was structured to support the different requirements of the domain
adaptation models. Pix2Pix, which relies on supervised learning, was trained using paired
depth images to learn direct pixel-wise transformations between simulated and real depth
domains. In contrast, CycleGAN was trained on unpaired depth images, learning to perform
synthetic-to-real depth translation without explicit one-to-one correspondences. This
distinction in training strategies provided a comparative framework for evaluating the
advantages and limitations of paired versus unpaired adaptation methods.


             
¡
The dataset was structured to support the different requirements of the domain
adaptation models. Pix2Pix, which relies on supervised learning, was trained using paired
depth images to learn direct pixel-wise transformations between simulated and real depth
domains. In contrast, CycleGAN was trained on unpaired depth images, learning to perform
synthetic-to-real depth translation without explicit one-to-one correspondences. This
distinction in training strategies provided a comparative framework for evaluating the
advantages and limitations of paired versus unpaired adaptation methods.



Páginas:


The alignment process, depth acquisition setup, and structured dataset are
illustrated in Figure 2a shows the OptiTrack optical tracking system used for real-time pose
synchronisation. Figure 2b presents the paired real and simulated environments,
demonstrating the accuracy of object placement between domains. Figure 2c displays a
sample of the paired depth image dataset, highlighting the correspondence between
simulated and real-world depth images, which was essential for evaluating the performance
of the adaptation models.
Generative Models for Depth Image Adaptation
To bridge the simulation-to-reality gap, this study implemented two domain
adaptation techniques: Pix2Pix and CycleGAN, both of which leverage Generative
Adversarial Networks (GANs). These models were trained to re�ine synthetic depth images,
making them more representative of real-world sensor outputs. They were selected due to
their effectiveness in image-to-image translation tasks and their contrasting training
paradigms: one supervised and the other unsupervised. Adversarial domain adaptation
methods such as CoGAN (Liu and Tuzel, 2016), SimGAN (Shrivastava et al., 2017), and
PixelDA (Bousmalis et al., 2017) have introduced strategies like coupled discriminators,
self-regularisation, and noise conditioning to enhance the realism of generated images.
However, these models often require more complex architectural con�igurations and may
struggle to preserve structural features that are critical for depth-based robotic perception.
Pix2Pix and CycleGAN were therefore selected to provide a balanced evaluation of paired
versus unpaired training regimes while maintaining architectural simplicity and structural
�idelity.
Pix2Pix is a supervised GAN-based model that learns a direct mapping from
synthetic to real depth images using paired training data. The model architecture consists of
a U-Net-based generator, which transforms simulated depth images into their real-world
counterparts, and a PatchGAN discriminator, which evaluates the realism of the generated
images. The training process optimises adversarial loss, encouraging the generator to
produce more realistic images, and L1 loss, ensuring that translated images remain
structurally consistent with ground truth. This combination allows Pix2Pix to retain �ine
structural details while improving the realism of synthetic images.
In contrast, CycleGAN is an unsupervised GAN-based model that learns bidirectional
mappings between synthetic and real domains without requiring paired training data.
Instead of relying on direct pixel-wise correspondences, CycleGAN introduces cycle
consistency loss, ensuring that an image translated to the target domain and then mapped
back retains its original structure. The model consists of two generators and two
discriminators, learning transformations between synthetic and real domains in both
directions. This �lexibility allows CycleGAN to adapt synthetic images to real-world
characteristics even when paired datasets are unavailable. However, due to the lack of direct
supervision, CycleGAN may introduce structural inconsistencies and fail to replicate certain
depth features.
While both models are based on adversarial learning and aim to reduce the
perceptual gap between synthetic and real domains, they differ signi�icantly in their
underlying mechanisms and training requirements. Pix2Pix relies on supervised learning


Páginas:
¢
with paired data and uses a combination of adversarial and pixel-wise losses to ensure both
realism and structural �idelity. In contrast, CycleGAN adopts an unsupervised approach,
introducing cycle-consistency loss to preserve content in the absence of paired samples.
This distinction not only affects training strategies but also in�luences the models' ability to
replicate sensor-speci�ic noise and geometric accuracy. Theoretical correlation between the
two lies in their shared generative framework, yet their architectural con�igurations and
learning objectives re�lect different trade-offs between data alignment, realism, and
generalisation.
After training, both models were integrated into ROS, enabling real-time
transformation of depth images. The adapted depth images were published as ROS topics
and visualised using rqt_image_view, allowing UAVs operating in simulation to process
depth images that closely resemble real-world sensor outputs.
Experimental Setup and Model Evaluation
A structured experimental framework was designed to evaluate the performance of
Pix2Pix and CycleGAN. The evaluation focused on depth accuracy, noise replication, and
geometric structure preservation. The models’ ability to reduce discrepancies between
simulated and real-world depth images was assessed through quantitative error analysis
and qualitative structural comparisons.
A dataset of 2378 depth images was used to evaluate Pix2Pix, while CycleGAN was
tested on 603 images, with the difference in dataset size attributed to the higher
computational cost of CycleGAN inference. Each dataset consisted of simulated, generated,
and real depth images, with depth pro�iles extracted along the centre row of each image
(640 pixels). The absolute depth errors were computed to quantify the improvements
achieved by each model.
For statistical validation, a Wilcoxon signed-rank test was conducted to determine
whether the observed error reductions were statistically signi�icant. Additionally, Cliff’s
Delta was used to assess the practical signi�icance of these reductions. The evaluation
process combined histogram and boxplot visualisations with qualitative depth image
assessments, providing a comprehensive analysis of each model’s effectiveness in depth
adaptation.



Páginas:



This section evaluates Pix2Pix and CycleGAN in re�ining simulated depth images by
analyzing error reduction and visual realism. Error distributions are quanti�ied using
histograms, boxplots, and statistical tests, including the Wilcoxon signed-rank test and
Cliff’s Delta. Representative images—corresponding to minimum, median, and maximum
MedAE—are assessed through radar charts and depth comparisons for structural accuracy.
A �inal comparison highlights Pix2Pix’s advantage in paired supervision versus CycleGAN’s
reliance on unpaired training, determining which method better re�ines depth maps for
applications in robotics and autonomous systems.
Performance of the Pix2Pix Model
The results of this study are structured into two primary analyses: the evaluation of
error distributions across the dataset and the qualitative assessment of depth image
reconstruction. The �irst analysis examines the extent to which the generative model
reduces the discrepancies between simulated and real depth values, while the second
analysis focuses on the visual realism and structural integrity of the generated depth
images. To quantify the performance of the Pix2Pix model, we employ the Median Absolute
Error (MedAE) as the primary metric, which provides a robust measure of error by
computing the median of the absolute differences between the predicted and real depth
values. MedAE is particularly useful in this context as it mitigates the in�luence of outliers,
ensuring a more stable evaluation of the generative model’s performance.
The boxplots of error distributions, presented in Figure 3d, illustrate the variations
in absolute error across the three selected images corresponding to the minimum, median,
and maximum MedAE cases. A signi�icant reduction in error magnitude is evident when
comparing the Real-Gen errors with the Sim-Real errors, indicating that the Pix2Pix model
effectively re�ines depth information and brings the generated outputs closer to real-world
depth representations.
Across the analysed dataset, the Real-Gen errors remain consistently lower than the
Sim-Real errors, con�irming that the generative model signi�icantly reduces the gap
between the original simulated depth maps and the corresponding real depth images. The
reduction in error is most pronounced in the images associated with the minimum and
median MedAE cases, where the generated depth maps closely approximate the real depth
values. In contrast, the maximum MedAE case demonstrates a less effective correction, with
larger residual errors remaining even after generation.
A statistical summary of the absolute errors further supports these observations. In
the case of the minimum MedAE image, the Real-Gen error exhibits a median of 0.0392
meters, whereas the corresponding Sim-Real error has a median of 1.7255 meters. The
reduction is also evident in the median MedAE image, where the Real-Gen median error is
0.4314 meters, compared to 1.0588 meters for Sim-Real. In the case of the maximum MedAE
image, while the Pix2Pix model still reduces the error, the Real-Gen median error is 4.9804
meters, which remains substantially high. The interquartile range for the Real-Gen errors in
this case is notably larger, indicating a greater degree of variance and con�irming that the
generative model exhibits reduced effectiveness in highly complex scenes.
For the minimum MedAE case, the generated depth pro�iles closely match real data,
effectively correcting distortions, especially in smooth surfaces and well-de�ined
boundaries, making Pix2Pix effective in structured scenes. In the median MedAE case, while
depth estimates remain realistic, discrepancies appear at occlusion boundaries and



Páginas:

high-gradient regions, where the model struggles with depth ambiguities. In the maximum
MedAE case, deviations become pronounced, with over-smoothing in high-gradient
transitions, leading to a loss of �ine geometric details and highlighting the model’s
limitations in handling complex structural changes.
The qualitative assessment of generated depth images provides further insights into
the strengths and weaknesses of the Pix2Pix model. Figure 3e, �igure 3f and �igure 3g
present the simulated, generated, and real depth images corresponding to the minimum,
median, and maximum MedAE cases, respectively.
For the minimum MedAE case, the generated depth pro�iles closely match real data,
effectively correcting systematic distortions in smooth surfaces and well-de�ined object
boundaries, suggesting Pix2Pix performs well in structured scenes. In the median MedAE
case, while the generated depth pro�iles align with real data, discrepancies emerge at
occlusion boundaries and high-gradient regions, with errors increasing near depth
discontinuities, indicating challenges in resolving occlusions and textured surfaces. In the
maximum MedAE case, deviations become more pronounced, with noticeable distortions
and over-smoothing in high-gradient transitions, suggesting over-regularization that leads
to a loss of �ine geometric details, particularly in areas with depth discontinuities.



¡¡¡

    £ ¤      

This highlights a fundamental limitation of the Pix2Pix model in handling highly
complex depth variations, where the generative approach fails to fully reconstruct the �ine
geometric features of the scene.
Performance of the CycleGAN Model
The assessment of the CycleGAN model's performance is structured around two key
analyses: the evaluation of error distributions across the dataset and the qualitative
examination of generated depth images. The �irst analysis aims to quantify the extent to
which the CycleGAN model reduces discrepancies between simulated and real depth values.
The second analysis focuses on the visual realism and structural consistency of the



Páginas:






¡
               

Performance of the CycleGAN Model
The assessment of the CycleGAN model's performance is structured around two key
analyses: the evaluation of error distributions across the dataset and the qualitative
examination of generated depth images. The �irst analysis aims to quantify the extent to
which the CycleGAN model reduces discrepancies between simulated and real depth values.
The second analysis focuses on the visual realism and structural consistency of the
generated depth maps, providing insights into the model’s ability to replicate real-world
depth distributions.
The statistical evaluation of the model’s error distributions is illustrated in �igure 4
(d), where the boxplots of absolute errors corresponding to the minimum, median, and
maximum MedAE cases are presented. The analysis reveals that CycleGAN effectively
reduces error magnitudes, as indicated by the signi�icant decrease in Real-Gen errors when
compared to Sim-Real errors. This suggests that the model successfully re�ines the depth
distributions, aligning the generated outputs more closely with real-world depth
representations.
Despite the improvement in depth estimation, CycleGAN demonstrates varying
levels of effectiveness across different cases. For the image associated with the minimum
MedAE, the Real-Gen median error is 0.0392 meters, contrasting with a Sim-Real median
error of 0.5490 meters. This reduction con�irms that the model achieves a meaningful
re�inement in relatively simple scenarios. In the median MedAE case, the Real-Gen median
error is 0.7843 meters, while the corresponding Sim-Real median error is 0.6275 meters.
Unlike the Pix2Pix model, where a clear reduction in error is evident, CycleGAN does not
consistently outperform the simulated depth maps in all cases. In the maximum MedAE
case, the Real-Gen median error reaches 5.0196 meters, showing that while the model does
reduce error to some extent, it struggles signi�icantly in complex scenarios, often failing to
produce substantial re�inements in regions with intricate depth structures.
generated depth maps, providing insights into the model’s ability to replicate real-world
depth distributions.


Páginas:

The statistical summary of absolute errors con�irms that while CycleGAN can reduce
discrepancies between simulated and real depth distributions, it does not consistently
outperform the direct simulated-to-real comparison. Notably, the interquartile ranges for
Real-Gen errors in all cases remain relatively large, indicating considerable variance in
model performance. This suggests that while the generative process introduces some
corrections, the degree of re�inement is highly scene-dependent, particularly in regions
characterized by complex geometries and occlusions.
The radar charts presented in Figure 4a, Figure 4b, and Figure 4c provide an
in-depth comparative analysis of depth pro�iles extracted from the simulated, generated,
and real images. The results indicate that for cases with lower MedAE, the generated depth
maps exhibit a general alignment with real depth data, particularly in regions with gradual
depth transitions. However, in cases with higher MedAE, the generated depth maps deviate
signi�icantly from the real data, revealing the limitations of CycleGAN in handling abrupt
depth discontinuities.
For the minimum MedAE case, the generated depth pro�iles closely resemble real
depth, correcting distortions and accurately reconstructing smooth depth variations,
making CycleGAN effective in simple scenes. In the median MedAE case, artifacts appear at
occlusion boundaries and high-gradient transitions, as the model struggles with sharp
depth variations, leading to localized distortions. In the maximum MedAE case, deviations
are signi�icant, with over-smoothing in complex geometries, loss of �ine details, and
dif�iculty preserving intricate depth transitions. These �indings highlight a key limitation of
CycleGAN in its ability to generalize across highly complex scenes, where real-world depth
distributions exhibit signi�icant variance.
The qualitative assessment of the generated depth images provides additional
insights into the strengths and weaknesses of CycleGAN.
Figure 4e, Figure 4f, and Figure 4g present the simulated, generated, and real depth
images for the minimum, median, and maximum MedAE cases, respectively. A critical
observation from these results is that while the model does produce depth maps that are
statistically closer to real data, the visual quality of the generated images is noticeably
inferior compared to Pix2Pix.
For the minimum MedAE case, the generated depth image shows structural
consistency with the real depth map but appears blurred, re�ining edges while losing �ine
details. In the median MedAE case, artifacts become more prominent, especially in occluded
areas and high-gradient transitions, as CycleGAN struggles with noise characterization and
fails to preserve �iner geometric structures. In the maximum MedAE case, discrepancies are
most pronounced, with over-smoothing eliminating sharp depth transitions and failing to
reconstruct high-frequency variations, highlighting CycleGAN’s dif�iculty in replicating
complex depth distributions.This qualitative analysis reveals that CycleGAN, while effective
in reducing numerical errors, struggles to produce visually convincing depth
reconstructions. Unlike Pix2Pix, which maintains structural coherence in most cases,
CycleGAN introduces distortions that make the generated depth maps appear unrealistic.
The model appears to prioritize statistical alignment over perceptual accuracy, leading to
results that, while numerically valid, fail to capture the true characteristics of real-world
depth distributions.



Páginas:



Comparative Performance of Pix2Pix and CycleGAN
The comparative analysis between Pix2Pix and CycleGAN provides insight into the
relative strengths and limitations of each model. The error histograms shown in Figure 5a
and Figure 5b illustrate the normalized distributions of Real-Gen errors for both models.
Pix2Pix demonstrates a higher frequency of lower error values, suggesting that its
generated depth images are generally more accurate when compared to real depth data.
CycleGAN, on the other hand, exhibits a broader error distribution, with a noticeable shift
toward higher error magnitudes, indicating that its generated depth images retain greater
discrepancies from real-world measurements.


        ¡     ¡ 
 ¡        £   
¡¡
The statistical analysis presented in Table 1 further con�irms these observations. The
median Real-Gen error for Pix2Pix is 0.7143 meters, whereas CycleGAN has a higher median
Real-Gen error of 1.0509 meters. Similarly, while the Sim-Gen errors are comparable for
both models, the Sim-Real error for Pix2Pix is signi�icantly larger than that of CycleGAN.
This suggests that CycleGAN operates within a more constrained transformation space,
achieving only marginal improvements over the original simulated depth maps. In contrast,
Pix2Pix applies more substantial corrections, leading to a stronger reduction in the
Sim-Real gap.



Model
Median
Error (Real-
Gen) [m]
Median
Error (Sim-
Gen) [m]
Median
Error (Sim-
Real) [m]
Wilcoxon p-
value
Cliff's
Delta
Pix2Pix
0.7143
0.8945
1.1744
4.0893e-189
0.5172
CycleGAN
1.0509
1.0562
1.0706
0.00914
-0.0116


Páginas:

The statistical signi�icance of these differences is supported by the Wilcoxon
signed-rank test, which evaluates whether the observed reduction in error is consistent
across the dataset. Pix2Pix achieves an exceptionally low p-value (4.0893e-189), providing
very strong evidence that the reduction in Real-Gen error is statistically signi�icant. The
effect size, measured using Cliff’s Delta, is 0.5172, indicating a large practical effect and
con�irming that the improvement is not only statistically signi�icant but also meaningful in
real-world applications. In contrast, CycleGAN’s p-value is 0.00914, which, while still
statistically signi�icant, suggests a weaker improvement. The corresponding effect size of
-0.0116 is classi�ied as negligible, further reinforcing the conclusion that CycleGAN does not
meaningfully reduce the Sim-Real error.
Table 2 highlights the practical implications of these differences. Pix2Pix achieves an
average Real-Gen error reduction of 0.46 meters, demonstrating a strong improvement in
depth accuracy. CycleGAN, by contrast, achieves a signi�icantly lower improvement of only
0.20 meters. The practical impact of these �indings is clear: Pix2Pix is far more effective at
re�ining simulated depth maps and reducing the Sim-Real gap, while CycleGAN’s
improvements are relatively minor and statistically weak.

             


These �indings are consistent with prior research in domain adaptation and
sim-to-real transfer for robotic perception. For instance, Sadeghi and Levine (2017)
demonstrated the effectiveness of domain randomisation for UAV control, although their
approach lacked �ine structural preservation compared to adversarial learning models.
James et al. (2019) employed GAN-based adaptation for robotic grasping and reported
statistically signi�icant performance gains when using paired supervision. More recently,
Westerski and Fong (2023) surveyed state-of-the-art synthetic data generation techniques
and emphasised that methods which preserve semantic and spatial structure, such as those
based on photorealistic simulation or structured domain adaptation, tend to outperform
approaches relying solely on randomisation. The results of the present study reinforce this
trend, showing that Pix2Pix, which leverages paired data, offers superior performance in
re�ining depth images. This supports the idea that supervised adaptation strategies provide
measurable advantages when high-�idelity domain alignment is required for UAV
navigation.
Comparison
CycleGAN
Mean Real-Gen Error
Reduction
0.46m 0.20m
Wilcoxon p-value
0.0091 (weak evidence)
Effect Size (Cliff’s Delta)
-0.0116 (negligible effect)
Practical Improvement
Very weak or negligible
improvement



Páginas:



This study evaluated the performance of Pix2Pix and CycleGAN for domain
adaptation in depth image re�inement, focusing on their ability to bridge the
simulation-to-reality gap. The results demonstrate that both models reduce depth
estimation errors, but Pix2Pix consistently outperforms CycleGAN in terms of structural
accuracy, numerical error reduction, and overall realism. The comparison between these
two approaches highlights the trade-offs between paired and unpaired domain adaptation
methods, offering insights into their applicability for real-world UAV perception and
navigation.
A crucial aspect of this study was the meticulous replication of the real-world
environment in simulation, achieved through precise 3D modelling and real-time pose
synchronisation. Objects such as the wooden cube, cylindrical obstacle, and ramp structure
were carefully reconstructed in Blender and integrated into the Gazebo simulation platform,
ensuring that the geometric features encountered in simulation closely matched those in
the real world. This alignment was further reinforced by the OptiTrack motion capture
system, which enabled real-time tracking and pose synchronisation of physical objects with
their virtual counterparts. The ability to track object positions with millimetre accuracy and
replicate their exact placement in simulation was essential for acquiring paired depth
images, forming the foundation for supervised domain adaptation with Pix2Pix. Without
this level of environmental �idelity, the domain adaptation process would lack consistency,
reducing the effectiveness of the trained models.
The error analysis con�irms that Pix2Pix achieves a stronger reduction in Sim-Real
error compared to CycleGAN. The boxplot analysis demonstrates that Real-Gen errors are
substantially lower than Sim-Real errors across all three representative cases—minimum,
median, and maximum MedAE—con�irming that Pix2Pix successfully transforms simulated
depth images into realistic depth distributions. The model exhibits its strongest
performance in structured environments, where depth variations are gradual, and
occlusion boundaries are well-de�ined. However, in highly complex scenes, Pix2Pix
struggles with �ine-scale geometric details, occasionally over-regularising depth transitions
and introducing smoothing artefacts.
CycleGAN, in contrast, presents a different set of strengths and weaknesses. While
the model achieves a measurable reduction in numerical error, its ability to produce visually
coherent depth images is signi�icantly weaker than that of Pix2Pix. The unpaired nature of
CycleGAN training, which relies on cycle consistency loss rather than direct supervision,
leads to inconsistent structural corrections. As a result, the generated depth maps often
contain artefacts and distortions, failing to achieve the �ine-grained depth re�inement
required for high-precision applications. Notably, CycleGAN’s effect size is negligible,
indicating that its improvements over the simulated depth images are minor in practical
terms.
A key �inding of this study is that Pix2Pix bene�its signi�icantly from paired
supervision, allowing it to learn direct mappings between synthetic and real depth images.
The effectiveness of this approach is re�lected in the Wilcoxon signed-rank test results,
which show overwhelmingly strong statistical signi�icance (p-value = 4.0893e-189) and a
large effect size (Cliff’s Delta = 0.5172). This con�irms that the model’s re�inement process
is not only statistically signi�icant but also practically meaningful, making Pix2Pix a highly
suitable approach for depth adaptation in UAV perception tasks.



Páginas:

CycleGAN, despite demonstrating some level of numerical improvement, exhibits
limited effectiveness in depth re�inement. The statistical analysis supports this conclusion,
as its Wilcoxon p-value (0.00914) suggests only weak statistical signi�icance, and its effect
size (-0.0116) is classi�ied as negligible. These �indings suggest that CycleGAN struggles to
meaningfully close the Sim-Real gap, reinforcing the notion that unpaired domain
adaptation alone may not be suf�icient for high-precision depth transformations.
Beyond numerical error reduction, the qualitative assessment of depth images
further underscores the advantages of Pix2Pix. The radar charts reveal that
Pix2Pix-generated depth pro�iles closely follow real data, especially in scenes with low to
moderate complexity. In contrast, CycleGAN fails to reconstruct �ine-grained structures,
often producing depth maps that exhibit unnatural distortions and spatial inconsistencies.
These observations highlight the importance of paired supervision in generative depth
re�inement, as Pix2Pix consistently generates more accurate and visually coherent depth
maps than CycleGAN.
Despite Pix2Pix’s strong performance, challenges remain. The model occasionally
struggles with high-frequency depth variations, particularly in occlusion regions where
depth discontinuities are abrupt. One promising direction is to explore hybrid approaches
that integrate the structural consistency of CycleGAN with the supervised learning
advantages of Pix2Pix, potentially yielding a more balanced and robust depth re�inement
framework.
Another avenue would be to investigate the integration of alternative re�inement
strategies with Pix2Pix to address its possible limitations in handling high-frequency depth
variations and occlusion boundaries. Techniques such as spatially-adaptive normalisation
(Park et al., 2019) and multi-scale feature alignment (Xu et al., 2021) have shown promise in
preserving �ine structural details during image-to-image translation. Godard et al., (2019)
on monocular depth estimation demonstrated that incorporating edge-aware smoothness
and temporal consistency constraints can improve depth accuracy in dynamic scenes.
Combining such architectural enhancements or post-processing methods with Pix2Pix may
further improve its ability to generalise across complex geometries. Additionally,
attention-based mechanisms (Wang et al., 2024) offer a way to prioritise high-gradient
regions and could help mitigate over-smoothing in �ine-grained depth structures.



Páginas:



This study investigated the application of Pix2Pix and CycleGAN for domain
adaptation in depth image re�inement, assessing their ability to transform simulated depth
images into realistic counterparts. The �indings con�irm that Pix2Pix signi�icantly
outperforms CycleGAN, achieving greater reductions in depth estimation error and
generating more visually coherent depth maps. The paired supervision approach used in
Pix2Pix proves highly effective, enabling the model to learn direct mappings that optimise
depth accuracy.
A major strength of this study lies in the high-�idelity replication of the real-world
environment in simulation. The use of precisely modelled objects, OptiTrack motion capture
for real-time pose synchronisation, and Gazebo-based virtual scene construction ensured
that the training and evaluation process remained as consistent as possible across real and
simulated domains. This structured experimental framework facilitated the acquisition of
paired depth images, which were crucial for Pix2Pix’s successful adaptation. The study
highlights that maintaining a high level of environmental consistency is fundamental to
achieving reliable depth re�inement through domain adaptation.
The statistical analysis con�irms the superiority of Pix2Pix, with a highly signi�icant
p-value and a large effect size, indicating that the reduction in Sim-Real error is both
statistically and practically meaningful. CycleGAN, despite offering some numerical
improvements, exhibits weak structural accuracy and negligible practical impact. The
unpaired nature of its training process limits its ability to generate depth maps that
convincingly resemble real-world data.
These �indings underscore the importance of selecting the appropriate domain
adaptation strategy based on speci�ic application requirements. In tasks where
high-precision depth estimation is crucial, Pix2Pix emerges as the preferred model,
providing substantial reductions in error and strong structural consistency. However,
CycleGAN’s ability to learn without paired data may still be useful in scenarios where
labelled datasets are unavailable, though its effectiveness remains limited.
Beyond its immediate contributions to UAV perception and navigation, this research
has broader implications for robotic vision, computer graphics, and sensor simulation. The
ability to generate realistic depth images from synthetic environments can bene�it
applications in autonomous systems, mixed reality, and robotic training simulations.
Additionally, this tool holds potential for educational applications, particularly in settings
where students and researchers lack access to real-world depth cameras. By providing a
cost-effective, simulation-based approach to depth sensing, this framework can serve as a
valuable resource for teaching and research in robotics, computer vision, and AI-driven
perception.
Future research should explore hybrid approaches that combine the structural
consistency of CycleGAN with the depth accuracy of Pix2Pix, potentially leading to a more
�lexible and robust depth re�inement framework. Additionally, investigating alternative
architectures or post-processing techniques could further enhance depth consistency and
mitigate over-smoothing in complex scenes. Expanding this study to other depth sensing
technologies and simulation environments may further improve its generalisation and
practical impact.



Páginas:
¢
Ultimately, this study con�irms that adversarial domain adaptation can signi�icantly
bridge the simulation-to-reality gap in depth perception, enabling autonomous systems to
operate more effectively in real-world environments while also offering new opportunities
for education, research, and real-time robotic applications.
Acknowledgements
The author acknowledges the support of the Secretaría de Educación Superior,
Ciencia, Tecnología e Innovación (SENESCYT) of Ecuador.



Páginas:




Bousmalis K, Silberman N, Dohan D, Erhan D, Krishnan D. Unsupervised Pixel-Level
Domain Adaptation with Generative Adversarial Networks. 2017 IEEE Conf. Comput.
Vis. Pattern Recognit., vol. 2017- Janua, IEEE; 2017, p. 95–104.
Godard C, Aodha O Mac, Firman M, Brostow G. Digging Into Self-Supervised Monocular
Depth Estimation. 2019 IEEE/CVF Int. Conf. Comput. Vis., vol. 2019- Octob, IEEE;
2019, p. 3827–37.
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative
adversarial nets. Adv. Neural Inf. Process. Syst., vol. 3, Wiesbaden: Springer
Fachmedien Wiesbaden; 2014, p. 2672–80.
Isola P, Zhu J-Y, Zhou T, Efros AA. Image-to-Image Translation with Conditional Adversarial
Networks. CVPR 2017.
James S, Johns E. 3D Simulation for Robot Arm Control with Deep Q-Learning 2016.
James S, Wohlhart P, Kalakrishnan M, Kalashnikov D, Irpan A, Ibarz J, et al. Sim-To-Real via
Sim-To-Sim: Data-Ef�icient Robotic Grasping via Randomized-To-Canonical Adaptation
Networks 2019:12627–37.
Jing Chen, Tianbo Liu, Shaojie Shen. Online generation of collision-free trajectories for
quadrotor �light in unknown cluttered environments. 2016 IEEE Int. Conf. Robot.
Autom., vol. 2016- June, IEEE; 2016, p. 1476–83.
Le H, Saeedvand S, Hsu CC. A Comprehensive Review of Mobile Robot Navigation Using
Deep Reinforcement Learning Algorithms in Crowded Environments. J Intell Robot
Syst Theory Appl 2024;110:1–22.
Liu MY, Tuzel O. Coupled generative adversarial networks. Adv. Neural Inf. Process. Syst.,
vol. 29, 2016, p. 469–77.
Park T, Liu MY, Wang TC, Zhu JY. Semantic image synthesis with spatially-adaptive
normalization. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-
June, IEEE; 2019, p. 2332–41.
Quigley M, Gerkey B, Conley K, Faust J, Foote T, Leibs J, et al. ROS: an open-source Robot
Operating System 2019.
Al Radi M, AlMallahi MN, Al-Sumaiti AS, Semeraro C, Abdelkareem MA, Olabi AG. Progress in
arti�icial intelligence-based visual servoing of autonomous unmanned aerial vehicles
(UAVs). Int J Thermo�luids 2024;21:100590.
Sadeghi F, Levine S. CAD2RL: Real Single-Image Flight Without a Single Real Image. Robot.
Sci. Syst. XIII, vol. 13, Robotics: Science and Systems Foundation; 2017.
Sampedro C, Bavle H, Rodriguez-Ramos A, de la Puente P, Campoy P. Laser-Based Reactive
Navigation for Multirotor Aerial Robots using Deep Reinforcement Learning. 2018
IEEE/RSJ Int. Conf. Intell. Robot. Syst., IEEE; 2018, p. 1024–31.
Shrivastava A, P�ister T, Tuzel O, Susskind J, Wang W, Webb R. Learning From Simulated and
Unsupervised Images Through Adversarial Training 2017:2107–16.
Sikang Liu, Watterson M, Tang S, Kumar V. High speed navigation for quadrotors with
limited onboard sensing. 2016 IEEE Int. Conf. Robot. Autom., vol. 2016- June, IEEE;
2016, p. 1484–91.
Tzeng E, Devin C, Hoffman J, Finn C, Abbeel P, Levine S, et al. Adapting Deep Visuomotor
Representations with Weak Pairwise Constraints. Springer Proc. Adv. Robot., vol. 13,
Springer, Cham; 2020, p. 688–703.


Páginas:

Wang F, Zhang Q, Zhao Q, Wang M, Sun F. Unsupervised image-to-image translation with
multiscale attention generative adversarial network. Appl Intell 2024;54:6558–78.
Westerski A, Teck FW. Synthetic Data for Object Detection with Neural Networks: State of
the Art Survey of Domain Randomisation Techniques. ACM Trans Multimed Comput
Commun Appl 2023;21.
Wu K, Esfahani MA, Yuan S, Wang H. Depth-based Obstacle Avoidance through Deep
Reinforcement Learning. Proc. 5th Int. Conf. Mechatronics Robot. Eng., vol. Part F1476,
New York, NY, USA: ACM; 2019, p. 102–6.
Xu C, Zhou M, Ge T, Jiang Y, Xu W. Unsupervised Domain Adaption With Pixel-Level
Discriminator for Image-Aware Layout Generation 2023:10114–23.
Xu X, Chen Z, Yin F. Multi-Scale Spatial Attention-Guided Monocular Depth Estimation With
Semantic Enhancement. IEEE Trans Image Process 2021;30:8811–22.
Xu Y, Cao H, Xie L, Li X-L, Chen Z, Yang J. Video Unsupervised Domain Adaptation with Deep
Learning: A Comprehensive Survey. ACM Comput Surv 2024;56:36.
Zhu J-Y, Park T, Isola P, Efros AA. Unpaired Image-to-Image Translation Using
Cycle-Consistent Adversarial Networks. 2017 IEEE Int. Conf. Comput. Vis., vol. 2017-
Octob, IEEE; 2017, p. 2242–51.
