Current Approach to Differentiable Rendering

index

The quest to capture and recreate our three-dimensional world from simple photographs represents one of the most ambitious challenges in computer graphics and computer vision. This problem, known as novel view synthesis, asks a deceptively simple question: given a handful of images of a scene, can we generate photorealistic views from any new viewpoint? The answer has profound implications for virtual reality, gaming, film production, and digital content creation.

For decades, this challenge drove researchers through complex, multi-stage pipelines involving geometric reconstruction, depth estimation, and texture mapping. Traditional approaches required explicit geometric modeling, where practitioners would painstakingly reconstruct 3D meshes or point clouds before attempting to render new views. These methods, while mathematically sound, often struggled with the inherent ambiguity of reconstructing complete 3D information from limited 2D observations.

However, a recent paradigm shift has fundamentally revolutionized how we approach this problem. Instead of treating geometric reconstruction as a separate, explicit preprocessing step, modern methods frame novel view synthesis as an end-to-end optimization problem. By creating differentiable renderers that can compute gradients with respect to scene parameters, we can learn a scene’s representation directly by minimizing the difference between rendered images and real photographs. This approach allows the scene’s intricate geometry and appearance to emerge naturally from the optimization process itself, guided by the powerful supervisory signal of pixel color consistency.

This paradigm has given rise to two dominant and complementary philosophies for scene representation. The first treats scenes as continuous, implicit functions, where geometry and appearance are encoded directly in the weights of neural networks. The second returns to the classical concept of explicit geometric primitives, but reimagines them for a world of gradient-based optimization and differentiable rendering.

This article provides a comprehensive exploration of these two landmark approaches that have defined the modern landscape of differentiable rendering. We begin by examining the world of implicit representations through Neural Radiance Fields (NeRF), a technique that redefined quality standards for novel view synthesis. We then explore the power of explicit, point-based representations through 3D Gaussian Splatting, a method that brought real-time rendering capabilities to state-of-the-art view synthesis. Together, these approaches paint a complete picture of how differentiable rendering has transformed our ability to capture and recreate the visual world.

Neural Radiance Fields: The Power of Implicit Representation

In 2020, researchers from UC Berkeley, Google Research, and UC San Diego introduced a method that produced a step-change in the quality of novel view synthesis: Neural Radiance Fields, commonly known as NeRF [2]. The core innovation was radical in its simplicity: instead of explicitly reconstructing geometric meshes or point clouds, NeRF represents complex 3D scenes using nothing more than a standard, fully-connected neural network.

The Mathematical Foundation: A 5D Function for Scenes

NeRF’s fundamental insight lies in its scene representation. Rather than thinking of scenes as collections of geometric primitives, NeRF models them as continuous, five-dimensional functions. As the original paper elegantly states, it represents scenes as “a 5D vector-valued function whose input is a 3D location $\mathbf{x} = (x, y, z)$ and 2D viewing direction $(\theta, \phi)$ , and whose output is an emitted color $\mathbf{c} = (r, g, b)$ and volume density $\sigma$ ” [2].

This function, denoted as $F_\Theta$ , is approximated by a Multi-Layer Perceptron (MLP), where $\Theta$ represents the network’s learnable parameters. The elegance of this representation becomes apparent when we consider what each output component represents:

The volume density $\sigma$ encodes the differential probability that a ray of light terminates at a specific point in space. Critically, this density depends only on the spatial location $\mathbf{x}$ , ensuring that the scene’s geometric structure remains consistent regardless of viewing angle. High density values indicate solid surfaces or opaque materials, while low values represent empty space or transparent regions.

The emitted color $\mathbf{c}$ depends on both the spatial location $\mathbf{x}$ and the viewing direction $\mathbf{d}$ . This view-dependent color modeling allows the network to capture sophisticated optical phenomena such as specular reflections, glossy surfaces, and other view-dependent appearance effects. The metallic sheen on a car’s surface, for instance, appears different when viewed from various angles, and this directional dependency is crucial for photorealistic rendering.

This implicit representation offers remarkable efficiency. An entire, complex scene with millions of geometric details can be encoded within the weights of a relatively compact neural network, typically requiring only around 5 megabytes of storage. This stands in stark contrast to traditional explicit representations, which might require gigabytes for high-resolution voxel grids or detailed mesh models.

Volume Rendering: From Function to Image

Having established a 5D function representation, the next challenge involves converting this continuous function into discrete pixel colors. NeRF accomplishes this through a differentiable adaptation of classical volume rendering techniques. The process begins by tracing camera rays through the scene and accumulating color and density information at various points along each ray’s path.

For a camera ray $\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}$ (where $\mathbf{o}$ represents the ray origin and $\mathbf{d}$ represents the ray direction), the expected color $C(\mathbf{r})$ is computed using the volume rendering integral:

$C(\mathbf{r}) = \int_{t_n}^{t_f} T(t)\sigma(\mathbf{r}(t))\mathbf{c}(\mathbf{r}(t), \mathbf{d})dt$

where $t_n$ and $t_f$ represent the near and far bounds of the scene volume. The crucial component in this equation is the transmittance $T(t)$ , which represents the probability that a ray travels from the near bound $t_n$ to point $t$ without being absorbed or scattered by intervening matter. This transmittance is defined as:

$T(t) = \exp\left(-\int_{t_n}^{t} \sigma(\mathbf{r}(s))ds\right)$

To make this continuous integral computationally tractable, NeRF employs numerical quadrature. The ray is discretized into $N$ segments, and samples are taken at regular intervals. The continuous integral becomes a discrete sum:

$\hat{C}(\mathbf{r}) = \sum_{i=1}^{N} T_i(1 - \exp(-\sigma_i \delta_i))\mathbf{c}_i$

where $\delta_i$ represents the distance between adjacent samples along the ray. The term $(1 - \exp(-\sigma_i \delta_i))$ can be interpreted as the probability that the ray terminates within the $i$ -th segment.

This entire rendering pipeline maintains full differentiability. Every operation, from querying the neural network for density and color values to compositing these values into final pixel colors, can be expressed as differentiable operations. This means we can compute gradients of the rendering error with respect to the network parameters $\Theta$ , enabling direct optimization of the scene representation using standard gradient descent techniques.

Critical Innovations for Photorealistic Quality

While the basic NeRF formulation is conceptually elegant, naive implementations struggle to capture the fine, high-frequency details that characterize photorealistic imagery. The original NeRF paper introduced two crucial innovations that proved essential for achieving state-of-the-art results.

Positional Encoding: Overcoming Spectral Bias

Neural networks exhibit a well-documented phenomenon known as spectral bias, where they are “biased towards learning lower frequency functions” [2]. This bias manifests as an inability to represent sharp edges, fine textures, and intricate details that are abundant in real-world photographs. Without addressing this limitation, NeRF would produce overly smooth, blurry renderings that lack the crisp detail necessary for photorealism.

NeRF addresses this fundamental limitation through positional encoding, which maps input coordinates to a higher-dimensional space using high-frequency functions before passing them to the network. For each scalar input coordinate $p$ , the positional encoding $\gamma(p)$ is computed as:

$\gamma(p) = (\sin(2^0\pi p), \cos(2^0\pi p), \sin(2^1\pi p), \cos(2^1\pi p), \ldots, \sin(2^{L-1}\pi p), \cos(2^{L-1}\pi p))$

where $L$ determines the maximum frequency represented in the encoding. This transformation expands each input coordinate from a single scalar to a vector of $2L$ trigonometric functions, effectively providing the network with access to features at multiple frequency scales.

The impact of this seemingly simple transformation cannot be overstated. It enables the network to represent everything from the subtle grain of wood texture to the sharp geometric edges of architectural structures. Without positional encoding, NeRF would be fundamentally limited to representing smooth, low-frequency variations in scene appearance.

Hierarchical Volume Sampling: Efficient Ray Traversal

The second critical innovation addresses computational efficiency during ray traversal. Uniformly sampling points along each camera ray proves highly inefficient, as many samples inevitably fall in empty space or occluded regions that contribute nothing to the final pixel color. This uniform sampling strategy wastes computational resources and can lead to poor optimization dynamics.

NeRF introduces a hierarchical, coarse-to-fine sampling strategy that adaptively concentrates samples in regions most likely to contain visible content. The approach employs two networks: a coarse network and a fine network.

The process begins with stratified sampling along the ray, where the coarse network is evaluated at regularly spaced intervals. The outputs of this coarse network provide initial estimates of where visible content is likely to be located along the ray. The volume rendering weights from the coarse network are then normalized to create a piecewise-constant probability density function (PDF) along the ray.

A second, more informed set of samples is drawn from this learned distribution, concentrating samples in regions where the coarse network indicates high probability of visible content. The final pixel color is rendered using the fine network, evaluated at the union of both coarse and fine samples.

As the original paper notes, “This procedure allocates more samples to regions we expect to contain visible content” [2]. This adaptive sampling strategy focuses the model’s representational capacity where it matters most, significantly improving both rendering quality and computational efficiency.

Impact and Limitations

The combination of implicit neural representation, differentiable volume rendering, positional encoding, and hierarchical sampling established NeRF as a transformative breakthrough in novel view synthesis. The method demonstrated unprecedented quality in capturing complex scenes with intricate geometry, realistic lighting, and view-dependent effects.

However, NeRF’s implicit representation comes with inherent computational costs. Training typically requires hours or days of optimization, and rendering new views can take seconds per image. These limitations, while acceptable for offline content creation, preclude real-time applications and interactive use cases.

3D Gaussian Splatting: Explicit Primitives for Real-Time Rendering

While NeRF and its successors demonstrated the remarkable potential of implicit representations, their computational requirements remained a significant barrier to practical deployment. Training times measured in hours or days, combined with rendering speeds of seconds per frame, made real-time applications impossible. In 2023, a team from Inria and Max-Planck-Institut für Informatik introduced a method that achieved state-of-the-art quality while enabling real-time rendering: 3D Gaussian Splatting (3DGS) [1].

This approach marked a fundamental philosophical shift in scene representation. Instead of encoding scenes as continuous, implicit functions, 3DGS returns to explicit geometric primitives. However, these primitives are specifically designed for gradient-based optimization and differentiable rendering, combining the best aspects of classical computer graphics with modern machine learning techniques.

The Primitive of Choice: 3D Gaussians

The central innovation of 3DGS lies in its choice of scene representation primitive: the 3D Gaussian. Rather than representing scenes as neural networks or traditional meshes, 3DGS models scenes as collections of millions of 3D Gaussians, each characterized by several key parameters:

A position (mean) $\boldsymbol{\mu}$ that locates the Gaussian in 3D space. This serves as the primitive’s anchor point and determines where it contributes to the scene.

An anisotropic covariance matrix $\boldsymbol{\Sigma}$ that defines the Gaussian’s shape, scale, and orientation. This matrix allows primitives to be stretched, flattened, and rotated to accurately model diverse surface geometries, from thin lines and flat planes to rounded volumes.

An opacity value $\alpha$ that controls the primitive’s transparency. This parameter enables modeling of semi-transparent materials and provides a mechanism for soft boundaries between objects.

A set of Spherical Harmonic (SH) coefficients that encode the primitive’s color as a function of viewing direction. This representation allows for efficient modeling of view-dependent appearance effects while maintaining compact storage requirements.

The explicit nature of this representation offers several compelling advantages. As the authors observe, “we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space” [1]. Unlike implicit representations that must be queried throughout the entire scene volume, Gaussians exist only where matter is present, naturally avoiding expensive computations in empty regions.

Differentiable Rasterization: From 3D Gaussians to 2D Images

The core technical challenge in 3DGS involves efficiently rendering millions of 3D Gaussians while maintaining differentiability for gradient-based optimization. This is accomplished through a sophisticated, GPU-optimized rasterization pipeline that projects 3D Gaussians to 2D screen space and blends them to form the final image.

The rendering process begins with projection. For each 3D Gaussian viewed from a specific camera, the system projects the 3D covariance matrix $\boldsymbol{\Sigma}$ into a corresponding 2D covariance matrix $\boldsymbol{\Sigma}'$ in screen space. This projection operation transforms the 3D Gaussian into a 2D “splat” that represents how the primitive appears from the current viewpoint.

The final image is formed through alpha blending of these 2D splats in front-to-back order. The rendering equation for 3DGS can be expressed as:

$\hat{C} = \sum_{i} T_i \alpha_i \mathbf{c}_i \qquad$

where samples of density $\sigma$ , transmittance $T$ , and color $\mathbf{c}$ are taken along the ray with intervals $\delta_i$ . The alpha values are computed as $\alpha_i = \omega_i * G(x)$ and the transmittance is given by $T_i = \prod_{j=1}^{i-1}(1 - \alpha_j)$ .

A typical neural point-based approach would compute the color $\mathbf{c}$ of a pixel by blending $N$ ordered points overlapping the pixel using the standard alpha compositing formula:

$\mathbf{c} = \sum_{i \in N} \mathbf{c}_i \alpha_i \prod_{j=1}^{i-1}(1 - \alpha_j) \qquad$

where $\mathbf{c}_i$ is the color of each point and $\alpha_i$ is given by evaluating a 2D Gaussian with covariance $\boldsymbol{\Sigma}'$ multiplied by a learned per-point opacity. This formulation demonstrates the close relationship between point-based alpha blending and volumetric rendering approaches.

The key to 3DGS’s exceptional performance lies in its highly optimized, tile-based rasterization algorithm. This algorithm addresses the primary bottleneck in point-based rendering: the computational cost of sorting and blending millions of primitives for each frame.

The rasterization pipeline divides the screen into regular 16×16 pixel tiles and assigns each Gaussian to the tiles it overlaps. A crucial optimization involves performing a single, global sort of all Gaussian instances per frame using fast GPU-accelerated radix sort. This sort uses a composite key combining the Gaussian’s depth and tile identifier, enabling efficient front-to-back traversal during rendering.

Each screen tile is then processed by a separate GPU thread block, which traverses its pre-sorted list of Gaussians and performs alpha blending. This design maximizes parallelism and avoids the costly per-pixel sorting operations that hindered previous point-based rendering methods.

Critically, this entire pipeline maintains full differentiability. Gradients can flow from final pixel colors back through the blending operations to every parameter of every Gaussian—position, covariance, opacity, and color coefficients. This differentiability enables direct optimization of the scene representation using standard gradient descent techniques.

Optimization Strategy: Adaptive Density Control

Achieving high-quality results with 3DGS requires careful initialization and dynamic control of the Gaussian primitives during optimization. The method employs a sophisticated adaptive density control strategy that automatically adjusts the number and configuration of Gaussians based on reconstruction quality.

Initialization and Parameterization

The optimization process begins with initialization from the sparse point cloud produced by standard Structure-from-Motion (SfM) algorithms. “We initialize the set of 3D Gaussians with the sparse point cloud produced for free as part of the SfM process” [1]. Each point in the sparse reconstruction becomes the center of a 3D Gaussian, with initial covariance parameters set to represent small, isotropic distributions.

Direct optimization of the covariance matrix $\boldsymbol{\Sigma}$ presents numerical challenges, as the matrix must remain positive semi-definite throughout optimization. The covariance matrix of a 3D Gaussian is analogous to describing the configuration of an ellipsoid, determining how the Gaussian extends and orients in three-dimensional space.

To ensure optimization stability, 3DGS employs a more intuitive parameterization that separates scaling and rotation components. Given a scaling matrix $\mathbf{S}$ and rotation matrix $\mathbf{R}$ , the corresponding covariance matrix is constructed as:

$\boldsymbol{\Sigma} = \mathbf{R}\mathbf{S}\mathbf{S}^T\mathbf{R}^T \qquad$

To allow independent optimization of both geometric factors, the system stores them separately: a 3D vector $\mathbf{s}$ for scaling and a quaternion $\mathbf{q}$ to represent rotation. These parameters can be trivially converted to their respective matrices and combined, ensuring that the quaternion $\mathbf{q}$ is normalized to obtain a valid unit quaternion. This parameterization guarantees that the resulting covariance matrix remains positive semi-definite throughout optimization while providing intuitive control over the Gaussian’s shape and orientation.

Densification and Pruning

The sparse initialization provides insufficient geometric detail for representing complex scenes. The system must dynamically add and remove Gaussians during optimization to achieve the necessary level of detail. This is accomplished through periodic densification and pruning operations guided by reconstruction error metrics.

The system identifies candidates for densification by monitoring the magnitude of positional gradients in screen space. Large gradients indicate regions where the current Gaussian configuration inadequately represents the scene content. Two strategies address different types of under-reconstruction:

For regions containing small Gaussians with large positional gradients, the system performs cloning, creating duplicate Gaussians to increase local density. This addresses areas where additional primitives are needed to represent fine geometric detail.

For regions containing large Gaussians with significant positional gradients, the system performs splitting, replacing single large Gaussians with multiple smaller ones. This addresses cases where a single primitive inadequately covers a region with substantial internal variation.

Conversely, Gaussians that become irrelevant during optimization are removed through pruning. This includes primitives with very low opacity (nearly transparent) or those that have grown to excessively large scales. Regular pruning prevents the accumulation of unnecessary primitives and maintains computational efficiency.

Performance and Quality Breakthrough

The combination of explicit Gaussian primitives, optimized differentiable rasterization, and adaptive density control delivers a remarkable breakthrough: 3DGS achieves rendering quality comparable to leading implicit methods while enabling real-time rendering at interactive frame rates. Training times are reduced from hours to minutes, and rendering speeds improve from seconds per frame to dozens of frames per second.

This performance advantage opens entirely new application domains for neural view synthesis, including interactive virtual reality experiences, real-time content creation tools, and gaming applications where immediate visual feedback is essential.

Comparing Paradigms: Implicit vs. Explicit

The emergence of both NeRF and 3D Gaussian Splatting represents more than incremental progress in rendering techniques; it demonstrates two fundamentally different approaches to the same underlying problem. Understanding the trade-offs between implicit and explicit representations provides crucial insights for practitioners choosing between these methods.

Representational Capacity and Quality

Implicit representations like NeRF excel at capturing scenes with complex volumetric phenomena, subtle lighting effects, and intricate semi-transparent materials. The continuous nature of neural functions provides an elegant mathematical framework for representing scenes with unlimited geometric detail, bounded only by the network’s capacity and the quality of the training data.

Explicit representations like 3DGS offer different strengths, particularly for scenes dominated by solid surfaces and opaque materials. The discrete nature of Gaussian primitives provides direct control over geometric detail and enables efficient representation of sharp boundaries and distinct material transitions.

In practice, both methods achieve remarkably similar quality on typical photographic scenes, suggesting that the choice between implicit and explicit representation may depend more on computational constraints and application requirements than on fundamental quality limitations.

Computational Efficiency

The computational profiles of these methods differ dramatically. NeRF’s implicit representation requires extensive neural network evaluation during both training and rendering, leading to significant computational costs. However, the compact representation (typically 5-50 MB) enables efficient storage and transmission of trained models.

3DGS trades storage efficiency for computational speed. The explicit representation requires more memory (hundreds of MB to GB for complex scenes) but enables real-time rendering through efficient rasterization. This trade-off proves particularly advantageous for interactive applications where rendering speed is paramount.

Editability and Control

The explicit nature of 3DGS primitives provides more direct pathways for user editing and scene manipulation. Individual Gaussians can be selected, moved, deleted, or modified to achieve desired visual effects. This explicit control makes 3DGS particularly attractive for content creation workflows.

NeRF’s implicit representation, while compact and elegant, provides limited opportunities for intuitive editing. Modifying specific scene regions requires careful manipulation of network weights, which lacks the direct geometric interpretation available with explicit primitives.

Practical Applications and Future Directions

The development of NeRF and 3D Gaussian Splatting has fundamentally transformed the landscape of 3D content creation, providing accessible tools for generating photorealistic 3D assets from simple photograph collections. These methods have democratized high-quality 3D reconstruction, making it available to practitioners without specialized expertise in traditional 3D modeling.

Current Applications

These techniques have found immediate adoption across diverse domains. In film and television production, they enable rapid creation of digital environments and virtual sets. Game developers use them to generate realistic 3D assets from reference photographs. Architectural visualization benefits from their ability to create immersive virtual walkthroughs from construction documentation.

The real-time capabilities of 3DGS have opened additional applications in virtual and augmented reality, where interactive frame rates are essential for user experience. Educational applications leverage these methods to create immersive historical reconstructions and scientific visualizations.

Technical Frontiers

Current research continues to push the boundaries of what’s possible with differentiable rendering. Dynamic scene reconstruction addresses the challenge of capturing moving objects and changing environments. Real-time training methods aim to reduce the time required for scene optimization. Hybrid representations explore combinations of implicit and explicit elements to capture the advantages of both approaches.

Advanced editing capabilities represent another active area of development. Researchers are developing intuitive interfaces for scene manipulation, semantic editing tools that understand object boundaries and material properties, and automated techniques for content generation and style transfer.

The Broader Impact

The success of NeRF and 3DGS demonstrates the power of differentiable rendering as a general paradigm for 3D content creation. By formulating rendering as an optimization problem, these methods enable direct learning of scene representations from 2D supervision, bypassing the need for explicit 3D ground truth.

This paradigm shift has implications beyond novel view synthesis. Similar principles apply to related problems in computer graphics and vision, including material acquisition, lighting estimation, and geometric reconstruction. The mathematical framework of differentiable rendering provides a unifying approach to these traditionally separate domains.

Conclusion

The journey from implicit neural fields to explicit Gaussian primitives illustrates the rapid evolution and remarkable creativity of the computer graphics and vision communities. Both NeRF and 3D Gaussian Splatting represent significant advances in our ability to capture and recreate the visual world, each offering unique advantages and addressing different aspects of the view synthesis challenge.

NeRF’s implicit representation provides unparalleled flexibility and mathematical elegance, excelling at capturing complex volumetric phenomena and subtle lighting effects. Its compact representation and continuous formulation make it particularly suitable for scenes with intricate geometry and sophisticated material properties.

3D Gaussian Splatting’s explicit approach offers complementary strengths, particularly in computational efficiency and real-time rendering capabilities. Its use of optimized rasterization and adaptive primitive control enables interactive applications while maintaining high visual quality.

The fundamental insight shared by both methods—that differentiable rendering enables direct optimization of scene representations from 2D images—represents a paradigm shift with broad implications for 3D content creation. This principle has already inspired numerous extensions and variations, and will likely continue to drive innovation in the field.

For practitioners, the choice between these approaches depends on specific application requirements. Projects requiring real-time rendering and interactive manipulation favor explicit representations like 3DGS. Applications prioritizing storage efficiency and complex volumetric effects may prefer implicit representations like NeRF.

Looking forward, the most exciting developments may come from hybrid approaches that combine the best aspects of both paradigms. As the field continues to evolve, we can expect to see methods that offer the quality and flexibility of implicit representations with the efficiency and editability of explicit primitives.

The transformation of novel view synthesis from a niche research problem to a practical tool for content creation demonstrates the power of differentiable rendering as a general framework for 3D understanding. By mastering the language of gradients and optimization, the field has unlocked new possibilities for capturing, representing, and manipulating the visual world around us.

These are not endpoints but foundations upon which the next generation of 3D content creation tools will be built. The principles established by NeRF and 3D Gaussian Splatting will continue to influence research and development in computer graphics, computer vision, and related fields for years to come.

References

[1] Kerbl, B., Kopanas, G., Leimkühler, T., & Drettakis, G. (2023). 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics (TOG), 42(4), 1-14.

[2] Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2020). NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In European Conference on Computer Vision (ECCV).