index

Scaffold-GS: Technical Deep Dive

Structured 3D Gaussians for View-Adaptive Rendering

Authors: Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, Bo Dai. Primary reference: Lu et al., Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering [1].

Introduction

We have covered NeRFs [3] and 3DGS [2] in previous posts.

Scaffold-GS addresses these limitations by introducing a hierarchical, region-aware scaffold of anchor points and by predicting Gaussian attributes on-the-fly in a view-adaptive manner. The method preserves the efficiency of primitive-based rasterization yet recovers robustness and compactness via structure-aware encoding and dynamic decoding of local Gaussians [1].

This blog post adapts and expands the original paper into a publication-ready technical exposition.

Preliminaries (Revision)

A 3D Gaussian used in 3D-GS is an anisotropic Gaussian density centered at $\mu\in\mathbb{R}^3$ :

G(\mathbf{x}) = \exp\!\Big(-\tfrac{1}{2}(\mathbf{x}-\mu)^\top \Sigma^{-1}(\mathbf{x}-\mu)\Big), \tag{1}

with covariance $\Sigma$ factorized as $\Sigma = R S S^\top R^\top$ so that $\Sigma$ remains positive semidefinite (here $S$ is a diagonal scale matrix and $R$ a rotation). Color is typically represented by SH coefficients or direct RGB, and each Gaussian bears an opacity $\alpha$ . These 3D Gaussians are projected to 2D Gaussians on the image plane and rasterized with differentiable tile-based splatting. Pixel colors are composed via an ordered $\alpha$ -blending accumulation:

C(\mathbf{x}') = \sum_{i\in\mathcal{N}} c_i \sigma_i \prod_{j=1}^{i-1} (1-\sigma_j),\qquad \sigma_i = \alpha_i G'_i(\mathbf{x}'), \tag{2}

where $G'_i(\mathbf{x}')$ is the projected 2D Gaussian and $\mathcal{N}$ denotes the Gaussians overlapping pixel $\mathbf{x}'$ . The whole pipeline is differentiable and amenable to gradient-based optimization [2].

(References: NeRF background [3]; the canonical 3D-GS work [2].)

Key idea of Scaffold-GS (high level)

Scaffold-GS replaces an unstructured, per-Gaussian optimization with a two-layer hierarchical representation:

Anchor points: a sparse, regularized voxelized grid of anchors initialized from SfM point clouds (COLMAP) that roughly encode where scene content exists; each anchor stores a compact, learnable feature and scale.
Neural Gaussians: spawned on-the-fly from each anchor during rendering: for each visible anchor we predict the attributes (position offsets, opacity, color, scale, rotation) of a small set $k$ of Gaussians via dedicated MLPs conditioned on view information (distance and view direction) and the anchor feature.

This design yields three practical effects: (i) geometry-aware distribution of Gaussians (anchors scaffold the coverage), (ii) view-adaptive Gaussians (attributes are decoded on demand, so they can vary with camera pose), and (iii) compactness, the model stores anchors + MLPs instead of millions of independent Gaussian parameters. See Fig.2 in [1] for global overview.

Mathematical and algorithmic details

We now step through the major components and reproduce the core equations from the paper.

1. Anchor scaffold initialization

Start from an SfM point cloud $P \in \mathbb{R}^{M\times 3}$ (COLMAP) and voxelize at grid size $\epsilon$ :

V := \Big\{ \big\lfloor P / \epsilon \rceil \Big\} \cdot \epsilon, \tag{3}

where $\{\cdot\}$ denotes deduplication of voxel center coordinates and $\lfloor \cdot \rceil$ is rounding to nearest voxel center. Each voxel center $v\in V$ becomes an anchor with parameters:

local context feature $f_v\in\mathbb{R}^{32}$ ,
learnable anisotropic scaling $l_v\in\mathbb{R}^3$ ,
$k$ learnable offsets $O_v\in\mathbb{R}^{k\times 3}$ (these define relative positions of spawned neural Gaussians).

This initialization concentrates anchors where SfM believes geometry exists and reduces the irregularity of raw SfM points.

2. Multi-resolution view-dependent anchor feature

To make anchor features view-adaptive and multi-scale, the authors maintain a feature bank per anchor:

\mathcal{B}_v = \{f_v,\, f_{v\downarrow 1},\, f_{v\downarrow 2}\},

i.e., the base feature and two downsampled/sliced variants. Given a camera position $x_c$ and anchor position $x_v$ , compute relative distance and direction:

\delta_{vc} = \|x_v - x_c\|_2,\qquad \hat{\mathbf{d}}_{vc} = \frac{x_v - x_c}{\delta_{vc}}. \tag{4}

A tiny MLP $F_w$ maps $(\delta_{vc},\hat{\mathbf{d}}_{vc})$ to a 3-way softmax weight vector:

(w, w_1, w_2) = \mathrm{Softmax}\big( F_w(\delta_{vc}, \hat{\mathbf{d}}_{vc}) \big), \tag{5}

and the enhanced anchor feature is a weighted sum:

\widehat{f}_v = w \cdot f_v + w_1 \cdot f_{v\downarrow 1} + w_2 \cdot f_{v\downarrow 2}. \tag{6}

Intuition: the learned weights select an appropriate resolution mixture depending on viewing distance and orientation, enabling coarse vs. fine local detail to be modulated automatically. Implementation uses slicing/repetition to cheaply create multi-resolution features (supplementary details in [1]).

3. On-the-fly neural Gaussian derivation (decoding)

For each anchor $v$ visible in the camera frustum, spawn $k$ candidate neural Gaussians. Their positions are computed by learned offsets scaled by the anchor’s per-axis scale:

\{\mu_0,\dots,\mu_{k-1}\} = x_v + \{O_0,\dots,O_{k-1}\}\odot l_v, \tag{7}

where $\odot$ denotes elementwise scaling. The remaining Gaussian attributes are decoded in a single pass from $\widehat{f}_v$ , $\delta_{vc}$ and $\hat{\mathbf{d}}_{vc}$ via small MLPs:

\{\alpha_i\}_{i=0}^{k-1} = F_\alpha(\widehat{f}_v, \delta_{vc}, \hat{\mathbf{d}}_{vc}), \tag{8a}

\{c_i\}_{i=0}^{k-1} = F_c(\widehat{f}_v, \delta_{vc}, \hat{\mathbf{d}}_{vc}), \tag{8b}

\{q_i\}_{i=0}^{k-1} = F_q(\widehat{f}_v, \delta_{vc}, \hat{\mathbf{d}}_{vc}), \quad \{s_i\}_{i=0}^{k-1} = F_s(\widehat{f}_v, \delta_{vc}, \hat{\mathbf{d}}_{vc}), \tag{8c}

where $q_i$ is a quaternion parameterizing orientation, $s_i\in\mathbb{R}^3$ a scale, and $c_i\in\mathbb{R}^3$ the color. Importantly, opacity thresholding prunes trivial Gaussians:

\text{keep } i \ \text{iff}\ \alpha_i \ge \tau_\alpha. \tag{9}

This dynamic decoding reduces both computation (only decode for visible anchors) and overfitting (Gaussians adapt with view).

4. Differentiable rasterization & image formation

Each neural Gaussian is projected and rasterized like in 3D-GS; the final pixel color is computed by the standard ordered alpha blend (Eq.2). During training, gradients flow to anchors, offsets, feature banks, and the MLP weights that decode attributes, enabling end-to-end optimization. The architecture leverages tile-based rasterizers for GPU efficiency, matching real-time constraints.

Initial anchors from SfM can be noisy or sparse. Scaffold-GS refines anchors during training using a neural-Gaussian-based gradient signal:

Growing: accumulate gradients of neural Gaussians grouped per voxel over $N$ iterations; if the accumulated gradient norm exceeds a threshold, spawn new anchors (improves coverage where reconstruction frustrates gradients).
Pruning: remove anchors that consistently spawn neural Gaussians with negligible opacity or low importance, controlling memory.

This bi-directional policy maintains compactness while recovering coverage in missed areas. Ablations in the paper show that growing is essential to fidelity while pruning controls size.

6. Loss and regularization

Training is supervised by reconstruction and regularization losses. The total loss is:

\mathcal{L} = \mathcal{L}_{\ell_1} + \lambda_{\text{SSIM}}\mathcal{L}_{\text{SSIM}} + \lambda_{\text{vol}}\mathcal{L}_{\text{vol}}, \tag{10}

where $\mathcal{L}_{\ell_1}$ is the per-pixel L1 between rendered and ground truth RGB, $\mathcal{L}_{\text{SSIM}}$ is a perceptual structural similarity loss [7], and $\mathcal{L}_{\text{vol}}$ is a volume regularizer to discourage excessive occupied volume (e.g., penalizing total volume implied by Gaussian scales or a product of scales across axes). The paper gives implementation details and weight choices in the supplementary material.

Implementation notes (practical recipe)

Below are reproducible details distilled from the paper and supplement:

Initialization: obtain SfM points and camera poses using COLMAP [4]. Voxelize with $\epsilon$ chosen by scene scale (see supplement). Deduplicate voxels to get anchors.
Anchor feature size: $32$ -dim latent per anchor; build a feature bank by slicing and repeating the vector to form multi-resolution variants.
k (neural Gaussians per anchor): typical $k=4$ but paper shows robustness across choices, active Gaussians converge to a stable number via pruning.
MLP architectures: small MLPs for $F_w$ , $F_\alpha$ , $F_c$ , $F_q$ , $F_s$ ; attributes are decoded in one forward pass (efficiency). Exact layer widths/depths are provided in the supplement.
Filtering for speed: two pre-filters are used: (1) view frustum culling of anchors, and (2) opacity threshold ( $\tau_\alpha$ ). Filtering dramatically increases FPS with negligible fidelity loss when properly tuned.
Training schedule: mix of multi-scale training (coarse → fine) on datasets; loss weights and optimizer schedules follow the supplementary. The authors report ~100 FPS at 1K resolution at inference for typical scenes.

Experiments and empirical findings

Scaffold-GS is evaluated across synthetic (Blender), Mip-NeRF360, Tanks & Temples, BungeeNeRF, and VR-NeRF datasets. Key empirical takeaways:

Better generalization and robustness: while 3D-GS can achieve slightly higher training PSNR (indicating overfitting to views), Scaffold-GS exhibits higher testing PSNR and consistently better generalization to unseen views, particularly in scenes with varying levels of detail or difficult view-dependent effects.
Significant storage reduction: the anchor+MLP representation requires far less storage than a fully optimized Gaussian cloud (e.g., orders-of-magnitude reductions reported for some datasets). Table comparisons in the paper show storage drops from hundreds of MB to a few dozen MB while keeping comparable fidelity.
Speed tradeoffs: with frustum culling and opacity filtering, inference speed matches or closely trails 3D-GS, while delivering denser coverage and fewer artifacts.
Ablations: the paper carefully ablates k, filtering strategies, and anchor refinement. Results show that (i) filtering primarily impacts speed, not fidelity; (ii) growing is crucial for recovering from poor SfM initializations; and (iii) pruning tames memory growth without hurting quality when combined with growth.

Why it works and where to be careful

Why it reduces redundancy. Anchors impose spatial structure: instead of letting Gaussians drift to fit each view, anchors constrain where Gaussians are spawned. The view-conditioned decoding then lets local appearance change with viewpoint without proliferating primitives. This balances compactness (few anchors + MLPs) and expressiveness (many view-adaptive Gaussians when necessary).

View-adaptivity is key. Many artifacts in 3D-GS stem from baking view effects per Gaussian. By decoding attributes conditioned on camera direction and distance, Scaffold-GS models view-dependent BRDF-like changes implicitly and smoothly.

Limitations and failure modes.

Dependence on SfM: anchor initialization relies on reasonable SfM points. Extremely poor SfM may require more aggressive growing or additional priors. The authors mitigate this via gradient-based growing but practitioners should still check SfM quality.
MLP overhead: although the MLPs are small and decoding is limited to visible anchors, for extremely dense anchor grids the decoding cost scales up. Filtering strategies are essential to keep real-time targets.
Complex materials & lighting: as with most view-synthesis methods that use image supervision, disentangling lighting and view-dependent material responses remains implicit and can lead to hallucinations under strong illumination changes.

Practical tips for researchers & engineers

Start with quality SfM: use COLMAP and visually inspect point clouds. If SfM is noisy, increase growing sensitivity early in training.
Tune $\epsilon$ to scene scale: voxel size controls anchor density; prefer slightly coarse anchors and rely on growing to fill missing areas.
Set opacity threshold ( $\tau_\alpha$ ) conservatively: too high removes thin structures; too low wastes rendering time. Use the paper’s recommended sweep.
Batch decode MLPs: implement the attribute MLPs so that all decoded attribute vectors are produced in a single fused forward pass per anchor to maximize GPU efficiency.
Combine several evaluation metrics: PSNR, SSIM, LPIPS, and storage/FPS together tell a fuller story (the paper reports all).

Conclusion

Scaffold-GS presents a principled and practical strategy to reconcile the competing demands of fidelity, compactness and speed in Gaussian-based neural rendering. Its anchor scaffold enforces geometry-aware allocation of primitives while the view-conditioned MLP decoders provide the flexibility to produce view-dependent, multi-scale appearance without exploding the number of stored primitives. Empirically, Scaffold-GS achieves strong generalization, reduced storage, and real-time capable inference using standard rasterization pipelines [1].

For applied teams building real-time view synthesis systems, Scaffold-GS is worth implementing as a next step beyond vanilla Gaussian splatting: it is especially attractive when storage, robustness to viewpoint change, and multi-scale scene coverage are priorities.

Important (References)

[1] Lu, T., Yu, M., Xu, L., Xiangli, Y., Wang, L., Lin, D., & Dai, B. (2023). Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering. arXiv preprint arXiv:2312.00109.

[2] Kerbl, B., Kopanas, G., Leimkühler, T., & Drettakis, G. (2023). 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics (ToG), 42(4), Article 139.

[3] Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2020). NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. Communications of the ACM, 65(1), 99-106.

[4] Schönberger, J. L., & Frahm, J.-M. (2016). Structure-from-Motion Revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4104-4113.

[5] Müller, T., Evans, A., Schied, C., & Keller, A. (2022). Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. ACM Transactions on Graphics (ToG), 41(4), Article 102.

[6] Hedman, P., Philip, J., Price, T., Frahm, J.-M., Drettakis, G., & Brostow, G. (2018). Deep Blending for Free-Viewpoint Image-Based Rendering. ACM Transactions on Graphics (ToG), 37(6), Article 257.

[7] Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13(4), 600-612.

Scaffold-GS: Do more with less