# Neural Rendering

- Introduction
- Loss functions for nueral rendering
- Generative Adaverial Networks with 3D Control
- Neural Scene Representation and Rendering
- Novel View Syntehsis for Obejcts and Scenes
- Neural Volumetric Rendering: NeRF, etc

**SIGGRAPH2021 Course: Adavnces in Neural Rendering**link

## Introduction

- Michael Zollhoefer from Facebook Reality Labs Research
- Two alteratives of realisting image syntehsis
- Photo-realistic rendering: Lots of manual work + Full control of scene parameter
- Generative ML: Lots of data + Automatic training + Interactive inference/rendering

- Motivation: Creating photorealistic assets is challenging using classical CG techniques
- Neural Rendering?: 3 Components
- Generative networks that synthesis raw pixel output
- Controllable by interpretable parameters or by video/audio input
- Illumination, camera, pose, geometry, appearance, or semantic structure controllable

- Why neural rendering?:
- Can we learn (part of the) scene representation and/or (part of the) CG function?

- Neural Rendering Zoo
- “Regress it” (GQN): 1D code => 2D generative network => 2D image
- “Make it more real” (DVP, DNR): 3D mesh/points + 1D codes => CG (3D to 2D) => 2D encoder/decoder => 2D image
- “Regress and render” (Neural volume): 1D code => 2D generative network => 3D mesh/points + 2D texture + 3D volume => CG (3D to 2D) => 2D
- “Step, sample and blend” (NeRF=super popular in the community): 3D space => Coordinates => MLP => CG (3D to 2D) => 2D image

## Loss functions for nueral rendering

- Jun-Yan Zhu, CMU CS
- Problem statament?
- argmin_G Loss(G(x_input),y_output)
- So, what is a good objective loss function
**L**?

- Designing loss function?
- L2 regression: tends to average out L2-distance, not good for optimize each local image
- Classificiation loss - Cross entropoy objective with colofulness term
- Feature/Perceptual loss - Deep feature space matching objective

- Loss fucntion in Generative Adversary Network?
- Distinguish whether it is real of fake image?
- Using human annotation is expensive. So replace it with classifier!
- Check pix2pix, edge2cat
- What can pix2pix do?
- grayscale => automatic colorization
- scatch => photo
- But, it needs paired training data, which are expensive.

- Check Cycle-consistent adversarial networks: horse to zebra, orange to apple!
- How to train with unpaired data (shape of horse + texture of zebra with differnt shape?)
- Cycle-consistent loss: horse (X) to zebra (G(x)), and zebra to horse again (F(G(x))

- Patch-based contrastive loss using cosine similarity of small patches of X and G(X)
- Summary

## Generative Adaverial Networks with 3D Control

- Ayush Tewari, Max Planck Institute for Informatics
- GAN?
- Supervised training of GAN for neural rendering?
- Add scene prarameters (illumination/pose) to input data and do supervised training
- Training with synthetic datasets (so, no need lots of labeled data pairs)
- Training with supervised pairs? Add annotation tools and move along nomal direction of semantic property (pose/gender/expression) hyperplane in the latent space.
- Add non-linearity for high quality control!
- Inverse graphics & 3D control the semantic property gradually!

- Unsupervised methods?
- Training generator for controllability
- Projecting real images to latent vector using optimziation-based methods
- Editing the projected latent vector
- Reguralization
- Transformation
- Learning-based methods for projection (not optimization): Using encoder!

- Challenges?
- What can be edited?: Widen adjustable control parameters!
- What can be projected?: Trade betwen quality of projection/reconstruction/realistic-editing

- 3D GANs?
- View points can be controlled explicitly!

## Neural Scene Representation and Rendering

- Gordon Wetzstein, Stanford EE/CS, www.computataionlimaging.org
- Self-supervised scene representation learning approach
- Model 3D object to network network
- Then came NeRF: Mildenhall et al., ECCV2020
- NN is more compact than 3D voxel or mesh
- Use sirec (periodic fucntion) instead of ReLU

- Pi-GAN
- Neural volume rendering is slow! (NeRF, Pi-GAN)
- Works by defining camera and shooting rays through the scene, and calcuating integrals of each ray (by approaximate integral).
- Need fast and efficinet integration technique!
- Not numerical integration, use anti-derivative!
- AutoInt:

- Neural lumigraph rendering: real time rending during inference time
- Summary

## Novel View Syntehsis for Obejcts and Scenes

- Goal: Get 2 images and generate scenes between the two (with arbitraty camera position).
- IF you have one image only, then use prior-based reconstruction

- What method?
- Voxel-based methods: DeepVoxels, Neural Volumns, HoloGAN
- Neural implicit approaches: Scene representation netowrks, Differentiable volumetric rendering, NeRF, Implicit differentiable renderer
- Hybrid Implicit/Explicit: Neural sparse voxel fields, PiFU, GRF, pixelNeRF, MVSNerf, Unconstrained scene generation with locally conditioned radiance fields
- Multi-plane images
- Image-based: Stable view synthesis, IBRNet

## Neural Volumetric Rendering: NeRF, etc

- Ben Mildenhall, Google Research (bmild.github.io)
- What is nerual volumetric rendering?
- Rendering?: Querying the radiance value along rays through 3D space
- Volumetric?: Continuous, differentiable, rendering model without concrete ray/surface intersections
- Neural: Using a neural network as a scene representation, rather than a voxel grid of data
- Inputs: sparse, unstructured, photographs of a scene
- Outputs: representation allowing us to render new views of that scene

- Volumetric rendering math
- Traditional method?: using optical pysics => adapted for visualizing medical data and linked alpha compositing => Modern path tracers use sophisticated Monte Carlo methods to render volumetric effects
- Volumetric redering and ML?: Various volume-rendering-esque methods devised for 3D shape reconstruction methods, scaled up to higher resolution voxel grids, ML methods can achieve excellent view synthesis results.
- Volumetric formulation for NeRF
- Scene is a cloud of tiny colored particles
- If a ray traveling through the scene hits a particle at
**t**, we return its color**c(t)** - This notion is probabilistic: Chance that ray stops in a small interval around
**t**is**sigma(t)dt**. Sigma is known as the “volume density” - To determine if
**t**is the first hit, need to know**T(t)**: probability that the ray didn’t hit any particles earlier.**T(t)**is called “transmittance”. We assume**sigma**is known and want to use it to calculate**T** - P[no hits before
**t**] =**T(t)** - P[hit at
**t**] =**sigma(t)dt** - P[no hits before
**t+dt**] = P[no hits before**t**]xP[no hit at**t**] **T(t+dt)**=**T(t)**X**(1-sigma(t)dt)**- You can weight the intergrand with color function
**c(t)** - Approaximating the nested integral? Use quadrature to approximate the nested integral, splitting the ray up into
**n**segments with endpoints**{t_1,t_2,…,t_(n+1)}**with length**delta_i=t_(i+1)-t_i**. - Assume that volume density and color are roughly constant within each interval
- Remind that piecewise constant density/color
*do not*imply constant transmittance! Important to account for how early part of a segment blocks later part when**sigma_i**is high. - Connection to alpha compositing
- Next question? How do we store the values of color and sigma at each point in space?

- Neural networks as representations for spatial data
- Toy problem: storing 2D image data
- Usually we stroe an image as a 2D grid of RGB color values
- What if we train a simple fully-connected network (MLP) to do this instead?
- Problem: Standard coordinate-based MLPs cannot represwent high frequency functions
- Solution: Pass input coordinates through a high frequency mapping first
- Input coordinate mapping?: Spatial position (x and y) to positional encoding features using sin/cos functions
- Scaling frequency matrix B traverses underfitting-overfitting curve. So, optimal scale lies between the extremes.

- Toy problem: storing 2D image data
- Neural Radiance Fields (NeRF)
- NeRF = Volume rendering + Coordinate-based network
- Neural network replaces large N-D array
- (x,y, z, theta, phi) => NN => (r,g,b,sigma)
- (theta, phi) to visulaize view-dependent effects
- Train network to reproduce input views of scene using gradient descent

- NeRF improvements and extensions
- NeRF problems
- Scene representation is not anti-aliased
- Rendering is very slow: KiloNeRF, FastNeRF …
- Network must be retrained for every scene: GRF, IBRNet, pixelNeRF
- Requires many input images
- Needs scen to be static and have fixed lighting

- NeRF problems