Neural Radiance Fields (NeRF)

CS5670 Project 4 - Introduction to Computer Vision
Cornell University, Spring 2025

Build Neural Radiance Fields (NeRF) with a set of images for novel view synthesis.

This project implements the groundbreaking NeRF architecture from ECCV 2020, learning to represent 3D scenes as continuous neural radiance fields. Using only 2D images as supervision, we train deep networks to synthesize photorealistic novel views from arbitrary camera positions.

NeRF 360° Novel View Synthesis
360° novel view synthesis: NeRF learns to render photorealistic views from any camera angle

Project Details

Assigned: Thursday, March 20, 2025
Due: Friday, March 28, 2025
Group Size: 2 students
Platform: Google Colab with PyTorch
Key Concepts: Neural networks, ray tracing, volume rendering, positional encoding

Overview

Neural Radiance Fields (NeRF) represent a paradigm shift in 3D scene representation and novel view synthesis. Unlike traditional 3D reconstruction methods that build explicit geometric models, NeRF learns an implicit continuous volumetric representation using neural networks.

The key insight is to represent a scene as a function that maps 3D coordinates and viewing directions to color and volume density, then use differentiable volume rendering to train this function from 2D images alone.

NeRF Architecture and Pipeline

Complete NeRF Pipeline

1

Camera Ray Generation

Compute ray origins and directions through image pixels

2

3D Point Sampling

Sample points along rays in 3D space

3

Positional Encoding

Encode coordinates with high-frequency functions

4

Neural Network

Predict color and density for each point

5

Volume Rendering

Composite colors along rays to form images

Implementation Details

Part 1: Positional Encoding

Positional encoding enables neural networks to learn high-frequency functions by mapping coordinates to higher-dimensional spaces using trigonometric functions.

Positional Encoding Mathematics:

Positional Encoding Formula

Key Insight: Without positional encoding, MLPs exhibit a spectral bias toward low-frequency functions, leading to oversmoothed outputs that cannot capture fine details.

No Positional Encoding

Result without Positional Encoding

Oversmoothed, lacks detail

PE Frequency = 3

Result with PE Frequency 3

Improved detail capture

PE Frequency = 6

Result with PE Frequency 6

High-frequency details preserved

Part 2: Ray Tracing and 3D Sampling

For each pixel in the target image, we generate a camera ray and sample points along it to query the neural radiance field.

Camera Ray Generation:

  1. Pixel to Camera: Convert pixel coordinates to camera coordinate system
  2. Ray Origin: Camera center in world coordinates
  3. Ray Direction: Unit vector from camera center through pixel
  4. World Transform: Apply camera-to-world transformation matrix

Ray Equation:

Camera Ray Mathematics

3D Point Sampling:

Sample points along each ray using stratified sampling to ensure good coverage of the 3D space while maintaining differentiability.

Ray Sampling Visualization

Part 3: Neural Network Architecture

NeRF Network Design:

The NeRF MLP takes 5D input (3D position + 2D viewing direction) and outputs 4D (RGB color + volume density).

NeRF Neural Network Architecture

Network Structure:

  1. Position Encoding: 3D coordinates → high-dimensional encoding
  2. Density Branch: Position → volume density σ
  3. Color Branch: Position + viewing direction → RGB color
  4. Skip Connections: Improve gradient flow for deep networks

Part 4: Volume Rendering

Volume rendering composites the colors and densities along each ray to produce the final pixel color using the classic volume rendering equation.

Volume Rendering Equation:

Volume Rendering Mathematics

Compositing Weights:

Alpha Compositing Weights

Physical Interpretation: The volume density σ represents how much light is absorbed at each point, while the compositing weights determine how much each sample contributes to the final pixel color.

PyTorch Implementation

Deep Learning Framework

This project provided hands-on experience with PyTorch for computer vision and 3D deep learning:

My Results

Training Results and Metrics

Successfully trained NeRF on the provided scene with convergence to high-quality novel view synthesis.

Training Performance:

Iterations: 1000-3000 (10-30 minutes on GPU)
Final PSNR: >20 dB (target quality threshold)
Loss Function: L2 photometric reconstruction
Optimizer: Adam with learning rate decay

Training Progression

Training at 100 iterations
100 Iterations
Early training - blurry reconstruction
Training at 1000 iterations
1000 Iterations
Converged model - sharp details

Novel View Synthesis Results

Novel View 1
Novel View 1
Photorealistic rendering from unseen angle
Novel View 2
Novel View 2
Consistent geometry and lighting

Depth Maps and 3D Understanding

NeRF Depth Map
Learned Depth Map
NeRF implicitly learns scene geometry
Volume Density Visualization
Volume Density Field
3D structure representation

360° Video Synthesis

Complete 360° Novel View Synthesis

360° Video Frame Sequence
Frame sequence from 360° rotation around the scene

Video Generation Process: Generate camera poses in a circular path around the object, render images from each viewpoint using the trained NeRF, and composite into a smooth 360° video.

Video Specifications:

Resolution: 800×800 pixels
Frame Count: 40 frames (360° rotation)
Render Time: ~2-3 seconds per frame
Camera Path: Circular orbit around scene center

Extra Credit: Custom Dataset

Training NeRF on Personal Photography

Custom Input Images
Custom Input Images
LLFF-style forward-facing capture
Custom Novel Views
Novel View Results
Synthesized views of personal scene

Data Capture Process: Following LLFF methodology for forward-facing scenes, using COLMAP for camera pose estimation, and training NeRF with custom photographs to demonstrate real-world applicability.

Key Learnings

Computer Vision and Deep Learning Concepts

Technical Skills Developed

Impact and Applications

Revolutionary Applications

Virtual Reality: Photorealistic VR environments from simple photo captures

Film and Media: Novel view synthesis for cinematography and special effects

3D Content Creation: Democratizing 3D modeling through neural representations

Robotics: 3D scene understanding for navigation and manipulation

Cultural Preservation: Digital documentation of historical sites and artifacts

Medical Imaging: 3D reconstruction from sparse medical scans