Back to Blog

ARM+FPGA-based Industrial Medical Endoscope Solution

#FPGADev

Introduction

This article describes the first installment of a multi-part series on an FPGA-based monocular endoscope positioning system for cardiac surgery simulation. The system uses a camera, SDRAM frame buffering, and real-time image processing on an FPGA to track and localize the tip of a surgical catheter inside a cardiac simulator — giving trainee surgeons coordinate feedback without requiring direct visual inspection of the heart interior. This post covers the project background, FPGA fundamentals, system design rationale, algorithm selection, and hardware module overview.


1. Background and Motivation

1.1 Medical Device Context

Cardiovascular disease treatment is among the most commercially and clinically significant segments of the medical device industry. Cardiac assist devices — whether implantable or external, electronic or mechanical — have steadily decreased in size, making them easier to implant. At the same time, cardiac electrophysiology has emerged as a rapidly evolving subspecialty: it studies the electrical phenomena generated by excitable tissue (action potentials in nerves, cardiac muscle, etc.) and uses that information to guide interventional procedures.

A critical bottleneck in expanding surgical capacity is the shortage of experienced surgeons. Current estimates suggest that only about 6% of cardiac patients who could benefit from surgery actually receive it; many others die waiting, or survive with unmanaged risk. The root cause is that training a competent cardiac surgeon through traditional apprenticeship — operating on real patients under supervision — is slow, expensive, and carries ethical and medico-legal risk, particularly in the current climate of heightened doctor-patient tension. A cardiac surgery simulator that lets trainees practice catheter navigation in a realistic phantom heart, while receiving real-time position feedback, directly addresses this bottleneck.

Image processing is the enabling technology here. The system described in this series captures live endoscope video from inside a cardiac simulator, detects the catheter tip in each frame, outputs its 2-D coordinates for display, and forwards contact-point signals to an external instrument for 3-D cardiac modeling.

1.2 FPGA in Medical Imaging

FPGAs (Field-Programmable Gate Arrays) have long been the preferred platform for real-time video processing because their massively parallel datapath architecture maps naturally to pipeline-oriented image algorithms. In a typical FPGA image pipeline, each processing stage operates on a pixel as it arrives from the sensor, so the end-to-end latency is a handful of clock cycles rather than a full frame period. This makes sustained frame rates of 25 fps straightforward to achieve, with 60 fps attainable on mid-range devices. Early FPGA adoption was concentrated in communications (baseband coding, modulation/demodulation), but the technology has migrated into medical imaging, ECG processing, endoscopy, and other clinical signal-processing domains. The design described here targets catheter-tip localization, but the same FPGA platform can be extended to multi-camera configurations for 3-D coordinate recovery.


2. FPGA Architecture and Design Flow

2.1 How FPGAs Implement Logic

An FPGA achieves programmability through three cooperating mechanisms:

Programmable logic blocks. The fundamental element is a Look-Up Table (LUT) — typically 4-input or 6-input, 1-output — implemented in SRAM or Flash. A 4-input LUT can store the truth table of any 4-input/1-output combinational function; changing the stored bits reconfigures the function. Paired with a D flip-flop, any sequential logic can be expressed as combinational logic plus register, making the LUT+FF pair a universal building block.

Programmable interconnect. Individual LUT+FF cells are too small to be useful alone. A mesh of programmable routing resources — segments, switch matrices, and connection boxes controlled by configuration bits — joins thousands of cells into large, arbitrary logic networks. Asserting or de-asserting configuration bits connects or disconnects individual wire segments, giving the designer full control over the netlisting.

Programmable I/O. Every FPGA pin (except dedicated power, ground, and programming pins) can be configured as input, output, or bidirectional. Voltage standard, drive strength, slew rate, and on-chip termination are all software-settable, allowing the device to interface to a wide range of external components without external level-shifting hardware.

2.2 Design Flow

A typical FPGA design moves through the following stages:

  1. RTL entry — HDL source (VHDL or Verilog) is written or generated from IP cores, or captured schematically for small blocks.
  2. Functional simulation — logic behavior is verified in a simulator before synthesis. ModelSim is the most widely used tool for this stage, valued for its simulation speed and accuracy. Cadence Verilog-XL and Synopsys VCS are also common in larger teams; ActiveHDL adds a state-machine debugger that is useful for control-path verification.
  3. Synthesis — the HDL is compiled to a gate-level netlist optimized for area and timing. Synopsys Synplify/Synplify Pro is widely used for its timing-driven synthesis engine and fast runtimes. Xilinx XST (now Vivado Synthesis) and the integrated synthesizer in Altera/Intel Quartus II are the primary vendor alternatives.
  4. Implementation (map + place + route) — the synthesized netlist is mapped to device primitives, placed within the FPGA fabric, and routed. This stage is performed by vendor tools: Xilinx ISE/Vivado or Intel Quartus II.
  5. Static timing analysis — the place-and-route tools report critical paths and maximum achievable clock frequency. The design constraint is that logic utilization must remain below 80% of available macrocells to preserve routability margin.
  6. Post-route simulation — gate-level simulation with extracted back-annotation verifies that timing-dependent behavior matches intent.
  7. Bitstream generation and download — a configuration bitstream is generated and programmed into the FPGA (or companion Flash/EEPROM for non-volatile retention).

3. System Design

3.1 Design Task

The goal is to detect the 2-D position of a catheter tip as it moves within the field of view of an endoscope camera. The catheter enters the camera's field of view from any edge of the frame; only the tip coordinate is of interest — the rest of the catheter body is ignored. The FPGA implementation must keep logic utilization below 80% of available macrocells.

3.2 Algorithm Selection

Three candidate detection strategies were evaluated:

Option A — Color-based segmentation. The catheter tip is painted a distinctive color; the FPGA isolates the tip by thresholding a specific color component or HSV range. This approach is simple but requires pre-processing the catheter (painting it) and is sensitive to background clutter with similar colors. Rejected.

Option B — Frame difference method. Consecutive frames are subtracted; the absolute-value difference image is thresholded to yield a binary motion mask, highlighting the moving catheter while suppressing the static background. The method is insensitive to lighting and background complexity because it only responds to temporal change. Its weaknesses are sensitivity to high-frequency noise (requiring a denoising pre-stage) and parameter sensitivity: if the inter-frame interval is too short relative to catheter speed, two separated tip blobs appear; if too long relative to a slow-moving catheter, no motion is detected. The resulting binary image shows edges rather than a solid region, requiring frame buffering for correct processing. Selected for this design.

Option C — Background subtraction. A clean background frame is stored and subtracted from each live frame; the residual is thresholded to isolate the catheter. The method produces a solid binary region (better fill than frame difference), but the stored background becomes stale as lighting or scene conditions change. Compensating for this requires a background-update algorithm (mean, median, Kalman filter, or Gaussian mixture), which adds considerable FPGA complexity. Implementing a robust background update in hardware is the primary difficulty; this option would have been viable but was judged harder to implement reliably within the logic budget.

The decision matrix is summarized in Table 3.1 (see original figures). Option B (frame difference) was selected based on its combination of acceptable fill quality, low background-complexity requirement, and manageable FPGA implementation effort.

3.3 Key Technical Challenges

Seven problems were identified as technically critical:

  1. Image frame buffering in SDRAM
  2. Image pre-processing (noise removal)
  3. Frame-difference computation
  4. Ping-pong SDRAM buffer management
  5. Pixel format conversion chain
  6. Binary image projection
  7. Correct catheter-tip coordinate extraction from projection data

3.4 Solutions

Frame buffering. The design requires three simultaneous SDRAM accesses: the camera writes incoming pixels to one of two alternating regions, while two read ports simultaneously deliver both buffered frames to the subtraction stage. A dual-buffered SDRAM interface is used, with two FIFOs on the write side and two on the read side, satisfying the three-port concurrency requirement.

Image pre-processing. Morphological operations are applied before differencing: erosion removes isolated noise pixels, and dilation expands the remaining foreground regions to fill gaps and strengthen connectivity.

Frame differencing. The two buffered frames are subtracted pixel-by-pixel; the absolute value of the difference is taken (|Frame_N − Frame_{N−1}|), eliminating dependence on subtraction order.

Ping-pong buffer switching. The camera write alternates between buffer 0 and buffer 1 on successive frames; the read side mirrors this assignment one frame behind. A common mistake is using the capture or output module's "done" signal to trigger the buffer switch — this can cause data accumulated in the FIFOs to be flushed by a reset before it is read, resulting in partial frame loss. The correct approach is to use the SDRAM controller's internal completion signal to drive the switch, as implemented here.

Format conversion. The OV7670 outputs YUV 4:2:2 (YCbCr422). The pipeline converts this to YUV444, then to RGB888 (using a lookup table for the YCbCr→RGB matrix), and finally to RGB565 for VGA output.

Binary image projection. After thresholding the difference image to binary, horizontal and vertical projections are computed: the horizontal projection sums pixel values along each row (Y axis), and the vertical projection sums along each column (X axis). The resulting 1-D profiles are stored and used to find the bounding box of the motion region.

Tip coordinate extraction. The four boundary lines derived from the projection profiles define the bounding rectangle of the detected motion region. Within that rectangle, the algorithm computes the midpoint of the horizontal boundary span (endpoint 7 in Figure 3.1), then compares the distance from endpoint 7 to the right boundary against the distance from endpoint 7 to the left boundary to determine catheter orientation. Points lying on the boundaries are classified as non-target; the outlier point not on any boundary is identified as the catheter tip, and its coordinates are output.


4. Hardware Module Overview

4.1 Power Module

The system requires multiple supply rails: 5 V (VCC), 3.3 V, 2.8 V, 2.5 V, and 1.2 V. All are generated from the AMS1117 family of low-dropout linear regulators. AMS1117 devices are available in fixed-output variants (1.2 V, 1.5 V, 1.8 V, 2.5 V, 2.8 V, 3.0 V, 3.3 V) and an adjustable variant. Each regulator includes built-in thermal shutdown and current-limit protection. A 22 µF output capacitor is placed on each regulator output to ensure stability and suppress high-frequency transients.

4.2 Acquisition Module (OV7670 Camera)

The OV7670 CMOS image sensor is selected for its wide availability, well-documented register interface, and straightforward FPGA integration. Key interface signals are:

  • SCCB / I²C bus — used to configure internal registers (resolution, format, exposure, white balance, etc.)
  • XCLK — master clock input to the sensor
  • PCLK — pixel clock output; runs at twice the pixel rate to allow 8-bit parallel data capture
  • FRAME_VALID / LINE_VALID — frame and line synchronization signals
  • DOUT[7:0] — 8-bit parallel pixel data output

Additional signals include STROBE (flash sync), STANDBY/PWDN (power management), and reset. The sensor output is configured for YUV 4:2:2 format.

4.3 Buffer Module (SDRAM)

The selected SDRAM is the Hynix H57V2562GTR: 256 Mb (16 M × 16 bit), stable operation to 200 MHz, 4 banks, 13-bit row address, 16-bit data width. It stores the YCbCr 4:2:2 data stream from the camera. In the video capture context, the SDRAM is configured for page-write burst mode with burst-interrupt support, enabling arbitrary burst lengths and maximizing sustained bandwidth — essential for maintaining real-time frame rates in the processing pipeline.

4.4 Display Module (VGA)

VGA is chosen as the display output for its universality: virtually all monitors and projectors support VGA, and the interface is straightforward to implement in FPGA logic (R/G/B analog signals driven by simple resistor-ladder DACs, plus HSYNC and VSYNC). VGA supports high resolutions, fast refresh rates, and full color depth. Limitations — no hot-plug detection, no embedded audio — are not relevant for this application.


What's Next

This first installment has covered the project motivation, FPGA architecture fundamentals, algorithm trade-offs, key implementation challenges, and the top-level hardware block diagram. The next installment will present the detailed hardware design: power supply schematics, FPGA peripheral circuits, acquisition circuit, SDRAM buffer circuit, and VGA output circuit. The following installment will then cover the full software (RTL) implementation of each module — acquisition, buffering, frame-difference processing, format decoding, and display — along with simulation waveforms and board-level test results.