Multimedia program development

Multimedia program development refers to the technical field that integrates text, images, audio, video and animation to implement interactive functions through programming language. Its development focuses on hardware acceleration, coding efficiency and user experience smoothness.

core development components

Graphic rendering:Handle 2D vector drawing and 3D model rendering, commonly used APIs include OpenGL, Vulkan or DirectX.
Audio and video processing:Applications involving codecs (Codecs), such as H.264/AVC, H.265/HEVC, AAC and the commonly used framework FFmpeg.
Synchronization technology:Ensure that audio and video are accurately aligned during playback to avoid video and audio being out of sync.
Interactive interface:Handle input from mouse, touch, gesture, or VR/AR devices through event-driven programming.

Mainstream development tools and languages

Development areas	Commonly used languages	Technical framework/tools
Web multimedia	JavaScript / TypeScript	HTML5 Canvas, WebGL, Three.js
Mobile Apps/Games	C++ / C# / Swift	Unity, Unreal Engine, Metal
Back-end audio and video processing	Python / Go / C++	FFmpeg, OpenCV, GStreamer

Common development processes

Requirements analysis: Determine media types (such as streaming media, interactive games, educational software).
Resource preparation: material collection and format conversion (optimizing file size and resolution).
Programming: Implement playback logic, filter effects or interactive algorithms.
Performance tuning: Perform memory management and multi-thread optimization to ensure high frame rate operation.
Deployment and testing: Cross-platform compatibility testing to ensure that it can operate under different screen sizes and hardware specifications.

Note: When developing multimedia programs that involve a large amount of calculations, hardware decoding should be given priority to reduce CPU load.

DirectX

DirectX is a series of application programming interfaces (APIs) developed by Microsoft to allow software (especially games) to communicate directly with hardware such as graphics cards and sound effects cards. It is a core pillar of multimedia development for Windows platforms and Xbox consoles.

Main API components

Direct3D：The core part of DirectX is responsible for processing 3D graphics rendering and is widely used in 3D games and drawing software.
Direct2D：Provides high-performance rendering capabilities for 2D geometry, bitmaps, and text.
DirectWrite：Used for high-quality text layout and rendering, supporting hardware acceleration.
DirectSound / XAudio2：Handle audio playback, recording and spatial sound (3D audio) special effects.
DirectInput / XInput：Processing data from input devices such as game joysticks, mice, and keyboards, XInput is optimized for Xbox controllers.
DXGI (DirectX Graphics Infrastructure)：Manages display adapters, enumerates display modes, and handles background buffer swapping.

DirectX version evolution comparison

Version	Important features	Applicable environment
DirectX 11	Introducing surface tessellation (Tessellation) and multi-thread rendering for high stability.	Windows 7 and above
DirectX 12	The underlying API (Low-level) greatly reduces CPU overhead and supports multi-core scheduling of graphics cards.	Windows 10 / 11
DirectX 12 Ultimate	Integrate next-generation technologies such as Ray Tracing and Mesh Shaders.	High-End GPUs and Xbox Series X/S

Development advantages

Hardware abstraction: Developers do not need to write specific code for different brands of graphics cards.
High performance: DirectX 12 allows developers to manage GPU resources more granularly and reduce system latency.
Complete ecosystem: closely integrated with Visual Studio and Microsoft development tool chain, and rich in debugging tools (such as PIX).

Note: In modern game development, developers usually call DirectX through engines such as Unity or Unreal Engine instead of directly writing low-level instructions to improve development efficiency.

Media Foundation

Media Foundation (MF) is a multimedia framework launched by Microsoft after Windows Vista and is designed to replace the old DirectShow. It adopts a new pipeline design and is optimized for high-resolution video, digital rights management (DRM) and more efficient hardware acceleration. It is the core technology for modern Windows applications to process audio and video.

Core architectural components

Media Foundation breaks down the multimedia processing process into three main levels. This design provides extremely high flexibility of control:

Media Sources:Responsible for reading raw data, whether from local files, network streams, or hardware interception devices.
Media Foundation Transforms (MFTs):This is the most critical processing unit, responsible for encoding, decoding, color space conversion or adding image special effects. MFTs have extensive support for hardware acceleration (e.g. via DXVA).
Media Sinks:Responsible for outputting the processed data, such as displaying it on the screen (Enhanced Video Renderer), writing files, or streaming to the network.

Comparison of technical advantages

characteristic	Media Foundation	DirectShow (old version)
High resolution support	Natively optimized for 4K, 8K and HDR content.	The scalability is limited and it is difficult to handle ultra-high resolution.
Hardware acceleration	Deeply integrated with DXVA 2.0, extremely efficient.	Depending on specific filter implementation, performance may vary.
Content protection	Built-in PMP (Protected Media Path) supports DRM.	There is a lack of unified copyright protection mechanism.
Thread model	Use asynchronous topology to reduce UI freezes.	Synchronous execution model can easily lead to interface lag.

Common development interface

Source Reader：A simplified API for developers who only need to get decoded frames from an archive or camera.
Sink Writer：A quick tool for encoding audio and video data into files in a specific format.
Media Session：A complete pipeline controller provides full control over play, pause, jump and other actions.

Note: Although Media Foundation has excellent performance, its API design is relatively complex and rigorous. It is recommended that developers use the MFTrace tool provided by Microsoft for debugging to track the event flow in the media pipeline.

DirectShow

DirectShow is a multimedia framework based on the Component Object Model (COM), mainly used for audio and video capture and playback on the Windows platform. Although Microsoft later launched Media Foundation as its successor, DirectShow is still widely used in industrial cameras, medical imaging, and traditional audio and video software due to its strong compatibility and flexibility.

filter graph model

The core concept of DirectShow is the Filter Graph, which processes multimedia data by connecting different filters into links:

Source Filters:Responsible for reading files or obtaining raw data from hardware devices (such as network cameras).
Transform Filters:Responsible for data processing, such as decoding, format conversion, watermarking or image processing.
Renderer Filters:Responsible for outputting processed data, such as displaying images on the screen or sending audio to speakers.

Core development functions

Functional classification	illustrate
media playback	Supports integration of multiple container formats (such as AVI, WMV, MP4) and codecs.
Image capture	Provides a standard interface for communicating with WDM (Windows Driver Model) devices, suitable for USB cameras.
Hardware acceleration	Hardware-accelerated rendering can be performed using the graphics card via Video Mixing Renderer (VMR) or EVR.
format conversion	Supports resampling, cropping, and color space conversion (such as YUV to RGB) of real-time video streams.

Development advantages and challenges

Highly modular:Developers can write custom filters and insert them into existing graphic links.
Automated wiring:It has an Intelligent Connect mechanism that can automatically find and combine the required filters.
Learning curve:Due to its deep reliance on the COM interface, it is more complicated for developers who are not familiar with COM indicators and memory management.

Note: When carrying out modern development, if you do not need to support older systems, Microsoft recommends giving priority to using Media Foundation, which has more advantages in handling high-resolution content and digital rights management (DRM).

Vulkan

Vulkan is a next-generation cross-platform graphics and computing API developed by Khronos Group. Unlike OpenGL, Vulkan is a low-level API designed to provide more direct hardware control, minimize the driver's overhead, and improve the utilization of multi-core processors.

Core design features

Vulkan’s design logic requires developers to assume more management responsibilities in exchange for ultimate performance:

Explicit control:Developers must manage memory allocation, thread synchronization, and resource lifecycle themselves rather than letting the driver handle it automatically.
Multi-thread optimization:Supports the creation of command buffers (Command Buffers) in multiple execution threads in parallel, completely solving the bottleneck of traditional APIs on a single execution thread.
Precompiled shaders:Using the SPIR-V intermediate format, developers can pre-compile shader code, reducing application load times and improving consistency across hardware.
Unified API:A single API works on desktop computers, mobile devices (natively supported by Android 7.0+), and embedded systems.

Differences between Vulkan and OpenGL

characteristic	Vulkan	OpenGL
Driver burden	Very low, most logic is implemented by developers.	At a higher level, the driver takes care of a lot of background management.
Multi-thread support	Native support for parallel task distribution.	Mainly relies on a single thread.
Development complexity	Extremely high, the amount of code is usually several times that of OpenGL.	Medium, more friendly to beginners.
Hardware utilization	High, can accurately control GPU computing and memory.	, limited by the abstraction level of the API.

key development components

Instance & Physical Device：Initialize Vulkan and enumerate the graphics card hardware on the system.
Logical Device & Queues：Establish logical connections from physical devices and obtain queues that handle graphics, compute, or transfer tasks.
Pipeline State Objects (PSO)：Pre-encapsulate the rendering state (such as blending mode, depth test) to avoid dynamically changing the state during drawing, resulting in performance frame drops.
Render Pass：Clearly defining the rendering target and operation steps is conducive to the optimization of tile rendering (Tile-based rendering) on mobile GPUs.

Note: Due to the extremely high development threshold of Vulkan, it is usually recommended for 3D game engine cores that require extreme performance (such as id Tech 7) or scientific simulation programs that require cross-platform high-performance computing.

Machine vision program development

OpenCV

1. What is OpenCV?

OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library for real-time image processing and analysis.

2. Supported functions

Image processing: image filtering, edge detection, histogram equalization, etc.
Feature extraction: SIFT, SURF, ORB and other feature point detection and description.
Object detection and recognition: face detection, object tracking, image matching.
Image transformation: affine transformation, projection transformation, perspective correction.
Machine learning: built-in support for SVM, KNN, random forest and other models.

3. Supported platforms

Windows
Linux
MacOS
Android
iOS

4. Usage examples

# Read the image and display it
import cv2
image = cv2.imread("image.jpg")
cv2.imshow("Image", image)
cv2.waitKey(0)
cv2.destroyAllWindows()

5. Resources and Documents

cv::imread

1. Basic grammar

In OpenCV, the core function for reading images iscv::imread. It will load the image file ascv::MatMatrix format.

#include <opencv2/opencv.hpp>

// Grammar prototype
cv::Mat img = cv::imread(const std::string& filename, int flags = cv::IMREAD_COLOR);

Commonly used tags (Flags):

cv::IMREAD_COLOR: Default value, load BGR 3-channel image.
cv::IMREAD_GRAYSCALE: Convert the image to a single-channel grayscale image.
cv::IMREAD_UNCHANGED: Load the original image containing the alpha channel.

2. Exception checking and handling mechanism

Key ideas:cv::imreadfailed andNo C++ exceptions are thrown, so traditional try-catch is not effective for it. When the read fails (such as path error, unsupported format or insufficient permissions), it will return an emptycv::Matobject.

The correct processing flow should be usedempty()Member function to check:

#include <opencv2/opencv.hpp>
#include <iostream>

int main() {
    std::string path = "data/image.jpg";
    cv::Mat img = cv::imread(path);

    // Must check if the image is loaded successfully
    if (img.empty()) {
        std::cerr << "Error: Unable to read image file!" << std::endl;
        std::cerr << "Please confirm whether the path is correct:" << path << std::endl;
        return -1;
    }

    //Execute the operation after successful reading
    std::cout << "Image width: " << img.cols << " Height: " << img.rows << std::endl;
    return 0;
}

3. Analysis of common failure reasons

ifimg.empty()is true, usually due to the following reasons:

reason	Explanation and Countermeasures
File path error	Most common reasons. Please check whether the relative path is relative to the executable directory, or use an absolute path.
Unsupported file extension	OpenCV needs a corresponding decoder (such as libjpeg, libpng). If OpenCV is compiled without support, it cannot be read.
Chinese path problem	In Windows environment, old version or specific compilation environment`cv::imread`Poor support for Chinese paths.
Insufficient permissions	The user executing the program does not have operating system permissions to read the file.

4. Advanced solution: Chinese path reading

If reading fails due to a Windows Chinese path, it is recommended to read the file into the memory Buffer first, and thencv::imdecodeTo decode:


#include <fstream>
#include <vector>

cv::Mat imread_unicode(std::string path) {
    std::ifstream fs(path, std::ios::binary | std::ios::ate);
    if (!fs.is_open()) return cv::Mat();

    std::streamsize size = fs.tellg();
    fs.seekg(0, std::ios::beg);

    std::vector<char> buffer(size);
    if (fs.read(buffer.data(), size)) {
        return cv::imdecode(cv::Mat(buffer), cv::IMREAD_COLOR);
    }
    return cv::Mat();
}

Oscillation point group grouping

When the order of point groups (such as screw edges or sine waves) is disordered, they must first be projected in the direction of the fitted straight line and sorted, and then the points can be correctly grouped according to their positive and negative offsets relative to the straight line (Signed Distance). The following is an implementation plan for integrating OpenCV and standard C++.

Coordinate point definition and distance sorting

First implement the specified point distance sorting function you require. This can be used to locate a starting point or a specific feature point.

#include <vector>
#include <array>
#include <algorithm>
#include <opencv2/opencv.hpp>

using Point2D = std::array<float, 2>;
using Points = std::vector<Point2D>;

namespace GeometryPointsUtil {
    bool FindSortedPointsByDistOfPoint(Points& retPoints, const Points& allPoints, const Point2D& aPoint) {
        if (allPoints.empty()) return false;

        retPoints = allPoints;
        std::sort(retPoints.begin(), retPoints.end(), [&aPoint](const Point2D& p1, const Point2D& p2) {
            float dx1 = p1[0] - aPoint[0];
            float dy1 = p1[1] - aPoint[1];
            float dx2 = p2[0] - aPoint[0];
            float dy2 = p2[1] - aPoint[1];
            // Use sum of squares comparison to avoid sqrt operation overhead
            return (dx1 * dx1 + dy1 * dy1) < (dx2 * dx2 + dy2 * dy2);
        });
        return true;
    }
}

Grouping Algorithm Along Lines for Out-of-Order Point Groups

For oscillating lines, this function will automatically fit the straight line, sort the projection, and segment it according to both sides of the straight line.

std::vector<Points> splitOscillatingPoints(const Points& allPoints) {
    if (allPoints.size() < 2) return {allPoints};

    // 1. Straight line fitting
    std::vector<cv::Point2f> cvPts;
    for (const auto& p : allPoints) cvPts.push_back({p[0], p[1]});
    
    cv::Vec4f line; // (vx, vy, x0, y0)
    cv::fitLine(cvPts, line, cv::DIST_L2, 0, 0.01, 0.01);
    float vx = line[0], vy = line[1], x0 = line[2], y0 = line[3];

    // 2. Projection sorting: ensure that the points are arranged along a straight line
    struct ProjectedPoint {
        Point2D original;
        float t; // projection length
        float side; // algebraic distance to straight line
    };

    std::vector<ProjectedPoint> projected;
    float nx = -vy; // normal vector x
    float ny = vx; // normal vector y

    for (const auto& p : allPoints) {
        float dx = p[0] - x0;
        float dy = p[1] - y0;
        float t = dx * vx + dy * vy; // Displacement projected onto a straight line
        float s = dx * nx + dy * ny; // Distance perpendicular to the straight line (including plus and minus signs)
        projected.push_back({p, t, s});
    }

    std::sort(projected.begin(), projected.end(), [](const ProjectedPoint& a, const ProjectedPoint& b) {
        return a.t < b.t;
    });

    // 3. Grouping based on positive and negative sign transitions
    std::vector<Points> segments;
    if (projected.empty()) return segments;

    Points currentGroup;
    bool lastSide = (projected[0].side >= 0);

    for (const auto& item : projected) {
        bool currentSide = (item.side >= 0);

        if (currentSide != lastSide && !currentGroup.empty()) {
            segments.push_back(currentGroup);
            currentGroup.clear();
        }
        
        currentGroup.push_back(item.original);
        lastSide = currentSide;
    }

    if (!currentGroup.empty()) segments.push_back(currentGroup);
    return segments;
}

Explanation of implementation points

Projection sorting: Use the dot product (Dot Product) of the direction vector and the point to calculate the projection amount t, which solves the problem of chaotic order of the input point group.
Algebraic distance: The side value calculated using the normal vector, whose sign represents which side of the straight line the point is on, is the key to distinguishing the peak and trough areas.
Noise processing: If the results are too trivial, it is recommended to check segments[i].size() after grouping and eliminate abnormal small groups with too few points.

Halcon

Features

Halcon is a powerful industrial vision software developed by MVTec, specifically designed for image processing and machine vision applications.

Supports multiple programming languages: such as C, C++, C# and Python.
Cross-platform support: Windows, Linux and embedded platforms.
Provides more than 2000 image processing operators.
Efficient hardware acceleration: supports GPU and multi-core processing.

Function

Image processing: filtering, morphological operations, image segmentation.
Feature detection: edge detection, circle and line fitting.
Object recognition: template matching, shape detection, color analysis.
3D applications: point cloud processing, stereo vision, depth map generation.
Barcode and QR code recognition.

Application areas

Industrial automation: defect detection, dimensional measurement.
Medical imaging: cell analysis, organ testing.
Automobile manufacturing: parts inspection, assembly accuracy analysis.
Food and packaging: product classification, packaging inspection.

resource

Official website:https://www.mvtec.com/products/halcon/
document:https://www.mvtec.com/documentation/

Video editing program development

Common functions

Edit and merge: remove unnecessary clips or concatenate multiple clips
Transition effects: visual effects such as fade in and fade out, sliding, zooming, page turning, etc.
Subtitles and text: Add subtitle files or embedded text effects
Audio processing: background music, sound effect overlay, noise reduction, volume adjustment
Filters and special effects: color correction, blur, special effects compositing
Multi-track editing: multi-track mixing of video, audio, and pictures
Output and conversion: Output different resolutions and formats (MP4, MOV, GIF, etc.)

Common Tools and Library

FFmpeg: Cross-platform command line tool and library, the most powerful
MoviePy（Python）: Based on FFmpeg, supports automatic editing, synthesis, and subtitles
OpenCV: Process videos frame by frame, suitable for image special effects and computer vision applications
GStreamer: Modular audio and video framework, supporting streaming and video processing
AVFoundation（Apple）: Video processing API for macOS/iOS apps
Media Foundation（Windows）:Official video API for Windows platform
Shotstack / Cloudinary / Kapwing API: Cloud video editing and automation services
Adobe Premiere Pro API:Professional video editing automation and plug-in development

Application examples

Automated editing and subtitle generation for short video platforms (such as TikTok and Reels)
Educational video production: combining slides and audio explanations
Advertising and Marketing: Add transitions, filters, and background music
Surveillance and image recognition: Combining AI for image detection and analysis

Open source video editing software

1. Shotcut

Shotcut is a free and open source video editing software that supports multiple formats and has many powerful editing tools. Features include:

Supports 4K video editing.
Multi-track timeline allows multi-layered video and audio editing.
It has rich visual effects and transition effects.
in C++