Hi-VideoSum — A User-Centric YouTube Video Summarization & Highlight Service

Abstract
§1

As short-form consumption becomes the norm, viewers expect summaries that surface not only what a video is about but also how other people reacted to it. Video summarization can no longer remain at the level of "content compression"; it must evolve into a part of the user experience that conveys the social cues people rely on when deciding whether to watch.

Hi-VideoSum builds a resource-efficient, sLLM-based summarization service that never touches video frames. It operates purely on two text signals from Korean YouTube—transcripts and viewer comments. We curate a ~10k-sample dataset spanning 7 categories and ~80 channels, select high-quality comments via an LLM-based three-axis evaluation (informativeness · subjectivity · relevance), and fine tune sLLM that produces a three-paragraph prose summary covering video content, viewer reactions, and standout moments.

Our service strengths are as follows.

/01 — Frameless

No video frames. Just text.

Operating on only two text signals—transcripts and comments—dramatically lowers compute and serving cost while sidestepping the hallucination risk of video-multimodal stacks.

/02 — Viewer-aware

Summaries that carry the viewer's voice.

General comments recover collective sentiment; timestamped comments recover standout moments. The output goes beyond compression to deliver social cues.

/03 — Hallucination-suppressed

One paragraph, one source—by design.

¶1 from transcript, ¶2 from general comments, ¶3 from timestamped comments. Source mixing is forbidden at the system-prompt level.

FIGURE 01 · OUTPUT SPECIMEN ¶

A three-paragraph prose summary, not a bullet list.

500–1,000 characters of friendly prose—no section headers, no bullets. Each paragraph is drawn from one source only.

¶ CONTENT The video walks through a minimal urban camping setup. The host lays out a tent, table, and cooking gear that all fit into a single car, and explains the weight, packing, and price criteria that drove each pick. The second half follows the actual setup and a few wrap-up tips, building a flow that beginners can replicate end-to-end.

¶ REACTION Viewers most often praised the host for being upfront about gear choices, and especially for listing weight and bulk alongside each item. Many called the price points reasonable, and a recurring request was a follow-up on winter setups. The calm, non-promotional tone also drew repeated mention.

¶ HIGHLIGHTS Around 3:24, the tent-pitching cut drew reactions like "no way this is really one minute," and near 7:10 viewers kept asking to re-watch the chair comparison. The late-video 11:48 night-scene setup pulled the most "where is this location?" replies—a clear signal of which moment held attention the longest.

SECTION 02 · METHOD §2

The pipeline runs in six stages.

From channel curation through two filtering passes to prose label generation, resource-efficient LoRA training, and finally a web & extension program deployment.

/ STEP 01

Channel Curation

~80 Korean YouTube channels hand-curated to balance 7 top-level categories and 16 sub-categories.

SHUKA · Sherlock HJ
EBS · Psick · ITSub

/ STEP 02

Raw Collection

Videos 5–30 min, up to 300 per channel. Ordered by comment count; transcripts and general + timestamped comments are collected.

yt-dlp
youtube-transcript-api
youtube-comment-downloader
Webshare proxy

/ STEP 03

iii

Rule-based Filter

Regex separates timestamped comments. A meaningful-character ratio drops noise immediately at collection time.

len ≥ 10
meaningful ratio ≥ 40%
geo_blocked → skip

/ STEP 04

3-Axis LLM Filter

Informativeness, subjectivity, relevance scored 1–3. Only comments with total ≥ 6 pass. Same prompt across all judges.

Gemini 3 Flash Preview
K-EXAONE 238B (Elice)
threshold ≥ 6

/ STEP 05

Prose Label Gen.

500–1,000 chars, three paragraphs, friendly tone. Per-paragraph source isolation suppresses hallucination at the prompt level.

Gemini 3.1 Flash Preview
thinking_level: medium
output: 3-paragraph prose

/ STEP 06

sLLM LoRA Fine-tune

A lightweight baseline validates the full pipeline first, then we scale up to larger Korean sLLMs.

gemma-4-E4B-it
r=32 · α=64 · drop 0.05
H200 · bf16 · seq 20k

SECTION 03 · DATASET §3

≈10k samples across 7 categories and ~80 channels.

Sourced entirely from Korean YouTube. Each video becomes one JSONL row carrying transcript, general comments, and timestamped comments.
Released on HuggingFace Hub as kim586w/hivideosum_training_dataset.

80ch

curated Korean YouTube channels

Channels

7/16

top-level / sub-categories

Paste one URL. Get prose back in 30–120 s.

FastAPI handles intake and polling only; the actual pipeline runs asynchronously inside an arq worker. The two never talk directly—Redis is the mailbox between them.

FastAPI

Intake, status polling, and result delivery only. Does no real work itself.

$ uvicorn api.main:app

arq Worker

Pulls jobs from the queue and runs the collect → filter → summarize pipeline asynchronously.

$ arq worker.runner.WorkerSettings

Redis

Queue + progress hash + result/cache strings. A video_id-keyed cache returns repeats instantly.

$ redis-server

vLLM Server

Serves prose summaries via an OpenAI-compatible API using our fine-tuned LoRA adapter.

$ bash inference/serve.sh

Vertex AI Gemini

Three-axis comment scoring (informativeness · subjectivity · relevance). gemini-3-flash-preview.

vertexai=true · project=hivideosum

YouTube · transcript / comments

yt-dlp · youtube-transcript-api · youtube-comment-downloader, routed via Webshare proxy.

module: worker/steps/collect.py

Hi-VideoSum web version screenshot — Figure 03a · Web version

Hi-VideoSum browser extension screenshot — Figure 03b · Browser extension version

SECTION 05 · RESEARCH §5

The four questions we set out to answer.

Each question is wired to a specific analysis section of the mid-report.

Research Question 01

Can viewer comments serve as a complementary signal for the visual information that transcripts cannot capture?

→ §4.1
Comment–transcript complementarity
cross-modal coverage

Research Question 02

What distinct characteristics and profiles does each domain's data exhibit?

→ §4.2
Per-category profile analysis
information-dense / scene-reactive / audience-resonant

Research Question 03

Does a fine-tuned sLLM outperform a general-purpose LLM in reducing hallucination and reflecting the viewer's voice?

→ §4.3
sLLM vs general-purpose LLM
hallucination · viewer-grounding

Research Question 04

Among LoRA variants, which adapter configuration offers the best compute–quality trade-off?

→ §4.3
rank · alpha · target-module sweep
compute-quality trade-off

SECTION 06 · CITATION §6

Cite / BibTeX.

Please use the entry below to cite this work. A separate dataset citation will be added soon.

@misc{hivideosum2026, title = {Hi-VideoSum: A User-Centric YouTube Video Summarization and Highlight Service for Korean}, author = {Jung, Yeon-hu and Oh, Kyeong-jun and Lee, Yong-ha and Kim, Kipyo}, institution = {Sungkyunkwan University, AI Convergence}, advisor = {Jung, Mina}, course = {Data Science Capstone Project}, year = {2026}, note = {Mid-term report, v0.2}, url = {https://huggingface.co/datasets/kim586w/hivideosum_training_dataset} }