HI-EVERYONE · SKKU · AI Convergence · Data Science Capstone · 2026
Abstract Method Dataset Service RQ Cite
A User-Centric YouTube Summarization & Highlight Service

Hi‑VideoSum A resource-efficient sLLM-based service that summarizes Korean YouTube videos using only two text signals—transcripts and viewer comments.

Yeon-hu Jung Kyeong-jun Oh Yong-ha Lee Kipyo Kim
Sungkyunkwan University · AI Convergence / Data Science Capstone Project / Advisor: Prof Mina Jung
Code HuggingFace · Dataset Mid-Report · PDF Live Demo
Abstract
§1

As short-form consumption becomes the norm, viewers expect summaries that surface not only what a video is about but also how other people reacted to it. Video summarization can no longer remain at the level of "content compression"; it must evolve into a part of the user experience that conveys the social cues people rely on when deciding whether to watch.

Hi-VideoSum builds a resource-efficient, sLLM-based summarization service that never touches video frames. It operates purely on two text signals from Korean YouTube—transcripts and viewer comments. We curate a ~10k-sample dataset spanning 7 categories and ~80 channels, select high-quality comments via an LLM-based three-axis evaluation (informativeness · subjectivity · relevance), and fine tune sLLM that produces a three-paragraph prose summary covering video content, viewer reactions, and standout moments.

Our service strengths are as follows.

/01 — Frameless

No video frames. Just text.

Operating on only two text signals—transcripts and comments—dramatically lowers compute and serving cost while sidestepping the hallucination risk of video-multimodal stacks.

/02 — Viewer-aware

Summaries that carry the viewer's voice.

General comments recover collective sentiment; timestamped comments recover standout moments. The output goes beyond compression to deliver social cues.

/03 — Hallucination-suppressed

One paragraph, one source—by design.

¶1 from transcript, ¶2 from general comments, ¶3 from timestamped comments. Source mixing is forbidden at the system-prompt level.

FIGURE 01 · OUTPUT SPECIMEN

A three-paragraph prose summary, not a bullet list.

500–1,000 characters of friendly prose—no section headers, no bullets. Each paragraph is drawn from one source only.

¶ CONTENT The video walks through a minimal urban camping setup. The host lays out a tent, table, and cooking gear that all fit into a single car, and explains the weight, packing, and price criteria that drove each pick. The second half follows the actual setup and a few wrap-up tips, building a flow that beginners can replicate end-to-end.

¶ REACTION Viewers most often praised the host for being upfront about gear choices, and especially for listing weight and bulk alongside each item. Many called the price points reasonable, and a recurring request was a follow-up on winter setups. The calm, non-promotional tone also drew repeated mention.

¶ HIGHLIGHTS Around 3:24, the tent-pitching cut drew reactions like "no way this is really one minute," and near 7:10 viewers kept asking to re-watch the chair comparison. The late-video 11:48 night-scene setup pulled the most "where is this location?" replies—a clear signal of which moment held attention the longest.

SECTION 02 · METHOD §2

The pipeline runs in six stages.

From channel curation through two filtering passes to prose label generation, resource-efficient LoRA training, and finally a web & extension program deployment.

/ STEP 01
i

Channel Curation

~80 Korean YouTube channels hand-curated to balance 7 top-level categories and 16 sub-categories.

SHUKA · Sherlock HJ
EBS · Psick · ITSub
/ STEP 02
ii

Raw Collection

Videos 5–30 min, up to 300 per channel. Ordered by comment count; transcripts and general + timestamped comments are collected.

yt-dlp
youtube-transcript-api
youtube-comment-downloader
Webshare proxy
/ STEP 03
iii

Rule-based Filter

Regex separates timestamped comments. A meaningful-character ratio drops noise immediately at collection time.

len ≥ 10
meaningful ratio ≥ 40%
geo_blocked → skip
/ STEP 04
iv

3-Axis LLM Filter

Informativeness, subjectivity, relevance scored 1–3. Only comments with total ≥ 6 pass. Same prompt across all judges.

Gemini 3 Flash Preview
K-EXAONE 238B (Elice)
threshold ≥ 6
/ STEP 05
v

Prose Label Gen.

500–1,000 chars, three paragraphs, friendly tone. Per-paragraph source isolation suppresses hallucination at the prompt level.

Gemini 3.1 Flash Preview
thinking_level: medium
output: 3-paragraph prose
/ STEP 06
vi

sLLM LoRA Fine-tune

A lightweight baseline validates the full pipeline first, then we scale up to larger Korean sLLMs.

gemma-4-E4B-it
r=32 · α=64 · drop 0.05
H200 · bf16 · seq 20k
SECTION 03 · DATASET §3

≈10k samples across 7 categories and ~80 channels.

Sourced entirely from Korean YouTube. Each video becomes one JSONL row carrying transcript, general comments, and timestamped comments.
Released on HuggingFace Hub as kim586w/hivideosum_training_dataset.

80ch
curated Korean YouTube channels
Channels
7/16
top-level / sub-categories
Categories
≈10k
three-paragraph training pairs
Samples
5–30m
video length window
Length
Figure 02 · Channel category distribution across the raw collection
Raw category distribution across 80 Korean YouTube channels
SECTION 04 · WEB SERVICE §4

Paste one URL. Get prose back in 30–120 s.

FastAPI handles intake and polling only; the actual pipeline runs asynchronously inside an arq worker. The two never talk directly—Redis is the mailbox between them.

BACKEND · web_service/ Browser /ui · poll every 2s FastAPI api/ · POST /jobs GET /jobs/{id} Redis arq:queue · job:meta job:result · cache:video TTL 24h arq Worker worker/ · pipeline collect → filter → sum YouTube transcript · comments · meta Vertex AI Gemini 3-axis filter · 1–3 vLLM Server gemma-4-E4B-it + LoRA POST poll t=0 submit 10–60s collect 10–30s filter 10–30s summarize done 30–120s
FastAPI

Intake, status polling, and result delivery only. Does no real work itself.

$ uvicorn api.main:app
arq Worker

Pulls jobs from the queue and runs the collect → filter → summarize pipeline asynchronously.

$ arq worker.runner.WorkerSettings
Redis

Queue + progress hash + result/cache strings. A video_id-keyed cache returns repeats instantly.

$ redis-server
vLLM Server

Serves prose summaries via an OpenAI-compatible API using our fine-tuned LoRA adapter.

$ bash inference/serve.sh
Vertex AI Gemini

Three-axis comment scoring (informativeness · subjectivity · relevance). gemini-3-flash-preview.

vertexai=true · project=hivideosum
YouTube · transcript / comments

yt-dlp · youtube-transcript-api · youtube-comment-downloader, routed via Webshare proxy.

module: worker/steps/collect.py
Figure 03a · Web version
Hi-VideoSum web version screenshot
Figure 03b · Browser extension version
Hi-VideoSum browser extension screenshot
SECTION 05 · RESEARCH §5

The four questions we set out to answer.

Each question is wired to a specific analysis section of the mid-report.

Research Question 01
Can viewer comments serve as a complementary signal for the visual information that transcripts cannot capture?
→ §4.1
Comment–transcript complementarity
cross-modal coverage
Research Question 02
What distinct characteristics and profiles does each domain's data exhibit?
→ §4.2
Per-category profile analysis
information-dense / scene-reactive / audience-resonant
Research Question 03
Does a fine-tuned sLLM outperform a general-purpose LLM in reducing hallucination and reflecting the viewer's voice?
→ §4.3
sLLM vs general-purpose LLM
hallucination · viewer-grounding
Research Question 04
Among LoRA variants, which adapter configuration offers the best compute–quality trade-off?
→ §4.3
rank · alpha · target-module sweep
compute-quality trade-off
SECTION 06 · CITATION §6

Cite / BibTeX.

Please use the entry below to cite this work. A separate dataset citation will be added soon.

@misc{hivideosum2026, title = {Hi-VideoSum: A User-Centric YouTube Video Summarization and Highlight Service for Korean}, author = {Jung, Yeon-hu and Oh, Kyeong-jun and Lee, Yong-ha and Kim, Kipyo}, institution = {Sungkyunkwan University, AI Convergence}, advisor = {Jung, Mina}, course = {Data Science Capstone Project}, year = {2026}, note = {Mid-term report, v0.2}, url = {https://huggingface.co/datasets/kim586w/hivideosum_training_dataset} }