淡江大學機構典藏:Item 987654321/126358
English  |  正體中文  |  简体中文  |  Items with full text/Total items : 64178/96951 (66%)
Visitors : 10022869      Online Users : 19712
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/126358


    Title: A Multimodal Learning Approach for Translating Live Lectures into MOOCs Materials
    Authors: Huang, Tzu-Chia;Chang, Chih-Yuan;Tsai, Hung-I;Tao, Han-Si
    Keywords: Generative MOOCs;Instructional videos;multimodal;Skeleton-based motion classification;Extractive summarization
    Date: 2024-07-09
    Issue Date: 2024-10-07 12:05:57 (UTC+8)
    Abstract: This paper introduces an AI-based solution for the automatic generation of MOOCs, aiming to efficiently create highly realistic instructional videos while ensuring high-quality content. The generated content strives to keep content accuracy, video fluidity, and vivacity. This paper employs a multimodal to understand text, images, and sound simultaneously, enhancing the accuracy and realism of video generation. The process involves three stages: First, the preprocessing stage employs OpenAI's Whisper for audio-to-text conversion, supplemented by Fuzzy Wuzzy and Large Language Models (LLMs) to enhance content accuracy and detect thematic sections. In the second stage, speaker motion prediction begins with skeleton tags. Based on these labels, the speaker’s motion can be classified into different categories. Subsequently, a multimodal, including BERT and CNN, further extracts features from text and voice diagrams, respectively. Based on these features, the multimodal can learn the speaker’s motion categories through the skeleton labels. As a result, the multimodal can predict the classes of the speaker’s motions. The final stage generates MOOCs audiovisuals, converting text into subtitles using LLMs and predicting the speaker’s motions. Finally, the wellknown tool is used to ensure accurate voice and lip synchronization. Based on the mentioned approaches, the proposed mechanism guarantees seamless alignment and consistency in the video elements, thereby ensuring the generated MOOCs can be realistic and more recent.
    Appears in Collections:[Graduate Institute & Department of Computer Science and Information Engineering] Proceeding

    Files in This Item:

    There are no files associated with this item.

    All items in 機構典藏 are protected by copyright, with all rights reserved.


    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - Feedback