摘要: | 隨著全球化的趨勢,學習英文愈來愈重要,而近年來,已經有許多學習者利用閱讀英文文章的方式來幫助自己學習英文。但對於一般的學習者來說,如何去選擇一篇自己感興趣而又難度適中的英文文章來做學習是不容易的。 本研究的目的就是要產生一個推薦機制,當我們輸入一篇英文新聞文章,推薦機制可以判斷該篇文章是否適合使用者閱讀,研究使用對象設定為本國高中學生,使用的語料庫分別是「全民英文能力分級檢定測驗」 (GEPT)六級單字字彙庫、智慧型互動式網路語言學習社群 (Intelligent Web-based Interactive Language Learning,簡稱IWiLL)中,高中生所發表的文章、高中英文課本的課文 (SHSETs)及網路上收集的英文新聞文章 (Web News)。 要找出適合高中生閱讀的英文文章,首先要先計算語料庫的特徵值,再依據該特徵值,將英文文章分類,最後,再評估我們使用的特徵值選取方式與分類器的效能,評估是否能將英文文章做正確的分類。本研究使用2種方法計算出文章特徵值 (Document Features),分別是The Smoothed Unigram Model (平滑模型)及Cosine Similarity (餘弦相似性);三種分類方式對文章做分類,分別為貝式分類法 (Naive Bayes)、KNN (第k位最接近的鄰居)、SVM (支援向量機);三種效能評估方式 (Evaluation)分別是Classification accuracy (正確分類的比例)、F-measure (F-測量)、Brier score (Brier得分測量精度概率);最後,我們使用Confusion Matrix (混淆矩陣)來表示分類的準確性。 English language has been receiving more and more attention all over the world as a consequence of globalization, especially for non-English speaking countries. For most English learners, reading English articles has always been a proper way of improving the English proficiency. However, it is not a trivial job to select interesting English articles of adequate difficulty level. The purpose of this work is to devise a mechanism for selecting appropriate English articles. The proposed mechanism works by indicating whether a particular article, e.g., English news, is adequate in difficulty for the users. This research targets specifically at senior high school students. Four different databases of English vocabularies are utilized in this work. They are the GEPT level six, Intelligent Web-based Interactive Language Learning (IWiLL), senior high school English textbooks (SHSETs), and the Web News collected on the Internet. To find the level of difficulty for a given English article, we first have to obtain the Document Feature, which is then taken as the only characteristic for classifying the English article. In this work, the approaches of the Smoothed Unigram Model and the Cosine Similarity are both taken to find the document feature. We consider three different methods for classification, i.e., Naive Bayes, K-Nearest Neighbor (KNN), and Support Vector Machine (SVM). For performance evaluation, the Classification accuracy, F-measure, and Brier score are all computed to assess the proposed mechanism. The Evaluated Results are obtained by using the Confusion Matrix. Finally, we analyze the Evaluated Results over different combinations of the methods of obtaining the document features and the classifiers to assess whether the proposed mechanism is able to select adequate English articles for senior high school students. |