版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、See discussions, stats, and author profiles for this publication at:Text style analysis using trace ratio criterion patch alignment embeddingARTICLE in NEUROCOMPUTING · JULY 2014pac Fac o : 2.08 DO : 0. 0 6/j. euco.20 4.0 .0 2C A ON1S413 AUTHORS, INCLUDING:bo ZhaoCity University ofTommy W SCity
2、 University of226 PUBL CA ONS 2,620 C38 PUBL CA ONS177 CAONSAONSSEE PROF LESEE PROF LEAvai ab eo : Mi gbo Z aoRe ieved o : 25 Sep e be 20 5Neurocomputing 136 (2014) 201212Contents lists available at ScienceDirectNeurocomputingjournal homepage:Text style analysis using trace ratio criterion patch ali
3、gnment embeddingPeng Tang,bo Zhao n, Tommy W.S.Department of Electronic Engineering, City University ofa r t i c l e i n f o a b s t r a c t An effective algorithm for extracting cues of text styles is proposed in this paper. When processing document collections, the documents are rst converted to a
4、 high dimensional data set with the assistant of a group of style markers. We also employ the Trace Ratio Criterion Patch Alignment Embedding (TR PAE) to obtain lower dimensional representation in a textual space. The TR PAE has some advantages that the inter class separability and intra class compa
5、ctness are well characterized by the special designed intrinsic graph and penalty graph, which are based on discriminative patch alignment strategy. Another advantage is that the proposed method is based on trace ratio criterion, which directly represents the average between class distance and avera
6、ge within class distance in the low dimensional space. To evaluate our proposed algorithm, three corpuses are designed and collected using existing popular corpuses and real life data covering diverse topics and genres. Extensive simulations are conducted to illustrate the feasibility and effectiven
7、ess of our implementation. Our simulations demonstrate that the proposed method is able to extract the deeply hidden information of styles of given documents, and efciently conduct reliable text analysis results on text styles can be provided.Article history:Received 25 June 2013 Received in revised
8、 form 2 January 2014Accepted 6 January 2014 Communicated by X. GaoAvailable online 29 January 2014Keywords:Text style analysisTrace ratio criterion patch alignment embeddingStyle markers Text clustering& 2014 Elsevier B.V.1. Introductiongenre based features in documents are of great help to our
9、research on text analysis. The Term Frequency (TF), as well as Term Frequency Inverse Document Frequency (TF IDF) is applied to mostThe tasks on text analyzing, such as text classication and docu ment categorization, have gained a prominent status in the informa tion systems eld, due to the availabi
10、lity of documents created by the World Wide Web and the increasing demand to retrieval them by exible means 1. In these text analyzing tasks, papers are usually measured and classied according to their contents, or topics, and genres, or types. A third type of text analysis can exist. For example, d
11、ifferent writers can write in different style when they describe the same thing, and experienced Englishers usually have no difculty in telling whether an essay was written by native English speakers or the non native ones by looking at the authors wordings and writing styles which differ from its g
12、enre or topic. Here, the cues to differentiate the native and English as a Secondary Language (ESL) speakers are writing text styles. Therefore, the style cues of documents can be an effective measure in automatic text processing tasks. In this research, we will analyze the text styles using style m
13、arkers and machine learning approaches.Document styles are obviously affected by topics and genres of documents. Consequently, approaches that handle topic based ordocument ms like Vector Space Ms or Probabilistic Ms2,3. It can be noticed that simple textual features, like word (orterm) frequency an
14、d length of sentences, are the most widely used. For instance, word length and sentence length features have beenused to test tre classes and authorship 4 6. Tweedie haspointed out that the richness of vocabulary highly depends on text length and is very unstable 7. Many works have been established
15、with the assistance of POS tagging. For example, POS tagging features are used to detect text genre 8 12. N gram mixed with POS tagging are used to investigate the inuence of syntax structure 12. Feldman et al. extracted text genre features with POS histo grams and machine learning technologies 13.
16、Biber 14 dened “style markers”, regarded as a formal denition of style of texts, as aset of measurable patterns. Kessler identied four generic cues on the bases of style markers 8. It is also believed that writers can be a determining factor for writing habits. There are also research work focusing
17、on identifying the authorship of given documents. Style markers are utilized to dealing with unrestricted text for an authorship based classication, and a 50% or above accuracy has been reported when a 10 author corpus are processed 15. In 15, multiple regression and discriminant analysis are employ
18、ed to analyse genres of documents. Similar approaches are also applied in web document classication 16 18. In these analyses, the stylen Corresponding author. Tel.: þ 852 34427756; fax: þ 852 27887791.addresses: .hk (P. Tang),.hk (M. Zhao), .hk (
19、T.W.S.).0925-2312/$ - see front matter & 2014 Elsevier B.V.202P. Tang et al. / Neurocomputing 136 (2014) 201212markers of text, together with HTML tags and entities are consid ered as textual features. By using regression and discriminate analysis, differences among given documents can be reveal
20、ed. Using the occurrence frequency of the most widely used words from a training corpus as style markers has also been studied 19,20,15. Textual features and self organizing maps are used for text classi cation 1,21. Proximity based information between words to extract extra features of documents is
21、 also widely used for retriev ing information. Petkova and Croft propose a document representalinear projection matrix is used for map extend LE to its linear version.new cosamples andThe aforementioned methods are developed based on the specicknowledge of eld experts for their own purposes. Recentl
22、y, Yan et al.27 demonstrate that several dimension reduction methods (e.g. PCA, LDA, ISOMAP, LLE and LE) can be unied in a graph embedding framework, in which the statistical and geometrical properties of the data are encoded as graph relationships. Zhang et al. 28 further reformulated several dimen
23、sion reduction methods into a unied patch alignment framework (PAF), which consists of two parts: localtion mbased on the proximity between occurrences of entitiesand terms 22. Lv and Zhai propagate the word count using the sopatch construction and whole alignment, and showed that the above methods
24、are different in the local patch construction stage and share an almost identical whole alignment stage. In addition, this general framework, which is also originally used by local tangent space alignment (LTSA) 34, has been widely used in different elds such as correspondence construction 35, image
25、 retrieval 36 and distance metric learning 37 by constructing different patches corre sponding to different applications.In general, most of the above methods are unsupervised and they do not use label information. However, label information is of great importance when handling classication problem.
26、 In addition, though LDA can achieve promising performance as a supervised method, it is developed based on the assumption that the samples in each class follow a Gaussian distribution. In many applications such as text classication problems, samples in a data set, however, may follow a non Gaussian
27、 distribution that cannot satisfy the above assumption. Without this assumption, the separation of different classes may not be well characterized which results in degrading the classication performance 31. To solve this problem, some super vised dimensionality reduction methods have adopted the ide
28、a from the mentioned unsupervised manifold methods for better preserving the discriminative information. These methods usually start from the local structure of data and preserve the geometric information provided by data points and the label information. Typical methods include Supervised Locality
29、Preserving Projectioncalled Positional Language Mto obtain a virtual propagatedword count and applied to other language ms 23. Differentfrom the above method, our proposed method ms a given textas a lexicon of weighted word pairs. In this paper, the weight of word pair, calculated by using proximity
30、 based kernels in manyapplications, refers to theness between the two terms ofthe word pairs. Syntactic features performs better than simple textual statistics such as word frequency and length of sentencesin genre classication 20,14. It is reported, however, that the syntactic dependent features ar
31、e computationally expensive andtime consu8. To balance the computational performanceand the effectiveness of analysis result, we use POS bigrams and trigrams, which can encode useful syntactic information 24.Most above mentioned approaches on document analysis contain two parts. First, they generate
32、 a matrix by using a set of style markers. Second, use regression, discriminate analysis, classiers, or other machine learning methods to evaluate the results. Hence, if we want to improve the text analysis results, two approaches can be applied: (1) by exploring more effective features and (2) byut
33、ilizing movanced machine learning methods that can bettermake use of hidden information in the original data. In this paper, we mainly focus the latter approach. Specically, to better repre sent the features of text or documents, we rst collected several corpuses using real life textual data and exi
34、sting popular corpuses for text analysis in different scenarios. In this way, document(SLPP) 38, Discriminative Locality Alignment (DLA) 28, Stableanalysis is transformed into a feature extraction problem with a high dimensional and non Gaussian data set. We then develop an effective approach to han
35、dle such data set for document analysis.However, dealing with high dimensional data has always been a major problem for pattern recognition. Hence, dimensionality reduction techniques can be used to reduce the complexity of the original data and embed high dimensional data into low dimensional data,
36、 while keemost of the desired intrinsic information 25,26. Over the past decades, many dimensionality reduction methods have been proposed 27,28. PCA pursues theOrthogonal Local Discriminant Embedding (SOLDE) 39, SparseNeighbor Selection and Sparse Representation based Enhancement (SNS SRE) 40, Unsu
37、pervised Transfer Learning based Target Detec tion (UTLD) 41, etc.To better unveil the hidden information in the given high dimensional data created by a group of specied style markers, a fast trace ratio criterion patch alignment embedding (TR PAE) method is introduced. Our proposed method has the
38、advantages that the inter class separability and intra class compactness are well characterized by the special designed intrinsic graph and penalty graph, which are based on discriminative patch alignment method. This strategy is essential for extracting the deeply hidden information about text styl
39、es in document collections. Another advantage is that the proposed method is based on trace ratio criterion, which directly represents the average between class distance and average within class distance in the low dimensional space. This advantage is helpful to directly obtain intuitive text analys
40、is results. In this paper, we have performed extensive study using style marker collections, our col lected corpuses and the proposed TR PAE. The simulation results show that using our method, we can distinguish various styles of documents of different genres. Meanwhile, the styles of British and Am
41、erican writing English, as well as English from Asian areas, can be separated. Moreover, our proposed algorithm can separate news items collected from the same media and of the same genre, but composed in different decades.The rest of this paper is organized as follows. The corpuses and textual feat
42、ures, i.e. the style marker collections, collected anddirection ofum variance for optimal reconstruction 29,30.For linear supervised methods, LDA and its variants nd theoptimal solution thatizes the distance between the meansof the classes while minimizing the variance within each class31. Due to th
43、e utilization of label information, LDA can achieve better classication results than those obtained by PCA if sufcient labeled samples are provided.To nd the intrinsic manifold structure of the data, nonlinear dimensionality reduction methods such as ISOMAP 25, Locally Linear Embedding (LLE) 26, Lap
44、lacian Eigenmap (LE) 32 were developed. These methods preserve the local structures and look for a direct non linearly embedding the data in a global coordinate. For example, ISOMAP aims to preserve global geodesic distances of all pairs of measurements; LLE uses linear coefcients, which reconstruct
45、 a given measurement by its neighbors, to represent the local geometry; LE is able to preserve the proximity relationships by using an undirected weighted graph to indicate neighbor relations of pair wise measure ments. But it is worth noting that all the above methods suffer from the out of sample
46、problem 33. To deal with the problem, He et al.33 developed the Locality Preserving Projections (LPP) in which aused in our study adressed in Section 2. In Section 3, webriey overview the conventional linear discriminant analysis. Our proposed fast trace ratio criterion patch alignment embeddingP. T
47、ang et al. / Neurocomputing 136 (2014) 201212203method is subsequently elaborated. In Section 4, data sets used in visualization and classications, experimental congurations andcollocation information among words and some other textual statistics. In our study, we obtain the listed style markers wit
48、hcorresponding experimental results are algorithm in Section 5.ed. We conclude ourthe assistant ofscripts and thepackage NLTK.Compared to features used in 18,15,42, we removed featuresrelated to topics and genres so that the bias caused by topics and genres can be minimized. Moreover, features conce
49、rning of frequent/function words are enhanced. Such methods also aim to eliminate the bias caused by genres and topics of different documents.2. Corpuses and features usedTo our knowledge, there are few corpuses aifor analyzingstyles of text or documents. Most existing corpuses are for classication
50、by the contents, topics, types, or genres of the documents. To conduct our text style analysis, we collected several corpuses using real life textual data and existing popular corpuses for text analysis in different scenarios. The components of Corpus 1, Corpus 2 and Corpus3. Feature extraction base
51、d on fast trace ratio criterion linear discriminant analysis3 areed in Tables 1 3, respectively. The Corpus 1 consists of a3.1. Related workgroup of materials covering various contents and genres, aitoexamine the performance of our proposed algorithm for analyzing the3.1.1. Review of trace ratio lin
52、ear discriminant analysisLDA uses the within class scatter matrix Sw to evaluate the compactness within each class and between class scatter matrix Sb to evaluate the separability of different classes. The goal of LDA is to nd a linear transformation matrix W A RD d, for which thestyles among differ
53、ent topics and genres. News and reportage items covering different English media, i.e. the British and American Englishmed (formthe native English media, and the English media in the CJK Japaand Korean) area as the non native English media,Corpus 2. This corpus is designed for testing our algotrace
54、of between class scatter matrix isized, while the tracerithm when dealing with documents of the same genre but in composed in different regions. The Corpus 3 includes news and reports from the same English media and covering the same topics. We collect this corpus mainly for validating the feasibili
55、ty of using our algorithm under such harsh conditions.of within class scatter matrix is minimized. Let X ¼ fx1; x2;xlg A RD l be the training set, each xi belongs to a classci ¼ f1; 2; cg . Let li be the number of data points in the ith classand l be the number of data points in all classe
56、s. Then, the between class scatter matrix Sb, within class scatter matrix Sw, and total class scatter matrix St are dened as follows:We choose some popular style markers of documents to construct a high dimensional data set 18,15,42. Here, each style marker is considered as a textual feature concern
57、ing of token level, lexical level, structural level. The 200 style markers used incTSt ¼ ðx Þðx Þx A cii 1our study areed in Table 4, which means we construct a 200dimensional data set by using them. Token level features include the classic term frequencies for the words, numbers, punctuations, and special symbols. Lexical level features consist of both funct ion words and content words, POS tagged tokens and some useful word using statistics that indicate the high and low frequencyTa
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 墓碑制作技术规范及保障方案
- 2024年创业贷款合同模板
- 2024年企业数据共享与安全保护合同
- 2024年全新版短期汽车租赁协议
- 2024年吊车安全操作保证协议
- 2024年LED产品订购协议
- 2024年城市公共交通代运营复杂协议
- 医院防汛应急处理及病患安全方案
- 2024年大型火力发电厂建设与运营合同
- 电子工业清洁生产废气净化方案
- 自闭儿童创业计划书
- 阴道助产并发症的处理
- 幼儿园公开课:中班语言《跑跑镇》课件
- 公共机构节能知识讲座
- 幼小衔接那些事儿
- 山东省临沂市罗庄区2023-2024学年七年级上学期期中数学试题
- 代人贷款免责协议
- 机器人带来的挑战和机遇
- 2年级下册小学语文校本教材(二)
- 文言文实虚词复习语文八年级上册
- 结合实际-谈谈怎样做一名人民满意的公务员
评论
0/150
提交评论