갈루아의 반서재

728x90

sort() 를 이용하여 해당 corpus 내 모든 단어들의 빈도를 카운트할 수 있다.


> freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)  // colSums : computing the sums of matrix columns

                                                                                     // as.matrix : explicitly converting from data frame to matrix

> head(freq,14) 

        the         and        that         you        have        will        this         but       draft 

         71          56          51          27          15          15          13          11          11 

      about         all strickland:         for        know 

         10          10           9           8           8 


> wf <- data.frame(word=names(freq), freq=freq)   // names() : get or set the names of an object

> head(wf)

     word freq

the   the   71

and   and   56

that that   51

you   you   27

have have   15

will will   15


> library(ggplot2)


Attaching package: ‘ggplot2’


The following object is masked from ‘package:NLP’:


    annotate


> p <- ggplot(subset(wf, freq>10), aes(word, freq)) // 빈도 10 이상, x축 = word, y축 = freq

> p <- p + geom_bar(stat="identity") // geom_bar : producing 1d area plots, stat="identity" : turn off the default summary

> p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))

> p



728x90