07. Document-Term행렬 만들기 Creating a Document-Term Matrix

728x90

Document-term matrix 란 문서를 행으로, 그리고 용어를 열로 가지는 행렬로, 해당 문서의 해당 용어의 출현빈도를 카운팅해서 알려준다. DocumentTermMatrix() 를 이용하여 해당 행렬을 만들 수 있다. 예를 들면, 다음과 같다.

D1 = "I like databases"
D2 = "I hate databases",
then the document-term matrix would be:
I like hate databases
D1 1 1 0 1
D2 1 0 1 1

[출처] http://en.wikipedia.org/wiki/Document-term_matrix

	I	like	hate	databases
D1	1	1	0	1
D2	1	0	1	1

아래와 같이 실행해보면 총 10개의 문서에, 503개의 용어가 사용되고 있음을 알 수 있다.

> dtm <- DocumentTermMatrix(doc)
> dtm
<<DocumentTermMatrix (documents: 10, terms: 503)>>
Non-/sparse entries: 519/4511
Sparsity : 90%
Maximal term length: 16
Weighting : term frequency (tf)
> class(dtm)
[1] "DocumentTermMatrix" "simple_triplet_matrix"
> dim(dtm)
[1] 10 503

inspect() 를 이용해서도 위와 같은 내용을 확인할 수 있으며, 일부 데이터세트만 뽑아서 살펴볼 수도 있다.

먼저 colnames() 를 이용하여 열의 이름을 알아보자.

> colnames(dtm)
[1] "(laughter)" "(sample" "@coordinator" "@delighted"
[5] "@she" "@two" "10,000." "14th"
[9] "173" "7,000" "8,000," "able"
(중략)
[513] "who" "who's" "whom" "will"
[517] "with" "work" "work," "would"
[521] "writing," "wrote" "yes," "yes."
[525] "yet" "york." "you" "you're"
[529] "you," "your"     "yourself."

아래와 같이 10개의 문서에 대해 10개의 열 값의 출현 빈도를 조회할 수 있다.

결과물에서 보듯이 값이 없는 칸이 많다(전체 100개의 셀에서 값이 있는 셀이 12개, 따라서 Sparsity = 88%)

> inspect(dtm[1:10,521:530]) // 1:10 은 문서 지정, 521:530 은 조회할 컬럼 지정
<<DocumentTermMatrix (documents: 10, terms: 10)>>
Non-/sparse entries: 12/88
Sparsity : 88%
Maximal term length: 8
Weighting : term frequency (tf)

그 역은 TermDocumentMatrix() 로 구할 수 있다.

> tdm <- TermDocumentMatrix(doc)
> tdm
<<TermDocumentMatrix (terms: 503, documents: 10)>>
Non-/sparse entries: 519/4511
Sparsity : 90%
Maximal term length: 16
Weighting : term frequency (tf)
>

728x90

저작자표시 비영리 변경금지 (새창열림)

'Season 1 아카이브 > 프로그래밍' 카테고리의 다른 글

09. term의 출현빈도에 대한 분포 구하기 Distribution of Term Frequencies (0)	2014.11.22
08. Exploring the Document Term Matrix (0)	2014.11.22
06. Stemming 어간추출 (0)	2014.11.22
05. Preparing the Corpus - 특정 변환 (0)	2014.11.22
04. Preparing the Corpus - 기본 변환 (0)	2014.11.22

갈루아의 반서재

07. Document-Term행렬 만들기 Creating a Document-Term Matrix

'Season 1 아카이브 > 프로그래밍' 카테고리의 다른 글

티스토리툴바