갈루아의 반서재

1. Exploring the Corpus

inspect() 를 이용하여 문서의 데이터가 제대로 로딩되었는지 확인이 가능하다.

 

> inspect(docs[2])

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

 

[[1]]

NULL

2. Preparing the Corpus

텍스트 분석을 위해서 경우에 따라서는 전처리 과정이 필요할 수 있다. 아래에서 보듯이 대상 텍스트를 소문자로 변환시키고, 숫자를 제거하는 등의 기능을 갖고 있음을 확인할 수 있다.

 

> getTransformations()

## [1] "removeNumbers" "removePunctuation" "removeWords"

## [4] "stemDocument" "stripWhitespace"

변환을 위해서는 tm_map() 을 사용한다. 아래에서 살펴본다.

 

 

3. Simple Transforms


아래 예제를 통해 간단히 살펴본다.

 

> library(NLP)

> library(tm)  // tm 실행

> con <- file("./corpus/txt/corpustext.txt") 

> lines <- readLines(con)

> close(con)

> rm(con)

> head(lines, 10) // 위에서부터 10개 읽어온다

 [1] "(Sample 1)"                                                                                                                                                                                                  

 [2] "STRICKLAND: Good morning@."                                                                                                                                                                                  

 [3] "Marsha is on her way. @She called from the car phone I think. It sounded like the car phone, to let us know that she would be delayed."                                                                      

 [4] "I would like to welcome @two people who haven't been with us before."                                                                                                                                        

 [5] "Suzanne Clewell, we're @delighted to have you with us today. Suzanne, would you tell us a little bit about what you do?"                                                                                     

 [6] "CLEWELL: Yes. I'm the @Coordinator for Reading Language Arts with the Montgomery County Public Schools which is the suburban district surrounding Washington. We have 173 schools and 25 elementary schools."

 [7] "It's great to be here."                                                                                                                                                                                      

 [8] "STRICKLAND: And I'll skip over to another member of the committee, but for her, this is her first meeting, too, Judith Langer. I think we all know her work, if we didn't know her."                         

 [9] "Judith."                                                                                                                                                                                                     

[10] "LANGER@: Hello. I'm delighted to be here."              // 여기에 보이는 @를 제거하는 예제를 실행해보자.                                                                                                                                                     

> lines <- head(lines, 10)

> doc <- Corpus(VectorSource(lines))

> summary(doc)

   Length Class             Mode

1  2      PlainTextDocument list

2  2      PlainTextDocument list

3  2      PlainTextDocument list

4  2      PlainTextDocument list

5  2      PlainTextDocument list

6  2      PlainTextDocument list

7  2      PlainTextDocument list

8  2      PlainTextDocument list

9  2      PlainTextDocument list

10 2      PlainTextDocument list

> doc[[1]]

<<PlainTextDocument (metadata: 7)>>

(Sample 1)

> doc[[2]]

<<PlainTextDocument (metadata: 7)>>

STRICKLAND: Good morning@. // <- @가 보인다.

> doc[[3]]

<<PlainTextDocument (metadata: 7)>>

Marsha is on her way. @She called from the car phone I think. It sounded like the car phone, to let us know that she would be delayed.

> doc[[4]]

<<PlainTextDocument (metadata: 7)>>

I would like to welcome @two people who haven't been with us before.

> inspect(doc[1])

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>


[[1]]

<<PlainTextDocument (metadata: 7)>>

(Sample 1)


> toSpace <- content_transformer(function(x,pattern) gsub(pattern,"",x)) // R 오브젝트의 컨텐츠를 수정

> doc <- tm_map(doc,toSpace, "@") 

> inspect(doc[2])

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>


[[1]]

<<PlainTextDocument (metadata: 7)>>

STRICKLAND: Good morning. // @ 가 사라진 것이 보인다.