04. Preparing the Corpus

728x90

1. 소문자 변환 Conversion to Lower Case

> inspect(doc[2])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
STRICKLAND: Good morning.

> doc <- tm_map(doc, content_transformer(tolower))
> inspect(doc[2])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
strickland: good morning. // G -> g 로 변환되었음을 알 수 있다.

2. 숫자 지우기 Remove Numbers

> inspect(doc[6])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
clewell: yes. i'm the coordinator for reading language arts with the montgomery county public schools which is the suburban district surrounding washington. we have 173 schools and 25 elementary schools.

> doc <- tm_map(doc, removeNumbers)
> inspect(doc[6])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
clewell: yes. i'm the coordinator for reading language arts with the montgomery county public schools which is the suburban district surrounding washington. we have schools and elementary schools. // 173, 25 가 제거되었음

3. 구두점 제거 Remove Punctuation

> inspect(doc[4])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
i would like to welcome two people who haven't been with us before.

> doc <- tm_map(doc, removePunctuation)
> inspect(doc[4])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
i would like to welcome two people who havent been with us before // . 이 제거되었네요.

4. Stop words 제거 Remove stop words

Stop words 는 for, very, and, of, are 등과 같은 일반적인 단어를 지칭하는 용어를 말합니다.

> length(stopwords("English")) // 영어의 stop words 갯수는 174개
[1] 174
> stopwords("English") // 해당 단어는 다음과 같음
[1] "i" "me" "my" "myself" "we"
[6] "our" "ours" "ourselves" "you" "your"
[11] "yours" "yourself" "yourselves" "he" "him"
[16] "his" "himself" "she" "her" "hers"
[21] "herself" "it" "its" "itself" "they"
[26] "them" "their" "theirs" "themselves" "what"
[31] "which" "who" "whom" "this" "that"
[36] "these" "those" "am" "is" "are"
[41] "was" "were" "be" "been" "being"
[46] "have" "has" "had" "having" "do"
[51] "does" "did" "doing" "would" "should"
[56] "could" "ought" "i'm" "you're" "he's"
[61] "she's" "it's" "we're" "they're" "i've"
[66] "you've" "we've" "they've" "i'd" "you'd"
[71] "he'd" "she'd" "we'd" "they'd" "i'll"
[76] "you'll" "he'll" "she'll" "we'll" "they'll"
[81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
[86] "haven't" "hadn't" "doesn't" "don't" "didn't"
[91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
[96] "cannot" "couldn't" "mustn't" "let's" "that's"
[101] "who's" "what's" "here's" "there's" "when's"
[106] "where's" "why's" "how's" "a" "an"
[111] "the" "and" "but" "if" "or"
[116] "because" "as" "until" "while" "of"
[121] "at" "by" "for" "with" "about"
[126] "against" "between" "into" "through" "during"
[131] "before" "after" "above" "below" "to"
[136] "from" "up" "down" "in" "out"
[141] "on" "off" "over" "under" "again"
[146] "further" "then" "once" "here" "there"
[151] "when" "where" "why" "how" "all"
[156] "any" "both" "each" "few" "more"
[161] "most" "other" "some" "such" "no"
[166] "nor" "not" "only" "own" "same"
[171] "so" "than" "too" "very"
>

> inspect(doc[8])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
strickland and ill skip over to another member of the committee but for her this is her first meeting too judith langer i think we all know her work if we didnt know her // 빨간색으로 표시된 stop word 가 제거된 것을 아래에서 볼 수 있습니다.

> doc <- tm_map(doc, removeWords, stopwords("english"))
> inspect(doc[8])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
strickland ill skip another member committee first meeting judith langer think know work didnt know

>

5. 특정 단어 제거 Remove Own Stop Words

> inspect(doc[5])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
suzanne clewell delighted us today suzanne tell us little bit

> doc <- tm_map(doc, removeWords, c("tell"))
> inspect(doc[5])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
suzanne clewell delighted us today suzanne us little bit // tell 제거

>

6. 여백 없애기 Strip Whitespace

> inspect(doc[3])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
marsha way called car phone think sounded like car phone let us know delayed

> doc <- tm_map(doc, stripWhitespace)
> inspect(doc[3])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
marsha way called car phone think sounded like car phone let us know delayed

>

728x90

저작자표시 비영리 변경금지

'프로그래밍 Programming' 카테고리의 다른 글

06. Stemming 어간추출 (0)	2014.11.22
05. Preparing the Corpus - 특정 변환 (0)	2014.11.22
03. Exploring the corpus - 전처리 및 간단한 변환 (0)	2014.11.21
02. Loading a Corpus (txt, pdf, word) (1)	2014.11.18
01. 텍스트마이닝(Text Mining)을 위한 패키지 준비 (0)	2014.11.14

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

갈루아의 반서재

04. Preparing the Corpus - 기본 변환

'프로그래밍 Programming' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역