갈루아의 반서재

728x90

1. 소문자 변환 Conversion to Lower Case



> inspect(doc[2])

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>


[[1]]

<<PlainTextDocument (metadata: 7)>>

STRICKLAND: Good morning.


> doc <- tm_map(doc, content_transformer(tolower))

> inspect(doc[2])

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>


[[1]]

<<PlainTextDocument (metadata: 7)>>

strickland: good morning. // G -> g 로 변환되었음을 알 수 있다.


2. 숫자 지우기 Remove Numbers 


> inspect(doc[6])

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>


[[1]]

<<PlainTextDocument (metadata: 7)>>

clewell: yes. i'm the coordinator for reading language arts with the montgomery county public schools which is the suburban district surrounding washington. we have 173 schools and 25 elementary schools.


> doc <- tm_map(doc, removeNumbers)

> inspect(doc[6])

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>


[[1]]

<<PlainTextDocument (metadata: 7)>>

clewell: yes. i'm the coordinator for reading language arts with the montgomery county public schools which is the suburban district surrounding washington. we have  schools and  elementary schools. // 173, 25 가 제거되었음


3. 구두점 제거 Remove Punctuation


> inspect(doc[4])

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>


[[1]]

<<PlainTextDocument (metadata: 7)>>

i would like to welcome two people who haven't been with us before.


> doc <- tm_map(doc, removePunctuation)

> inspect(doc[4])

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>


[[1]]

<<PlainTextDocument (metadata: 7)>>

i would like to welcome two people who havent been with us before // . 이 제거되었네요.


4. Stop words 제거 Remove stop words


Stop words 는 for, very, and, of, are 등과 같은 일반적인 단어를 지칭하는 용어를 말합니다.


> length(stopwords("English")) // 영어의 stop words 갯수는 174개

[1] 174

> stopwords("English") // 해당 단어는 다음과 같음

  [1] "i"          "me"         "my"         "myself"     "we"        

  [6] "our"        "ours"       "ourselves"  "you"        "your"      

 [11] "yours"      "yourself"   "yourselves" "he"         "him"       

 [16] "his"        "himself"    "she"        "her"        "hers"      

 [21] "herself"    "it"         "its"        "itself"     "they"      

 [26] "them"       "their"      "theirs"     "themselves" "what"      

 [31] "which"      "who"        "whom"       "this"       "that"      

 [36] "these"      "those"      "am"         "is"         "are"       

 [41] "was"        "were"       "be"         "been"       "being"     

 [46] "have"       "has"        "had"        "having"     "do"        

 [51] "does"       "did"        "doing"      "would"      "should"    

 [56] "could"      "ought"      "i'm"        "you're"     "he's"      

 [61] "she's"      "it's"       "we're"      "they're"    "i've"      

 [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     

 [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      

 [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   

 [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    

 [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    

 [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     

 [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    

[101] "who's"      "what's"     "here's"     "there's"    "when's"    

[106] "where's"    "why's"      "how's"      "a"          "an"        

[111] "the"        "and"        "but"        "if"         "or"        

[116] "because"    "as"         "until"      "while"      "of"        

[121] "at"         "by"         "for"        "with"       "about"     

[126] "against"    "between"    "into"       "through"    "during"    

[131] "before"     "after"      "above"      "below"      "to"        

[136] "from"       "up"         "down"       "in"         "out"       

[141] "on"         "off"        "over"       "under"      "again"     

[146] "further"    "then"       "once"       "here"       "there"     

[151] "when"       "where"      "why"        "how"        "all"       

[156] "any"        "both"       "each"       "few"        "more"      

[161] "most"       "other"      "some"       "such"       "no"        

[166] "nor"        "not"        "only"       "own"        "same"      

[171] "so"         "than"       "too"        "very"      


> inspect(doc[8])

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>


[[1]]

<<PlainTextDocument (metadata: 7)>>

strickland and ill skip over to another member of the committee but for her this is her first meeting too judith langer i think we all know her work if we didnt know her // 빨간색으로 표시된 stop word 가 제거된 것을 아래에서 볼 수 있습니다.


> doc <- tm_map(doc, removeWords, stopwords("english"))

> inspect(doc[8])

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>


[[1]]

<<PlainTextDocument (metadata: 7)>>

strickland  ill skip   another member   committee       first meeting  judith langer  think   know  work   didnt know   



5. 특정 단어 제거 Remove Own Stop Words



> inspect(doc[5])

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>


[[1]]

<<PlainTextDocument (metadata: 7)>>

suzanne clewell  delighted     us today suzanne   tell us  little bit    


> doc <- tm_map(doc, removeWords, c("tell"))

> inspect(doc[5])

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>


[[1]]

<<PlainTextDocument (metadata: 7)>>

suzanne clewell  delighted     us today suzanne    us  little bit    // tell 제거



6. 여백 없애기 Strip Whitespace


> inspect(doc[3])

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>


[[1]]

<<PlainTextDocument (metadata: 7)>>

marsha    way  called   car phone  think  sounded like  car phone  let us know     delayed


> doc <- tm_map(doc, stripWhitespace)

> inspect(doc[3])

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>


[[1]]

<<PlainTextDocument (metadata: 7)>>

marsha way called car phone think sounded like car phone let us know delayed




728x90