'corpus' 태그의 글 목록

corpus

갈루아의 반서재

2014. 11. 22.

Stemming 어간추출 예를 들면, "es", "ed", "s" 와 같은 common word endings english 을 제거하는 알고리즘을 이용한다. SnowballC 패키지의 wordStem() 의 기능을 이용한다(Bouchet-Valat, 2014) 많은 경우에 데이터 분석을 위해 어간을 추출할 필요가 있다. 예를 들어, "example" 과 "examples" 은 동일한 "exampl" 에서 비롯되었다고 할 수 있기 때문이다. 아래 결과를 통해 추출전과 추출후를 비교해보자. > doc[[3]]STRICKLAND: All right. So it will be prior to August 14th or whatever date it is.> doc[[6]]STRICKLAND: Way prior..

05. Preparing the Corpus - 특정 변환

2014. 11. 22.

Specific Transformations 의 예 > toString inspect(doc[6]) [[1]]clewell yes im coordinator reading language arts montgomery county public schools suburban district surrounding washington schools elementary schools > doc inspect(doc[6]) [[1]]clewell yes im coordinator reading language arts montgomery county public schools suburban district surrounding WA schools elementary schools

04. Preparing the Corpus - 기본 변환

2014. 11. 22.

1. 소문자 변환 Conversion to Lower Case > inspect(doc[2]) [[1]]STRICKLAND: Good morning. > doc inspect(doc[2]) [[1]]strickland: good morning. // G -> g 로 변환되었음을 알 수 있다. 2. 숫자 지우기 Remove Numbers > inspect(doc[6]) [[1]]clewell: yes. i'm the coordinator for reading language arts with the montgomery county public schools which is the suburban district surrounding washington. we have 173 schools and 25 el..

03. Exploring the corpus - 전처리 및 간단한 변환

2014. 11. 21.

1. Exploring the Corpusinspect() 를 이용하여 문서의 데이터가 제대로 로딩되었는지 확인이 가능하다. > inspect(docs[2]) [[1]]NULL2. Preparing the Corpus텍스트 분석을 위해서 경우에 따라서는 전처리 과정이 필요할 수 있다. 아래에서 보듯이 대상 텍스트를 소문자로 변환시키고, 숫자를 제거하는 등의 기능을 갖고 있음을 확인할 수 있다. > getTransformations()## [1] "removeNumbers" "removePunctuation" "removeWords"## [4] "stemDocument" "stripWhitespace"변환을 위해서는 tm_map() 을 사용한다. 아래에서 살펴본다. 3. Simple Transforms 아..

02. Loading a Corpus (txt, pdf, word)

2014. 11. 18.

1. Corpus 로딩분석의 대상이 되는 문서의 포맷은 다양하지만, 우리가 앞으로 사용하게 될 tm 패키지는 꽤 많은 포맷을 지원한다. text, PDF, Microsoft Word, XML.의 포맷을 포함한다. 2. Corpus Sources and Readers1) sources> getSources()[1] "DataframeSource" "DirSource" "URISource" "VectorSource" "XMLSource" 2) readers - 텍스트 분석 결과는 다음의 포맷 등으로 표현가능하다> getReaders()[1] "readDOC" "readPDF" "readPlain" "readRCV1" [5] "readRCV1asPlain" "readReut21578XML" "readReut2..

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

갈루아의 반서재

corpus

06. Stemming 어간추출

05. Preparing the Corpus - 특정 변환

04. Preparing the Corpus - 기본 변환

03. Exploring the corpus - 전처리 및 간단한 변환

02. Loading a Corpus (txt, pdf, word)

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역