06. Stemming 어간추출

728x90

Stemming 어간추출

예를 들면, "es", "ed", "s" 와 같은 common word endings english 을 제거하는 알고리즘을 이용한다.

SnowballC 패키지의 wordStem() 의 기능을 이용한다(Bouchet-Valat, 2014)

많은 경우에 데이터 분석을 위해 어간을 추출할 필요가 있다. 예를 들어, "example" 과 "examples" 은 동일한 "exampl" 에서 비롯되었다고 할 수 있기 때문이다. 아래 결과를 통해 추출전과 추출후를 비교해보자.

> doc[[3]]
<<PlainTextDocument (metadata: 7)>>
STRICKLAND: All right. So it will be prior to August 14th or whatever date it is.
> doc[[6]]
<<PlainTextDocument (metadata: 7)>>
STRICKLAND: Way prior, yes. Way prior.
> doc[[10]]
<<PlainTextDocument (metadata: 7)>>
STRICKLAND: Anything else?

> library(SnowballC)
> doc <- tm_map(doc, stemDocument)
> inspect(doc[3])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
STRICKLAND: All right. So it will be prior to August 14th or whatev date it is. // 상단의 붉은 색 텍스트와 비교해볼 것

> inspect(doc[6])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
STRICKLAND: Way prior, yes. Way prior.

> inspect(doc[10])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
STRICKLAND: Anyth else?

728x90

저작자표시 비영리 변경금지

'프로그래밍 Programming' 카테고리의 다른 글

08. Exploring the Document Term Matrix (0)	2014.11.22
07. Document-Term행렬 만들기 Creating a Document-Term Matrix (0)	2014.11.22
05. Preparing the Corpus - 특정 변환 (0)	2014.11.22
04. Preparing the Corpus - 기본 변환 (0)	2014.11.22
03. Exploring the corpus - 전처리 및 간단한 변환 (0)	2014.11.21

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

갈루아의 반서재

06. Stemming 어간추출

'프로그래밍 Programming' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역