[Python] 영어 단어 모음 분석하기

Notice

Recent Posts

Recent Comments

Link

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Tags more

Archives

Today

Total

관리 메뉴

one step

[Python] 영어 단어 모음 분석하기 본문

이것저것 코드/파이썬

[Python] 영어 단어 모음 분석하기

원-스텝 2022. 10. 8. 15:05

영어 단어 모음 분석하기

이 프로젝트에서는 영어 단어와 그 빈도수를 정리한 British National Corpus 단어 모음을 분석하고 시각화해봅니다.

corpus.txt를 이용해 가장 많이 사용된 영어 단어 분석
matplotlib을 이용해 단어 별 사용 빈도를 보여주는 막대 그래프 작성

분석 후《이상한 나라의 엘리스》동화책에 등장하는 단어 수와 BNC 데이터를 비교해보겠습니다.

가장 많이 등장하는 단어의 분포
불용어를 제외하고 가장 많이 사용된 단어

라이브 수업에서 함께 코드를 작성하기 전에 corpus.txt 파일과 main.py의 스켈레톤 코드를 살펴보세요.

작성해야 하는 함수

import_corpus(filename)
create_corpus(filenames)
filter_by_prefix(corpus, prefix)
most_frequent_words(corpus, number)

세부 구현 사항

1. import_corpus(filename)

단어와 빈도수 데이터가 담긴 파일 한 개를 불러온 후, (단어, 빈도수) 꼴의 튜플로 구성된 리스트를 반환합니다.
즉, 코퍼스 파일을 읽어 리스트로 변환하는 함수입니다.

반환 예시

[('zoo', 768), ('zones', 1168), ...
Copy

2. create_corpus(filenames)

텍스트 파일 여러 개를 한 번에 불러온 후, (단어, 빈도수) 꼴의 튜플로 구성된 리스트를 반환합니다.
즉, 텍스트 파일을 읽어들여 튜플 꼴의 리스트 형태로 만드는 함수입니다.

반환 예시

[('Down', 3), ('the', 487), ('RabbitHole', 1), ...
Copy

3. filter_by_prefix(corpus, prefix)

(단어, 빈도수) 꼴의 튜플들을 담고 있는 리스트의 형태로 주어지는 corpus의 데이터 중 특정 문자열 prefix로 시작하는 단어 데이터만 추린 리스트를 반환합니다.

호출 예시

filter_by_prefix(corpus, "qu")
Copy

주어진 corpus 데이터 중에서 문자열 **”qu”**로 시작하는 데이터만 추려 반환합니다.

반환 예시

[('quotes', 700), ('quoted', 2663), ('quote', 1493),  ...
Copy

4. most_frequent_words(corpus, number)

corpus의 데이터 중 가장 빈도가 높은 number개의 데이터만 추립니다.

호출 예시

most_frequent_words(corpus, 3)
Copy

반환 예시

[('the', 6187927), ('of', 2941790), ('and', 2682878)]
Copy

# 초기코드
# 프로젝트에 필요한 패키지를 import합니다.
from operator import itemgetter
from collections import Counter
from string import punctuation
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm

from elice_utils import EliceUtils
elice_utils = EliceUtils()


def import_corpus(filename):
    # 튜플을 저장할 리스트를 생성합니다.
    corpus = []
    
    # 매개변수로 입력 받은 파일을 열고 읽습니다.
    
        # 텍스트 파일의 각 줄을 (단어, 빈도수) 꼴로 corpus에 저장합니다.
        
            
            
    
    return corpus


def create_corpus(filenames):
    # 단어를 저장할 리스트를 생성합니다.
    words = []
    
    # 여러 파일에 등장하는 모든 단어를 모두 words에 저장합니다.
    
        
            
            # 이 때 문장부호를 포함한 모든 특수기호를 제거합니다. 4번째 줄에서 임포트한 punctuation을  이용하세요.
            for symbol in punctuation:
                content = None
            words = None
    
    # words 리스트의 데이터를 corpus 형태로 변환합니다. Counter() 사용 방법을 검색해보세요.
    corpus = Counter(words)
    return list(corpus.items())


def filter_by_prefix(corpus, prefix):
    return None


def most_frequent_words(corpus, number):
    return None
    

def draw_frequency_graph(corpus):
    # 막대 그래프의 막대 위치를 결정하는 pos를 선언합니다.
    pos = range(len(corpus))
    
    # 튜플의 리스트인 corpus를 단어의 리스트 words와 빈도의 리스트 freqs로 분리합니다.
    words = [tup[0] for tup in corpus]
    freqs = [tup[1] for tup in corpus]
    
    # 한국어를 보기 좋게 표시할 수 있도록 폰트를 설정합니다.
    font = fm.FontProperties(fname='./NanumBarunGothic.ttf')
    
    # 막대의 높이가 빈도의 값이 되도록 설정합니다.
    plt.bar(pos, freqs, align='center')
    
    # 각 막대에 해당되는 단어를 입력합니다.
    plt.xticks(pos, words, rotation='vertical', fontproperties=font)
    
    # 그래프의 제목을 설정합니다.
    plt.title('단어 별 사용 빈도', fontproperties=font)
    
    # Y축에 설명을 추가합니다.
    plt.ylabel('빈도', fontproperties=font)
    
    # 단어가 잘리지 않도록 여백을 조정합니다.
    plt.tight_layout()
    
    # 그래프를 표시합니다.
    plt.savefig('graph.png')
    elice_utils.send_image('graph.png')


def main(prefix=''):
    # import_corpus() 함수를 통해 튜플의 리스트를 생성합니다.
    corpus = import_corpus('corpus.txt')
    
    # head로 시작하는 단어들만 골라 냅니다.
    prefix_words = filter_by_prefix(corpus, prefix)
    
    # 주어진 prefix로 시작하는 단어들을 빈도가 높은 순으로 정렬한 뒤 앞의 10개만 추립니다.
    top_ten = most_frequent_words(prefix_words, 10)
    
    # 단어 별 빈도수를 그래프로 나타냅니다.
    draw_frequency_graph(top_ten)
    
    # 'Alice in Wonderland' 책의 단어를 corpus로 바꿉니다.
    alice_files = ['alice/chapter{}.txt'.format(chapter) for chapter in range(1, 6)]
    alice_corpus = create_corpus(alice_files)
    
    top_ten_alice = most_frequent_words(alice_corpus, 10)
    draw_frequency_graph(top_ten_alice)


if __name__ == '__main__':
    main()

# 완성코드
# 프로젝트에 필요한 패키지를 import합니다.
from operator import itemgetter
from collections import Counter
from string import punctuation
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm

from elice_utils import EliceUtils
elice_utils = EliceUtils()


def import_corpus(filename):
    # 튜플을 저장할 리스트를 생성합니다.
    corpus = []
    
    # 매개변수로 입력 받은 파일을 열고 읽습니다.
    with open(filename) as file:
        # 텍스트 파일의 각 줄을 (단어, 빈도수) 꼴로 corpus에 저장합니다.
        for line in file:
            word, num = line.split(',')
            num = int(num.replace('\n', ''))
            corpus.append((word, num))
    
    return corpus

def create_corpus(filenames):
    # 단어를 저장할 리스트를 생성합니다.
    words = []
    
    # 여러 파일에 등장하는 모든 단어를 모두 words에 저장합니다.
    for filename in filenames:
        with open(filename) as file:
            content = file.read()

            # 이 때 문장부호를 포함한 모든 특수기호를 제거합니다. 4번째 줄에서 임포트한 punctuation을  이용하세요.
            for symbol in punctuation:
                content = content.replace(symbol, '')
            words += content.split()
    
    # words 리스트의 데이터를 corpus 형태로 변환합니다. Counter() 사용 방법을 검색해보세요.
    corpus = Counter(words)
    #print(list(corpus.items()))
    return list(corpus.items())


def filter_by_prefix(corpus, prefix):
    words = []
    for items in corpus:
        word, num = items
        if word.startswith(prefix):
            words.append((word, num))
    return words


    # tmp = list(filter(lambda word : word[0].startswith(prefix), corpus))
    # return tmp

def most_frequent_words(corpus, number):
    tmp = sorted(corpus, key = lambda corpus : corpus[1], reverse = True)[:number]
    #print(tmp)
    return tmp
    

def draw_frequency_graph(corpus):
    # 막대 그래프의 막대 위치를 결정하는 pos를 선언합니다.
    pos = range(len(corpus))
    
    # 튜플의 리스트인 corpus를 단어의 리스트 words와 빈도의 리스트 freqs로 분리합니다.
    words = [tup[0] for tup in corpus]
    freqs = [tup[1] for tup in corpus]
    
    # 한국어를 보기 좋게 표시할 수 있도록 폰트를 설정합니다.
    font = fm.FontProperties(fname='./NanumBarunGothic.ttf')
    
    # 막대의 높이가 빈도의 값이 되도록 설정합니다.
    plt.bar(pos, freqs, align='center')
    
    # 각 막대에 해당되는 단어를 입력합니다.
    plt.xticks(pos, words, rotation='vertical', fontproperties=font)
    
    # 그래프의 제목을 설정합니다.
    plt.title('단어 별 사용 빈도', fontproperties=font)
    
    # Y축에 설명을 추가합니다.
    plt.ylabel('빈도', fontproperties=font)
    
    # 단어가 잘리지 않도록 여백을 조정합니다.
    plt.tight_layout()
    
    # 그래프를 표시합니다.
    plt.savefig('graph.png')
    elice_utils.send_image('graph.png')


def main(prefix=''):
    # import_corpus() 함수를 통해 튜플의 리스트를 생성합니다.
    corpus = import_corpus('corpus.txt')
    
    # head로 시작하는 단어들만 골라 냅니다.
    prefix_words = filter_by_prefix(corpus, prefix)
    
    # 주어진 prefix로 시작하는 단어들을 빈도가 높은 순으로 정렬한 뒤 앞의 10개만 추립니다.
    top_ten = most_frequent_words(prefix_words, 10)
    
    # 단어 별 빈도수를 그래프로 나타냅니다.
    draw_frequency_graph(top_ten)
    
    # 'Alice in Wonderland' 책의 단어를 corpus로 바꿉니다.
    alice_files = ['alice/chapter{}.txt'.format(chapter) for chapter in range(1, 6)]
    alice_corpus = create_corpus(alice_files)
    
    top_ten_alice = most_frequent_words(alice_corpus, 10)
    draw_frequency_graph(top_ten_alice)


if __name__ == '__main__':
    main()

중간에 lambda 함수를 쓰는 구간이 있는데, 난 이걸 쓸 줄 몰랐음..!

아래 블로그를 참고했다.

출처: https://limjun92.github.io/ai_%EC%8B%9C%EC%9E%91/%ED%8C%8C%EC%9D%B4%EC%8D%AC-%EC%8B%A4%EC%A0%84-%EB%8D%B0%EC%9D%B4%ED%84%B0-%EB%B6%84%EC%84%9D/#%EC%98%81%EC%96%B4-%EB%8B%A8%EC%96%B4-%EB%AA%A8%EC%9D%8C-%EB%B6%84%EC%84%9D%ED%95%98%EA%B8%B0

'이것저것 코드 > 파이썬' 카테고리의 다른 글

[Python] 파이썬 데이터 분석 기초 시험문제 풀이 (영어 단어 빈도수 찾기) (0)	2022.10.08
[Python] 파이썬 데이터 분석 기초 시험문제 풀이 (트럼프 대통령 트윗 분류하기) (1)	2022.10.08
[Python] 문장 분석 전처리하고 word cloud 만들기 (0)	2022.10.02
[파이썬] 텍스트와 텍스트의 반복 수 조합해 딕셔너리 만들기 (0)	2022.09.16
[파이썬] 태그 수집, 중복 태그 수 세어 딕셔너리형으로 반환하기 (0)	2022.09.16

'이것저것 코드/파이썬' Related Articles

one step

[Python] 영어 단어 모음 분석하기 본문

[Python] 영어 단어 모음 분석하기

영어 단어 모음 분석하기

작성해야 하는 함수

세부 구현 사항

1. import_corpus(filename)

반환 예시

2. create_corpus(filenames)

반환 예시

3. filter_by_prefix(corpus, prefix)

호출 예시

반환 예시

4. most_frequent_words(corpus, number)

호출 예시

반환 예시

'이것저것 코드 > 파이썬' 카테고리의 다른 글

티스토리툴바