Natural Language Processing (NLP) in Python with 8 Projects - Sentence Segmentation
Natural Language Processing (NLP) in Python with 8 Projects 목차
- Tokenization Basics
- Stemming and Lemmatization
- Stop Words
- Vocabulary_and_Matching
- POS Tagging
- Named Entity Recognition
- Sentence Segmentation
- Spam Message Classification
- Tfidf Vectorizer
- Restaurant Reviews Classification with NLTK
- Restaurant Reviews Classification with NLTK 응용하기
- IMDB and Amazon Review Classification with SpaCy
- Text summarization
- Spam Detection with CNN
- Spam Detection with RNN
- Text Generation with TensorFlow Keras and LSTM
Sentence Segmentation
A Doc object’s sentences are available via the Doc.sents property. To view a Doc’s sentences, you can iterate over the Doc.sents, a generator that yields Span objects. You can check whether a Doc has sentence boundaries by calling Doc.has_annotation with the attribute name “SENT_START”.
Doc 개체의 문장은 Doc.sents 속성을 통해 사용할 수 있습니다. Doc의 문장을 보려면 Span 개체를 생성하는 생성기인 Doc.sents를 반복할 수 있습니다. 속성 이름이 “SENT_START”인 Doc.has_annotation을 호출하여 문서에 문장 경계가 있는지 확인할 수 있습니다.
# 예제1
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence. This is another sentence.")
assert doc.has_annotation("SENT_START")
for sent in doc.sents:
print(sent.text)
This is a sentence.
This is another sentence.
#예제2
s1 = "This is a sentence. This is second sentence. This is last sentence."
s2 = "This is a sentence; This is second sentence; This is last sentence."
import spacy
nlp = spacy.load(name='en_core_web_sm')
doc1 = nlp(s1)
한문장 안에 .으로 끝나는 문장을 여러개 출력해준다
for sent in doc1.sents:
print(sent.text)
This is a sentence.
This is second sentence.
This is last sentence.
s3 = "This is a sentence. This is second U.K. sentence. This is last sentence."
doc3 = nlp(s3)
for sent in doc3.sents:
print(sent.text)
This is a sentence.
This is second U.K. sentence.
This is last sentence.
doc2 = nlp(s2)
for sent in doc2.sents:
print(sent.text)
This is a sentence; This is second sentence; This is last sentence.
s2
'This is a sentence; This is second sentence; This is last sentence.'
‘This is a sentence; This is second sentence; This is last sentence.’ 문장을 어떻게 여러개의 문장들로 나눌수 있을까?
참고사이트-https://johnnn.tech/q/using-pos-and-punct-tokens-in-custom-sentence-boundaries-in-spacy/
from spacy.language import Language
custom_EOS = ['.', ',', '!', '!',';']
custom_conj = ['then', 'so']
@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text in custom_EOS:
doc[token.i + 1].is_sent_start = True
if token.text in custom_conj:
doc[token.i].is_sent_start = True
return doc
def set_sentence_breaks(doc):
for token in doc:
if token == "SCONJ": #SCONJ: subordinating conjunction
doc[token.i].is_sent_start = True
def main():
text = "In the add user use case, we need to consider speed and reliability, so use of a relational DB would be better than using SQLite. Though it may take extra effort to convert @Bot"
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("set_custom_boundaries", before="parser")
doc = nlp(text)
# for token in doc:
# print(token.pos_)
print("Sentences:", [sent.text for sent in doc.sents])
if __name__ == "__main__":
main()
Sentences: ['In the add user use case,', 'we need to consider speed and reliability,', 'so use of a relational DB would be better than using SQLite.', 'Though it may take extra effort to convert @Bot']
- 참고 사이트에 나와있는 방식대로 적용하여 [‘This is a sentence;’, ‘This is second sentence;’, ‘This is last sentence.’] 문장들로 분리했음
def main():
text = s2
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("set_custom_boundaries", before="parser")
doc = nlp(text)
# for token in doc:
# print(token.pos_)
print("Sentences:", [sent.text for sent in doc.sents])
if __name__ == "__main__":
main()
Sentences: ['This is a sentence;', 'This is second sentence;', 'This is last sentence.']