[NLP-tensorflow] Part 3 Training a model to recognize sentiment in text

자연어처리 2022. 6. 6. 15:18

https://www.youtube.com/watch?v=Y_hzMnRXjhI&list=PLQY2H8rRoyvzDbLUZkbudP-MFQZwNmU4S&index=3

문장이 sarcasitic 한지 안한지 classifer로 분류.

is_sarcastic: 1-sarcastic, 0-otherwise

headline: the headline of the news article

article_link: link to the original news article. Useful in collecting supplementart data

파일이 JSON 형식이므로 파이썬 list 형태로 바꾸어주어야함

import json

with open("sarcasm.json", 'r') as f: # file open and close
	datastore = json.load(f)

sentences = []
lables = []
urls = []

for items in datastore:
	sentences.append(item['headline'])
	labels.append(item['is_sarcastic'])
	urls.appedn(item['article_link])

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index)

sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')
print(padded[0])
print(padded.shape)

Some of words in word index:

위의 코드 출력 결과:

하지만 위의 코드는 training, test data가 분리되어있지 않음 따라서 분리한 코드를 작성 ->

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(trainin_sentences)

word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length,
								padding=padding_type, truncating=trunc_type)


testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length,
								padding=padding_type, truncating=trunc_type)

그렇다면 sarcastic, not sarcastic 한 문장을 어떻게 구별해내고 분리할 것인가?

만약 우리가 데이터를 좋고 나쁨으로 분류한다면 다음과 같이 2차원의 그래프로 나타낼 수 있고 벡터의 방향을 통해 어느 쪽에 더 가까운지 구별할 수 있다.

이러한 과정을 embedding이라고 한다.

model = tf.keras.Sequential([
	tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
	tf.keras.layers.GlobalAveragePooling1D(),
	tf.keras.layers.Dense(24, activation='relu'),
	tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentrophy',optimizer='adam',metrics=['accuracy'])

num_epochs = 30

history = model.fit(training_padded, training_labels, epochs=num_epoches,
					validation_data=(testing_padded, testing_labels), verbose=2)

새로운 문장에 적용하는 방법:

sentences = [
	"granny starting to fear spiders in the garden might be real",
	"the weather today is bright and sunny"
]

sequences = tokenizer.texts_to_sequences(sentence)

padded = pad_sequencces(sequences, maxlen=max_length,
			padding=padding_type,
			truncating=trunc_type)
                        
print(model.predict(padded))

첫번째 문장: sarcastic 확률이 높음, 두번째 문장: 확률이 낮음

저작자표시

'자연어처리' 카테고리의 다른 글

[NLP-tensorflow] Training an AI to create poetry (NLP Zero to Hero - Part 6) (0)	2022.06.08
[NLP-tensorflow] Long Short Term Memory for NLP (0)	2022.06.08
[NLP-tensorflow] ML with Recurrent Neural Networks (0)	2022.06.08
[NLP-tensorflow] Part 2 Turning sentences into data (0)	2022.06.06
[NLP-Tensorflow] Part 1 Tokenization (0)	2022.06.06

ABOUT ME

JH's Tech Blog JH's Tech Blog

'자연어처리' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'자연어처리' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바