Customer Information

Overview of the Data Cleaning, Preprocessing, and Analysis Workflow

1 Data Source

This study collected extensive customer review data from the well-known restaurant chain Jiji Hong Hotpot.

2 Data Filtering

Ensuring the quality of the dataset before conducting research is crucial for obtaining accurate and reliable results. This study uses OpenRefine for efficient data cleaning and filtering to ensure dataset quality and consistency. The detailed steps for data filtering are as follows:

2.1 Selecting Relevant Fields

The first step in data filtering is identifying and extracting fields directly related to sentiment analysis tasks. In the customer reviews dataset of Jijihong Hotpot, the two key fields are:

Review Details: Contains detailed descriptions and sentiment expressions of customers' dining experiences.
Review Tags: Indicates overall customer satisfaction with the service or food.

These two fields serve as the main inputs and labels for model training.

2.2 Removing Duplicates and Missing Values

Ensuring data completeness is crucial for analysis quality. Records with missing or incomplete review details should be removed. In the review details field, 893 duplicate records were identified and removed, resulting in 10,033 valid records.

2.3 Removing Outliers

Texts that are unusually short or unclear, such as single emojis or words, are removed to ensure substantive content for analysis. A minimum text length standard is set to enhance model training effectiveness.

After these data filtering steps, a cleaner and more standardized dataset is obtained, laying a solid foundation for subsequent preprocessing, feature extraction, and model training.

3 Text Preprocessing

The review texts undergo standardized text preprocessing, including removing irrelevant characters, tokenization, stop word filtering, part-of-speech tagging, and stemming.

3.1 Environment Setup

Use Python environment and install necessary libraries such as jieba, HanLP, pandas, and numpy.

3.2 Text Encoding

Ensure all text data uses a uniform encoding format (UTF-8) to avoid encoding errors during text processing.

3.3 Removing Irrelevant Characters

Use regular expressions to remove HTML tags, URLs, numbers, and special characters, reducing noise during model training.

        df['text'] = df['text'].str.replace(r'<[^>]+>', '', regex=True)  # Remove HTML tags
        df['text'] = df['text'].str.replace(r'http\S+', '', regex=True)  # Remove URLs
        df['text'] = df['text'].str.replace(r'\d+', '', regex=True)  # Remove numbers

3.4 Privacy Information Removal

Identify and replace or remove possible personal information such as phone numbers and email addresses.

        df['text'] = df['text'].str.replace(r'\b\d{11}\b', ' ', regex=True)  # Replace 11-digit phone numbers with blank

3.5 Text Tokenization

Add proprietary nouns to the jieba dictionary before tokenization to ensure correct segmentation of brand names and other special terms.

        import jieba
        jieba.add_word('季季红', freq=100)  # Add brand name with frequency adjustment
        df['text'] = df['text'].apply(lambda x: ' '.join(jieba.cut(x)))  # Tokenize text

3.6 Stop Word Removal

Use the Harbin Institute of Technology stop words list to filter out irrelevant words in text analysis.

        stopwords = set(open('stopwords_hit.txt', 'r', encoding='utf-8').read().split())
        df['text'] = df['text'].apply(lambda x: ' '.join(word for word in x.split() if word not in stopwords))

3.7 Synonym Processing and Standardization

Create a synonym dictionary to unify different expressions with the same meaning, enhancing consistency in the dataset.

        synonym_dict = {'满意': '喜欢', '好吃': '美味'}
        df['text'] = df['text'].apply(lambda x: ' '.join(synonym_dict.get(word, word) for word in x.split()))

3.8 Text Length Standardization

Ensure text length uniformity according to model requirements, trimming excessively long texts and padding short texts to a standard length of 512 words.

        max_length = 512
        df['text_trimmed'] = df['text'].apply(lambda x: ' '.join(x.split()[:max_length]))

4 Sentiment Annotation

After basic text preprocessing, sentiment annotation assigns a sentiment label (positive, negative, or neutral) to each text.

4.1 Automated Sentiment Analysis

Perform initial sentiment tagging using automated tools, followed by manual corrections for specific domain texts.

5 Word Embedding Representation

Word embedding converts textual data into numerical forms understandable by machines. This study uses the BERT model for deep semantic understanding of text data.

5.1 Environment Preparation

Ensure installation of necessary libraries like TensorFlow and transformers.

        pip install tensorflow transformers
        import tensorflow as tf
        from transformers import BertTokenizer, TFBertModel

5.2 Loading Pretrained Model

Load the 'bert-base-uncased' model and tokenizer from the transformers library.

        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        model = TFBertModel.from_pretrained('bert-base-uncased')

5.3 Text Preparation and Tokenization

Import preprocessed text data and use the BERT tokenizer to convert texts into token sequences.

        df = pd.read_csv("data_syn.csv", encoding='utf-8')
        texts = df['text'].astype(str).tolist()
        encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors="tf")

5.4 Generating Word Embeddings

Use the BERT model to generate word embeddings, capturing the semantic information of each word in its context.

        outputs = model(encoded_input)
        last_hidden_states = outputs.last_hidden_state

5.5 Evaluation and Application

Store the generated embeddings for sentiment analysis and other tasks, evaluating their quality through model performance.

6 Data Loader

Due to the large data volume, a data loader is designed to improve model training efficiency and effectiveness. TensorFlow's data pipeline API is used to create a data loader that batches, shuffles, and prefetches data, ensuring efficient data transmission during training.

7 Dataset Splitting

Proper dataset splitting is crucial for model training, ensuring diverse learning and performance validation.

7.1 Training Set

Consists of approximately 70% of the data for model learning.

7.2 Validation Set

Occupies about 15% of the data for performance evaluation and hyperparameter tuning.

7.3 Test Set

The remaining 15% of the data is used for unbiased performance evaluation after training.

The dataset is split before the experiments, considering class balance to avoid model bias.

This overview summarizes the comprehensive process of data cleaning, preprocessing, and analysis, ensuring the preparation of a high-quality dataset for sentiment analysis.

The designed pie chart showing the dataset split into 70% Training Set, 15% Validation Set, and 15% Test Set.