Creați un detector de spam prin e-mail folosind Amazon SageMaker | Amazon Web Services

Creați un detector de spam prin e-mail folosind Amazon SageMaker | Amazon Web Services

Spam emails, also known as junk mail, are sent to a large number of users at once and often contain scams, phishing content, or cryptic messages. Spam emails are sometimes sent manually by a human, but most often they are sent using a bot. Examples of spam emails include fake ads, chain emails, and impersonation attempts. There is a risk that a particularly well-disguised spam email may land in your inbox, which can be dangerous if clicked on. It’s important to take extra precautions to protect your device and sensitive information.

As technology is improving, the detection of spam emails becomes a challenging task due to its changing nature. Spam is quite different from other types of security threats. It may at first appear like an annoying message and not a amenințare, but it has an immediate effect. Also spammers often adapt new techniques. Organizations who provide email services want to minimize spam as much as possible to avoid any damage to their end customers.

In this post, we show how straightforward it is to build an email spam detector using Amazon SageMaker. Încorporat Algoritmul BlazingText offers optimized implementations of Word2vec and text classification algorithms. Word2vec is useful for various natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, and machine translation. Text classification is essential for applications like web searches, information retrieval, ranking, and document classification.

Prezentare generală a soluțiilor

This post demonstrates how you can set up email spam detector and filter spam emails using SageMaker. Let’s see how a spam detector typically works, as shown in the following diagram.

Build an email spam detector using Amazon SageMaker | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

Emails are sent through a spam detector. An email is sent to the spam folder if the spam detector detects it as spam. Otherwise, it’s sent to the customer’s inbox.

We walk you through the following steps to set up our spam detector model:

  1. Download the sample dataset from the GitHub repo.
  2. Load the data in an Amazon SageMaker Studio caiet.
  3. Prepare the data for the model.
  4. Train, deploy, and test the model.

Cerințe preliminare

Before diving into this use case, complete the following prerequisites:

  1. Configurați un Cont AWS.
  2. Configurați un domeniul SageMaker.
  3. Creați o Serviciul Amazon de stocare simplă (Amazon S3) bucket. For instructions, see Creați-vă prima găleată S3.

Descărcați setul de date

Download the email_dataset.csv from GitHub și upload the file to the S3 bucket.

The BlazingText algorithm expects a single preprocessed text file with space-separated tokens. Each line in the file should contain a single sentence. If you need to train on multiple text files, concatenate them into one file and upload the file in the respective channel.

Load the data in SageMaker Studio

Pentru a efectua încărcarea datelor, parcurgeți următorii pași:

  1. Descărcați spam_detector.ipynb fișier din GitHub și upload the file in SageMaker Studio.
  2. In your Studio notebook, open the spam_detector.ipynb caiet.
  3. If you are prompted to choose a Kernel, choose the Python 3 (Data Science 3.0) kernel and choose Selectați. If not, verify that the right kernel has been automatically selected.

Build an email spam detector using Amazon SageMaker | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

  1. Import the required Python library and set the roles and the S3 buckets. Specify the S3 bucket and prefix where you uploaded email_dataset.csv.

Build an email spam detector using Amazon SageMaker | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

  1. Run the data load step in the notebook.

Build an email spam detector using Amazon SageMaker | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

  1. Check if the dataset is balanced or not based on the Category labels.

Build an email spam detector using Amazon SageMaker | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

We can see our dataset is balanced.

Pregătiți datele

The BlazingText algorithm expects the data in the following format:

__label__<label> "<features>"

Iată un exemplu:

__label__0 “This is HAM"
__label__1 "This is SPAM"

Verifica Training and Validation Data Format for the BlazingText Algorithm.

Acum rulați pasul de pregătire a datelor în notebook.

  1. First, you need to convert the Category column to an integer. The following cell replaces the SPAM value with 1 and the HAM value with 0.

Build an email spam detector using Amazon SageMaker | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

  1. The next cell adds the prefix __label__ to each Category value and tokenizes the Message column.

Build an email spam detector using Amazon SageMaker | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

  1. The next step is to split the dataset into train and validation datasets and upload the files to the S3 bucket.

Build an email spam detector using Amazon SageMaker | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

Antrenează modelul

To train the model, complete the following steps in the notebook:

  1. Set up the BlazingText estimator and create an estimator instance passing the container image.

Build an email spam detector using Amazon SageMaker | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

  1. Set the learning mode hyperparameter to supervised.

BlazingText has both unsupervised and supervised learning modes. Our use case is text classification, which is supervised learning.

  1. Create the train and validation data channels.

Build an email spam detector using Amazon SageMaker | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

  1. Start training the model.

Build an email spam detector using Amazon SageMaker | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

  1. Get the accuracy of the train and validation dataset.

Build an email spam detector using Amazon SageMaker | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

Implementați modelul

In this step, we deploy the trained model as an endpoint. Choose your preferred instance

Build an email spam detector using Amazon SageMaker | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

Testați modelul

Let’s provide an example of three email messages that we want to get predictions for:

  • Click on below link, provide your details and win this award
  • Best summer deal here
  • See you in the office on Friday.

Tokenize the email message and specify the payload to use when calling the REST API.

Build an email spam detector using Amazon SageMaker | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

Now we can predict the email classification for each email. Call the predict method of the text classifier, passing the tokenized sentence instances (payload) into the data argument.

Build an email spam detector using Amazon SageMaker | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

A curăța

Finally , you can delete the endpoint to avoid any unexpected cost.

Build an email spam detector using Amazon SageMaker | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

Also, delete the data file from S3 bucket.

Concluzie

In this post, we walked you through the steps to create an email spam detector using the Algoritm SageMaker BlazingText. With the BlazingText algorithm, you can scale to large datasets. BlazingText is used for textual analysis and text classification problems, and has both unsupervised and supervised learning modes. You can use the algorithm for use cases like customer sentiment analysis and text classification.

To learn more about the BlazingText algorithm, check out Algoritmul BlazingText.


Despre autor

Build an email spam detector using Amazon SageMaker | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

Dhiraj Thakur este arhitect de soluții cu Amazon Web Services. El lucrează cu clienții și partenerii AWS pentru a oferi îndrumări cu privire la adoptarea cloud, migrarea și strategia întreprinderii. Este pasionat de tehnologie și îi place să construiască și să experimenteze în spațiul de analiză și AI/ML.

Timestamp-ul:

Mai mult de la Învățare automată AWS