Transfer Learning For TensorFlow Text Classification Models In Amazon SageMaker

Ponovno objavil Platon

Spremljevalci: 0

Amazon SageMaker ponuja zbirko vgrajeni algoritmi, predhodno usposobljeni modeliin vnaprej izdelane predloge rešitev to help data scientists and machine learning (ML) practitioners get started training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can process various types of input data, including tabular, image, and text.

This post is the third in a series on the new built-in algorithms in SageMaker. In the prva objava, we showed how SageMaker provides a built-in algorithm for image classification. In the druga objava, we showed how SageMaker provides a built-in algorithm for object detection. Today, we announce that SageMaker provides a new built-in algorithm for text classification using TensorFlow. This supervised learning algorithm supports transfer learning for many pre-trained models available in TensorFlow hub. It takes a piece of text as input and outputs the probability for each of the class labels. You can fine-tune these pre-trained models using transfer learning even when a large corpus of text isn’t available. It’s available through the SageMaker vgrajeni algoritmi, as well as through the Uporabniški vmesnik SageMaker JumpStart in Amazon SageMaker Studio. Za več informacij glejte Razvrstitev besedil in primer zvezka Uvod v JumpStart – Klasifikacija besedila.

Text Classification with TensorFlow in SageMaker provides transfer learning on many pre-trained models available in the TensorFlow Hub. According to the number of class labels in the training data, a classification layer is attached to the pre-trained TensorFlow hub model. The classification layer consists of a dropout layer and a dense layer, fully connected layer, with 2-norm regularizer, which is initialized with random weights. The model training has hyper-parameters for the dropout rate of dropout layer, and L2 regularization factor for the dense layer. Then, either the whole network, including the pre-trained model, or only the top classification layer can be fine-tuned on the new training data. In this transfer learning mode, training can be achieved even with a smaller dataset.

How to use the new TensorFlow text classification algorithm

This section describes how to use the TensorFlow text classification algorithm with the SDK SageMaker Python. Za informacije o tem, kako ga uporabljati iz uporabniškega vmesnika Studio, glejte SageMaker JumpStart.

Algoritem podpira prenos učenja za predhodno usposobljene modele, navedene v Tensorflow models. Vsak model je označen z edinstveno oznako model_id. The following code shows how to fine-tune BERT base model identified by model_id tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2 na naboru podatkov za usposabljanje po meri. Za vsakogar model_id, to launch a SageMaker training job through the Ocenjevalnik class of the SageMaker Python SDK, you must fetch the Docker image URI, training script URI, and pre-trained model URI through the utility functions provided in SageMaker. The training script URI contains all of the necessary code for data processing, loading the pre-trained model, model training, and saving the trained model for inference. The pre-trained model URI contains the pre-trained model architecture definition and the model parameters. The pre-trained model URI is specific to the particular model. The pre-trained model tarballs have been pre-downloaded from TensorFlow and saved with the appropriate model signature in Preprosta storitev shranjevanja Amazon (Amazon S3) buckets, so that the training job runs in network isolation. See the following code:

from sagemaker import image_uris, model_uris, script_urisfrom sagemaker.estimator import Estimator

model_id, model_version = "tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2", "*"
training_instance_type = "ml.p3.2xlarge"
# Retrieve the docker image
train_image_uri = image_uris.retrieve(model_id=model_id,model_version=model_version,image_scope="training",instance_type=training_instance_type,region=None,framework=None)# Retrieve the training script
train_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version, script_scope="training")# Retrieve the pre-trained model tarball for transfer learning
train_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="training")

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tensorflow-tc-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

S temi vadbenimi artefakti, specifičnimi za model, lahko sestavite predmet Ocenjevalnik razred:

# Create SageMaker Estimator instance
tf_tc_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,)

Nato boste za prenos učenja na svojem naboru podatkov po meri morda morali spremeniti privzete vrednosti hiperparametrov usposabljanja, ki so navedeni v Hiperparametri. S klicem lahko pridobite slovar Python teh hiperparametrov z njihovimi privzetimi vrednostmi hyperparameters.retrieve_default, jih po potrebi posodobite in jih nato posredujte razredu Estimator. Upoštevajte, da so privzete vrednosti nekaterih hiperparametrov različne za različne modele. Pri velikih modelih je privzeta velikost serije manjša in train_only_top_layer hiperparameter je nastavljen na True. Hiperparameter Train_only_top_layer določa, kateri parametri modela se spreminjajo med postopkom natančnega prilagajanja. če train_only_top_layer is True, then parameters of the classification layers change and the rest of the parameters remain constant during the fine-tuning process. On the other hand, if train_only_top_layer is False, then all of the parameters of the model are fine-tuned. See the following code:

from sagemaker import hyperparameters# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)# [Optional] Override default hyperparameters with custom values
hyperparameters["epochs"] = "5"

Nudimo SST2 as a default dataset for fine-tuning the models. The dataset contains positive and negative movie reviews. It has been downloaded from TensorFlow pod Licenca Apache 2.0. The following code provides the default training dataset hosted in S3 buckets.

# Sample training data is available in this bucket
training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/SST2/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

Finally, to launch the SageMaker training job for fine-tuning the model, call .fit on the object of the Estimator class, while passing the Amazon S3 location of the training dataset:

# Launch a SageMaker Training job by passing s3 path of the training data
tf_od_estimator.fit({"training": training_dataset_s3_path}, logs=True)

For more information about how to use the new SageMaker TensorFlow text classification algorithm for transfer learning on a custom dataset, deploy the fine-tuned model, run inference on the deployed model, and deploy the pre-trained model as is without first fine-tuning on a custom dataset, see the following example notebook: Uvod v JumpStart – Klasifikacija besedila.

Input/output interface for the TensorFlow text classification algorithm

Vsak od predhodno usposobljenih modelov, navedenih v Modeli TensorFlow to any given dataset made up of text sentences with any number of classes. The pre-trained model attaches a classification layer to the Text Embedding model and initializes the layer parameters to random values. The output dimension of the classification layer is determined based on the number of classes detected in the input data. The objective is to minimize classification errors on the input data. The model returned by fine-tuning can be further deployed for inference.

The following instructions describe how the training data should be formatted for input to the model:

Input – A directory containing a data.csv file. Each row of the first column should have integer class labels between 0 and the number of classes. Each row of the second column should have the corresponding text data.
Output – A fine-tuned model that can be deployed for inference or further trained using incremental training. A file mapping class indexes to class labels is saved along with the models.

The following is an example of an input CSV file. Note that the file should not have any header. The file should be hosted in an S3 bucket with a path similar to the following: s3://bucket_name/input_directory/. Note that the trailing / je potrebno.

|0 |hide new secretions from the parental units|
|0 |contains no wit , only labored gags|
|1 |that loves its characters and communicates something rather beautiful about human nature|
|...|...|

Inference with the TensorFlow text classification algorithm

The generated models can be hosted for inference and support text as the application/x-text content type. The output contains the probability values, class labels for all of the classes, and the predicted label corresponding to the class index with the highest probability encoded in the JSON format. The model processes a single string per request and outputs only one line. The following is an example of a JSON format response:

accept: application/json;verbose
{"probabilities": [prob_0, prob_1, prob_2, ...],
 "labels": [label_0, label_1, label_2, ...],
 "predicted_label": predicted_label}

If accept nastavljena na application/json, then the model only outputs probabilities. For more details on training and inference, see the sample notebook Introduction to Uvod v JumpStart – Klasifikacija besedila.

Uporabite vgrajene algoritme SageMaker prek uporabniškega vmesnika JumpStart

You can also use SageMaker TensorFlow text classification and any of the other built-in algorithms with a few clicks via the JumpStart UI. JumpStart is a SageMaker feature that lets you train and deploy built-in algorithms and pre-trained models from various ML frameworks and model hubs through a graphical interface. Furthermore, it lets you deploy fully-fledged ML solutions that string together ML models and various other AWS services to solve a targeted use case.

Sledita dva videa, ki prikazujeta, kako lahko z nekaj kliki prek uporabniškega vmesnika JumpStart posnemate isti postopek natančnega prilagajanja in uvajanja, ki smo ga pravkar opravili.

Natančno prilagodite predhodno usposobljeni model

Here is the process to fine-tune the same pre-trained text classification model.

Razmestite natančno nastavljen model

Ko je usposabljanje modela končano, lahko model z enim klikom neposredno uvedete v obstojno končno točko v realnem času.

zaključek

In this post, we announced the launch of the SageMaker TensorFlow text classification built-in algorithm. We provided example code for how to do transfer learning on a custom dataset using a pre-trained model from TensorFlow hub using this algorithm.

Za več informacij si oglejte Dokumentacija in primer zvezka Uvod v JumpStart – Klasifikacija besedila.

O avtorjih

dr. Vivek Madan je uporabni znanstvenik z Ekipa Amazon SageMaker JumpStart. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS and SODA conferences.

João Moura is an AI/ML Specialist Solutions Architect at Amazon Web Services. He is mostly focused on NLP use-cases and helping customers optimize deep learning model training and deployment. He is also an active proponent of low-code ML solutions and ML-specialized hardware.

Dr. Ashish Khetan je višji aplikativni znanstvenik z Vgrajeni algoritmi Amazon SageMaker in pomaga pri razvoju algoritmov strojnega učenja. Doktoriral je na Univerzi Illinois Urbana Champaign. Je aktiven raziskovalec strojnega učenja in statističnega sklepanja ter je objavil veliko člankov na konferencah NeurIPS, ICML, ICLR, JMLR, ACL in EMNLP.