跳转至

Handle Numeric features#

This feature is a experimental feature

https://github.com/BrikerMan/Kashgari/issues/90

Some time, except the text, we have some additional features like text formatting (italic, bold, centered), position in text and more. Kashgari provides NumericFeaturesEmbedding and StackedEmbedding for this kine data. Here is the details.

If you have a dataset like this.

token=NLP       start_of_p=True    bold=True     center=True     B-Category
token=Projects  start_of_p=False   bold=True     center=True     I-Category
token=Project   start_of_p=True    bold=True     center=False    B-Project-name
token=Name      start_of_p=False   bold=True     center=False    I-Project-name
token=:         start_of_p=False   bold=False    center=False    I-Project-name

First, numerize your additional features. Convert your data to this. Remember to leave 0 for padding.

text = ['NLP', 'Projects', 'Project', 'Name', ':']
start_of_p = [1, 2, 1, 2, 2]
bold = [1, 1, 1, 1, 2]
center = [1, 1, 2, 2, 2]
label = ['B-Category', 'I-Category', 'B-Project-name', 'I-Project-name', 'I-Project-name']

Then you have four input sequence and one output sequence. Prepare your embedding layers.

import kashgari
from kashgari.embeddings import NumericFeaturesEmbedding, BareEmbedding, StackedEmbedding

import logging
logging.basicConfig(level='DEBUG')

text = ['NLP', 'Projects', 'Project', 'Name', ':']
start_of_p = [1, 2, 1, 2, 2]
bold = [1, 1, 1, 1, 2]
center = [1, 1, 2, 2, 2]
label = ['B-Category', 'I-Category', 'B-ProjectName', 'I-ProjectName', 'I-ProjectName']

text_list = [text] * 100
start_of_p_list = [start_of_p] * 100
bold_list = [bold] * 100
center_list = [center] * 100
label_list = [label] * 100


SEQUENCE_LEN = 100

# You can use WordEmbedding or BERTEmbedding for your text embedding
text_embedding = BareEmbedding(task=kashgari.LABELING, sequence_length=SEQUENCE_LEN)
start_of_p_embedding = NumericFeaturesEmbedding(feature_count=2,
                                                feature_name='start_of_p',
                                                sequence_length=SEQUENCE_LEN)

bold_embedding = NumericFeaturesEmbedding(feature_count=2,
                                                feature_name='bold',
                                                sequence_length=SEQUENCE_LEN)

center_embedding = NumericFeaturesEmbedding(feature_count=2,
                                                feature_name='center',
                                                sequence_length=SEQUENCE_LEN)

# first one must be the text embedding
stack_embedding = StackedEmbedding([
    text_embedding,
    start_of_p_embedding,
    bold_embedding,
    center_embedding
])

x = (text_list, start_of_p_list, bold_list, center_list)
y = label_list
stack_embedding.analyze_corpus(x, y)

# Now we can embed with this stacked embedding layer
print(stack_embedding.embed(x))

Once embedding layer prepared, you could use all of the classification and labeling models.

# We can build any labeling model with this embedding

from kashgari.tasks.labeling import BLSTMModel
model = BLSTMModel(embedding=stack_embedding)
model.fit(x, y)

print(model.predict(x))
print(model.predict_entities(x))

This is the struct of this model.