Document clustering

This workflow shows how to import textual data, preprocess documents by filtering and stemming, transform documents into a bag of words and document vectors, and finally cluster the documents based on their numerical representation.

Document Classification

This workflow shows how to import textual data, preprocess documents by filtering and stemming, transform documents into a bag of words and document vectors and finally build a predictive model to classify the documents. It also contains the corresponding deployment workflow.

Sentiment Classification with NGrams

This workflow shows how to import text from a csv file, convert it to documents, preprocess the documents and transform them into numerical document vectors consisting of single word and 2-gram features.
Finally two predictive models are trained on the vectors to predict the sentiment class of the documents. The two models are then compared via a ROC curve.

epub JPEG Romeo Juliet

The challenge here is to blend together text and image data. Text data is in epub format while images are in JPEG format. The goal is to build the network of interactions in one of Shakespear most famous tragedies: Romeo and Juliet. The network of interactions is then dispayed as a graph, where each node represents a character. Each node then displays the character JPEG image. epub with JPEG. Will they blend? ... and yes! They blend.

Topic Detection LDA

This workflow extracts topics from the "Romeo & Juliet" epub book using the Topic Extractor (Parallel LDA) node. It reads textual data from a table and converts them into documents. The documents are then preprocessed, i.e. tagged, filtered, lemmatized, etc. After that, the Topic Extractor node can be applied to the preprocessed documents. However, the node requires users to input the number of topics that should be extracted beforehand. After pre-processing, the Topic Extractor node can be executed and a tag cloud is created to visualize the topics' terms.

Sentiment Analysis Lexicon Based Approach

This workflow shows how to perform a lexycon based approach for sentiment analysis of IMDB reviews dataset. The dataset contains movie reviews, previously labelled as positive/negative. The lexicon based approach assigns a sentiment to each word in a text based on dictionaries of positive and negative words. A sentiment score is then calculated for each document as: (number of positive words - number of negative words) / total number of words.

NER Tagger Model Training

This workflows shows how to train a model for named-entity recognition. The model can be created with the StanfordNLP NE Learner node which creates a conditional random field (CRF) model. To create the model, a document training set and a dictionary with known named-entities is needed. Due to generalization of word patterns, the model can be used by the tagger to find new named-entitities in unknown documents. A Scorer node for model evaluation is also available.

Subscribe to Text Processing