dev

How to easily extract Text from anything using spaCy

Here's a new framework that our AI Developer just unearthed - with this framework you can now extract text in a jiffy and also do a load of other cool stuff. Read on and find out how!

Karthik Kamalakannan / 21 November, 2017

21 November, 2017

How to easily extract Text from anything using spaCy

NAME	DESCRIPTION
Tokenization	Segmenting text into words, punctuations marks etc.
Part-of-speech(POS) Tagging	Assigning word types to tokens, like verb or noun.
Dependency Parsing	Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
Lemmatization	Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".
Sentence Boundary Detection(SBD)	Finding and segmenting individual sentences.
Named Entity Recognition(NER)	Labelling named "real-world" objects, like persons, companies or locations.
Similarity	Comparing words, text spans and documents and how similar they are to each other.
Text Classification	Assigning categories or labels to a whole document, or parts of a document.
Rule-based Matching	Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
Training	Updating and improving a statistical model's predictions.
Serialization	Saving objects to files or byte strings.

Here, we are going to see about Rule-based Matching which is going to help us in text/entity extraction.

One simple example to get started with,

import spacy
from spacy.matcher import Matcher
from spacy.attrs import *
 
# This is the part where to loads the vocabulary
nlp:spacy.load('en')
# Creating a matcher object
matcher:Matcher(nlp.vocab)
sentence:u"Completed my Engineering in 1876"
doc:nlp(sentence)
 
patterns:{
            "year": [{'IS_DIGIT': True }],
            "is_engineering": [{"LOWER": "engineering"}]
          }
 
for label, pattern in patterns.iteritems():
  matcher.add(label, None, pattern)
 
matches:matcher(doc)
 
for match in matches:
  # match object returns a tuple with (id, startpos, endpos)
  print doc[match[1]:match[2]]

What else can you really do with this Matching? That was my first question too when I was trying to understand what spaCy could do!

The one thing I admire about spaCy is, the documentation and the code. Both are beautifully written. And any noob can understand it just by reading. No complication adapters or exceptions.

P.S: For beginners, there was a big leap taken from spaCy 1.x to spaCy 2 and you might need to get hold of new functions and new changes in function names. But it's worth investing time in.

There are few attrs that help in easier extraction of text from the sentence. This helps us in achieving custom patterns which are very stable.

This is the attrs file. You can see that they are very simple and helpful attrs like LIKE_URL, LIKE_EMAIL etc., and the best part is you can define your own flags and attrs in special cases.

There is an on_match (callback function) in the matcher.add() function. The second parameter takes the matched triple object and uses send as the parameter to the on_match callback function().

A sample of the working:

def on_match(*args):
  print("Matched")
  # the remaining workflow.
 
matcher.add("Checking", on_mathc, [{"LOWER": "checking"}])

I hope you are able to understand the basic operations that can be done using spaCy. spaCy 2 is the bleeding edge version and it's getting loaded with lots and lots of features that every NLP enthusiast has ever dreamt of - and there are even other libraries like textacy which have been built on the top of spaCy.

Okay guys, until we meet next time, I wish you have some good time with spaCy's magic!

Last updated: January 23rd, 2024 at 1:50:36 PM GMT+0

How to easily extract Text from anything using spaCy

On this page