spaCy 101: Everything you need to know · spaCy Usage Documentation (2024)

The most important concepts, explained in simple terms

Whether you’re new to spaCy, or just want to brush up on some NLP basics andimplementation details – this page should have you covered. Each section willexplain one of spaCy’s features in simple terms and with examples orillustrations. Some sections will also reappear across the usage guides as aquick introduction.

What’s spaCy?

spaCy is a free, open-source library for advanced Natural LanguageProcessing (NLP) in Python.

If you’re working with a lot of text, you’ll eventually want to know more aboutit. For example, what’s it about? What do the words mean in context? Who isdoing what to whom? What companies and products are mentioned? Which texts aresimilar to each other?

spaCy is designed specifically for production use and helps you buildapplications that process and “understand” large volumes of text. It can be usedto build information extraction or natural language understandingsystems, or to pre-process text for deep learning.

What spaCy isn’t

spaCy is not a platform or “an API”. Unlike a platform, spaCy does notprovide a software as a service, or a web application. It’s an open-sourcelibrary designed to help you build NLP applications, not a consumable service.
spaCy is not an out-of-the-box chat bot engine. While spaCy can be usedto power conversational applications, it’s not designed specifically for chatbots, and only provides the underlying text processing capabilities.
spaCy is not research software. It’s built on the latest research, butit’s designed to get things done. This leads to fairly different designdecisions than NLTK orCoreNLP, which were created asplatforms for teaching and research. The main difference is that spaCy isintegrated and opinionated. spaCy tries to avoid asking the user to choosebetween multiple algorithms that deliver equivalent functionality. Keeping themenu small lets spaCy deliver generally better performance and developerexperience.
spaCy is not a company. It’s an open-source library. Our companypublishing spaCy and other software is calledExplosion.

Features

In the documentation, you’ll come across mentions of spaCy’s features andcapabilities. Some of them refer to linguistic concepts, while others arerelated to more general machine learning functionality.

Name	Description
Tokenization	Segmenting text into words, punctuations marks etc.
Part-of-speech (POS) Tagging	Assigning word types to tokens, like verb or noun.
Dependency Parsing	Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
Lemmatization	Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.
Sentence Boundary Detection (SBD)	Finding and segmenting individual sentences.
Named Entity Recognition (NER)	Labelling named “real-world” objects, like persons, companies or locations.
Entity Linking (EL)	Disambiguating textual entities to unique identifiers in a knowledge base.
Similarity	Comparing words, text spans and documents and how similar they are to each other.
Text Classification	Assigning categories or labels to a whole document, or parts of a document.
Rule-based Matching	Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
Training	Updating and improving a statistical model’s predictions.
Serialization	Saving objects to files or byte strings.

Statistical models

While some of spaCy’s features work independently, others requiretrained pipelines to be loaded, which enable spaCy to predictlinguistic annotations – for example, whether a word is a verb or a noun. Atrained pipeline can consist of multiple components that use a statistical modeltrained on labeled data. spaCy currently offers trained pipelines for a varietyof languages, which can be installed as individual Python modules. Pipelinepackages can differ in size, speed, memory usage, accuracy and the data theyinclude. The package you choose always depends on your use case and the textsyou’re working with. For a general-purpose use case, the small, default packagesare always a good start. They typically include the following components:

Binary weights for the part-of-speech tagger, dependency parser and namedentity recognizer to predict those annotations in context.
Lexical entries in the vocabulary, i.e. words and theircontext-independent attributes like the shape or spelling.
Data files like lemmatization rules and lookup tables.
Word vectors, i.e. multi-dimensional meaning representations of words thatlet you determine how similar they are to each other.
Configuration options, like the language and processing pipeline settingsand model implementations to use, to put spaCy in the correct state when youload the pipeline.

Linguistic annotations

spaCy provides a variety of linguistic annotations to give you insights into atext’s grammatical structure. This includes the word types, like the parts ofspeech, and how the words are related to each other. For example, if you’reanalyzing text, it makes a huge difference whether a noun is the subject of asentence, or the object – or whether “google” is used as a verb, or refers tothe website or company in a specific context.

Once you’ve downloaded and installed a trained pipeline, youcan load it via spacy.load. This will return aLanguage object containing all components and data needed to process text. Weusually call it nlp. Calling the nlp object on a string of text will returna processed Doc:

Editable CodespaCy v3.7 · Python 3 · via Binder

Even though a Doc is processed – e.g. split into individual words andannotated – it still holds all information of the original text, likewhitespace characters. You can always get the offset of a token into theoriginal string, or reconstruct the original by joining the tokens and theirtrailing whitespace. This way, you’ll never lose any information when processingtext with spaCy.

Tokenization

During processing, spaCy first tokenizes the text, i.e. segments it intowords, punctuation and so on. This is done by applying rules specific to eachlanguage. For example, punctuation at the end of a sentence should be split off– whereas “U.K.” should remain one token. Each Doc consists of individualtokens, and we can iterate over them:

Editable CodespaCy v3.7 · Python 3 · via Binder

0	1	2	3	4	5	6	7	8	9	10
Apple	is	looking	at	buying	U.K.	startup	for	$	1	billion

First, the raw text is split on whitespace characters, similar totext.split(' '). Then, the tokenizer processes the text from left to right. Oneach substring, it performs two checks:

Does the substring match a tokenizer exception rule? For example, “don’t”does not contain whitespace, but should be split into two tokens, “do” and“n’t”, while “U.K.” should always remain one token.
Can a prefix, suffix or infix be split off? For example punctuation likecommas, periods, hyphens or quotes.

If there’s a match, the rule is applied and the tokenizer continues its loop,starting with the newly split substrings. This way, spaCy can split complex,nested tokens like combinations of abbreviations and multiple punctuationmarks.

While punctuation rules are usually pretty general, tokenizer exceptionsstrongly depend on the specifics of the individual language. This is why eachavailable language has its own subclass, likeEnglish or German, that loads in lists of hard-coded data and exceptionrules.

Part-of-speech tags and dependencies Needs model

After tokenization, spaCy can parse and tag a given Doc. This is wherethe trained pipeline and its statistical models come in, which enable spaCy tomake predictions of which tag or label most likely applies in this context.A trained component includes binary data that is produced by showing a systemenough examples for it to make predictions that generalize across the language –for example, a word following “the” in English is most likely a noun.

Linguistic annotations are available asToken attributes. Like many NLP libraries, spaCyencodes all strings to hash values to reduce memory usage and improveefficiency. So to get the readable string representation of an attribute, weneed to add an underscore _ to its name:

Editable CodespaCy v3.7 · Python 3 · via Binder

Text	Lemma	POS	Tag	Dep	Shape	alpha	stop
Apple	apple	`PROPN`	`NNP`	`nsubj`	`Xxxxx`	`True`	`False`
is	be	`AUX`	`VBZ`	`aux`	`xx`	`True`	`True`
looking	look	`VERB`	`VBG`	`ROOT`	`xxxx`	`True`	`False`
at	at	`ADP`	`IN`	`prep`	`xx`	`True`	`True`
buying	buy	`VERB`	`VBG`	`pcomp`	`xxxx`	`True`	`False`
U.K.	u.k.	`PROPN`	`NNP`	`compound`	`X.X.`	`False`	`False`
startup	startup	`NOUN`	`NN`	`dobj`	`xxxx`	`True`	`False`
for	for	`ADP`	`IN`	`prep`	`xxx`	`True`	`True`
$	$	`SYM`	`$`	`quantmod`	`$`	`False`	`False`
1	1	`NUM`	`CD`	`compound`	`d`	`False`	`False`
billion	billion	`NUM`	`CD`	`pobj`	`xxxx`	`True`	`False`

Using spaCy’s built-in displaCy visualizer, here’s whatour example sentence and its dependencies look like:

Named Entities Needs model

A named entity is a “real-world object” that’s assigned a name – for example, aperson, a country, a product or a book title. spaCy can recognize varioustypes of named entities in a document, by asking the model for a prediction.Because models are statistical and strongly depend on the examples they weretrained on, this doesn’t always work perfectly and might need some tuninglater, depending on your use case.

Named entities are available as the ents property of a Doc:

Editable CodespaCy v3.7 · Python 3 · via Binder

Text	Start	End	Label	Description
Apple	0	5	`ORG`	Companies, agencies, institutions.
U.K.	27	31	`GPE`	Geopolitical entity, i.e. countries, cities, states.
$1 billion	44	54	`MONEY`	Monetary values, including unit.

Using spaCy’s built-in displaCy visualizer, here’s whatour example sentence and its named entities look like:

Apple ORG is looking at buying U.K. GPE startup for $1 billion MONEY

Word vectors and similarity Needs model

Similarity is determined by comparing word vectors or “word embeddings”,multi-dimensional meaning representations of a word. Word vectors can begenerated using an algorithm likeword2vec and usually look like this:

banana.vector

Pipeline packages that come with built-in word vectors make them available asthe Token.vector attribute.Doc.vector and Span.vector willdefault to an average of their token vectors. You can also check if a token hasa vector assigned, and get the L2 norm, which can be used to normalize vectors.

Editable CodespaCy v3.7 · Python 3 · via Binder

The words “dog”, “cat” and “banana” are all pretty common in English, so they’repart of the pipeline’s vocabulary, and come with a vector. The word “afskfsd” onthe other hand is a lot less common and out-of-vocabulary – so its vectorrepresentation consists of 300 dimensions of 0, which means it’s practicallynonexistent. If your application will benefit from a large vocabulary withmore vectors, you should consider using one of the larger pipeline packages orloading in a full vector package, for example,en_core_web_lg, which includes 685k uniquevectors.

spaCy is able to compare two objects, and make a prediction of how similarthey are. Predicting similarity is useful for building recommendation systemsor flagging duplicates. For example, you can suggest a user content that’ssimilar to what they’re currently looking at, or label a support ticket as aduplicate if it’s very similar to an already existing one.

Each Doc, Span, Token andLexeme comes with a .similaritymethod that lets you compare it with another object, and determine thesimilarity. Of course similarity is always subjective – whether two words, spansor documents are similar really depends on how you’re looking at it. spaCy’ssimilarity implementation usually assumes a pretty general-purpose definition ofsimilarity.

Editable CodespaCy v3.7 · Python 3 · via Binder

What to expect from similarity results

Computing similarity scores can be helpful in many situations, but it’s alsoimportant to maintain realistic expectations about what information it canprovide. Words can be related to each other in many ways, so a single“similarity” score will always be a mix of different signals, and vectorstrained on different data can produce very different results that may not beuseful for your purpose. Here are some important considerations to keep in mind:

There’s no objective definition of similarity. Whether “I like burgers” and “Ilike pasta” is similar depends on your application. Both talk about foodpreferences, which makes them very similar – but if you’re analyzing mentionsof food, those sentences are pretty dissimilar, because they talk about verydifferent foods.
The similarity of Doc and Span objects defaultsto the average of the token vectors. This means that the vector for “fastfood” is the average of the vectors for “fast” and “food”, which isn’tnecessarily representative of the phrase “fast food”.
Vector averaging means that the vector of multiple tokens is insensitive tothe order of the words. Two documents expressing the same meaning withdissimilar wording will return a lower similarity score than two documentsthat happen to contain the same words while expressing different meanings.

Pipelines

When you call nlp on a text, spaCy first tokenizes the text to produce a Docobject. The Doc is then processed in several different steps – this is alsoreferred to as the processing pipeline. The pipeline used by thetrained pipelines typically include a tagger, a lemmatizer, a parserand an entity recognizer. Each pipeline component returns the processed Doc,which is then passed on to the next component.

Name	Component	Creates	Description
tokenizer	Tokenizer	`Doc`	Segment text into tokens.
processing pipeline
tagger	Tagger	`Token.tag`	Assign part-of-speech tags.
parser	DependencyParser	`Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks`	Assign dependency labels.
ner	EntityRecognizer	`Doc.ents`, `Token.ent_iob`, `Token.ent_type`	Detect and label named entities.
lemmatizer	Lemmatizer	`Token.lemma`	Assign base forms.
textcat	TextCategorizer	`Doc.cats`	Assign document labels.
custom	custom components	`Doc._.xxx`, `Token._.xxx`, `Span._.xxx`	Assign custom attributes, methods or properties.

The capabilities of a processing pipeline always depend on the components, theirmodels and how they were trained. For example, a pipeline for named entityrecognition needs to include a trained named entity recognizer component with astatistical model and weights that enable it to make predictions of entitylabels. This is why each pipeline specifies its components and their settings inthe config:

The statistical components like the tagger or parser are typically independentand don’t share any data between each other. For example, the named entityrecognizer doesn’t use any features set by the tagger and parser, and so on.This means that you can swap them, or remove single components from the pipelinewithout affecting the others. However, components may share a “token-to-vector”component like Tok2Vec or Transformer.You can read more about this in the docs onembedding layers.

Custom components may also depend on annotations set by other components. Forexample, a custom lemmatizer may need the part-of-speech tags assigned, so it’llonly work if it’s added after the tagger. The parser will respect pre-definedsentence boundaries, so if a previous component in the pipeline sets them, itsdependency predictions may be different. Similarly, it matters if you add theEntityRuler before or after the statistical entityrecognizer: if it’s added before, the entity recognizer will take the existingentities into account when making predictions. TheEntityLinker, which resolves named entities to knowledgebase IDs, should be preceded by a pipeline component that recognizes entitiessuch as the EntityRecognizer.

The tokenizer is a “special” component and isn’t part of the regular pipeline.It also doesn’t show up in nlp.pipe_names. The reason is that there can onlyreally be one tokenizer, and while all other pipeline components take a Docand return it, the tokenizer takes a string of text and turns it into aDoc. You can still customize the tokenizer, though. nlp.tokenizer iswritable, so you can either create your ownTokenizer class from scratch,or even replace it with anentirely custom function.

Architecture

The central data structures in spaCy are the Language class,the Vocab and the Doc object. The Language classis used to process a text and turn it into a Doc object. It’s typically storedas a variable called nlp. The Doc object owns the sequence of tokens andall their annotations. By centralizing strings, word vectors and lexicalattributes in the Vocab, we avoid storing multiple copies of this data. Thissaves memory, and ensures there’s a single source of truth.

Text annotations are also designed to allow a single source of truth: the Docobject owns the data, and Span and Token areviews that point into it. The Doc object is constructed by theTokenizer, and then modified in place by the componentsof the pipeline. The Language object coordinates these components. It takesraw text and sends it through the pipeline, returning an annotated document.It also orchestrates training and serialization.

Container objects

Name	Description
Doc	A container for accessing linguistic annotations.
DocBin	A collection of `Doc` objects for efficient binary serialization. Also used for training data.
Example	A collection of training annotations, containing two `Doc` objects: the reference data and the predictions.
Language	Processing class that turns text into `Doc` objects. Different languages implement their own subclasses of it. The variable is typically called `nlp`.
Lexeme	An entry in the vocabulary. It’s a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc.
Span	A slice from a `Doc` object.
SpanGroup	A named collection of spans belonging to a `Doc`.
Token	An individual token — i.e. a word, punctuation symbol, whitespace, etc.

Processing pipeline

The processing pipeline consists of one or more pipeline components that arecalled on the Doc in order. The tokenizer runs before the components. Pipelinecomponents can be added using Language.add_pipe.They can contain a statistical model and trained weights, or only makerule-based modifications to the Doc. spaCy provides a range of built-incomponents for different language processing tasks and also allows addingcustom components.

Name	Description
AttributeRuler	Set token attributes using matcher rules.
DependencyParser	Predict syntactic dependencies.
EditTreeLemmatizer	Predict base forms of words.
EntityLinker	Disambiguate named entities to nodes in a knowledge base.
EntityRecognizer	Predict named entities, e.g. persons or products.
EntityRuler	Add entity spans to the `Doc` using token-based rules or exact phrase matches.
Lemmatizer	Determine the base forms of words using rules and lookups.
Morphologizer	Predict morphological features and coarse-grained part-of-speech tags.
SentenceRecognizer	Predict sentence boundaries.
Sentencizer	Implement rule-based sentence boundary detection that doesn’t require the dependency parse.
Tagger	Predict part-of-speech tags.
TextCategorizer	Predict categories or labels over the whole document.
Tok2Vec	Apply a “token-to-vector” model and set its outputs.
Tokenizer	Segment raw text and create `Doc` objects from the words.
TrainablePipe	Class that all trainable pipeline components inherit from.
Transformer	Use a transformer model and set its outputs.
Other functions	Automatically apply something to the `Doc`, e.g. to merge spans of tokens.

Matchers

Matchers help you find and extract information from Doc objectsbased on match patterns describing the sequences you’re looking for. A matcheroperates on a Doc and gives you access to the matched tokens in context.

Name	Description
DependencyMatcher	Match sequences of tokens based on dependency trees using Semgrex operators.
Matcher	Match sequences of tokens, based on pattern rules, similar to regular expressions.
PhraseMatcher	Match sequences of tokens based on phrases.

Other classes

Name	Description
Corpus	Class for managing annotated corpora for training and evaluation data.
KnowledgeBase	Abstract base class for storage and retrieval of data for entity linking.
InMemoryLookupKB	Implementation of `KnowledgeBase` storing all data in memory.
Candidate	Object associating a textual mention with a specific entity contained in a `KnowledgeBase`.
Lookups	Container for convenient access to large lookup tables and dictionaries.
MorphAnalysis	A morphological analysis.
Morphology	Store morphological analyses and map them to and from hash values.
Scorer	Compute evaluation scores.
StringStore	Map strings to and from hash values.
Vectors	Container class for vector data keyed by string.
Vocab	The shared vocabulary that stores strings and gives you access to Lexeme objects.

Vocab, hashes and lexemes

Whenever possible, spaCy tries to store data in a vocabulary, theVocab, that will be shared by multiple documents. To savememory, spaCy also encodes all strings to hash values – in this case forexample, “coffee” has the hash 3197928453018144401. Entity labels like “ORG”and part-of-speech tags like “VERB” are also encoded. Internally, spaCy only“speaks” in hash values.

If you process lots of documents containing the word “coffee” in all kinds ofdifferent contexts, storing the exact string “coffee” every time would take upway too much space. So instead, spaCy hashes the string and stores it in theStringStore. You can think of the StringStore as alookup table that works in both directions – you can look up a string to getit* hash, or a hash to get its string:

Editable CodespaCy v3.7 · Python 3 · via Binder

Now that all strings are encoded, the entries in the vocabulary don’t need toinclude the word text themselves. Instead, they can look it up in theStringStore via its hash value. Each entry in the vocabulary, also calledLexeme, contains the context-independent information abouta word. For example, no matter if “love” is used as a verb or a noun in somecontext, its spelling and whether it consists of alphabetic characters won’tever change. Its hash value will also always be the same.

Editable CodespaCy v3.7 · Python 3 · via Binder

Text	Orth	Shape	Prefix	Suffix	is_alpha	is_digit
I	`4690420944186131903`	`X`	I	I	`True`	`False`
love	`3702023516439754181`	`xxxx`	l	ove	`True`	`False`
coffee	`3197928453018144401`	`xxxx`	c	fee	`True`	`False`

The mapping of words to hashes doesn’t depend on any state. To make sure eachvalue is unique, spaCy uses ahash function to calculate thehash based on the word string. This also means that the hash for “coffee”will always be the same, no matter which pipeline you’re using or how you’veconfigured spaCy.

However, hashes cannot be reversed and there’s no way to resolve3197928453018144401 back to “coffee”. All spaCy can do is look it up in thevocabulary. That’s why you always need to make sure all objects you create haveaccess to the same vocabulary. If they don’t, spaCy might not be able to findthe strings it needs.

Editable CodespaCy v3.7 · Python 3 · via Binder

If the vocabulary doesn’t contain a string for 3197928453018144401, spaCy willraise an error. You can re-add “coffee” manually, but this only works if youactually know that the document contains that word. To prevent this problem,spaCy will also export the Vocab when you save a Doc or nlp object. Thiswill give you the object and its encoded annotations, plus the “key” to decodeit.

Serialization

If you’ve been modifying the pipeline, vocabulary, vectors and entities, or madeupdates to the component models, you’ll eventually want to save yourprogress – for example, everything that’s in your nlp object. This meansyou’ll have to translate its contents and structure into a format that can besaved, like a file or a byte string. This process is called serialization. spaCycomes with built-in serialization methods and supports thePickle protocol.

All container classes, i.e. Language (nlp),Doc, Vocab and StringStorehave the following methods available:

Method	Returns	Example
`to_bytes`	bytes	`data = nlp.to_bytes()`
`from_bytes`	object	`nlp.from_bytes(data)`
`to_disk`	-	`nlp.to_disk("/path")`
`from_disk`	object	`nlp.from_disk("/path")`

Training

spaCy’s tagger, parser, text categorizer and many other components are poweredby statistical models. Every “decision” these components make – for example,which part-of-speech tag to assign, or whether a word is a named entity – is aprediction based on the model’s current weight values. The weight valuesare estimated based on examples the model has seen during training. To traina model, you first need training data – examples of text, and the labels youwant the model to predict. This could be a part-of-speech tag, a named entity orany other information.

Training is an iterative process in which the model’s predictions are comparedagainst the reference annotations in order to estimate the gradient of theloss. The gradient of the loss is then used to calculate the gradient of theweights through backpropagation. Thegradients indicate how the weight values should be changed so that the model’spredictions become more similar to the reference labels over time.

When training a model, we don’t just want it to memorize our examples – we wantit to come up with a theory that can be generalized across unseen data.After all, we don’t just want the model to learn that this one instance of“Amazon” right here is a company – we want it to learn that “Amazon”, incontexts like this, is most likely a company. That’s why the training datashould always be representative of the data we want to process. A model trainedon Wikipedia, where sentences in the first person are extremely rare, willlikely perform badly on Twitter. Similarly, a model trained on romantic novelswill likely perform badly on legal text.

This also means that in order to know how the model is performing, and whetherit’s learning the right things, you don’t only need training data – you’llalso need evaluation data. If you only test the model with the data it wastrained on, you’ll have no idea how well it’s generalizing. If you want to traina model from scratch, you usually need at least a few hundred examples for bothtraining and evaluation.

Training config and lifecycle

Training config files include all settings and hyperparameters for trainingyour pipeline. Instead of providing lots of arguments on the command line, youonly need to pass your config.cfg file to spacy train.This also makes it easy to integrate custom models and architectures, written inyour framework of choice. A pipeline’s config.cfg is considered the “singlesource of truth”, both at training and runtime.

Trainable components

spaCy’s Pipe class helps you implement your own trainablecomponents that have their own model instance, make predictions over Docobjects and can be updated using spacy train. This lets youplug fully custom machine learning components into your pipeline that can beconfigured via a single training config.

Language data

Every language is different – and usually full of exceptions and specialcases, especially amongst the most common words. Some of these exceptions areshared across languages, while others are entirely specific – usually sospecific that they need to be hard-coded. Thelang module contains all language-specific data,organized in simple Python files. This makes the data easy to update and extend.

The shared language data in the directory root includes rules that can begeneralized across languages – for example, rules for basic punctuation, emoji,emoticons and single-letter abbreviations. The individual language data in asubmodule contains rules that are only relevant to a particular language. Italso takes care of putting together all components and creating theLanguage subclass – for example, English or German. Thevalues are defined in the Language.Defaults.

Name	Description
Stop words `stop_words.py`	List of most common words of a language that are often useful to filter out, for example “and” or “I”. Matching tokens will return `True` for `is_stop`.
Tokenizer exceptions `tokenizer_exceptions.py`	Special-case rules for the tokenizer, for example, contractions like “can’t” and abbreviations with punctuation, like “U.K.”.
Punctuation rules `punctuation.py`	Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes.
Character classes `char_classes.py`	Character classes to be used in regular expressions, for example, Latin characters, quotes, hyphens or icons.
Lexical attributes `lex_attrs.py`	Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like “ten” or “hundred”.
Syntax iterators `syntax_iterators.py`	Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for noun chunks.
Lemmatizer `lemmatizer.py` `spacy-lookups-data`	Custom lemmatizer implementation and lemmatization tables.

We’re very happy to see the spaCy community grow and include a mix of peoplefrom all kinds of different backgrounds – computational linguistics, datascience, deep learning, research and more. If you’d like to get involved, beloware some answers to the most important questions and resources for furtherreading.

Help, my code isn’t working!

Bugs suck, and we’re doing our best to continuously improve the tests and fixbugs as soon as possible. Before you submit an issue, do a quick search andcheck if the problem has already been reported. If you’re having installation orloading problems, make sure to also check out thetroubleshooting guide. Help with spaCy is availablevia the following platforms:

Stack Overflow: Usagequestions and everything related to problems with your specific code. TheStack Overflow community is much larger than ours, so if your problem can besolved by others, you’ll receive help much quicker.
GitHub discussions:General discussion, project ideas and usage questions. Meet othercommunity members to get help with a specific code implementation, discussideas for new projects/plugins, support more languages, and share bestpractices.
GitHub issue tracker: Bugreports and improvement suggestions, i.e. everything that’s likelyspaCy’s fault. This also includes problems with the trained pipelines beyondstatistical imprecisions, like patterns that point to a bug.

How can I contribute to spaCy?

You don’t have to be an NLP expert or Python pro to contribute, and we’re happyto help you get started. If you’re new to spaCy, a good place to start is thehelp wanted (easy) labelon GitHub, which we use to tag bugs and feature requests that are easy andself-contained. We also appreciate contributions to the docs – whether it’sfixing a typo, improving an example or adding additional explanations. You’llfind a “Suggest edits” link at the bottom of each page that points you to thesource.

Another way of getting involved is to help us improve thelanguage data – especially if youhappen to speak one of the languages currently inalpha support. Even adding simple tokenizerexceptions, stop words or lemmatizer data can make a big difference. It willalso make it easier for us to provide a trained pipeline for the language in thefuture. Submitting a test that documents a bug or performance issue, or coversfunctionality that’s especially important for your application is also veryhelpful. This way, you’ll also make sure we never accidentally introduceregressions to the parts of the library that you care about the most.

For more details on the types of contributions we’re looking for, the codeconventions and other useful tips, make sure to check out thecontributing guidelines.

I’ve built something cool with spaCy – how can I get the word out?

First, congrats – we’d love to check it out! When you share your project onTwitter, don’t forget to tag @spacy_io so wedon’t miss it. If you think your project would be a good fit for thespaCy Universe, feel free to submit it! Tutorials are alsoincredibly valuable to other users and a great way to get exposure. So westrongly encourage writing up your experiences, or sharing your code andsome tips and tricks on your blog. Since our website is open-source, you can addyour project or tutorial by making a pull request on GitHub.

If you would like to use the spaCy logo on your site, please get in touch andask us first. However, if you want to show support and tell others that yourproject is using spaCy, you can grab one of our spaCy badges here:

Suggest edits