Reference

excel_ngrams.console

Command-line interface.

excel_ngrams.grammer

Return dataframe of ngrams from list of words.

class excel_ngrams.grammer.Grammer(terms_list)

Class that returns n-grams from text as a list of strings.

Words are delineated by white space and punctuation. Using Spacy’s NLP pipe and NLTK’s ngrams function to generate ngrams within a given word length range and output them to a Pandas DataFrame for writing to an output file.

term_list

List of text as strings (one or more).

_nlp and _stopwords are shared across all instances, but is loaded by the constructor to avoid loading is in cases where it isn’t needed.

combine_dataframes(dataframes)

Creates single multi-column dataframe.

Takes the terms and frequency values for dataframes constructed from ngrams of various lengths and combines them into a single dataframe, e.g single term and values, bigrams and values, trigrams and values, etc.

Parameters

dataframes (list) – List of pd.DataFrames containing the dataframes to be merged, side by side.

Returns

Single combined dataframe from list of dataframes.

Return type

pd.DataFrame

df_from_terms(ngram_tuples)

Creates DataFrame from lists of terms and values as tuple.

Calls terms_to_columns on ngram_tuple to unpack them.

Parameters

ngram_tuples (list) – list of tuple`[:obj:`tuple [str], int]. Results from get_ngrams.

Returns

Pandas DataFrame comprising a column of

terms and a column of frequency values for those terms.

Return type

df(pd.DataFrame)

get_ngrams(n, top_n_results=250, stopwords=True)

Create tuple with terms and frequency from list.

List of terms is tokenised using Spacy’s NLP pipe, set to lowercase and ngrams are calculated with NLTK’s ngrams function.

Parameters
  • n (int) – The length of phrases to analyse.

  • top_n_results (int) – The number of results to return. Default is 150.

  • stopwords (bool) – flag to indicate removal of stopwords. Default is True.

Returns

List of tuples containing term(s) and values.

Return type

list of :obj:`tuple`[:obj:`tuple`[str, …], int]

in_stop_words(spacy_token_text)

Check if word appears in stopword set.

Parameters

spacy_token_text (str) – The text attribute of the Spacy token being passed to the method.

Returns

Whether text is present in stopwords.

Return type

bool

ngram_range(max_n, n=1, top_n_results=250, stopwords=True)

Gets ngram terms and outputs for a range of phrase lengths.

Gets ngrams from single terms as default up to desired maximum phrase length and creates Pandas DataFrame from results.

Parameters
  • max_n (int) – The longest phrase length desired in output.

  • n (int) – The minimum term length. Default is 1 (single term).

  • top_n_results (int) – The number of rows of results to return. Default set to 150.

  • stopwords (bool) – flag to indicate removal of stopwords. Default is True.

Returns

Combined dataframe of all results from various

term lengths to desired maximum.

Return type

pd.DataFrame

remove_escaped_chars(text)

Remove newline and tab chars from string list.

Parameters

text (List of str) – Terms list to be cleaned of specific chars.

Returns

Terms list without

specific chars.

Return type

without_newlines(List of str)

terms_to_columns(ngram_tuples)

Returns term/value tuples as two lists.

Parameters

ngram_tuples (list) – list of tuple`[:obj:`tuple [str], int]. Results from get_ngrams.

Returns

Terms, concatinated into

single string for multi-word terms, returned as list.

value_col(list of int): Term frequencies as list. Lists are returned together as tuple containing both lists.

Return type

term_col(list of str)