Reference¶

excel_ngrams.console
excel_ngrams.grammer

excel_ngrams.console¶

Command-line interface.

excel_ngrams.grammer¶

Return dataframe of ngrams from list of words.

class excel_ngrams.grammer.Grammer(terms_list)¶

Class that returns n-grams from text as a list of strings.

Words are delineated by white space and punctuation. Using Spacy’s NLP pipe and NLTK’s ngrams function to generate ngrams within a given word length range and output them to a Pandas DataFrame for writing to an output file.

term_list¶: List of text as strings (one or more).

_nlp and _stopwords are shared across all instances, but is loaded by the constructor to avoid loading is in cases where it isn’t needed.

combine_dataframes(dataframes)¶

Creates single multi-column dataframe.

Takes the terms and frequency values for dataframes constructed from ngrams of various lengths and combines them into a single dataframe, e.g single term and values, bigrams and values, trigrams and values, etc.

Parameters: dataframes (list) – List of pd.DataFrames containing the dataframes to be merged, side by side.
Returns: Single combined dataframe from list of dataframes.
Return type: pd.DataFrame

df_from_terms(ngram_tuples)¶

Creates DataFrame from lists of terms and values as tuple.

Calls terms_to_columns on ngram_tuple to unpack them.

Parameters

ngram_tuples (list) – list of tuple`[:obj:`tuple [str], int]. Results from get_ngrams.

Returns

Pandas DataFrame comprising a column of: terms and a column of frequency values for those terms.

Return type

df(pd.DataFrame)

get_ngrams(n, top_n_results=250, stopwords=True)¶

Create tuple with terms and frequency from list.

List of terms is tokenised using Spacy’s NLP pipe, set to lowercase and ngrams are calculated with NLTK’s ngrams function.

Parameters

n (int) – The length of phrases to analyse.
top_n_results (int) – The number of results to return. Default is 150.
stopwords (bool) – flag to indicate removal of stopwords. Default is True.

Returns

List of tuples containing term(s) and values.

Return type

list of :obj:`tuple`[:obj:`tuple`[str, …], int]

in_stop_words(spacy_token_text)¶

Check if word appears in stopword set.

Parameters: spacy_token_text (str) – The text attribute of the Spacy token being passed to the method.
Returns: Whether text is present in stopwords.
Return type: bool

ngram_range(max_n, n=1, top_n_results=250, stopwords=True)¶

Gets ngram terms and outputs for a range of phrase lengths.

Gets ngrams from single terms as default up to desired maximum phrase length and creates Pandas DataFrame from results.

Parameters

max_n (int) – The longest phrase length desired in output.
n (int) – The minimum term length. Default is 1 (single term).
top_n_results (int) – The number of rows of results to return. Default set to 150.
stopwords (bool) – flag to indicate removal of stopwords. Default is True.

Returns

Combined dataframe of all results from various: term lengths to desired maximum.

Return type

pd.DataFrame

remove_escaped_chars(text)¶

Remove newline and tab chars from string list.

Parameters

text (List of str) – Terms list to be cleaned of specific chars.

Returns

Terms list without: specific chars.

Return type

without_newlines(List of str)

terms_to_columns(ngram_tuples)¶

Returns term/value tuples as two lists.

Parameters

ngram_tuples (list) – list of tuple`[:obj:`tuple [str], int]. Results from get_ngrams.

Returns

Terms, concatinated into: single string for multi-word terms, returned as list.

value_col(list of int): Term frequencies as list. Lists are returned together as tuple containing both lists.

Return type

term_col(list of str)