Reference

excel_ngrams.console

Command-line interface.

excel_ngrams.grammer

Client to get ngram values from Excel document.

class excel_ngrams.grammer.FileHandler(file_path, sheet_name=0, column_name='Keyword')

Class to handle reading, data extraction, and writing to files.

file_path

The path to Excel file to be read.

Type

str

sheet_name

The name or number of the sheet to read from.

Type

int or str

column_name

The name of the column to be read from. Defaults to ‘Keyword’.

Type

str

term_list

A list of terms (read from from Excel column).

Type

list

get_destination_path()

Creates path to write output csv file to.

Uses the path of the input Excel file to create an output path that mimics the input file name but is appended with the datetime and n-grams.

Returns

Path to write output file to.

Return type

str

get_file_path()

str: Getter method returns Excel doc file path.

Return type

str

get_terms()

list of str: Getter method returns terms_list.

Return type

List[str]

set_terms(file_path, sheet_name, column_name)

Sets term_list attribute from Excel doc.

Uses Pandas DataFrame as an intermediate to generate list.

Parameters
  • file_path (str) – The path to Excel file to read terms from.

  • sheet_name (int or str) – The name or number of the sheet containing terms. Defaults to 0 (first sheet when sheets are unnamed).

  • column_name (str) – The name of the column header containing terms. Defaults to Keyword.

Returns

Terms from Excel as Python array.

Return type

list

write_df_to_file(df)

Writes DataFrame to csv file.

Gets path from get_destination_path method and uses Pandas to_csv function to write DataFrame to csv file.

Parameters

df (pd.DataFrame) – Dataframe of terms and values columns for ngrams.

Returns

Path to which csv file was written.

Return type

str

class excel_ngrams.grammer.Grammer(file_handler)

Class to get n-grams from list of terms.

Using Spacy’s NLP pipe and NLTK’s ngrams function to generate ngrams within a given range and output them to a Pandas DataFrame for writing to an output file.

file_handler

FileHandler obj with input file path.

Type

FileHandler

term_list

Term list from FileHandler attribute.

_nlp and _stopwords are shared across all instances, but is loaded by the constructor to avoid loading is in cases where it isn’t needed.

combine_dataframes(dataframes)

Creates single multi-column dataframe.

Takes the terms and frequency values for dataframes constructed from ngrams of various lengths and combines them into a single dataframe, e.g single term and values, bigrams and values, trigrams and values, etc.

Parameters

dataframes (list) – List of pd.DataFrames containing the dataframes to be merged, side by side.

Returns

Single combined dataframe from list of dataframes.

Return type

pd.DataFrame

df_from_terms(ngram_tuples)

Creates DataFrame from lists of terms and values as tuple.

Calls terms_to_columns on ngram_tuple to unpack them.

Parameters

ngram_tuples (list) – list of tuple`[:obj:`tuple [str], int]. Results from get_ngrams.

Returns

Pandas DataFrame comprising a column of

terms and a column of frequency values for those terms.

Return type

df(pd.DataFrame)

get_ngrams(n, top_n_results=250, stopwords=True)

Create tuple with terms and frequency from list.

List of terms is tokenised using Spacy’s NLP pipe, set to lowercase and ngrams are calculated with NLTK’s ngrams function.

Parameters
  • n (int) – The length of phrases to analyse.

  • top_n_results (int) – The number of results to return. Default is 150.

  • stopwords (bool) – flag to indicate removal of stopwords. Default is True.

Returns

List of tuples containing term(s) and values.

Return type

list of :obj:`tuple`[:obj:`tuple`[str, …], int]

in_stop_words(spacy_token_text)

Check if word appears in stopword set.

Parameters

spacy_token_text (str) – The text attribute of the Spacy token being passed to the method.

Returns

Whether text is present in stopwords.

Return type

bool

ngram_range(max_n, n=1, top_n_results=250, stopwords=True)

Gets ngram terms and outputs for a range of phrase lengths.

Gets ngrams from single terms as default up to desired maximum phrase length and creates Pandas DataFrame from results.

Parameters
  • max_n (int) – The longest phrase length desired in output.

  • n (int) – The minimum term length. Default is 1 (single term).

  • top_n_results (int) – The number of rows of results to return. Default set to 150.

  • stopwords (bool) – flag to indicate removal of stopwords. Default is True.

Returns

Combined dataframe of all results from various

term lengths to desired maximum.

Return type

pd.DataFrame

output_csv_file(df)

Write dataframe to csv file.

Parameters

df (pd.DataFrame) – Dataframe with columns of terms and values.

Returns

The path the csv file was written to.

Return type

path(str)

Raises

ClickException – Writing to csv file failed.

remove_escaped_chars(text)

Remove newline and tab chars from string list.

Parameters

text (List of str) – Terms list to be cleaned of specific chars.

Returns

Terms list without

specific chars.

Return type

without_newlines(List of str)

terms_to_columns(ngram_tuples)

Returns term/value tuples as two lists.

Parameters

ngram_tuples (list) – list of tuple`[:obj:`tuple [str], int]. Results from get_ngrams.

Returns

Terms, concatinated into

single string for multi-word terms, returned as list.

value_col(list of int): Term frequencies as list. Lists are returned together as tuple containing both lists.

Return type

term_col(list of str)