Reference¶

excel_ngrams.console
excel_ngrams.grammer

excel_ngrams.console¶

Command-line interface.

excel_ngrams.grammer¶

Client to get ngram values from Excel document.

class excel_ngrams.grammer.FileHandler(file_path, sheet_name=0, column_name='Keyword')¶

Class to handle reading, data extraction, and writing to files.

file_path¶

The path to Excel file to be read.

Type: str

sheet_name¶

The name or number of the sheet to read from.

Type: int or str

column_name¶

The name of the column to be read from. Defaults to ‘Keyword’.

Type: str

term_list¶

A list of terms (read from from Excel column).

Type: list

get_destination_path()¶

Creates path to write output csv file to.

Uses the path of the input Excel file to create an output path that mimics the input file name but is appended with the datetime and n-grams.

Returns: Path to write output file to.
Return type: str

get_file_path()¶

str: Getter method returns Excel doc file path.

Return type: str

get_terms()¶

list of str: Getter method returns terms_list.

Return type: List[str]

set_terms(file_path, sheet_name, column_name)¶

Sets term_list attribute from Excel doc.

Uses Pandas DataFrame as an intermediate to generate list.

Parameters

file_path (str) – The path to Excel file to read terms from.
sheet_name (int or str) – The name or number of the sheet containing terms. Defaults to 0 (first sheet when sheets are unnamed).
column_name (str) – The name of the column header containing terms. Defaults to Keyword.

Returns

Terms from Excel as Python array.

Return type

list

write_df_to_file(df)¶

Writes DataFrame to csv file.

Gets path from get_destination_path method and uses Pandas to_csv function to write DataFrame to csv file.

Parameters: df (pd.DataFrame) – Dataframe of terms and values columns for ngrams.
Returns: Path to which csv file was written.
Return type: str

class excel_ngrams.grammer.Grammer(file_handler)¶

Class to get n-grams from list of terms.

Using Spacy’s NLP pipe and NLTK’s ngrams function to generate ngrams within a given range and output them to a Pandas DataFrame for writing to an output file.

file_handler¶

FileHandler obj with input file path.

Type: FileHandler

term_list¶: Term list from FileHandler attribute.

_nlp and _stopwords are shared across all instances, but is loaded by the constructor to avoid loading is in cases where it isn’t needed.

combine_dataframes(dataframes)¶

Creates single multi-column dataframe.

Takes the terms and frequency values for dataframes constructed from ngrams of various lengths and combines them into a single dataframe, e.g single term and values, bigrams and values, trigrams and values, etc.

Parameters: dataframes (list) – List of pd.DataFrames containing the dataframes to be merged, side by side.
Returns: Single combined dataframe from list of dataframes.
Return type: pd.DataFrame

df_from_terms(ngram_tuples)¶

Creates DataFrame from lists of terms and values as tuple.

Calls terms_to_columns on ngram_tuple to unpack them.

Parameters

ngram_tuples (list) – list of tuple`[:obj:`tuple [str], int]. Results from get_ngrams.

Returns

Pandas DataFrame comprising a column of: terms and a column of frequency values for those terms.

Return type

df(pd.DataFrame)

get_ngrams(n, top_n_results=250, stopwords=True)¶

Create tuple with terms and frequency from list.

List of terms is tokenised using Spacy’s NLP pipe, set to lowercase and ngrams are calculated with NLTK’s ngrams function.

Parameters

n (int) – The length of phrases to analyse.
top_n_results (int) – The number of results to return. Default is 150.
stopwords (bool) – flag to indicate removal of stopwords. Default is True.

Returns

List of tuples containing term(s) and values.

Return type

list of :obj:`tuple`[:obj:`tuple`[str, …], int]

in_stop_words(spacy_token_text)¶

Check if word appears in stopword set.

Parameters: spacy_token_text (str) – The text attribute of the Spacy token being passed to the method.
Returns: Whether text is present in stopwords.
Return type: bool

ngram_range(max_n, n=1, top_n_results=250, stopwords=True)¶

Gets ngram terms and outputs for a range of phrase lengths.

Gets ngrams from single terms as default up to desired maximum phrase length and creates Pandas DataFrame from results.

Parameters

max_n (int) – The longest phrase length desired in output.
n (int) – The minimum term length. Default is 1 (single term).
top_n_results (int) – The number of rows of results to return. Default set to 150.
stopwords (bool) – flag to indicate removal of stopwords. Default is True.

Returns

Combined dataframe of all results from various: term lengths to desired maximum.

Return type

pd.DataFrame

output_csv_file(df)¶

Write dataframe to csv file.

Parameters: df (pd.DataFrame) – Dataframe with columns of terms and values.
Returns: The path the csv file was written to.
Return type: path(str)
Raises: ClickException – Writing to csv file failed.

remove_escaped_chars(text)¶

Remove newline and tab chars from string list.

Parameters

text (List of str) – Terms list to be cleaned of specific chars.

Returns

Terms list without: specific chars.

Return type

without_newlines(List of str)

terms_to_columns(ngram_tuples)¶

Returns term/value tuples as two lists.

Parameters

ngram_tuples (list) – list of tuple`[:obj:`tuple [str], int]. Results from get_ngrams.

Returns

Terms, concatinated into: single string for multi-word terms, returned as list.

value_col(list of int): Term frequencies as list. Lists are returned together as tuple containing both lists.

Return type

term_col(list of str)