Reference¶
excel_ngrams.console¶
Command-line interface.
excel_ngrams.grammer¶
Client to get ngram values from Excel document.
-
class
excel_ngrams.grammer.FileHandler(file_path, sheet_name=0, column_name='Keyword')¶ Class to handle reading, data extraction, and writing to files.
-
file_path¶ The path to Excel file to be read.
- Type
str
-
sheet_name¶ The name or number of the sheet to read from.
- Type
int or str
-
column_name¶ The name of the column to be read from. Defaults to ‘Keyword’.
- Type
str
-
term_list¶ A list of terms (read from from Excel column).
- Type
list
-
get_destination_path()¶ Creates path to write output csv file to.
Uses the path of the input Excel file to create an output path that mimics the input file name but is appended with the datetime and n-grams.
- Returns
Path to write output file to.
- Return type
str
-
get_file_path()¶ str: Getter method returns Excel doc file path.
- Return type
str
-
get_terms()¶ listofstr: Getter method returns terms_list.- Return type
List[str]
-
set_terms(file_path, sheet_name, column_name)¶ Sets term_list attribute from Excel doc.
Uses Pandas DataFrame as an intermediate to generate list.
- Parameters
file_path (str) – The path to Excel file to read terms from.
sheet_name (int or str) – The name or number of the sheet containing terms. Defaults to 0 (first sheet when sheets are unnamed).
column_name (str) – The name of the column header containing terms. Defaults to Keyword.
- Returns
Terms from Excel as Python array.
- Return type
list
-
write_df_to_file(df)¶ Writes DataFrame to csv file.
Gets path from get_destination_path method and uses Pandas to_csv function to write DataFrame to csv file.
- Parameters
df (pd.DataFrame) – Dataframe of terms and values columns for ngrams.
- Returns
Path to which csv file was written.
- Return type
str
-
-
class
excel_ngrams.grammer.Grammer(file_handler)¶ Class to get n-grams from list of terms.
Using Spacy’s NLP pipe and NLTK’s ngrams function to generate ngrams within a given range and output them to a Pandas DataFrame for writing to an output file.
-
file_handler¶ FileHandler obj with input file path.
- Type
-
term_list¶ Term list from FileHandler attribute.
_nlp and _stopwords are shared across all instances, but is loaded by the constructor to avoid loading is in cases where it isn’t needed.
-
combine_dataframes(dataframes)¶ Creates single multi-column dataframe.
Takes the terms and frequency values for dataframes constructed from ngrams of various lengths and combines them into a single dataframe, e.g single term and values, bigrams and values, trigrams and values, etc.
- Parameters
dataframes (list) – List of
pd.DataFramescontaining the dataframes to be merged, side by side.- Returns
Single combined dataframe from list of dataframes.
- Return type
pd.DataFrame
-
df_from_terms(ngram_tuples)¶ Creates DataFrame from lists of terms and values as tuple.
Calls terms_to_columns on ngram_tuple to unpack them.
- Parameters
ngram_tuples (list) –
listoftuple`[:obj:`tuple[str], int]. Results from get_ngrams.- Returns
- Pandas DataFrame comprising a column of
terms and a column of frequency values for those terms.
- Return type
df(pd.DataFrame)
-
get_ngrams(n, top_n_results=250, stopwords=True)¶ Create tuple with terms and frequency from list.
List of terms is tokenised using Spacy’s NLP pipe, set to lowercase and ngrams are calculated with NLTK’s ngrams function.
- Parameters
n (int) – The length of phrases to analyse.
top_n_results (int) – The number of results to return. Default is 150.
stopwords (bool) – flag to indicate removal of stopwords. Default is True.
- Returns
List of tuples containing term(s) and values.
- Return type
-
in_stop_words(spacy_token_text)¶ Check if word appears in stopword set.
- Parameters
spacy_token_text (str) – The text attribute of the Spacy token being passed to the method.
- Returns
Whether text is present in stopwords.
- Return type
bool
-
ngram_range(max_n, n=1, top_n_results=250, stopwords=True)¶ Gets ngram terms and outputs for a range of phrase lengths.
Gets ngrams from single terms as default up to desired maximum phrase length and creates Pandas DataFrame from results.
- Parameters
max_n (int) – The longest phrase length desired in output.
n (int) – The minimum term length. Default is 1 (single term).
top_n_results (int) – The number of rows of results to return. Default set to 150.
stopwords (bool) – flag to indicate removal of stopwords. Default is True.
- Returns
- Combined dataframe of all results from various
term lengths to desired maximum.
- Return type
pd.DataFrame
-
output_csv_file(df)¶ Write dataframe to csv file.
- Parameters
df (pd.DataFrame) – Dataframe with columns of terms and values.
- Returns
The path the csv file was written to.
- Return type
path(str)
- Raises
ClickException – Writing to csv file failed.
-
remove_escaped_chars(text)¶ Remove newline and tab chars from string list.
- Parameters
text (
Listofstr) – Terms list to be cleaned of specific chars.- Returns
- Terms list without
specific chars.
- Return type
without_newlines(
Listofstr)
-
terms_to_columns(ngram_tuples)¶ Returns term/value tuples as two lists.
- Parameters
ngram_tuples (list) –
listoftuple`[:obj:`tuple[str], int]. Results from get_ngrams.- Returns
- Terms, concatinated into
single string for multi-word terms, returned as list.
value_col(
listofint): Term frequencies as list. Lists are returned together as tuple containing both lists.- Return type
term_col(
listofstr)
-