Reference¶
excel_ngrams.console¶
Command-line interface.
excel_ngrams.grammer¶
Return dataframe of ngrams from list of words.
-
class
excel_ngrams.grammer.
Grammer
(terms_list)¶ Class that returns n-grams from text as a list of strings.
Words are delineated by white space and punctuation. Using Spacy’s NLP pipe and NLTK’s ngrams function to generate ngrams within a given word length range and output them to a Pandas DataFrame for writing to an output file.
-
term_list
¶ List of text as strings (one or more).
_nlp and _stopwords are shared across all instances, but is loaded by the constructor to avoid loading is in cases where it isn’t needed.
-
combine_dataframes
(dataframes)¶ Creates single multi-column dataframe.
Takes the terms and frequency values for dataframes constructed from ngrams of various lengths and combines them into a single dataframe, e.g single term and values, bigrams and values, trigrams and values, etc.
- Parameters
dataframes (list) – List of
pd.DataFrames
containing the dataframes to be merged, side by side.- Returns
Single combined dataframe from list of dataframes.
- Return type
pd.DataFrame
-
df_from_terms
(ngram_tuples)¶ Creates DataFrame from lists of terms and values as tuple.
Calls terms_to_columns on ngram_tuple to unpack them.
- Parameters
ngram_tuples (list) –
list
oftuple`[:obj:`tuple
[str], int]. Results from get_ngrams.- Returns
- Pandas DataFrame comprising a column of
terms and a column of frequency values for those terms.
- Return type
df(pd.DataFrame)
-
get_ngrams
(n, top_n_results=250, stopwords=True)¶ Create tuple with terms and frequency from list.
List of terms is tokenised using Spacy’s NLP pipe, set to lowercase and ngrams are calculated with NLTK’s ngrams function.
- Parameters
n (int) – The length of phrases to analyse.
top_n_results (int) – The number of results to return. Default is 150.
stopwords (bool) – flag to indicate removal of stopwords. Default is True.
- Returns
List of tuples containing term(s) and values.
- Return type
-
in_stop_words
(spacy_token_text)¶ Check if word appears in stopword set.
- Parameters
spacy_token_text (str) – The text attribute of the Spacy token being passed to the method.
- Returns
Whether text is present in stopwords.
- Return type
bool
-
ngram_range
(max_n, n=1, top_n_results=250, stopwords=True)¶ Gets ngram terms and outputs for a range of phrase lengths.
Gets ngrams from single terms as default up to desired maximum phrase length and creates Pandas DataFrame from results.
- Parameters
max_n (int) – The longest phrase length desired in output.
n (int) – The minimum term length. Default is 1 (single term).
top_n_results (int) – The number of rows of results to return. Default set to 150.
stopwords (bool) – flag to indicate removal of stopwords. Default is True.
- Returns
- Combined dataframe of all results from various
term lengths to desired maximum.
- Return type
pd.DataFrame
-
remove_escaped_chars
(text)¶ Remove newline and tab chars from string list.
- Parameters
text (
List
ofstr
) – Terms list to be cleaned of specific chars.- Returns
- Terms list without
specific chars.
- Return type
without_newlines(
List
ofstr
)
-
terms_to_columns
(ngram_tuples)¶ Returns term/value tuples as two lists.
- Parameters
ngram_tuples (list) –
list
oftuple`[:obj:`tuple
[str], int]. Results from get_ngrams.- Returns
- Terms, concatinated into
single string for multi-word terms, returned as list.
value_col(
list
ofint
): Term frequencies as list. Lists are returned together as tuple containing both lists.- Return type
term_col(
list
ofstr
)
-