Prediction using Markov Chain¶

This module contains the code necessary to make a recommendation based on a Markov Chain.

Functions¶

get_latest_file_with_path(path, *paths):

Method to get the full path name for the latest file for the input parameter in paths. This method uses the os.path.getctime function to get the most recently created file that matches the filename pattern in the provided path.

Parameters:

pathstring
Root pathname for the files.

*pathsstring list
These are the var args field, the optional set of strings to denote the full path to the file names.

Returns:

latest_filestring
Full path name for the latest file provided in the paths parameter.

Class¶

MCpredict¶

Functions¶

get_inner_list(self, in_list):

Backtracking code to recursively obtain the item name from the hierachial output list.

Parameters:

in_listlist/ tuple
Either a list object or tuple whose data is retreived.

Returns:

list
Condensed hierarchial version of the list without probabilities.

pretty_output(self, output_list):

Get the item list without the probabilities.

Parameters:

output_listlist
Output List after complete processing..

Returns:

out_dictdict
Ordered Dict with level as the key and value as the condensed list for each level.

Example:

input: [[(-1.0286697494934511, ‘Wood’)], [(-1.8312012012793524, ‘Trsgi’)],
[[(-2.5411555001556785, ‘NA’), (-6.618692944061398, ‘Wood’), (-6.618692944061398, ‘MXD’), (-6.618692944061398, ‘LakeSediment’), (-6.618692944061398, ‘Composite’)]]]

output: {‘0’: [‘Wood’], ‘1’: [‘Trsgi’], ‘2’: [‘NA’, ‘Wood’, ‘MXD’, ‘LakeSediment’, ‘Composite’]}

get_max_prob(self, temp_names_set, trans_dict_for_word, prob):

Find the maximimum items from a list stream using heapq. We will only pick those items that belong to the category we are interested in. Example : only recommend values in Units for Units.

Parameters:

temp_names_setset
Set containing the items in the category.

trans_dict_for_worddict
Transition probability dict for the start word.

probfloat
The probability of the start word.

Returns:

list
Contains the top 5 recommendation for the start word.

back_track(self, data, name_list_ind, sentence = None):

Function to get top 5 items for each item in sequence

Parameters:

datalist/str
Input sequence.

name_list_ind: int
Index for names_list dict. Used to predict only proxyObservationType after Archive, and not give recommendations from other category.

Returns:

list
Output list for the input sequence.

get_ini_prob(self, sentence):

Method to find the transition probability for the given sentence. For the first word we use the initial probability and for the rest of the sentence we use the transition probability for getting the next word.

Parameters:

sentencestr
Input string sequence for which we have to predict the next sequence.

Returns:

output_listlist
Output list containing the probability and word for each stage of the sequence.

sentencelist
Sentence strip and split on space and returned for further use.

predict_seq(self, sentence, isInferred = False):

Predict the top 5 elements at each stage for every item in the chain There are 2 chain types:

archive -> proxyObservationType -> units,

archive -> proxyObservationType -> interpretation/variable, interpretation/variableDetail ->inferredVariable -> inferredVarUnits

We do not include inferredVariableType and inferredVarUnits in the sequential prediction, but provide the recommendation after the interpretation/variableDetail has been selected.

If isInferred == True, then we will choose the top value in prediction for the chain given the archiveType example:

archiveType = MarineSediment

proxy = D180

interpretation/variable = NA

interpretation/variableDetail = NA

then based on this generate the top 5 predictions for inferredVariable

Parameters:

sentencestr
Input sequence.

Returns:

output_listdict
Dict in hierarchial fashion containing top 5 predictions for value at each level.

Example:

input: ‘Wood’ intermediate output:

[[(-1.0286697494934511, ‘Wood’)], [[(-2.8598709507728035, ‘Trsgi’), (-3.519116579657067, ‘ARS’), (-3.588109451144019, ‘EPS’), (-3.701438136451022, ‘SD’), (-3.701438136451022, ‘Core’)]], [[ [(-3.5698252496491296, ‘NA’), (-7.647362693554849, ‘Wood’), (-7.647362693554849, ‘MXD’), (-7.647362693554849, ‘LakeSediment’), (-7.647362693554849, ‘Composite’)], [(-4.628778704511761, ‘NA’), (-8.029976086173917, ‘Wood’), (-8.029976086173917, ‘MXD’), (-8.029976086173917, ‘LakeSediment’), (-8.029976086173917, ‘Composite’)], [(-4.744541310700955, ‘NA’), (-8.076745820876159, ‘Wood’), (-8.076745820876159, ‘MXD’), (-8.076745820876159, ‘LakeSediment’), (-8.076745820876159, ‘Composite’)], [(-4.936909607836329, ‘NA’), (-8.15578543270453, ‘Wood’), (-8.15578543270453, ‘MXD’), (-8.15578543270453, ‘LakeSediment’), (-8.15578543270453, ‘Composite’)], [(-4.971198681314961, ‘NA’), (-6.803780145063271, ‘NotApplicable’), (-8.190074506183162, ‘Wood’), (-8.190074506183162, ‘MXD’), (-8.190074506183162, ‘Composite’)] ]]]

final output: {‘0’: [‘Trsgi’, ‘ARS’, ‘EPS’, ‘SD’, ‘Core’]}

Usage¶

MCpredict.py module is used for accuracy calculation in the /accuracy_calc/markovchain directory. For more information check out the documentation for Accuracy Calculation.

Accuracy Calculation for Markov Chains