Model re-training for dynamic graphs
Abstract
In Machine Learning, the most critical assumption is that training and testing datasets should have similar distributions. The model will be effective if the new test data is similar to the past data on which the model was trained. If there are substantial differences between the training data and the testing data, the machine learning algorithm will generate results that are not very accurate. In many applications, the data has dynamic periodicity, that is, the data changes with time. As the distribution of the data keeps changing, at some point, the model will therefore have to be retrained. In this research I look at the dynamic behavior of graph data. As data changes, there will be addition/deletions of nodes/edges of the graph. As we are dealing with large sets of graph data, we use embedding vector spaces (for graph data) for training and testing. Embedding vector spaces in each timestamp are different and training the model each time when data changes is expensive. To address these challenges, we use the dfs_dynode2vec algorithm where the current timestamp graph embedding vectors initializes from the previous embedding vectors. For each timestamp, data might change significantly or insignificantly. We propose a statistical model ‘Significant testing’ which determines whether the model should be retrained or not. If the change is insignificant, the model need not to be trained again and embedded vectors for that timestamp are not generated. We have considered several aspects in determining the statistical significance of the change. These include edge centrality, betweenness centrality and norm calculations.
Collections
- OSU Theses [15752]