Spatio-temporal VLAD encoding for human action recognition in videos