Recently, micro-expression recognition (MER) has attracted much attention due to its wide application in various fields such as crime trials and psychotherapy. However, the short duration and subtle movement of facial muscles make it difficult to extract micro-expression features. In this article, we propose a Dual Flow Fusion Convolutional Network (DFFCN) that combines the learning flow and optical flow to capture spatiotemporal features. Specifically, we adopt a trainable Learning Flow Module to extract the frame-level motion characteristics, fused with the mask generated from hand-crafted optical flow, and finally predict the micro-expression. Additionally, to overcome the shortcomings of limited and imbalanced training samples, we propose a data augmentation strategy based on Generative Adversarial Network (GAN). Comprehensive experiments are conducted on three public micro-expression datasets: CASME II, SAMM and SMIC with Leave-One-Subject-Out (LOSO) cross-validation. The results demonstrated that our method achieves competitive performance when compared with the existing approaches, with the best UF1 (0.8452) and UAR (0.8465).