Random Forests lithology prediction method for imbalanced data sets
WANG Guangyu1, SONG Jianguo1, XU Fei1, ZHANG Wen2, LIU Jiong3, CHEN Feixu4
1. School of Geosciences, China University of Petroleum(East China), Qingdao, Shandong 266580, China; 2. School of Earth and Space Sciences, University of Science and Technology of China, Hefei, Anhui 230026, China; 3. SINOPEC Petroleum Exploration and Production Research Institute, Beijing 100083, China; 4. Research Institute of Petroleum Exploration and Development, PetroChina Tarim Oilfield Company, Korla, Xinjiang 841000, China
Abstract:For the lithology prediction method depending on a supervised machine learning classifier, if the data set has too few samples of target lithology while too many samples of non-target lithology, the classifier trained on this imbalanced data set will cause the prediction results be biased toward the non-target lithology, resulting in poor prediction accuracy of target lithology. With regard to this problem, a Random Forests lithology prediction method for imbalanced data sets is proposed. Firstly, a lithology data set is constructed with lithological logging data as sample labels and seismic attributes and elastic parameters of rock at the uphole trace as sample features. Secondly, the NM-SMOTE algorithm integrating near miss (NM) and synthetic minority over-sampling technique (SMOTE) is employed to balance the lithology data set. Then a Random Forests classifier is trained on the balanced data set to build a nonlinear relationship of lithology with various seismic attributes and elastic parameters. Finally, the seismic attri-butes and elastic parameters of the target explorato-ry area are input into the Random Forests classifier which will predict lithology according to the above nonlinear relationship obtained during training. The actual data test results demonstrate that too many samples of non-target lithology will affect the prediction accuracy of the Random Forests classifier, and the prediction accuracy of lithology is only 38%. After the training data set is balanced with the NM-SMOTE algorithm, the prediction accuracy of lithology rises up to 83%, and a data volume of lithology is obtained, which is more consistent with seismic data.
李玉存,李君,孙明,等.地震解释技术在高北斜坡带中深层岩性圈闭评价中的应用[J].石油地球物理勘探,2017,52(增刊1):207-213.LI Yucun,LI Jun,SUN Ming,et al.Seismic interpretation techniques for middle and deep lithological trap evaluation in Gaobei Slope[J].Oil Geophysical Prospecting,2017,52(S1):207-213.
[2]
付光明,严加永,张昆,等.岩性识别技术现状与进展[J].地球物理学进展,2017,32(1):26-40.FU Guangming,YAN Jiayong,ZHANG Kun,et al. Current status and progress of lithology identification technology[J].Progress in Geophysics,2017,32(1):26-40.
[3]
赵谦,周江羽,张莉,等.利用地震波形-振幅响应技术预测海相碎屑岩岩性组合——以北康盆地为例[J].石油地球物理勘探,2017,52(6):1280-1289.ZHAO Qian,ZHOU Jiangyu,ZHANG Li,et al.Prediction of marine clastic rocks assemblage with seismic waveform and amplitude responses:an example in Beikang Basin,South China Sea[J].Oil Geophysical Prospecting,2017,52(6):1280-1289.
[4]
黄凤祥,夏振宇,马秀玲,等.基于测井和地震技术变质岩潜山岩性识别与预测[J].断块油气田,2016,23(6):721-725.HUANG Fengxiang,XIA Zhenyu,MA Xiuling,et al.Identification and prediction of metamorphic buried hill lithology based on logging and seismic technology[J].Fault-Block Oil&Gas Field,2016,23(6):721-725.
[5]
孙明,廖军,陈伟超,等.南堡凹陷东部扇三角洲砂岩地震预测技术及应用[J].石油地球物理勘探,2017,52(增刊1):128-133.SUN Ming,LIAO Jun,CHEN Weichao,et al.Seismic fan-delta sand prediction in the eastern Nanpu Depression[J].Oil Geophysical Prospecting,2017,52(S1):128-133.
[6]
黄饶,刘志斌.叠前同时反演在砂岩油藏预测中的应用[J].地球物理学进展,2013,28(1):380-386.HUANG Rao,LIU Zhibin.Application of prestack simultaneous inversion in sandstone oil reservoir prediction[J].Progress in Geophysics,2013,28(1):380-386.
[7]
洪忠,张猛刚,苏明军.应用地震波形分类技术识别岩相的适用性和局限性[J].物探与化探,2013,37(5):904-910.HONG Zhong,ZHANG Menggang,SU Mingjun.The applicability and limitations of the seismic waveform classification technology to the identification of litho-logical facies[J].Geophysical & Geochemical Exploration,2013,37(5):904-910.
[8]
田玉昆,周辉,袁三一.基于马尔科夫随机场的岩性识别方法[J].地球物理学报,2013,56(4):1360-1368.TIAN Yukun,ZHOU Hui,YUAN Sanyi.Lithologic discrimination method based on Markov random field[J].Chinese Journal of Geophysics,2013,56(4):1360-1368.
[9]
李国福.多参数储层预测及流体识别方法研究[D].四川成都:成都理工大学,2011.LI Guofu.Multi-parameter Reservoir Prediction and Fluid Identification Method Research[D].Chengdu University of Technology,Chengdu,Sichuan,2011.
[10]
李国和,郑阳,李莹,等.基于深度信念网络的多采样点岩性识别[J].地球物理学进展,2018,33(4):1660-1665.LI Guohe,ZHENG Yang,LI Ying,et al.Lithology recognition of multi-sampling points based on deep belief network[J].Progress in Geophysics,2018,33(4):1660-1665.
[11]
张国印,王志章,林承焰,等.基于小波变换和卷积神经网络的地震储层预测方法及应用[J].中国石油大学学报(自然科学版),2020,44(4):83-93.ZHANG Guoyin,WANG Zhizhang,LIN Chengyan,et al.Seismic reservoir prediction method based on wavelet transform and convolutional neural network and its application[J].Journal of China University of Petroleum (Edition of Natural Science),2020,44(4):83-93.
于化龙.类别不平衡学习:理论与算法[M].北京:清华大学出版社,2017.YU Hualong.Class Imbalanced Learning:Theories and Algorithms[M]. Tsinghua University Press,Beijing,2017.
[14]
Zhang J,Mani I.KNN approach to unbalanced data distributions:A case study involving information extraction[C].Proceedings of the ICML'2003 Workshop on Learning from Imbalanced Datasets,2003.
[15]
Chawla N V,Bowyer K W,Hall L O,et al. SMOTE:Synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16:321-357.
Efron B,Tibshirani R J.An Introduction to the Bootstrap[M]. CRC Press,Boca Raton,1994.
[18]
Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection[C].International Joint Conference on Artificial Intel-ligence,1995.
[19]
Mosley L.A Balanced Approach to the Multi-class Imbalance Problem[D]. Iowa State University,Ames, 2013.
[20]
Barnes A E.Handbook of Poststack Seismic Attri-butes[M]. Society of Exploration Geophysicists,Tulsa,2016.
[21]
冉然,宋建国.基于Zoeppritz方程的纵横波模量反演[J].物探与化探,2017,41(4):707-714.RAN Ran,SONG Jianguo.Compressional and shear modulus inversion based on Zoeppritz equation[J].Geophysical & Geochemical Exploration,2017,41(4):707-714.
[22]
Chen C,Liaw A,Breiman L.Using Random Forest to Learn Imbalanced Data[R]. University of California,Berkeley,2004.
[23]
Batista G E,Prati R C,Monard M C.A study of the behavior of several methods for balancing machine learning training data[J].ACM SIGKDD Explorations Newsletter,2004,6(1):20-29.