Computational Intelligence to aid Text File Format Identification

Kuppili Venkata, Santhilata and Green, Alex Computational Intelligence to aid Text File Format Identification., 2019 (Unpublished) [Preprint]

[thumbnail of Report_journal.pdf]
Preview
Text
Report_journal.pdf

Download (231kB) | Preview

English abstract

One of the challenges faced in digital preservation is to identify the file types when the files can be opened with simple text editors and their extensions are unknown. The problem gets complicated when the file passes through the test of human readability, but would not make sense how to put to use! The Text File Format Identification (TFFI) project was initiated at The National Archives to identify file types from plain text file contents with the help of computing intelligence models. A methodology that takes help of AI and machine learning to automate the process was successfully tested and implemented on the test data. The prototype developed as a proof of concept has achieved up to 98.58% of accuracy in detecting five file formats.

Item type: Preprint
Keywords: File format identification, Digital Preservation
Subjects: J. Technical services in libraries, archives, museum.
J. Technical services in libraries, archives, museum. > JH. Digital preservation.
Depositing user: Dr Santhilata Kuppili Venkata
Date deposited: 17 Sep 2019 08:31
Last modified: 17 Sep 2019 08:31
URI: http://hdl.handle.net/10760/38969

References

[1] DROID. http://droid.sourceforge.net/, 2013.

[2] TrID. http://mark0.net/soft-trid-e.html.

[3] Nasser S. Alamri and William H. Allen. A taxonomy of file-type identification techniques. In Proceedings of the 2014 ACM Southeast Regional Conference, ACM SE ’14, page 49:1–49:4, New York, NY, USA, 2014. ACM.

[4] William C. Calhoun and Drue Coles. Predicting the types of file fragments. Digit. Investig., 5:S14–S20, September 2008.

[5] Irfan Ahmed, Kyung suk Lhee, Hyunjung Shin, and ManPyo Hong. Content-based file-type identification using cosine similarity and a divide-and-conquer approach. IETE Technical Review, 27(6):465, 2010.

[6] IrfanAhmed,Kyung-SukLhee,Hyun-JungShin,andMan-PyoHong.Fastcontent-basedfiletypeidentifica- tion. In Advances in Digital Forensics VII, page 65–75. Springer Berlin Heidelberg, 2011.

[7] Rainer Poisel and Simon Tjoa. A comprehensive literature review of file carving. In 2013 International Conference on Availability, Reliability and Security. IEEE, sep 2013.

[8] Rainer Poisel, Marlies Rybnicek, and Simon Tjoa. Taxonomy of data fragment classification techniques. In Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineer- ing, page 67–85. Springer International Publishing, 2014.

[9] JohnDanielEvensen,SindreLindahl,andMortenGoodwin.File-typedetectionusingnaivebayesandn-gram analysis. In 2014: NISK 2014, 2014.

[10] Siddharth Gopal, Yiming Yang, Konstantin Salomatin, and Jaime Carbonell. Statistical learning for file-type identification. In 2011 10th International Conference on Machine Learning and Applications and Workshops. IEEE, dec 2011.

[11] Erich Feodor Wilgenbus. The file fragment classification problem : a combined neural network and linear programming discriminant model approach. Master’s thesis, N, 2013.

[12] Konstantinos Karampidis, Ergina Kavallieratou, and George Papadourakis. Comparison of classification al- gorithms for file type detection a digital forensics perspective. Polibits, 56:15–20, 2017.

[13] Konstantinos Karampidis and Giorgos Papadourakis. File type identification - computational intelligence for digital forensics. The Journal of Digital Forensics, Security and Law, 2017.

[14] Mason McDaniel and M.Hossain Heydari. Content based file type detection algorithms. In 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the. IEEE, 2003.

[15] M. McDaniel. Automatic file type detection algorithm. Master’s thesis, 2001.

[16] W. J. Li, S. J. Stolfo, and B. Herzog. Fileprints: identifying file types by n-gram analysis. In Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop, page 64–71, June 2005.

[17] J. G. Dunham and J. C. R. Tseng. Classifying file type of stream ciphers in depth using neural networks. In The 3rd ACS/IEEE International Conference onComputer Systems and Applications, 2005., page 97–, Jan 2005.

[18] M. Karresand and N. Shahmehri. File type identification of data fragments by their binary structure. In 2006 IEEE Information Assurance Workshop, page 140–147, June 2006.

[19] L. Zhang and G. B. White. An approach to detect executable content for anomaly based network intrusion detection. In 2007 IEEE International Parallel and Distributed Processing Symposium, page 1–8, March 2007.

[20] Mehdi Chehel Amirani, Mohsen Toorani, and Sara Mihandoost. Feature-based type identification of file fragments. Security and Communication Networks, 6(1):115–128, apr 2012.

[21] J. Mitlo ̈hner, S. Neumaier, J. Umbrich, and A. Polleres. Characteristics of open data csv files. In 2016 2nd International Conference on Open and Big Data (OBD), page 72–79, Aug 2016.

[22] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Addison Wesley, us ed edition, May 2005.

[23] H. Ramchoun, M. A. Janati Idrissi, Y. Ghanou, and M. Ettaouil. Multilayer perceptron: Architecture opti- mization and training with mixed activation functions. In Proceedings of the 2Nd International Conference on Big Data, Cloud and Applications, BDCA’17, page 71:1–71:6, New York, NY, USA, 2017. ACM.

[24] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. The MIT Press, 2016.


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item