Image File Extensions in the Wide Web Scrape

Percent of the three leading file formats in various TLDs. i.e. In China, JPGs are almost 90% of image extensions, in .gov they are 50%.
Percent of the three leading file formats in various TLDs. i.e. In China, JPGs are almost 90% of image extensions, in .gov they are 50%.

A bit of an aside post. I’ve been playing with extracting images from WARC files so I can try to play with some computer vision techniques on them. One part of extracting images was that I needed to know which file extensions to look for. While I began with the usual suspects (initially JP*Gs), after some crowdsourcing and being pointed towards Andy Jackson’s “Formats Over Time: Exploring UK Web History” I came up with a good list of extensions to extract.

These are from the 2011 Wide Web Scrape web archive.

The following regular expression, '.+\.TLD/.+(\.gif|\.jpg|\.tif|\.jpeg|\.tiff|\.png|\.jp2|\.j2k|\.bmp|\.pict|\.wmf|\.emf|\.ico|\.xbm)' where TLD represents the top-level domain I am searching for (i.e. CA or COM) helped find the extensions that I am looking for. A variation of this script extracted images and numbered them, preserving duplicates and identically-named files.

File formats in the .ca top-level domain. JPG dominates, followed by GIF and PNG.
File formats in the .ca top-level domain. JPG dominates, followed by GIF and PNG.

The findings were interesting in terms of what top-level domain contained what. I’ll paste my findings below in case they’re interesting. It can help you know what to look for, and gives a sense of the web c. 2011. The next step is to analyze these images, of course..

Table One: Absolute Appearances of File Formats Across Samples of Various Top-Level Domains, Wide Web Scrape.

Top-Level Domain ca cn com fr gov mil net org
jpg 79265 958388 1837836 21947 8150 1831 188309 131848
gif 23227 125892 538415 4300 7040 307 57407 47197
png 15491 15321 196377 3182 815 39 13618 17811
ico 781 1971 26421 343 68 1 2030 1425
jpeg 223 742 13684 143 122 0 1151 961
bmp 51 0 1982 15 14 0 143 161
wmf 17 25 39 0 0 0 0 3
tif 14 12 72 0 4 0 1 26
emf 1 0 1 0 0 0 0 0
tiff 0 0 16 0 1 0 0 3
xbm 0 0 0 0 0 0 0 2
total 119070 1102351 2614843 29930 16214 2178 262659 199437

Table Two: Relative Frequency of File Formats Across Samples of Various Top-Level Domains, Wide Web Scrape

Top-Level Domain ca cn com fr gov mil net org
jpg 66.57% 86.94% 70.28% 73.33% 50.27% 84.07% 71.69% 66.11%
gif 19.51% 11.42% 20.59% 14.37% 43.42% 14.10% 21.86% 23.67%
png 13.01% 1.39% 7.51% 10.63% 5.03% 1.79% 5.18% 8.93%
ico 0.66% 0.18% 1.01% 1.15% 0.42% 0.05% 0.77% 0.71%
jpeg 0.19% 0.07% 0.52% 0.48% 0.75% 0.00% 0.44% 0.48%
bmp 0.04% 0.00% 0.08% 0.05% 0.09% 0.00% 0.05% 0.08%
wmf 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
tif 0.01% 0.00% 0.00% 0.00% 0.02% 0.00% 0.00% 0.01%
emf 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
tiff 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00%
xbm 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%

 

One thought on “Image File Extensions in the Wide Web Scrape

Leave a comment