A bit of an aside post. I’ve been playing with extracting images from WARC files so I can try to play with some computer vision techniques on them. One part of extracting images was that I needed to know which file extensions to look for. While I began with the usual suspects (initially JP*Gs), after some crowdsourcing and being pointed towards Andy Jackson’s “Formats Over Time: Exploring UK Web History” I came up with a good list of extensions to extract.
These are from the 2011 Wide Web Scrape web archive.
The following regular expression, '.+\.TLD/.+(\.gif|\.jpg|\.tif|\.jpeg|\.tiff|\.png|\.jp2|\.j2k|\.bmp|\.pict|\.wmf|\.emf|\.ico|\.xbm)'
where TLD represents the top-level domain I am searching for (i.e. CA or COM) helped find the extensions that I am looking for. A variation of this script extracted images and numbered them, preserving duplicates and identically-named files.
The findings were interesting in terms of what top-level domain contained what. I’ll paste my findings below in case they’re interesting. It can help you know what to look for, and gives a sense of the web c. 2011. The next step is to analyze these images, of course..
Table One: Absolute Appearances of File Formats Across Samples of Various Top-Level Domains, Wide Web Scrape.
Top-Level Domain | ca | cn | com | fr | gov | mil | net | org |
jpg | 79265 | 958388 | 1837836 | 21947 | 8150 | 1831 | 188309 | 131848 |
gif | 23227 | 125892 | 538415 | 4300 | 7040 | 307 | 57407 | 47197 |
png | 15491 | 15321 | 196377 | 3182 | 815 | 39 | 13618 | 17811 |
ico | 781 | 1971 | 26421 | 343 | 68 | 1 | 2030 | 1425 |
jpeg | 223 | 742 | 13684 | 143 | 122 | 0 | 1151 | 961 |
bmp | 51 | 0 | 1982 | 15 | 14 | 0 | 143 | 161 |
wmf | 17 | 25 | 39 | 0 | 0 | 0 | 0 | 3 |
tif | 14 | 12 | 72 | 0 | 4 | 0 | 1 | 26 |
emf | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
tiff | 0 | 0 | 16 | 0 | 1 | 0 | 0 | 3 |
xbm | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
total | 119070 | 1102351 | 2614843 | 29930 | 16214 | 2178 | 262659 | 199437 |
Table Two: Relative Frequency of File Formats Across Samples of Various Top-Level Domains, Wide Web Scrape
Top-Level Domain | ca | cn | com | fr | gov | mil | net | org |
jpg | 66.57% | 86.94% | 70.28% | 73.33% | 50.27% | 84.07% | 71.69% | 66.11% |
gif | 19.51% | 11.42% | 20.59% | 14.37% | 43.42% | 14.10% | 21.86% | 23.67% |
png | 13.01% | 1.39% | 7.51% | 10.63% | 5.03% | 1.79% | 5.18% | 8.93% |
ico | 0.66% | 0.18% | 1.01% | 1.15% | 0.42% | 0.05% | 0.77% | 0.71% |
jpeg | 0.19% | 0.07% | 0.52% | 0.48% | 0.75% | 0.00% | 0.44% | 0.48% |
bmp | 0.04% | 0.00% | 0.08% | 0.05% | 0.09% | 0.00% | 0.05% | 0.08% |
wmf | 0.01% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
tif | 0.01% | 0.00% | 0.00% | 0.00% | 0.02% | 0.00% | 0.00% | 0.01% |
emf | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
tiff | 0.00% | 0.00% | 0.00% | 0.00% | 0.01% | 0.00% | 0.00% | 0.00% |
xbm | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
One thought on “Image File Extensions in the Wide Web Scrape”