Image File Extensions in the Wide Web Scrape

Percent of the three leading file formats in various TLDs. i.e. In China, JPGs are almost 90% of image extensions, in .gov they are 50%.

A bit of an aside post. I’ve been playing with extracting images from WARC files so I can try to play with some computer vision techniques on them. One part of extracting images was that I needed to know which file extensions to look for. While I began with the usual suspects (initially JP*Gs), after some crowdsourcing and being pointed towards Andy Jackson’s “Formats Over Time: Exploring UK Web History” I came up with a good list of extensions to extract.

These are from the 2011 Wide Web Scrape web archive.

The following regular expression, '.+\.TLD/.+(\.gif|\.jpg|\.tif|\.jpeg|\.tiff|\.png|\.jp2|\.j2k|\.bmp|\.pict|\.wmf|\.emf|\.ico|\.xbm)' where TLD represents the top-level domain I am searching for (i.e. CA or COM) helped find the extensions that I am looking for. A variation of this script extracted images and numbered them, preserving duplicates and identically-named files.

File formats in the .ca top-level domain. JPG dominates, followed by GIF and PNG.

The findings were interesting in terms of what top-level domain contained what. I’ll paste my findings below in case they’re interesting. It can help you know what to look for, and gives a sense of the web c. 2011. The next step is to analyze these images, of course..

Table One: Absolute Appearances of File Formats Across Samples of Various Top-Level Domains, Wide Web Scrape.

Top-Level Domain	ca	cn	com	fr	gov	mil	net	org
jpg	79265	958388	1837836	21947	8150	1831	188309	131848
gif	23227	125892	538415	4300	7040	307	57407	47197
png	15491	15321	196377	3182	815	39	13618	17811
ico	781	1971	26421	343	68	1	2030	1425
jpeg	223	742	13684	143	122	0	1151	961
bmp	51	0	1982	15	14	0	143	161
wmf	17	25	39	0	0	0	0	3
tif	14	12	72	0	4	0	1	26
emf	1	0	1	0	0	0	0	0
tiff	0	0	16	0	1	0	0	3
xbm	0	0	0	0	0	0	0	2
total	119070	1102351	2614843	29930	16214	2178	262659	199437

Table Two: Relative Frequency of File Formats Across Samples of Various Top-Level Domains, Wide Web Scrape

Top-Level Domain	ca	cn	com	fr	gov	mil	net	org
jpg	66.57%	86.94%	70.28%	73.33%	50.27%	84.07%	71.69%	66.11%
gif	19.51%	11.42%	20.59%	14.37%	43.42%	14.10%	21.86%	23.67%
png	13.01%	1.39%	7.51%	10.63%	5.03%	1.79%	5.18%	8.93%
ico	0.66%	0.18%	1.01%	1.15%	0.42%	0.05%	0.77%	0.71%
jpeg	0.19%	0.07%	0.52%	0.48%	0.75%	0.00%	0.44%	0.48%
bmp	0.04%	0.00%	0.08%	0.05%	0.09%	0.00%	0.05%	0.08%
wmf	0.01%	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%
tif	0.01%	0.00%	0.00%	0.00%	0.02%	0.00%	0.00%	0.01%
emf	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%
tiff	0.00%	0.00%	0.00%	0.00%	0.01%	0.00%	0.00%	0.00%
xbm	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%

Ian Milligan

Image File Extensions in the Wide Web Scrape

One thought on “Image File Extensions in the Wide Web Scrape”

Leave a comment Cancel reply

Share this:

Related

One thought on “Image File Extensions in the Wide Web Scrape”

Leave a comment Cancel reply