How to use Google Sheets to scrape online data

Quick Links

What is data scraping?

How does data scraping work?

What data can I scrape?

How to scrape data using Google Sheets

Scrape data in seconds, not hours

Finding the data you need can be a challenge, but when it comes to visualizing large quantities, you'll need the help of additional tools. Google Sheets has everything you need to format large amounts of data into a suitable format.

We show you how to scrape data from the internet using three methods. We break down how each works and when you should use them. What's the best thing about them? Because they're on Google Sheets, you can scrape data from anywhere with just a budget Chromebook.

What is data scraping?

Data scraping, in this sense, is the process of extracting data from a website and displaying it in a human-readable output.

A successful data scrape saves hours of work by collating information scattered across one or multiple web pages and displaying it in a format that a human can quickly read. While the term in its most general sense can refer to any program-to-program scrape, we cover the process of scraping data from a website into Google Sheets.

When should I scrape data?

Data scraping is used when an established data viewing method is unavailable. As the process relies on HTML and XML tags, most data from websites can be scraped with the correct formula.

For example, data scraping is the easiest method for exporting a table on Wikipedia for easy searching and ordering (as we'll do later in this guide).

How does data scraping work?

There are three methods to scrape data, which should be chosen based on the complexity and type of the data being scraped. These are HTML, XML, and RSS (with no Python needed).

Each method involves a different formula but follows the same fundamental rules. Point the formula toward the data you want to scrape with the appropriate tags, and it scrapes the data and places it into your table. The skill is identifying the tags you need and compensating for each website's source code.

What are tags?

If you use Google Chrome or most desktop browsers, you can view a webpage's source code with right-clicking on the page and selecting View page source from the drop-down menu. This opens a separate tab showing the website's HTML source code. Don't get alarmed if this seems overwhelming. All you need to scrape data successfully is to identify a few tags.

The source code for Android Police's web page

The HTML source code for Android Police's homepage

Tags come as pairs and look like this in the source code:

Anything placed between the tags is displayed as specified by the chosen tags. So in the example above, the text between these tags is formatted as a list. Tags can be placed within tags to specify further details about how the text is displayed.

Depending on the method you use, you'll look out for different tags.

What data can I scrape?

The short answer is pretty much anything. Scraping from tables and lists is the easiest, but you can scrape anything corresponding to a particular tag with the right know-how. It's best to pick a method after you identify your data. There's no point messing with a complicated XML formula for a simple HTML list.

What data can I scrape with the HTML method?

The HTML method can scrape lists and tables. Check the page's source code, and search for the data you want to scrape. If it's between <table>, <ol>, <li>, or <ul> tags, you can use this method.

What data can I scrape with the XML method?

Instead of clicking View page source, click Inspect from the drop-down menu. This displays the page's source code in XML.

Scraping data with the XML method involves finding the XPath. This is more precise than the HTML method, as you can search for a specific spot in the source code. Use the XML method if you're scraping data that isn't in a list or table format or want to scrape a part of a table.

What data can I scrape with the RSS method?

This method is used for scraping RSS feeds. It's a great way to create your own tool for scraping news, job listings, or regularly updated data.

How to scrape data using Google Sheets

Now that you have a basic understanding of scraping data, you can try it in action.

How to scrape data using the HTML method

The HTML method requires a straightforward formula:

=IMPORTHTML("URL", "element", location)

We show you how to scrape data from this Wikipedia page of best-selling books. As you can see from the page, there are multiple tables here. We'll scrape data from the second table that includes books that have sold between 50 million and 100 million copies.

We used the Inspect tool rather than View Source. For finding HTML tags, both methods work, but Inspect has the benefit of highlighting corresponding sections on the page.

By inspecting the source code, we see that this is a table, not a list. So we use "table" for the element component. It is the second table on the page, so we use "2" as the location. The resultant formula is:

=IMPORTHTML("https://en.wikipedia.org/wiki/List_of_best-selling_books", "table", 2)

That's it! Now you can organize the data as you wish within Google Sheets. However, you may run into problems. Here are some common problems and their solutions:

You receive an Error message. Recheck the formula. Ensure you're using straight quotes instead of curly quotes and that the position component is not within quotes. The URL and element components should appear in green, and the position should be blue.
Your formula does not scrape the correct table. Try different numbers. This involves some trial and error, as what appears to be the first table on the page may not be the first table in the code.
You can't find the tag. Use Ctrl + F and search "table" or one of the list tags. Hovering over the element in the Source window highlights the relevant section on the web page.

How to scrape data using the XML method

If the HTML method doesn't work or the scraped data isn't precise enough, the XML method should be your next port of call. This method requires the following formula:

=IMPORTXML("URL", "XPath")

The URL component is self-explanatory, but the XPath component can be complicated. This tutorial from w3schools does a great job explaining the structure of an XPath query, but we break down the basics here.

For this example, we'll scrape all the book titles from the same Wikipedia page in the HTML example. In this scenario, the correct formula would be:

=IMPORTXML("https://en.wikipedia.org/wiki/List_of_best-selling_books", "//tbody/tr/td/i")

Above, you can see the result. So how did we arrive at "//tbody/tr/td/i" for the XPath query?

The first step involved finding an example of the data we wanted. In this case, we had to burrow into the tags before finding the element containing the book title within the table.

As you can see, it's nested in multiple tags. It's nested within <i></i>, then <td></td>, and so on. The XPath query requests data found within the <i> tag. However, if you use "//i" for the XPath query, you'll get this:

Checking the Wikipedia page, we see that the formula has returned all data in italics, which is what the <i> tag represents. However, we only wanted the text within the table. Therefore, we use "//tbody/tr/td/i" to narrow the search. The resultant formula only returns text found in this specific place, which is the book titles.

XPath commands aren't an exact science, as every web page is different. In this example, someone could pull the table they wanted because it had a class that no other table on the page had. Figuring out what XPath you need depends on the web page.

This isn't a foolproof method. In this example, a separate XML script had to be written to scrape the data, and this was due to bad HTML practices on the source site. So if everything you do fails, blame the source code.

How to scrape data using the RSS method

Scraping RSS data is more akin to the HTML method than the XML method. It's just extremely limited in its scope. The formula is as follows:

=IMPORTFEED("URL")

If we use Android Police as an example URL (so, =IMPORTFEED("https://www.androidpolice.com/feed/") ), we get this result, precisely what we wanted.

But you can customize it further by using the following parameters in your formula:

=IMPORTFEED(url, [query], [headers], [num_items])

Scraped data from an RSS feed displayed in Google Sheets

A full breakdown of these parameters can be found on Google's support page for the formula. Using these parameters, you can create a tidier feed, such as the example above, which returns just the title and URL.

Scrape data in seconds, not hours

Scraping data in Google Sheets is a challenging concept to wrap your head around, but after some practice, you can scrape massive amounts of data in seconds. Still, you'll need an understanding of Google Sheets, but these tips and tricks can help you sort your data without a headache.