This document is for people who want to use any of the Airbnb city data that I have collected since November 2013. It describes how the data is collected, and looks at the completeness and accuracy of the data, and notes some areas for improvement. The source code is available at Github.
In this document, a “survey” is an automated collection of data from the Airbnb web site for a specified city (“search area”) on or around a specific date. The survey may take between one and three days, depending on the number of listings in the city.
There have been significant changes to how I do the collection since July 2015: this document reflects those changes.
The main conclusions are:
- The data collection gives a count of Airbnb listings in a city that is usually within 10% of the correct number. This level of accuracy provides a solid foundation for most policy and social impact discussions. If listings are missed, it is probably because they are fully booked for the near future.
- Within that count, listing type (“Entire home/apt”, “Private room”, or “Shared room”) and host statistics (number of listings per host) are reported unambiguously, and so provide solid foundation for discussion. One caveat is that some hosts may create multiple identities (“sockpuppets”), and list themselves as such on the site.
- The data collection includes latitude and longitude values for each listing. Street addresses of individual listings cannot be reliably inferred from these values. Combined with municipal geographical maps (eg, ESRI shapefiles of neighborhood boundaries), the data can be used to analyze listings by neighborhood within a city.
- Price data is collected in $US. The value is usually per night and aggregated pricing information is therefore reliable.
- Estimates of the proportions of listings and of revenue can be made, based using the number of reviews as a proxy for the relative number of visits, and the nightly price as a proxy for the relative income. Comparison with published data from Berlin shows that these estimates are good enough to be useful in policy and social impact discussions.
This section describes how a survey of an individual search area is carried out.
- Source code
Source code was originally available at Github, but as Airbnb have made it more difficult to search their site I have stopped making it available. If you want a copy, please send me an email. I’m generally happy to give a copy to people who are doing critical research, but not to people who want to set up a business on Airbnb.
The source code is in python 3. It scrapes data from the Airbnb web site for a city (labelled a search area) , and stores the result in a database. Each collection of a single city is called a survey. A single database holds many separate surveys, including some of the same city.
The database system used is now PostgreSQL.
Most of the python modules used are standard or are easily installed using one of the python package management systems. An exception is the lxml package, which can be tricky to install and is available from here. For Linux and Mac users the system package manager can be used.
- Data collection outline
This section describes the mechanics of collecting data. Each survey takes a number of steps. The instructions are given for Linux.
- Step One: Define a search area
Open a terminal, cd to the directory holding the script, and run the following command:
./airbnb.py -asa "City Name"
This step is the same as going to the web page displayed when you go to the Airbnb web site and search for “City Name” without entering any dates. For example: New York, Rome, and Montreal. The web page contains a list of Neighborhoods for the city, and these are downloaded into the database.
Database tables used:
- lists the city name and gives it an integer search_area_id
- lists the neighborhood names for the city, and has a search_area_id foreign key to search_area
- Define a survey
Run the following command:
./airbnb -asv "City Name"
This step makes an entry in a table that lists the distinct surveys in the database, including a short description (city name and date).
Database tables used:
- assigns a survey_id value and has a description and a foreign key to the search_area.
- Search for rooms
Run the following command:
./airbnb -s survey_id
This step is equivalent to going to the Airbnb search page for a city and carrying out a separate search for each of the following:
- loop over room_types
- Each Airbnb listing is categorized as “Shared Room”, “Private Room” or “Entire Home/Apt”.
- loop over neighborhoods
- For each room_type, loop over all the neighborhoods in the neighborhood table for this search_area.
- loop over the number of people accommodated
- for each room_type and neighborhood, loop over the number of people accommodated. Most searches put a maximum of four people on Private Room and Shared Room types, and a maximum of 16 on the Entire Home/Apt listings.
- loop over pages
- for each of these combinations, loop over the pages returned by the search until no more listings are returned, to a maximum of 100 pages.
Each Airbnb listing is identified by a unique room_id integer value. For each page found, the script collects the room_id value and the room_type, and enters them in a table in the database along with the survey_id. This table, which holds most of the data in the collection, is named room.
- Fill in room details
Run the following command:
./airbnb.py -f survey_id
This step repeatedly selects a room_id value that has not yet been filled out, and goes to that room’s page on the Airbnb web site, which has a URL http://airbnb.com/rooms/<room_id>. It collects data from that page, and then selects another room_id that has not yet been filled, and repeats this until all listings have been visited.
In recent surveys, the script has been updated to also visit room_id values from previous listings, in case the search missed them.
- Step One: Define a search area
- Data collected for each listing
The script parses the data on that page to collect the following data:
- Each Airbnb host has a unique host_id value.
- One of the three room types listed above.
- For example, “United States” or “Italy”
- As entered in the page. This value may not match the search_area. For example, many listings have Brooklyn as the city even though they appear in the New York search_area. The city field is not used much in any analysis.
- This value may not match the neighborhood in the search, as search results commonly include listings from other neighborhoods. It is difficult to extract reliably from the Airbnb pages, and should not be taken too seriously.
- The street address (which generally does not include a street number) is also difficult to extract reliably from the web page, as it is entered by the host. It should not be taken too seriously.
- The number of reviews for the listing. Unfortunately, the individual reviews are not collected.
- For listings that have at least three reviews, this is the overall satisfaction (on a one-to-five star rating, including half-stars). Individual ratings (eg cleanliness) are not collected.
- The number of people that the listing accommodates, according to the host.
- If listed, the number of bedrooms.
- If listed, the number of bathrooms.
- The price listed by the host. This is almost always the price per night, but there are listings for which it is a price per week. Unfortunately, the script does not make this distinction. The price is collected in the currency displayed on the web page, which depends on the location from which the search is carried out. All my surveys record prices in $Cdn, for example.
- If the page does not appear, it is marked as deleted by entering a 1 in this column.
- The minimum number of nights that the listing is available for, according to the host.
- A timestamp recording when the script collected the data.
- The latitude, as listed in the web page source.
- The longitude, as listed in the web page source. For individual listings it is not clear that these values are always accurate. In general, however, they are close to the listing location.
- The survey being recorded. A single room may be visited several times in repeat surveys.
- Unavailable data
Among the data that is not included in the surveys is:
- occupancy rates: no occupancy data is collected.
- host income: without occupancy rates, host income is not available.
- specific addresses: exploration with available reverse geocoding databases suggests that the latitude and longitude values do not map reliably to specific addresses.
- guest information: no guest information is collected.
Rough income and occupancy distributions can be estimated from the data, as described below, but these estimates are of relative values, not absolute.
This section demonstrates that the total number of listings in a city is accurate to within 10% or 20% most of the time. Airbnb rarely releases raw data, so that accuracy can only be assessed by comparison to other studies or to occasional public statements from Airbnb. One exception is the Attorney General’s report on New York, which is based on Airbnb internal data.
The search methodology does not enter “trip dates” in any of the searches, so that the web site does not exclude listings that are marked as unavailable. However, we know from individual cases that some listings (eg, ones that are heavily booked, or otherwise not available) are not found in the search.1 For this reason, the script has been updated to also visit room_id values from previous listings, in case the search missed them.
Boundaries are another source of uncertainty. A search in many major cities returns nearby listings that may not be within the city boundaries. Searches in “San Francisco”, for example, return results in Oakland and throughout the greater metropolitan area.
The availability of latitude and longitude values means that, if geo-spatial data is available from an individual city, more precise numbers can be reported (SQL Anywhere includes the ability to import ESRI shape files, commonly available on government open data web sites, and to carry out geo-spatial queries). For a few purposes, and for a few cities, such queries have been carried out. It is, however, a labour intensive process as each city makes neighborhood available in its own way.
All listings found in any survey do exist on the Airbnb site. There is evidence from Airbnb statements that the listings are fairly complete. Here are some statements made by Airbnb or by others who have access to internal Airbnb data, and some notes.
- New York: Attorney General’s Report
At the conclusion of the Airbnb dispute in New York, the Attorney General’s office gained access to Airbnb listings, and their analysis of those listings was published in a report called “Airbnb in the City” (link). The report says:
the number of unique units booked for private short-term rentals through Airbnb has exploded, rising from 2,652 units in 2010 to 16,483 in just the first five months of 2014.
A survey of New York that I carried out in May 2014 showed a total of 19,006 listings, of which 13,173 had one or more review. The 16,483 number of units with visitors falls in the middle of these numbers, as would be expected (not all visitors leave reviews).
The Attorney General’s report breaks down the listings by room_type:
72% of unique units used as private short-term rentals on Airbnb during the Review Period involved the rental of an “entire/home apartment” for less than 30 days in either (1) a “Class A” multiple dwelling or (2) a non-residential building.
While my data set does not break down rentals by dwelling type, the May 2014 survey of New York shows that 62% of listings with one or more reviews were “Entire Home/Apt” and 86% of all listings were “Entire home/apt”. Again, the Attorney General report’s number is between these two values, as would be expected.
The Attorney General’s report lists the number of hosts:
25,463 hosts offered private short-term rentals in New York City during the Review Period [January 1, 2010 through June 2, 2014].
A survey carried out in November 2013 and a survey carried out in May 2014 showed, when combined, 22,309 distinct hosts who had offered listings on Airbnb. As some will have come and gone during the review period, this number suggests that the surveys are picking up most of the listings in the city.
According to Bloomberg (link), Airbnb claimed to have 25,000 listings in Paris in 2014. A survey in September 2014 showed 23,044 and a survey in December showed 24,261.
An Airbnb report claimed that “In 2012-2013, Amsterdam’s 2,400 hosts ensured that over 62,000 Airbnb guests”. The first survey I did of Amsterdam, in May 2014, showed 4617 hosts, of which 3228 had at least one review. The numbers reflect a fast growth of Airbnb listings in the city: a year later there are 6,433 hosts, so extrapolating back suggests the survey is in the right ballpark.
- San Francisco
In May 2014, the San Francisco Chronicle carried out their own survey of listings in the city (link). The Chronicle reported “4,798 properties listed in the city. Almost two-thirds — 2,984 — were entire houses or apartments. Of the remainder, 1,651 were private rooms and 163 were shared rooms.”
A survey I carried out in May 2014 showed 6,609 listings in the San Francisco search area. Some of these listings were in areas such as Oakland that are not part of San Francisco proper and which were excluded from their study. Limiting the listings to the city proper (by using GIS data), I find 4,776 listings: almost exactly the same as the Chronicle.
- February 2015 Claims
In February 2015, Airbnb CEO Brian Chesky claimed that “we’ve grown from just 4,000 listings in Paris in 2012 to over 40,000 today.” (link, also the Airbnb statement). A survey I carried out in February 2015 showed 31,385 listings. This is the first occasion in which the number of listings I find is significantly less than a number quoted by Airbnb.
The article also presented (presumably from Airbnb sources) several other data points:
- “More than 90 per cent of hosts in Paris have only one listing”. My survey shows that 93% of hosts have only one listing. (Aside: Paris is unusual in how high this figure is).
- “New York has 34,000 listings”: A survey I carried out in March 2015 showed 28,796, again somewhat lower than Airbnb’s claim.
- “London has 23,000 listings”: My survey from January 2015 showed 21,298 listings, in good agreement.
- “Rio de Janeiro has 18,000 active listings”: My survey from January 2015 showed 13,395 listings.
- “Barcelona has 15,000 listings”: My survey from January 2015 showed 12,496.
In short, the numbers claimed by Airbnb for Paris and New York are the first to show a significant gap between the company’s claims and the data I find from the web site. Given that the occasion of their announcement was to promote the annual Airbnb host event, being held in Paris in 2015, one has to wonder if they were boosting their numbers a little, perhaps by stretching the boundaries that they are considering.
Although Airbnb rarely makes quantitative statements about the number of listings in particular cities, and when it does make such statements it does not define key terms, matching the survey data against Airbnb’s own data is difficult. There are, however, data points to support the claims that:
- In most cases, there is agreement in total listings (and therefore hosts) between my surveys and Airbnb data of within a few percent. The recent Paris and New York statements are an outlier.
- The distribution of multiple-listing hosts claimed by Airbnb and of room types is well reflected in my surveys.
Estimating visits and income
As mentioned above, the only information related to visits is the number of reviews and the price per night on a listing. Can this information be used to estimate relative incomes and number of visits within a city?
Absolute numbers of visits and incomes are not realistic, but relative numbers of visits would be reliable so long as the ratio of reviews to visits is the same within each group analysed.
In 2013, Airbnb published a report about their business in Berlin (link). It included a listing of the number of guests (= visits) and the hosts’ income grouped by neighbourhood.
Figure 1 shows the aggregated number of reviews per neighborhood (based on reported neighborhood from the Airbnb listing pages) on the x-axis, and the total number of guests as reported by Airbnb on the y-axis. The relative number of visits as estimated by this method provides a good foundation for policy and social impact discussions.
According to the Figure, the number of reviews is 55% of the number of guests reported by Airbnb (which is presumably greater than the number of bookings, as each visit may include multiple people).
Figure 2 takes the estimation one step further. If the average length of a visit is the same between groups, then the product of the number of reviews and the per-night price should give a good relative picture of the incomes within groups. In this case, the data is grouped by neighborhood within Berlin.
The figure shows that the estimated income is proportional to the reported absolute income to sufficient accuracy to be used for public policy and social impact discussions. Absolute “income” estimates are, of course, meaningless.
For all its flaws, the data collected by scraping the Airbnb web site for individual cities is reliable enough for policy and social impact discussions.
1 My thanks to Murray Cox for discussion of this point.