Google is getting ready to release its StreetLearn dataset for training machine-learning models to navigate cities without a map.
The StreetLearn environment relies on images from Google Street View and has been used by Google DeepMind to train a software agent to navigate various western cities without reference to a map or GPS co-ordinates, using only visual clues such as landmarks as it wanders the streets.
The StreetLearn environment encompasses multiple regions within the centers of the cities of London, Paris and New York. It is made up of cropped 360-degree panoramic images of street scenes from Street View, each measuring 84 x 84 pixels. Each panoramic image is a node in larger network or graph of images, with up to 65,000 nodes per 5km city region, and multiple regions per city. Each region has a distinct urban setting, for instance differing amount of construction and varying numbers of parks and bridges. For example, in New York the four distinct environments used for training were Harlem, Central Park, Midtown, and Greenwich Village.
Raia Hadsell, a research scientist on the Deep Learning team at DeepMind, said Google is “going to release” StreetLearn for other researchers to use, “probably in November”.
Speaking at the recent REWORK Deep Learning Summit in London, Hadsell compared the way the DeepMind agent used StreetLearn to how humans learn how to navigate a city simply by looking around them.
“We’re able to learn how to navigate the entire way across New York, Paris and London,” she said.
“Initially when the agent first starts learning, it’s only going to be given targets that are nearby in its own neighbourhood.
“Gradually those targets get further and further away until they’re covering the entire city.”
The system learns to traverse the cities using deep reinforcement learning, a process that uses a series of multi-layered neural networks, mathematical models based very loosely on structure of the human brain.
In reinforcement learning a software agent learns which actions it should take to maximize a reward, for example, how to move the paddle in the video game Breakout to maximize its score. In the case of StreetLearn, the agent’s goal is to get as close as possible to a given landmark, for instance, in London it might be ‘Go east of The Shard by 200m’. By randomly rotating left, rotating right or going straight forward, the system eventually finds itself at its target destination, and learns what the streets and landmarks along that path look like, much as a tourist unfamiliar with a city might. The Google agent was trained using A3C (asynchronous advantage actor-critic learning).
SEE: IT leader’s guide to deep learning (Tech Pro Research)
There are three neural networks in StreetLearn. A convolutional neural network that handles image recognition, and that feeds data to two Long Short Term Memory (LSTM) networks, a type of recurrent neural network that serve as a form of memory, allowing the wider system to consider contextual data.
Of the two LSTMs, one is a policy network that decides the action the agent should take next based on its current reward state, in the case of StreetLearn whether the software agent should rotate left, rotate right or go straight ahead.
The other LSTM is a network that is implicitly tasked with memorising the local environment, as well as learning a representation of “here”, the current position of the agent, and of “there”, the location of the goal.
Google used this three-network structure to create an agent that was able transfer what it had already learned from city to city.
“We didn’t want to have an agent that just memorizes a single city. A taxi driver in London is able to go to Paris and learn to drive there as well,” she said.
“What does that taxi driver have to do? They just need to go to Paris and relearn where the landmarks are, where the river is, what the best bridges are. But they don’t need to relearn what making a left turn is or what going straight feels like.”
Google was able to transfer the learnings between cities by freezing the training of most of the neural networks used by the agent, the convolution neural network and the policy LSTM, so they weren’t retrained with each new city. Instead only the locale-specific LSTM was trained afresh when moving to a new city.
“It works, and it works better the more cities you learn in,” said Hadsell.