Travelling with Style Feedback Thread

Discussion in 'GPGPU Technology & Programming' started by AlexV, Jun 17, 2012.

  1. Bernd Paradies

    Newcomer

    Joined:
    Sep 3, 2012
    Messages:
    54
    Likes Received:
    0
    I have a little bit time now, so I'll just start...
    My solution uses five different algorithms depending on the size of the data set and ATSP versus TSP (that's where the tweaking came in):

    1. Floyd-Warshall with path construction
    2. Nearest Neighbor matrix scanning
    3. 2-opt
    4. GA (Generational Algorithm)
    5. TSP to ATSP to TSP converter

    and diving into more details:


    1. Floyd-Warshall with path construction

    I took that AMP C++ sample, which uses Floyd to detect Transitive Closures and redesigned it for detecting Hamilton Cycles (TSP!) with path reconstruction:

    Transitive Closure Sample in C++ AMP,
    http://blogs.msdn.com/b/nativeconcurrency/archive/2011/11/08/transitive-closure-sample-in-c-amp.aspx

    The standard Floyd algorithm doesn't reconstruct paths. But as always, Wikipedia was my friend:
    http://en.wikipedia.org/wiki/Floyd–Warshall_algorithm#Path_reconstruction


    2. Nearest Neighbor matrix scanning

    Once you have a Floyd optimized matrix the nearest-neighbor algorithm is fairly trivial: start with a random row, find the minimum column, mark the column as visited, make that column the next row, repeat until all columns are marked.


    3. 2-opt

    Now, I implemented two different kinds of 2-opt algorithms. The first one is from a standard text book:
    I converted some Pascal code that was published in M.M.Syslo, N.Deo, J.S.Kowalik, Discrete Optimization
    Algorithms with Pascal Programs, Prentice-Hall, Englewood Cliffs 1983.

    The second one is pretty remarkable (and is perhaps the reason why I often managed to be a little bit faster). Some guys at the Technical University of Crete (TUC) wrote a paper and explained a really clever way of running multiple 2-opt swaps at the same time!

    Hardware Implementation of 2-Opt Local Search Algorithm for the Traveling Salesman Problem
    http://dl.acm.org/citation.cfm?id=1263915
    http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4228483&tag=1

    I called that algorithm 2-opt-tuc. It is not as precise as 2-opt-standard, but super fast (about 10x fater than 2-opt-standard for large data sets). Making 2-opt-tuc work with AMP C++ was a bit tricky. But I figured it out.


    4. GA (Generational Algorithm)

    I added a Generational Algorithm solver in the last days. If there is any time left until the timeout is reached I run a Generational Algorithm based on this paper (which is otherwise completely useless and misleading):

    A highly-parallel TSP solver for a GPU computing platform
    http://dl.acm.org/citation.cfm?id=1945726
    http://www.springerlink.com/content/n20570265p081734/fulltext.pdf

    The GA part is not essential but it's better than doing nothing.


    5. TSP to ATSP to TSP converter

    I started working on a TSP to ATSP to TSP converter at around Thursday after it became clear to me that my original idea was just not working any longer after I optimized too many other parts related to solving TSP.

    At Wikipedia I found this interesting section:
    http://en.wikipedia.org/wiki/Travelling_salesman_problem#Solving_by_conversion_to_symmetric_TSP

    I thought: "So if you can convert an ATSP to a TSP matrix you can run any TSP solver on it? That would work for me!"

    I implemented it and it did work! Well, the solutions were all very good. But the paths were also all scrambled after the matrix transformation. I was looking at a situation like with Floyd with no path reconstruction. As far as I know nobody had come up with a way to reconstruct paths from converted ATSP to TSP matrices. But that's what I tried to do in the remaining time mostly on Sat and Sun. My algorithm seems to work for small data sets but miserably fails for bigger ones. I had reached my limits.


    The rest was tweaking. I had cpu and gpu versions of Floyd, 2-opt-standard, and 2-opt-tuc in my tool chest, that I mixed up freely to keep up with your solution. Some of the cpu versions are faster than my gpu solutions.

    If I come out as the winner of this contest, it will be 2-opt-tuc that saved my butt.
    If you will, Greece has bailed me out!

    Cheers,

    - Bernd
     
    #161 Bernd Paradies, Oct 30, 2012
    Last edited by a moderator: Oct 30, 2012
  2. Naurava kulkuri

    Newcomer

    Joined:
    Aug 12, 2012
    Messages:
    67
    Likes Received:
    0
    This will be long and I'm rather sleepy already, so please, do bear with me.

    To make it short, a rather detailed description on what was the baseline for “my” solver, please see at Kamil Rocki’s page at http://olab.is.s.u-tokyo.ac.jp/~kamil.rocki/projects.html and a somewhat shorter introduction at http://gpuscience.com/articles/logo-gpu-accelerated-travelling-salesman-problem-tsp-solver/. I did of course ask for Kamil’s permission to make a port of his solver and ended up with something that wasn’t quite a straight port. My original contest submission notes and the disclaimer at \headers\opt2kernel.h (see when the code becomes available) have also appropriate texts and links, but perhaps a bit more explanation is in order before I explain what more there is in my code.

    When I learned about the competition on C++ AMP blog, I had just recently decided to teach myself GPU programming and use C++ AMP as a vehicle to do it. At the same time, it was also a good opportunity to check the new things coming in the new C++ standard – as my serious C++ programming dates to C++98 and earlier. In any event, besides writing an algorithm to solve a bitonic TSP at a university algorithms course sufficiently long time ago so that I really didn’t recall any specifics, I made a few baseline searches on the problem domain. The main constraints in my search were that the approaches I commit myself to needs to be something I understand (sufficiently so) and the approaches would be appropriate to the context of this contest, i.e. problem domain size and symmetricy. Some more constraints were that I should time to implement the code even as we were having a baby during the contest and there’s regular, daily work to do.

    I found out quite a few articles on ant colonies, genes and even particle swarms and so on (you catch the drift, I presume), but I had to discard these choices as most of the time I couldn’t understand what would make them tick and what would be the strategy to make them really fast (just using a GPU won’t cut it and particle swarms seemed just plain inefficient). The different opt approaches seemed to be a common theme and then iterated local search another, both of which I could understand and feel they are “tweakable”, and most of all, contained tight loops I could see how to turn into GPU code. Yeah! Something that must really eat cycles on the CPU and is an essential component!

    It was then when I found Kamil Rocki’s and Reiji Suda’s poster draft being showcased on a HPC conference promising insight on the very issue I was contemplating to take on. The trouble was, back then, it was very much a preliminary poster draft and it even cost money to download it! So, being a bit cautious as research papers are always too good to be true (I think we agree on this somewhat), I sent e-mail to Kamil and asked more about the draft before I’d pay for it. He replied and we exchanged a few thoughts and he asked if I liked to see the then code in progress. With a bit of hesitation, I wrote that sure, I just write it then to the submission notes and to the code and when I probably get disqualified, I’ve got what I was after the most anyway: experience and knowledge. Now, it turns out, this “brute-forcing” really gave a good run for your algorithm!

    Now to the differences in my solver to the one, Logo, Kamil wrote. You need to read Kamil’s material to understand how the core ideas work. Logo is written in CUDA having two kernels, one for small problems and one for potentially huge ones. I ported only the one for small kernels, which is at most for 4096 cities when using integer coordinates and cost matrix calculated on the fly on the GPU. I contemplated of transforming the distance matrix in symmetric case to coordinates and then using a straight port, but then for various reasons I decided against it (amongst them that I have an AMD Radeon HD 7850, so difficult to run the original code and weed out the lingering bugs in Logo code or mine and that I had to already use time to learn both C++ AMP and CUDA). Currently just I copy whole NxN matrices to GPU even in symmetric cases. I’m very much aware data transfer costs are considerable in this approach.

    But anyway. Then I also modified the “small kernel” so that it would accommodate different sizes of problems better (the Logo version is/was -- I used older code than is currently available -- a bit rigid in this regard) cost matrix and route data within GPU memory limits and changed the indexing scheme used to support this case. This meant forgoing significantly computing power in form of not using tile_static. There were other issues too, like C++ AMP and CUDA handle interlocked exchanges or other more subtle things, but the main gist on the GPU was as just described.

    Then on the CPU side of things I added support for the different problem sizes in \headers\kernel_context.h (which is a bit messy currently). I also implemented a simple FNV-1a hasher to preclude routes from re-computation, which would happen easily especially on smaller problem domains. You see in the code some traces of other ideas too, like using multiple GPUs or CPUs in parallel. One was to run ant colonies on the CPU and 2-opt on the GPU, but I run out of time. I also experimented shortly with the idea of using std::permutation on some base route, append permuted routes next to each other and calculate 2-opts for all routes with the same transfer cost (this was a bit more elaborate, but that was the gist). This worked somewhat well on some problems (not all), but ultimately I ejected the code since it was ad-hocish, fragile and prevented quick tweaking.

    You may be aware that 2-opt won’t work on asymmetric cases straight away. For a moment I was a bit loss how to solve that one. As it was, I was brutally short of hours during the whole project (one has to prioritize and family is always first) and then it dawned to me that I could just do the usual 2-opt and check that also the “reverse” shortens the route. That way, on average, the change should work both ways. Depending on the “asymmetricy degree” (for the lack of better term, but I read domination numbers and stuff regarding this :)), this would work well or not so well. If I used even the “positive change” 2-opt moves, I’d get this to work with somewhat high CPU cost on even the more difficult instances, but the route I went is to use some of the ones that 2-opt deems to shorten the “symmetric route” and then I just check the reverse does the same. For some added efficiency, I implemented a 3-opt on CPU (eh, a bit crummy one, but I just tried to pull it through), albeit a reduced one which doesn’t reverse the route. In hindsight I could have made it a “full 3-opt”, as I precondition the ATSP instances with it. In the code I provide some pondering how one could try to have 2-opt on ATSP with lesser computation burden, but I simply run out of time and steam to follow my own pondering. Hmm, did I write on using two routes, being reverses of each other? Oh well, it's implicit in the notes, I believe, but I digress...

    One of the difficult issues was finding a perturbation that did help out from local optima, but not too far out. As you may know already and I discovered from the research papers during this contest, good routes usually come near each other. Ants or genes (that is, find routes and crunch them straight with 2-opts) would have been one way to handle this, but time budget was there again. I did add some papers and notes on how to do a “perfect perturbation”, but that was the furthest I could get with my skills and knowledge.

    As to your idea on ATSP to TSP transformation, I thought about that too and for the reasons you discovered, decided against trying to implement it. I also thought about running Floyd-Warshall/Dijikstra from multiple nodes, but didn’t quite reach the conclusion how I could make it run fast with my already fast 2-opt and especially within my time budget (which was essentially thinking stuff during going work and coming back and frantically cranking code for an hour or two during an evening/midnight).

    That was it, I think. I urge you to check Kamil’s papers to get a better idea. The GPU science link gives you also some numbers, but here’s something Kamil told me: "For example, in pla7397 using MF heuristic I get an initial tour of 115.88% and after one 2-opt run (which takes 0.87s) it's 106.33% (see the paper)" (this made me suspect, by the way, that I miss some 2-opt exchanges, since my code obviously won't converge that fast, or then MF just gets into a superb starting route). At one point I also implemented a NN heuristic. I avoided looking at how it was done in Logo solver and ended up with something a bit different. K-d trees, Voronoi diagrams, don’t look bits and all the jazz were also on the table, but I didn’t quite have the time to study and implement them either. So, essentially, my solver is pretty much brute-forcing.

    Some of my code is better (maybe even good), some is crummier. I hope most of it is clear enough to follow through. As I wrote in my original submission notes, someone ought to lure Kamil from academia to earn big bucks and and do the same what he's doing now, good research. He is very smart (obviously) and a forthcoming and courteous man.
     
    #162 Naurava kulkuri, Oct 30, 2012
    Last edited by a moderator: Oct 30, 2012
  3. Naurava kulkuri

    Newcomer

    Joined:
    Aug 12, 2012
    Messages:
    67
    Likes Received:
    0
    And to add on tweaking and stuff, I thought about doing the warm-ups and stuff before starting calculate and hope they would be excluded from timing, but then I didn't bother. As an added note, the timer I use is made with C++11, which on Windows shows longer times. The header has some MS Connect links in this regard.
     
  4. Bernd Paradies

    Newcomer

    Joined:
    Sep 3, 2012
    Messages:
    54
    Likes Received:
    0
    Kamil Rocki? He is the guy I was mentioning, who "borrowed" in my opinion the indexing scheme from Noriyuki Fujimoto and Shigeyoshi Tsutsui's paper: Fast QAP Solving by ACO with 2-opt Local Search on a GPU - see figure #5.

    Rocki's source code has a bug. The Greeks have the same, but the impact is less.
    The bug is that Rocki and the Greeks intuitively reverse the nodes between the candidates. Instead the reversal has to be done inside out. Here is a quick test that Rocki fails: in the testing function he calculates a delta. If the delta is < 0 then the tour will be reduced. Now, if you rearrange the nodes he does then the delta is not showing up as the gain. In other words:

    (oldTourLength + delta) != newTourLength; (delta is < 0, so newTourLength should be smaller than oldTourLength).

    The Pascal code I took does it right. I was first also doing it like Rocki and the Greeks in my gpu version of 2-opt. But then I realized the bug. The correct 2-opt algorithm should always yield

    oldTourLength + delta == newTourLength;

    By doing so you can save a lot of time, because you don't have to recalculate the new tour length.

    newTourLength = oldTourLength + delta;

    instead of:

    newTourLength = getTourLength(tour);

    At least that's what I found out.

    Gotta run to the airport!
    Will check in a few days.

    Thanks for sharing the details on your solution!

    Cheers,

    - Bernd
     
  5. Bernd Paradies

    Newcomer

    Joined:
    Sep 3, 2012
    Messages:
    54
    Likes Received:
    0
    (While my wife lets the dog out...)
    Two more things:
    - Rocki's source code is GPL licensed, which made me nervous, because my code had to be MIT.
    - the net result of Rocki's way of doing 2-opt is that his chains are not as optimal as the ones that I calculated with newTourLength = oldTourLength + delta.

    My wife is getting the car - gotta run!

    Cheers,

    - Bernd
     
  6. Naurava kulkuri

    Newcomer

    Joined:
    Aug 12, 2012
    Messages:
    67
    Likes Received:
    0
    This was something I was wondering. On the very early versions, before the 1st submission, I had it differently for a while. I became unsure if I had it correct (this was still rather new to me and especially the stochastic arguments regarding path finding went quickly over my head) and read some old paper from the nineties, from some IBM folks, I recall that did it it likewise and I just followed the flow. I did profile it though (before I ran out of the Ultimate 30 days experimental license) and according to it, it was very far down the list in some promilles of computataion time if even that, I remember.

    Maybe there's even a note left in my code, I can't remember now. For what's worth it, I e-mailed to Kamil on this regard and he replied that in the Logo code there is a function to calculate the swap effect without going through the whole route. I didn't make use of it. I had decided that in this second, extended round, I wouldn't even take a peek on the Logo solver in order not to take anything from there (even accidentally, to have a fairer fight and to have a feeling I'd accomplished something by myself :)). In some sense I was in luck I used the older version and I hadn't gone through much else than the simple kernel part.


    Likewise. Say, did you wonder about how to use those 2-opt move candinates to facilitate a 3-opt run? I can't say using more than a few hours to think about this, but as 3-opt is in some regard a series of 2-opt exchanges, it feels like being doable. I had two courses to follow in my mind: add an extra loop to GPU or take out the move candinates and loop them through on the CPU -- in some way.

    I came too late to this idea, but last week (or I think so), I discovered Google's JavaScript solver does something like that. See at https://groups.google.com/forum/?fromgroups#!topic/google-maps-tsp-solver/pplU8TNr8sY and the actual implementation at http://code.google.com/p/google-maps-tsp-solver/source/browse/trunk/BpTspSolver.js.
     
    #166 Naurava kulkuri, Oct 31, 2012
    Last edited by a moderator: Oct 31, 2012
  7. Naurava kulkuri

    Newcomer

    Joined:
    Aug 12, 2012
    Messages:
    67
    Likes Received:
    0
    Sometimes you give, other times are given. :)

    I'd like to win too, of course, but being this close and finishing on second place in a competition like this against a competitor like you wouldn't be a shame either. This wasn't an easy compeition, I believe, as the early choices were the most crucial ones and there were a lot of technicalities along the way. Though this was the first competition of this sort I've ever took part in.

    When you wrote about using "optimizers" and a bit later when it became clear our performance signatures are a close match and I wrote we can "target each other's solutions", I was rather sure this could end up either way as our performance part (for the lack of better term) is basically the same.

    Judging from what you wrote, your solution feels more elaborate. I went for a simpler route and just preclude already known paths and tuned perturbation. But that we both know already.
     
    #167 Naurava kulkuri, Oct 31, 2012
    Last edited by a moderator: Oct 31, 2012
  8. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    Apologies gentlemen, my life has been placed on a plane for the past day, so I was a bit detached from the world of the internetz. I see you had a very good exchange about your respective solutions.

    Veikko: curious, is your work machine by chance Windows 7? Also, what driver revision are you using? The "breakage" has been verified on a Cayman powered rig as well (still Win8 with latest drivers though) - as has the correct behaviour for the prior submissions. So it's either a case of something quirky happening between the Win8 infrastructure and the drivers and the hardware, or it's a case of faster hardware breaking something iterative (f.e. you iterate past a critical point and instead of continuing to go in the direction of an optimum, you shoot off; if stopping conditions aren't robust enough, this can happen (has happened to me)). Note that there has been no change in the testing set whatsoever.

    On a more general scope, we are compiling the final grades et al. and will hopefully be done by Sunday. Since you gentlemen seemed to both agree on there being an arbitrarily set time limit that you must abide, we shall take a final round of measurements for the final grade, under imposed time constraints. Cheers!
     
  9. Naurava kulkuri

    Newcomer

    Joined:
    Aug 12, 2012
    Messages:
    67
    Likes Received:
    0
    Both Windows 7 and Windows 8 (Consumer Preview, or something, the one before official release) and Visual Studio 2012 Ultimate RC/Express and now also with VS 2012 RTM Premium at work, with latest drivers on AMD Radeon HD 7850 (these were "beta" on Windows 8, if I remember correctly) and AMD and a AMD 6800 series card. Both are equipped with Intel processors.

    The stopping condition is in every case the number of cities in route and iterating through that. The only reason I can think of is that the route cost gets something weird and consequently the tour won't get altered -- and cost changed. If you want to get to bottom of this, you can e-mail me if you have further information, so that I can run furter checks. Also, there's something I can say you can change in the code and see if it has an impact. It's a bit strange, though, since ATSP uses the same routine as TSP (there are only a few places they differ) and likely the TSP case should be broken too -- that'd narrow the cases considerably.

    Did Bernd's code get broken by a similar, yet undetermined reason or because of last momement tweaks?
     
  10. Naurava kulkuri

    Newcomer

    Joined:
    Aug 12, 2012
    Messages:
    67
    Likes Received:
    0
    I too may be somewhat disconnected from the Internet until next week or so.
     
  11. Bernd Paradies

    Newcomer

    Joined:
    Sep 3, 2012
    Messages:
    54
    Likes Received:
    0
    Gentlemen, it looks like I have a stable Internet connection. I am now reading through the emails I missed...
    Hey Veikko, I am kind of in your neighborhood: Hamburg, Germany. My best friend is turning 50 and I am visiting him and his family for a few days.

    Alex, where do you live?

    - Bernd
     
  12. Bernd Paradies

    Newcomer

    Joined:
    Sep 3, 2012
    Messages:
    54
    Likes Received:
    0
    @Veikko: I honestly don't know who is ahead at this point. I may have a more elaborate setup for TSP. But I think you came up with a much more robust solution for solving ATSP. Yes, I ran into the same thing: 2-opt can't be used for ATSP. Not sure whether I understand how you got 2-opt to work with ATSP. Sounds fascinating!
    Same here: I won't feel bad if I make second place. It was a tough fight and I think we both gave our best.

    Let's see and wait until the verdict comes in.

    Cheers,

    - Bernd
     
  13. Naurava kulkuri

    Newcomer

    Joined:
    Aug 12, 2012
    Messages:
    67
    Likes Received:
    0
    bachelor party

    Pretty big kind of regarding our respective whereabouts. :) Here's something more to wonder, my last name "translated" in English would mean pretty much (the) Eve as in the one in paradise. :smile:

    Otherwise it doesn't look like I'd have friends having their fifties, but some stag parties (bachelor parties) very soon.
     
  14. Bernd Paradies

    Newcomer

    Joined:
    Sep 3, 2012
    Messages:
    54
    Likes Received:
    0
    Ha, that's funny! My last name is pretty rare in German. I decided to extend my stay here in Hamburg for another week. My company has an office here.

    My Internet connection is great, too. GMT+1 is not that bad either.

    Cheers,

    - Bernd
     
  15. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    Sorry to keep you gentlemen in suspense, still working on it! If you will stick around (please do!), you'll notice that nothing is precisely on time in the land of B3D:smile:
     
  16. Bernd Paradies

    Newcomer

    Joined:
    Sep 3, 2012
    Messages:
    54
    Likes Received:
    0
    No worries, I am quite busy, too.

    Cheers,

    Bernd
     
  17. Naurava kulkuri

    Newcomer

    Joined:
    Aug 12, 2012
    Messages:
    67
    Likes Received:
    0
    Busy is a good word. Unix and Oracle databases, ugh.
     
    #177 Naurava kulkuri, Nov 6, 2012
    Last edited by a moderator: Nov 9, 2012
  18. Naurava kulkuri

    Newcomer

    Joined:
    Aug 12, 2012
    Messages:
    67
    Likes Received:
    0
    What is the situation?
     
  19. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    Update time: performance results are neatly compiled, primed and arranged. We ultimately (and sadly) decided to drop the ATSP from the final scoring since it was reasonably across the board over the entire run of the contest, and it also wouldn't have caused an upset whilst having the potential to muddy the waters. Be warned though, the next contest will be ATSP only:razz:

    At this point in time we are waiting for some final referee input, since his ended up being an overall bad timing, with people being highly caught up (BUILD2012 happened, AMD decided it was time to shake like a mambo dancer etc.). The second those are in, winner can break out the champagne. I will say that this has been a rather tight battle.
     
  20. Naurava kulkuri

    Newcomer

    Joined:
    Aug 12, 2012
    Messages:
    67
    Likes Received:
    0
    I'll pop the cork out in any event! :lol:


    Say, just as a minor thing, regardless how things are and shall develop, did the the scoring on my part use the latest submission, submission number six, or the second latest, submission number five?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...